This paper introduces InternVid, a large-scale video-centric multimodal
...
In this study, we initiate an exploration into video understanding by
in...
We present an interactive visual framework named InternGPT, or iGPT for
...
Video Foundation Models (VFMs) have received limited exploration due to ...
The foundation models have recently shown excellent performance on a var...
Learning discriminative spatiotemporal representation is the key problem...
In this report, we present our champion solutions to five tracks at Ego4...
Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
Distracted driving causes thousands of deaths per year, and how to apply...
Challenging illumination conditions (low light, underexposure and
overex...
It is a challenging task to learn discriminative representation from ima...
It is a challenging task to learn rich and multi-scale spatiotemporal
se...
Recently, zero-shot and few-shot learning via Contrastive Vision-Languag...
Vision transformers (ViTs) have become the popular structures and
outper...
Self-attention has become an integral component of the recent network
ar...
Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
We focus on the problem of novel-view human action synthesis. Given an a...
3D convolution is powerful for video classification but often computatio...