This paper introduces InternVid, a large-scale video-centric multimodal
...
With the exponential growth of video data, there is an urgent need for
a...
In this study, we initiate an exploration into video understanding by
in...
We present an interactive visual framework named InternGPT, or iGPT for
...
Scale is the primary factor for building a powerful foundation model tha...
Video Foundation Models (VFMs) have received limited exploration due to ...
We present a Non-parametric Network for 3D point cloud analysis, Point-N...
Masked Modeling (MM) has demonstrated widespread success in various visi...
The foundation models have recently shown excellent performance on a var...
Learning discriminative spatiotemporal representation is the key problem...
In this report, we present our champion solutions to five tracks at Ego4...
Video understanding is an important problem in computer vision. Currentl...
Tiny Actions Challenge focuses on understanding human activities in
real...
Point cloud completion aims to predict complete shape from its partial
o...
Cross domain object detection is a realistic and challenging task in the...
Domain adaptive object detection (DAOD) is a promising way to alleviate
...
It is a challenging task to learn discriminative representation from ima...
Self-supervised learning has not been fully explored for point cloud
ana...
It is a challenging task to learn rich and multi-scale spatiotemporal
se...
As real-scanned point clouds are mostly partial due to occlusions and
vi...
Vision transformers (ViTs) have become the popular structures and
outper...
Self-attention has become an integral component of the recent network
ar...
Graph Convolution Network (GCN) has been successfully used for 3D human ...
Self-supervised Multi-view stereo (MVS) with a pretext task of image
rec...
Blood cell detection in microscopic images is an essential branch of med...
3D convolution is powerful for video classification but often computatio...
On the existing benchmark datasets, THUMOS14 and ActivityNet, temporal a...
The end-to-end Human Mesh Recovery (HMR) approach has been successfully ...
Temporal convolution has been widely used for video classification. Howe...
Given a point in m-dimensional objective space, the local environment of...
A customized multi-objective evolutionary algorithm (MOEA) is proposed f...
Few-shot object detection is a challenging but realistic scenario, where...
Fine-grained classification is a challenging problem, due to subtle
diff...
Recent development of object detection mainly depends on deep learning w...
Unsupervised clustering has broad applications in data stratification,
p...
Recent advances in object detection are mainly driven by deep learning w...
Traditional feature encoding scheme (e.g., Fisher vector) with local
des...