In this paper, we show that recent advances in video representation lear...
Brain-inspired spiking neural networks (SNNs) have demonstrated great
po...
Pre-trained vision transformers have strong representation benefits to
v...
Though the success of CLIP-based training recipes in vision-language mod...
The exploitation of Deepfake techniques for malicious intentions has dri...
Despite the rapid advancement of unsupervised learning in visual
represe...
Vision Transformers (ViTs) are normally regarded as a stack of transform...
Current deep networks are very data-hungry and benefit from training on
...
Recent text-to-image diffusion models have demonstrated an astonishing
c...
Video-language pre-training (VLP) has become increasingly important due ...
In this report, we present our champion solution for Ego4D Natural Langu...
Visual foundation models like CLIP excel in learning feature representat...
Recent research on Large Language Models (LLMs) has led to remarkable
ad...
This paper examines the problems of severe image-text misalignment and h...
Public large-scale text-to-image diffusion models, such as Stable Diffus...
Various stuff and things in visual data possess specific traits, which c...
We introduce HOSNeRF, a novel 360 free-viewpoint rendering method that
r...
While remarkable success has been achieved in weakly-supervised object
l...
Human visual recognition is a sparse process, where only a few salient v...
Humans excel at learning from expert demonstrations and solving their ow...
Collecting and annotating images with pixel-wise labels is time-consumin...
Parameter-Efficient Transfer Learning (PETL) aims at efficiently adaptin...
Deepfake techniques have been widely used for malicious purposes, prompt...
Learning object-centric representations from complex natural environment...
Recently privacy-preserving action recognition (PPAR) has been becoming ...
To reproduce the success of text-to-image (T2I) generation, recent works...
Vision-Language Pre-Training (VLP) has shown promising capabilities to a...
To build Video Question Answering (VideoQA) systems capable of assisting...
Recent advances in generative adversarial networks (GANs) have demonstra...
Vector-Quantized (VQ-based) generative models usually consist of two bas...
Our education system comprises a series of curricula. For example, when ...
The traditional model upgrading paradigm for retrieval requires recomput...
VQA is an ambitious task aiming to answer any image-related question.
Ho...
Open-world instance segmentation (OWIS) aims to segment class-agnostic
i...
In the Metaverse, the physical space and the virtual space co-exist, and...
Multi-channel video-language retrieval require models to understand
info...
To thrive in evolving environments, humans are capable of continual
acqu...
Modeling dynamic scenes is important for many applications such as virtu...
Rendering scenes with a high-quality human face from arbitrary viewpoint...
Cognitive science has shown that humans perceive videos in terms of even...
As an important area in computer vision, object tracking has formed two
...
A long-standing goal of intelligent assistants such as AR glasses/robots...
It is still a pipe dream that AI assistants on phone and AR glasses can
...
Audio-visual speaker diarization aims at detecting “who spoken when“ usi...
Self-attention has become an integral component of the recent network
ar...
Video grounding aims to localize the temporal segment corresponding to a...
Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize
...
Active speaker detection (ASD) seeks to detect who is speaking in a visu...
This paper presents a novel task together with a new benchmark for detec...