Many studies focus on improving pretraining or developing new backbones ...
Text-to-video (T2V) synthesis has gained increasing attention in the
com...
We tackle the data scarcity challenge in few-shot point cloud recognitio...
Multimodal Large Language Models (MLLMs) have recently sparked significa...
We show that classifiers trained with random region proposals achieve
st...
Generative AI has made significant strides in computer vision, particula...
Despite their success in real data synthesis, diffusion models (DMs) oft...
Text-video retrieval contains various challenges, including biases comin...
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence...
This study explores the concept of equivariance in vision-language found...
Weakly Supervised Video Anomaly Detection (WSVAD) is challenging because...
Semantic Scene Completion (SSC) transforms an image of single-view depth...
We present a new paradigm for fine-tuning large-scale visionlanguage
pre...
We design a novel global-local Transformer named Ada-ClustFormer
(ACF) t...
Aligning objects with words plays a critical role in Image-Language BERT...
Deep neural networks for video action recognition easily learn to utiliz...
Extracting class activation maps (CAM) is a key step for weakly-supervis...
Knowledge distillation (KD) is essentially a process of transferring a
t...
Humans tend to decompose a sentence into different parts like sth do
sth...
Out-Of-Distribution generalization (OOD) is all about learning invarianc...
Conventional de-noising methods rely on the assumption that all samples ...
Nearly all existing scene graph generation (SGG) models have overlooked ...
We are interested in learning robust models from insufficient data, with...
Existing long-tailed classification (LT) methods only focus on tackling ...
Semi-Supervised Learning (SSL) is fundamentally a missing label problem,...
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Inst...
Seas of videos are uploaded daily with the popularity of social channels...
Thanks to the large pre-trained vision-language models (VLMs) like CLIP,...
Deep learning models have achieved great success in many fields, yet the...
Extracting class activation maps (CAM) is arguably the most standard ste...
We focus on the confounding bias between language and location in the vi...
We address the overlooked unbiasedness in existing long-tailed classific...
The fine-tuning of pre-trained language models has a great success in ma...
Question answering (QA) models are well-known to exploit data bias, e.g....
A good visual representation is an inference map from observations (imag...
Today's VQA models still tend to capture superficial linguistic correlat...
We propose an Auto-Parsing Network (APN) to discover and exploit the inp...
Attention module does not always help deep models learn causal features ...
Existing Unsupervised Domain Adaptation (UDA) literature adopts the cova...
Adversarial training is the de facto most promising defense against
adve...
Present language understanding methods have demonstrated extraordinary
a...
The prevailing framework for matching multimodal inputs is based on a
tw...
Multi-hop Question Answering (QA) is a challenging task because it requi...
Weakly-supervised Temporal Action Localization (WTAL) aims to detect the...
We present a novel attention mechanism: Causal Attention (CATT), to remo...
We propose a causal framework to explain the catastrophic forgetting in
...
We present a novel counterfactual framework for both Zero-Shot Learning ...
We uncover an ever-overlooked deficiency in the prevailing Few-Shot Lear...
As the class size grows, maintaining a balanced dataset across many clas...
We present a causal inference framework to improve Weakly-Supervised Sem...