While recently Multimodal Large Language Models (MM-LLMs) have made exci...
Text-to-video (T2V) synthesis has gained increasing attention in the
com...
Recent studies have shown that dense retrieval models, lacking dedicated...
Multimodal Large Language Models (MLLMs) have recently sparked significa...
Multi-modal recommendation systems, which integrate diverse types of
inf...
Panoptic Scene Graph Generation (PSG) parses objects and predicts their
...
The prevalence of short video platforms has spawned a lot of fake news
v...
Unpaired cross-lingual image captioning has long suffered from irrelevan...
Visual spatial description (VSD) aims to generate texts that describe th...
While developing a new vision-language LLM (VL-LLM) by pre-training on
t...
Recently, Meta AI Research approaches a general, promptable Segment Anyt...
Scene Graph Generation (SGG) aims to extract <subject, predicate, object...
Prompt tuning, a recently emerging paradigm, enables the powerful
vision...
Deep neural networks (DNNs) can be easily fooled by adversarial attacks
...
The booming development and huge market of micro-videos bring new e-comm...
Ramp merging is a typical application of cooperative intelligent
transpo...
Short video platforms have become an important channel for news sharing,...
We investigate composed image retrieval with text feedback. Users gradua...
Relying on deep supervised or self-supervised learning, previous methods...
Video Question Answering (VideoQA) is the task of answering questions ab...
Vision-language pre-training (VLP) has shown impressive performance on a...
Growing interests in RGB-D salient object detection (RGB-D SOD) have bee...
This research aims to study a self-supervised 3D clothing reconstruction...
Scene graph generation (SGG) aims to extract (subject, predicate, object...
Video Question Answering (VideoQA) aims to answer natural language quest...
Generally, humans are more skilled at perceiving differences between
hig...
Video question answering requires the models to understand and reason ab...
Grounded Situation Recognition (GSR), i.e., recognizing the salient acti...
Automatic meeting summarization is becoming increasingly popular these d...
The better accuracy and efficiency trade-off has been a challenging prob...
This paper focuses on a new problem of estimating human pose and shape f...
We tackle the task of video moment retrieval (VMR), which aims to locali...
Weakly-Supervised Object Detection (WSOD) and Localization (WSOL), i.e.,...
We aim to address the problem of Natural Language Video Localization
(NL...
Benefiting from the spatial cues embedded in depth images, recent progre...
The COVID-19 outbreak was announced as a global pandemic by the World He...
As a fundamental and challenging problem in computer vision, hand pose
e...