Recent research on Large Language Models (LLMs) has led to remarkable
ad...
The development of language models have moved from encoder-decoder to
de...
To build Video Question Answering (VideoQA) systems capable of assisting...
This paper presents OmniVL, a new foundation model to support both
image...
Large-scale multi-modal contrastive pre-training has demonstrated great
...
People say, "A picture is worth a thousand words". Then how can we get t...
The goal of this work is to build flexible video-language models that ca...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Contrastive language-image pretraining (CLIP) links vision and language
...
Vision-language (V+L) pretraining models have achieved great success in
...
Contrastive language-image pretraining (CLIP) using image-text pairs has...
Automated visual understanding of our diverse and open world demands com...
Most existing video-and-language (VidL) research focuses on a single dat...
Vision-and-language pre-training has achieved impressive success in lear...
This work concerns video-language pre-training and representation learni...
The canonical approach to video-and-language learning (e.g., video quest...
Articulated hand pose tracking is an underexplored problem that carries ...
Transformer has become ubiquitous in the deep learning field. One of the...
This paper presents a unified Vision-Language Pre-training (VLP) model. ...
Video description is one of the most challenging problems in vision and
...
Video action recognition, as a critical problem towards video understand...
We study weakly-supervised video object grounding: given a video segment...
Dense video captioning aims to generate text descriptions for all events...
The potential for agents, whether embodied or software, to learn by obse...
Attention mechanisms have attracted considerable interest in image capti...
Reinforcement learning has significant applications for multi-agent syst...