This paper presents a comprehensive survey of the taxonomy and evolution...
In this paper, we study the denoising diffusion probabilistic model (DDP...
Generative AI has made significant strides in computer vision, particula...
Despite the promising progress in multi-modal tasks, current large
multi...
Multimodal summarization with multimodal output (MSMO) has emerged as a
...
Model merging (e.g., via interpolation or task arithmetic) fuses multipl...
Spatial control is a core capability in controllable image generation.
A...
The most recent efforts in video matting have focused on eliminating tri...
This study explores the concept of equivariance in vision-language found...
We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
3D photography renders a static image into a video with appealing 3D vis...
We present X-Decoder, a generalized decoding model that can predict
pixe...
This paper surveys vision-language pre-training (VLP) methods for multim...
Masked visual modeling (MVM) has been recently proven effective for visu...
Vision-language (VL) pre-training has recently received considerable
att...
Unified vision-language frameworks have greatly advanced in recent years...
In this paper, we design and train a Generative Image-to-text Transforme...
We present a cross-modal Transformer-based framework, which jointly enco...
We initiate the first empirical study on the use of MLP architectures fo...
The canonical approach to video captioning dictates a caption generation...
A great challenge in video-language (VidL) modeling lies in the disconne...
Most existing video-and-language (VidL) research focuses on a single dat...
With large-scale pre-training, the past two years have witnessed signifi...
Large-scale transformer-based pre-training has recently revolutionized
v...
Vision-and-language pre-training has achieved impressive success in lear...
Multimodal pre-training has propelled great advancement in
vision-and-la...
The canonical approach to video-and-language learning (e.g., video quest...
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNI...
Cross-domain alignment between two sets of entities (e.g., objects in an...
We present VILLA, the first known effort on large-scale adversarial trai...
We present HERO, a Hierarchical EncodeR for Omni-representation learning...
There are two main lines of research on visual reasoning: neural module
...
Joint image-text embedding is the bedrock for most Vision-and-Language (...
In order to answer semantically-complicated questions about an image, a
...
This paper presents Recurrent Dual Attention Network (ReDAN) for visual
...
Humans make complex inferences on faces, ranging from objective properti...