This paper presents a comprehensive survey of the taxonomy and evolution...
Visual instruction tuning has recently shown encouraging progress with
o...
Conversational generative AI has demonstrated remarkable promise for
emp...
Image-text contrastive learning models such as CLIP have demonstrated st...
Large-scale text-to-image diffusion models have made amazing advances.
H...
We present X-Decoder, a generalized decoding model that can predict
pixe...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Recent state-of-the-art computer vision systems are trained from natural...
Learning visual representations from natural language supervision has
re...
Visual recognition is recently learned via either supervised learning on...
In computer vision, it has achieved great success in adapting large-scal...
In this work, we propose focal modulation network (FocalNet in short), w...
Contrastive language-image pretraining (CLIP) links vision and language
...
Contrastive language-image pretraining (CLIP) using image-text pairs has...
This paper presents a grounded language-image pre-training (GLIP) model ...
Automated visual understanding of our diverse and open world demands com...
Learning from image-text data has demonstrated recent success for many
r...
Contrastive learning has been widely used to train transformer-based
vis...
There is a surge of interest in image scene graph generation (object,
at...
Recently, Vision Transformer and its variants have shown great promise o...
This paper investigates two techniques for developing efficient
self-sup...
This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
This paper presents a detailed study of improving visual representations...
When answering questions about an image, it not only needs knowing what ...
Neuro-symbolic representations have proved effective in learning structu...
Passive visual systems typically fail to recognize objects in the amodal...
In an open-world setting, it is inevitable that an intelligent agent (e....
We propose a novel scene graph generation model called Graph R-CNN, that...
We introduce a novel framework for image captioning that can produce nat...
We present a novel training framework for neural sequence models,
partic...
We present LR-GAN: an adversarial image generation model which takes sce...
A number of recent works have proposed attention models for Visual Quest...
In this paper, we propose a recurrent framework for Joint Unsupervised
L...
Though having achieved some progresses, the hand-crafted texture feature...