Recent development in vision-language approaches has instigated a paradi...
Large-scale language models have shown the ability to adapt to a new tas...
We present GLIPv2, a grounded VL understanding model, that serves both
l...
Disinformation has become a serious problem on social media. In particul...
Recent work has shown that Pre-trained Language Models (PLMs) have the
a...
Logical reasoning is needed in a wide range of NLP tasks. Can a BERT mod...
Learning visual representations from natural language supervision has
re...
Contrastive language-image pretraining (CLIP) using image-text pairs has...
Answering complex questions about images is an ambitious goal for machin...
This paper presents a grounded language-image pre-training (GLIP) model ...
Commonsense is defined as the knowledge that is shared by everyone. Howe...
Vision-and-language(V L) models take image and text as input and learn...
Most existing Vision-and-Language (V L) models rely on pre-trained vis...
Pre-trained contextual vision-and-language (V L) models have brought
i...
We propose VisualBERT, a simple and flexible framework for modeling a br...
Contextual representation models have achieved great success in improvin...