Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...
The field of vision and language has witnessed a proliferation of pre-tr...
From a visual scene containing multiple people, human is able to disting...
Visual commonsense understanding requires Vision Language (VL) models to...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Contrastive language-image pretraining (CLIP) links vision and language
...
Answering complex questions about images is an ambitious goal for machin...
Graph Neural Network (GNN) has been demonstrated its effectiveness in de...
Pre-trained contextual vision-and-language (V L) models have brought
i...
Scene graph generation models understand the scene through object and
pr...
Unconstrained remote gaze estimation remains challenging mostly due to i...