Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...
The field of vision and language has witnessed a proliferation of pre-tr...
From a visual scene containing multiple people, human is able to disting...
Visual commonsense understanding requires Vision Language (VL) models to...
Large-scale multi-modal contrastive pre-training has demonstrated great
...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Point cloud analysis is challenging due to irregularity and unordered da...
Contrastive language-image pretraining (CLIP) links vision and language
...
Answering complex questions about images is an ambitious goal for machin...
Graph Neural Network (GNN) has been demonstrated its effectiveness in de...
Pre-trained contextual vision-and-language (V L) models have brought
i...
Scene graph generation models understand the scene through object and
pr...
Domain Adaptation (DA) approaches achieved significant improvements in a...
Exploiting relationships between visual regions and question words have
...
An explainable machine learning method for point cloud classification, c...
Learning effective fusion of multi-modality features is at the heart of
...
Three-dimensional (3D) shape recognition has drawn much research attenti...
Mesh is an important and powerful type of data for 3D shapes and widely
...
In this paper, we present a hypergraph neural networks (HGNN) framework ...
3D object recognition has attracted wide research attention in the field...
Generative adversarial network (GAN) has gotten wide re-search interest ...