Punctuation restoration is an important task in automatic speech recogni...
Being able to perceive the semantics and the spatial structure of the
en...
Vision-language pretraining models have achieved great success in suppor...
This paper presents a novel transformer architecture for graph represent...
Vision-and-Language Navigation (VLN) tasks require an agent to navigate
...
In Vision-and-Language Navigation (VLN), an agent needs to navigate thro...
Most existing Vision-and-Language (V L) models rely on pre-trained vis...
Since visual perception can give rich information beyond text descriptio...
Video understanding relies on perceiving the global content and modeling...
Vision language navigation is the task that requires an agent to navigat...
Existing methods for vision-and-language learning typically require desi...
For embodied agents, navigation is an important ability but not an isola...
Humans learn language by listening, speaking, writing, reading, and also...
Phrase localization is a task that studies the mapping from textual phra...
Despite the remarkable successes of Convolutional Neural Networks (CNNs)...
Vision-and-Language Navigation (VLN) requires an agent to follow
natural...
We find that the performance of state-of-the-art models on Natural Langu...
The Visual Dialog task requires a model to exploit both image and
conver...
Vision-and-language reasoning requires an understanding of visual concep...
Describing images with text is a fundamental problem in vision-language
...
Enabling robots to understand instructions provided via spoken natural
l...
A grand goal in AI is to build a robot that can accurately navigate base...
All-to-all data transmission is a typical data transmission pattern in
b...
Visual reasoning with compositional natural language instructions, e.g.,...
Models that can execute natural language instructions for situated robot...
Referring expressions are natural language constructions used to identif...