Robots should exist anywhere humans do: indoors, outdoors, and even unma...
A holistic understanding of object properties across diverse sensory
mod...
Understanding multimodal perception for embodied AI is an open question
...
HomeRobot (noun): An affordable compliant robot that navigates homes and...
Open-world survival games pose significant challenges for AI algorithms ...
Pre-trained large language models (LLMs) capture procedural knowledge ab...
Increased focus on the deployment of machine learning systems has led to...
A household robot should be able to navigate to target locations without...
Every home is different, and every person likes things done in their
par...
Integrating vision and language has gained notable attention following t...
We show that Vision-Language Transformers can be learned without human l...
What audio embedding approach generalizes best to a wide range of downst...
The computing research community needs to work much harder to address th...
The primary focus of recent work with largescale transformers has been o...
Reliable AI agents should be mindful of the limits of their knowledge an...
Recent methods for embodied instruction following are typically trained
...
Web search is fundamentally multimodal and multihop. Often, even before
...
Contrastive learning has been widely used to train transformer-based
vis...
Seemingly simple natural language requests to a robot are generally
unde...
The NLP community has seen substantial recent interest in grounding to
f...
Numerous works have analyzed biases in vision and pre-trained language m...
Guessing games are a prototypical instance of the "learning by interacti...
Recent developments in pre-trained neural language modeling have led to ...
In visual guessing games, a Guesser has to identify a target object in a...
Given a simple request (e.g., Put a washed apple in the kitchen fridge),...
In this paper we demonstrate that context free grammar (CFG) based
metho...
Fluent communication requires understanding your audience. In the new
co...
Procedural knowledge, which we define as concrete information about the
...
Successful linguistic communication relies on a shared experience of the...
Learning to navigate in a visual environment following natural language
...
We present ALFRED (Action Learning From Realistic Environments and
Direc...
To apply eyeshadow without a brush, should I use a cotton swab or a
toot...
Core to the vision-and-language navigation (VLN) challenge is building r...
Recent progress in natural language generation has raised dual-use conce...
Recent work by Zellers et al. (2018) introduced a new task of commonsens...
We use static object data to improve success detection for stacking obje...
High-level human instructions often correspond to behaviors with multipl...
We present FAST NAVIGATOR, a general framework for action decoding, whic...
Intuitively, human readers cope easily with errors in text; typos,
missp...
Visual understanding goes well beyond object recognition. With one glanc...
Increasingly, perceptual systems are being codified as strict pipelines
...
Language-and-vision navigation and question answering (QA) are exciting ...
Given a partial description like "she opened the hood of the car," human...
Machine translation systems require semantic knowledge and grammatical
u...
Robotic agents that share autonomy with a human should leverage human do...
We present CHALET, a 3D house simulator with support for navigation and
...
In this paper, we study the problem of mapping natural language instruct...
Character-based neural machine translation (NMT) models alleviate
out-of...
We define a novel textual entailment task that requires inference over
m...
We compare the effectiveness of four different syntactic CCG parsers for...