We study the task of zero-shot vision-and-language navigation (ZS-VLN), ...
Large language models (LLMs) and Vision-Language Models (VLMs) have been...
Vision-and-language navigation (VLN) requires an embodied agent to navig...
This paper presents a paradigm that adapts general large-scale pretraine...
Open World Object Detection (OWOD) is a novel computer vision task with ...
We address a practical yet challenging problem of training robot agents ...
Getting robots to navigate to multiple objects autonomously is essential...
We study self-supervised video representation learning that seeks to lea...
We addressed the challenging task of video question answering, which req...
In this paper, we introduce Foley Music, a system that can synthesize
pl...
We focus on the task of generating sound from natural videos, and the so...
We address the problem of video grounding from natural language queries....
Humans are able to localize objects in the environment using both visual...