Visual language navigation (VLN) is an embodied task demanding a wide ra...
Zero-shot object navigation is a challenging task for home-assistance ro...
Visually-grounded dialog systems, which integrate multiple modes of
comm...
Referring Expression Generation (REG) aims to generate unambiguous Refer...
Existing multimodal task-oriented dialog data fails to demonstrate the
d...
Existing multimodal conversation agents have shown impressive abilities ...