Grounding textual expressions on scene objects from first-person views i...
We present Cross3DVG, a novel task for cross-dataset visual grounding in...
We present a new multimodal dataset called Visual Recipe Flow, which ena...
We propose a new 3D spatial understanding task of 3D Question Answering
...
Vision-and-language navigation (VLN) is a task in which an agent is embo...
In Semantic Dependency Parsing (SDP), semantic relations form directed
a...
Japanese predicate-argument structure (PAS) analysis involves zero anaph...