Zero-shot object navigation is a challenging task for home-assistance ro...
3D vision-language grounding (3D-VL) is an emerging field that aims to
c...
Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
Large Vision-Language Models (LVLMs) have recently played a dominant rol...
Foundation models have made significant strides in various applications,...
Although Domain Generalization (DG) problem has been fast-growing in the...
Understanding the continuous states of objects is essential for task lea...
Visual recognition in low-data regimes requires deep neural networks to ...
We introduce SceneDiffuser, a conditional generative model for 3D scene
...
State-of-the-art 3D semantic segmentation models are trained on the
off-...
Current computer vision models, unlike the human visual system, cannot y...
Learning to generate diverse scene-aware and goal-oriented human motions...
The ability to decompose complex natural scenes into meaningful
object-c...
We propose a new task to benchmark scene understanding of embodied agent...
Understanding human tasks through video observations is an essential
cap...
Thermal infrared imaging is widely used in body temperature measurement,...
Nowadays, cameras equipped with AI systems can capture and analyze image...
We study the understanding of embodied reference: One agent uses both
la...
To date, various 3D scene understanding tasks still lack practical and
g...
Geometry problem solving has attracted much attention in the NLP communi...
How to effectively represent camera pose is an essential problem in 3D
c...
Cognitive grammar suggests that the acquisition of language grammar is
g...
Solving algebra story problems remains a challenging task in artificial
...
Understanding and interpreting human actions is a long-standing challeng...
Humans can progressively learn visual concepts from easy to hard questio...
The goal of neural-symbolic computation is to integrate the connectionis...
The movement of large quantities of data during the training of a Deep N...
Recent progress in deep learning is essentially based on a "big data for...
Detecting 3D objects from a single RGB image is intrinsically ambiguous,...
This paper addresses a new problem of understanding human gaze communica...
We propose a new 3D holistic++ scene understanding problem, which jointl...
Neuromorphic networks based on nanodevices, such as metal oxide memristo...
Holistic 3D indoor scene understanding refers to jointly recovering the ...
We present a human-centric method to sample and synthesize 3D room layou...
We propose a computational framework to jointly parse a single RGB image...
This paper presents a novel method to predict future human activities fr...
We propose the configurable rendering of massive quantities of photoreal...