Large language models (LLMs) and Vision-Language Models (VLMs) have been...
General physical scene understanding requires more than simply localizin...
Large Language Models (LLMs) have achieved remarkable results. But exist...
Recent AI-assistant agents, such as ChatGPT, predominantly rely on super...
Humans, even at a very early age, can learn visual concepts and understa...
Humans possess a versatile mechanism for extracting structured
represent...
Humans are able to accurately reason in 3D by gathering multi-view
obser...
Existing large language model-based code generation pipelines typically ...
Optimization in multi-task learning (MTL) is more challenging than
singl...
In this paper, we address the "dual problem" of multi-view scene
reconst...
Traditional multi-view photometric stereo (MVPS) methods are often compo...
Objects' motions in nature are governed by complex interactions and thei...
Face recognition under ideal conditions is now considered a well-solved
...
In this work, we propose a unified framework, called Visual Reasoning wi...
This paper addresses the problem of face video inpainting. Existing vide...
We study the problem of dynamic visual reasoning on raw videos. This is ...
Weakly-supervised Temporal Action Localization (WTAL) aims to detect the...
Referring expression comprehension (REF) aims at identifying a particula...
In this paper, we study the problem of weakly-supervised temporal ground...
In this paper, we address a novel task, namely weakly-supervised
spatio-...
Deep CNNs have achieved great success in text detection. Most of existin...