Vision-language navigation is the task of directing an embodied agent to...
3D visual grounding aims to locate the referred target object in 3D poin...
Very recently, a variety of vision transformer architectures for dense
p...
Almost all visual transformers such as ViT or DeiT rely on predefined
po...
Video instance segmentation (VIS) is the task that requires simultaneous...
Trajectory forecasting, or trajectory prediction, of multiple interactin...
The comprehension of environmental traffic situation largely ensures the...
Pedestrian trajectory prediction is crucial for many important applicati...