Research in scene graph generation (SGG) usually considers two-stage mod...
Do video-text transformers learn to model temporal relationships across
...
The task of dynamic scene graph generation (SGG) from videos is complica...
Active speaker detection (ASD) in videos with multiple speakers is a
cha...
In this paper, we present TExt Spotting TRansformers (TESTR), a generic
...
Structured video representation in the form of dynamic scene graphs is a...
We address the problem of active speaker detection through a new framewo...
Performing single image holistic understanding and 3D reconstruction is ...
Significant effort has been recently devoted to modeling visual relation...
The mainstream image captioning models rely on Convolutional Neural Netw...
A structured query can capture the complexity of object interactions (e....
Scene graphs have become an important form of structured knowledge for t...
Structured representations such as scene graphs serve as an efficient an...
In this paper, we present a generative adversarial network framework tha...
Generating realistic images from scene graphs asks neural networks to be...
We present PartNet: a consistent, large-scale dataset of 3D objects anno...
Generative adversarial networks (GANs) transform low-dimensional latent
...
Deep convolutional Neural Networks (CNN) are the state-of-the-art perfor...
Human keypoints are a well-studied representation of people.We explore h...
Generative adversarial networks (GANs) transform latent vectors into vis...
We present a novel, automatic eye gaze tracking scheme inspired by smoot...
Object proposals for detecting moving or static video objects need to ad...
Sign language recognition is important for natural and convenient
commun...
We explore the efficiency of the CRF inference beyond image level semant...
We explore the efficiency of the CRF inference module beyond image level...