While large-scale image-text pretrained models such as CLIP have been us...
We aim to investigate whether end-to-end learning of visual reasoning ca...
We propose a new task and model for dense video object captioning –
dete...
Current state-of-the-art video models process a video clip as a long seq...
In this paper, we address the challenges posed by the substantial traini...
The most performant spatio-temporal action localisation models use exter...
Vision-Language models have shown strong performance in the image-domain...
Existing works on open-vocabulary semantic segmentation have utilized
la...
Can we leverage the audiovisual information already present in video to
...
We propose Token Turing Machines (TTM), a sequential, autoregressive
Tra...
Modelling long-range dependencies is critical for scene understanding ta...
Transfer learning is the predominant paradigm for training deep networks...
This report describes the approach behind our winning solution to the 20...
Recent advances in deep learning have relied on large, labelled datasets...
Recent video and language pretraining frameworks lack the ability to gen...
Can we train a single transformer model capable of processing multiple
m...
Model efficiency is a critical aspect of developing and deploying machin...
Scenic is an open-source JAX library with a focus on Transformer-based m...
Learning effective visual representations that generalize well without h...
Humans perceive the world by concurrently processing and fusing
high-dim...
In this paper, we introduce a novel visual representation learning which...
We present pure-transformer based models for video classification, drawi...
Accurate video understanding involves reasoning about the relationships
...
Despite the recent advances in video classification, progress in
spatio-...
Exploiting long-range contextual information is key for pixel-wise predi...
Modelling long-range dependencies is critical for complex scene understa...
We present a bundle-adjustment-based algorithm for recovering accurate 3...
We present a weakly supervised model that jointly performs both semantic...
Deep Neural Networks (DNNs) have been demonstrated to perform exceptiona...
Object parsing -- the task of decomposing an object into its semantic pa...
Semantic segmentation and object detection research have recently achiev...
Are we using the right potential functions in the Conditional Random Fie...
Traditional Scene Understanding problems such as Object Detection and
Se...
It is not always possible to recognise objects and infer material proper...
We present an open-source, real-time implementation of SemanticPaint, a
...