Existing approaches to unsupervised video instance segmentation typicall...
We argue that there are many notions of 'similarity' and that models, li...
We present ImageBind, an approach to learn a joint embedding across six
...
The recent breakthroughs in natural language processing for model pretra...
We tackle the challenging task of unsupervised object localization in th...
This paper revisits the standard pretrain-then-finetune paradigm used in...
Semi-supervised learning aims to train a model using limited labels.
Sta...
We propose Cut-and-LEaRn (CutLER), a simple approach for training
unsupe...
We explore the extent to which zero-shot vision-language models exhibit
...
Self-supervised methods in vision have been mostly focused on large
arch...
This paper demonstrates an approach for learning highly semantic image
r...
A successful paradigm in representation learning is to perform
self-supe...
Transformer-based architectures have become competitive across a variety...
We propose Masked Siamese Networks (MSN), a self-supervised learning
fra...
Discriminative self-supervised learning allows training models on any ra...
Data-Augmentation (DA) is known to improve performance across tasks and
...
Prior work has studied different visual modalities in isolation and deve...
Current object detectors are limited in vocabulary size due to the small...
We find Mask2Former also achieves state-of-the-art performance on video
...
Image segmentation is about grouping pixels with different semantics, e....
We propose 3DETR, an end-to-end Transformer based object detection model...
We introduce WyPR, a Weakly-supervised framework for Point cloud Recogni...
In this paper, we question if self-supervised learning provides new
prop...
This paper proposes a novel method of learning by predicting view assign...
Multi-modal reasoning systems rely on a pre-trained object detector to
e...
We present a self-supervised learning method to learn audio and video
re...
The quality of the image representations obtained from self-supervised
l...
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and S...
Pretraining on large labeled datasets is a prerequisite to achieve good
...
Leveraging temporal information has been regarded as essential for devel...
Unsupervised image representations have significantly reduced the gap wi...
We present a self-supervised learning approach to learn audio-visual
rep...
Popularized as 'bottom-up' attention, bounding box (or region) based vis...
Pre-training convolutional neural networks with weakly-supervised and
se...
The goal of self-supervised learning from images is to construct image
r...
We propose an approach to predict the 3D shape and pose for the objects
...
The paper analyzes the accuracy of publicly available object-recognition...
Self-supervised learning aims to learn representations from the data its...
Providing systems the ability to relate linguistic and visual content is...
We introduce an interactive learning framework for the development and
t...
A major impediment in rapidly deploying object detection models for inst...
We introduce the first dataset for sequential vision-to-language, and ex...
Multi-task learning in Convolutional Networks has displayed remarkable
s...
In this paper, we present an approach for learning a visual representati...
There has been an explosion of work in the vision & language community d...
When human annotators are given a choice about what to label in an image...