Wheeled robot navigation has been widely used in urban environments, but...
Despite advances in Visual Question Answering (VQA), the ability of mode...
Vision-specific concepts such as "region" have played a key role in exte...
Driven by improved architectures and better representation learning
fram...
Modeling spatial relationship in the data remains critical across many
d...
Masked Autoencoding (MAE) has emerged as an effective approach for
pre-t...
Test-time training adapts to a new test distribution on the fly by optim...
Many recent self-supervised frameworks for visual representation learnin...
Dual encoders and cross encoders have been widely used for image-text
re...
In this work we present point-level region contrast, a self-supervised
p...
Object detection is a central downstream task used to test if pre-traine...
This paper shows that masked autoencoders (MAE) are scalable self-superv...
Non-contrastive methods of self-supervised learning (such as BYOL and
Si...
This paper does not describe a novel method. Instead, it studies a
strai...
Contrastive approaches to self-supervised learning (SSL) learn
represent...
Siamese networks have become a common structure in various recent models...
We propose a novel theoretical framework to understand self-supervised
l...
We introduce a learning-based approach for room navigation using semanti...
Machine learning models tend to over-rely on statistical shortcuts. Thes...
This paper targets at visual counting, where the setup is to estimate th...
Contrastive unsupervised learning has recently shown encouraging progres...
3D object detection has seen quick progress thanks to advances in deep
l...
Popularized as 'bottom-up' attention, bounding box (or region) based vis...
Studies have shown that a dominant class of questions asked by visually
...
Accurate multi-organ abdominal CT segmentation is essential to many clin...
Embodied Question Answering (EQA) is a relatively new task where an agen...
Passive visual systems typically fail to recognize objects in the amodal...
Sliding-window object detectors that generate bounding-box object predic...
Despite significant progress in Visual Question Answering over the years...
In the scenario where small cells are densely deployed, the millimeter w...
Image captioning models have achieved impressive results on datasets
con...
Video description is one of the most challenging problems in vision and
...
This document describes Pythia v0.1, the winning entry from Facebook AI
...
We present a novel framework for iterative visual reasoning. Our framewo...
To keep pace with the rapid growth of mobile traffic demands, dense
depl...
We explore design principles for general pixel-level prediction problems...
We adapted the join-training scheme of Faster RCNN framework from Caffe ...
We explore architectures for general pixel-level prediction problems, fr...
What does a typical visit to Paris look like? Do people first take photo...
We present an approach to utilize large amounts of web data for learning...
In this paper we describe the Microsoft COCO Caption dataset and evaluat...
In this paper we explore the bi-directional mapping between images and t...