Modern hierarchical vision transformers have added several vision-specif...
There has been a longstanding belief that generation can facilitate a tr...
In this work we study the benefits of using tracking and 3D poses for ac...
This paper revisits the standard pretrain-then-finetune paradigm used in...
A central goal of visual recognition is to understand objects and scenes...
Large vision-language models are generally applicable to many downstream...
We present Masked Audio-Video Learners (MAViL) to train audio-visual
rep...
We present Fast Language-Image Pre-training (FLIP), a simple and more
ef...
We introduce Token Merging (ToMe), a simple method to increase the throu...
This paper studies a simple extension of image-based Masked Autoencoders...
This paper studies a conceptually simple extension of Masked Autoencoder...
While today's video recognition systems parse snapshots or short clips
a...
The "Roaring 20s" of visual recognition began with the introduction of V...
We present Masked Feature Prediction (MaskFeat) for self-supervised
pre-...
In this paper, we study Multiscale Vision Transformers (MViT) as a unifi...
We introduce PyTorchVideo, an open-source deep-learning library that pro...
We present VideoCLIP, a contrastive approach to pre-train a unified mode...
In video transformers, the time dimension is often treated in the same w...
We present a simplified, task-agnostic multi-modal pre-training approach...
We present a large-scale study on unsupervised spatiotemporal representa...
We present Multiscale Vision Transformers (MViT) for video and image
rec...
We present a multiview pseudo-labeling approach to video learning, a nov...
We present TrackFormer, an end-to-end multi-object tracking and segmenta...
This paper presents X3D, a family of efficient video networks that
progr...
Feature pyramid networks have been widely adopted in the object detectio...
We present Audiovisual SlowFast Networks, an architecture for integrated...
First-person video naturally brings the use of a physical environment to...
Training competitive deep video models is an order of magnitude slower t...
Modern approaches for multi-person pose estimation in video require larg...
Learning how to interact with objects is an important step towards embod...
Previous work on predicting or generating 3D human pose sequences regres...
To understand the world, we humans constantly need to relate the present...
Learning how to interact with objects is an important step towards embod...
Despite huge success in the image domain, modern detection models such a...
We present SlowFast networks for video recognition. Our model involves (...
In this work, we demonstrate that 3D poses in video can be effectively
e...
This paper documents the winning entry at the CVPR2017 vehicle velocity
...
As the success of deep models has led to their deployment in all areas o...
Recent approaches for high accuracy detection and tracking of object
cat...
Two-stream Convolutional Networks (ConvNets) have shown strong performan...
Recent applications of Convolutional Neural Networks (ConvNets) for huma...