We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an...
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...
We present RECLIP (Resource-efficient CLIP), a simple method that minimi...
We present a method that enables synthesizing novel views and novel pose...
Video Panoptic Segmentation (VPS) aims to achieve comprehensive pixel-le...
The development of language models have moved from encoder-decoder to
de...
We propose Clustering Mask Transformer (CMT-DeepLab), a transformer-base...
We present TubeFormer-DeepLab, the first attempt to tackle multiple core...
In this paper, we aim at synthesizing a free-viewpoint video of an arbit...
Object proposals have become an integral preprocessing steps of many vis...
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a
...
Temporal correspondence - linking pixels or objects across frames - is a...
Pursuing a more coherent scene understanding towards real-time vision
ap...
Panoptic segmentation has become a new standard of visual recognition ta...
Visual storytelling is a task of creating a short story based on photo
s...
In this paper, we investigate the problem of unpaired video-to-video
tra...
We propose a novel feed-forward network for video inpainting. We use a s...
Blind video decaptioning is a problem of automatically removing text ove...
Video inpainting aims to fill spatio-temporal holes with plausible conte...
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle...
In this paper, we address the problem of unsupervised video summarizatio...
Objects and their relationships are critical contents for image
understa...
In this paper, we explore methods of complicating self-supervised tasks ...
Weakly supervised semantic segmentation and localiza- tion have a proble...