Building scalable vision-language models to learn from diverse, multimod...
Open-world instance segmentation is a rising task, which aims to segment...
Building general-purpose models that can perceive diverse real-world
mod...
Referring expression segmentation aims to segment an object described by...
Learning with large-scale unlabeled data has become a powerful tool for
...
This paper aims to establish a generic multi-modal foundation model that...
Learning image classification and image generation using the same set of...
In this work, we present Multi-Level Contrastive Learning for Dense
Pred...
All instance perception tasks aim at finding certain objects specified b...
We identify and overcome two key obstacles in extending the success of
B...
We propose a sparse end-to-end multi-person pose regression framework, t...
Existing object detection methods are bounded in a fixed-set vocabulary ...
Multimodal representation learning has shown promising improvements on
v...
Masked autoencoders (MAEs) have emerged recently as art self-supervised
...
In this paper we present a novel multi-attribute face manipulation metho...
Spatio-Temporal video grounding (STVG) focuses on retrieving the
spatio-...
In this paper, we empirically study how to make the most of low-resoluti...
Co-occurrent visual pattern makes context aggregation become an essentia...
Open-world instance segmentation (OWIS) aims to segment class-agnostic
i...
Feature pyramid network (FPN) is one of the key components for object
de...
We present a unified method, termed Unicorn, that can simultaneously sol...
Unsupervised domain adaptation (UDA) aims to enhance the generalization
...
Fine-Grained Visual Classification(FGVC) is the task that requires
recog...
Generally, humans are more skilled at perceiving differences between
hig...
Referring video object segmentation (R-VOS) is an emerging cross-modal t...
Utilizing trimap guidance and fusing multi-level features are two import...
A typical pipeline for multi-object tracking (MOT) is to use a detector ...
Vision-and-Language Navigation (VLN) is a task that an agent is required...
Multi-object tracking (MOT) aims at estimating bounding boxes and identi...
A more realistic object detection paradigm, Open-World Object Detection,...
Supervised learning is dominant in person search, but it requires elabor...
Video scene parsing is a long-standing challenging task in computer visi...
The training loss function that enforces certain training sample distrib...
Although single-image super-resolution (SISR) methods have achieved grea...
Multiple-object tracking(MOT) is mostly dominated by complex and multi-s...
End-to-end one-stage object detection trailed thus far. This paper disco...
Generative adversarial networks (GANs) have achieved remarkable progress...
We present Sparse R-CNN, a purely sparse method for object detection in
...
Orthogonality is widely used for training deep neural networks (DNNs) du...
We address the problem of spatio-temporal action detection in videos.
Ex...
Attention mechanisms have been widely used in Visual Question Answering ...
Leveraging both visual frames and audio has been experimentally proven
e...