We present Emu, a Transformer-based multimodal foundation model, which c...
Dense retrieval is widely used for entity linking to retrieve entities f...
We present SegGPT, a generalist model for segmenting everything in conte...
Large-scale text-to-image diffusion models achieve unprecedented success...
Contrastive language-image pre-training, CLIP for short, has gained
incr...
We launch EVA-02, a next-generation Transformer-based visual representat...
In-context learning, as a new paradigm in NLP, allows the model to rapid...
We launch EVA, a vision-centric foundation model to explore the limits o...
Point annotations are considerably more time-efficient than bounding box...
Instance segmentation is a fundamental vision task that aims to recogniz...
We propose a direct, regression-based approach to 2D human pose estimati...
Compared to many other dense prediction tasks, e.g., semantic segmentati...
We propose a fully convolutional multi-person pose estimation framework ...
We propose a human pose estimation framework that solves the task in the...
Person search aims to localize and identify a specific person from a gal...
We present a high-performance method that can achieve mask-level instanc...
Video instance segmentation (VIS) is the task that requires simultaneous...
To date, most existing self-supervised learning methods are designed and...
In this work, we aim at building a simple, direct, and fast instance
seg...
We present a method for depth estimation with monocular images, which ca...
We present a new, embarrassingly simple approach to instance segmentatio...
Monocular depth estimation enables 3D perception from a single 2D image,...
A 3D point cloud describes the real scene precisely and intuitively.To d...
Detecting individual pedestrians in a crowd remains a challenging proble...
Generative Adversarial Networks (GAN) have attracted much research atten...