We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an...
Observing the close relationship among panoptic, semantic and instance
s...
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...
We present RECLIP (Resource-efficient CLIP), a simple method that minimi...
The development of language models have moved from encoder-decoder to
de...
We present a simple approach which can turn a ViT encoder into an effici...
We present F-VLM, a simple open-vocabulary object detection method built...
Effective scaling and a flexible task interface enable large language mo...
We present a pre-training approach for vision and language transformer
m...
Video question answering is a challenging task that requires understandi...
We present Answer-Me, a task-aware multi-task framework which unifies a
...
We propose FindIt, a simple and versatile framework that unifies a varie...
3D perception of object shapes from RGB image input is fundamental towar...
Object proposals have become an integral preprocessing steps of many vis...
Computed tomography (CT) is the imaging modality used in the diagnosis o...
Zero-shot image classification has made promising progress by training t...
Object recognition has seen significant progress in the image domain, wi...
Instance segmentation aims to detect and segment individual objects in a...
Deep learning for clinical applications is subject to stringent performa...
This paper studies the problem of detecting acute intracranial hemorrhag...
Existing object proposal approaches use primarily bottom-up cues to rank...