This paper reveals that every image can be understood as a first-order
n...
Existing deep video models are limited by specific tasks, fixed input-ou...
Object tracking (OT) aims to estimate the positions of target objects in...
Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM)
d...
We present X-Decoder, a generalized decoding model that can predict
pixe...
Exploring dense matching between the current frame and past frames for
l...
This paper presents a new perspective of self-supervised learning based ...
The complexity-precision trade-off of an object detector is a critical
p...
We present GLIPv2, a grounded VL understanding model, that serves both
l...
Leveraging large-scale data can introduce performance gains on many comp...
People say, "A picture is worth a thousand words". Then how can we get t...
Transformers have achieved great success in pluralistic image inpainting...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Mixture of Experts (MoE) is able to scale up vision transformers effecti...
Contrastive language-image pretraining (CLIP) links vision and language
...
Contrastive language-image pretraining (CLIP) using image-text pairs has...
Automated visual understanding of our diverse and open world demands com...
In this paper, we propose a single UniFied transfOrmer (UFO), which is
c...
We present Mobile-Former, a parallel design of MobileNet and Transformer...
This paper aims at addressing the problem of substantial performance
deg...
Recently, Vision Transformer and its variants have shown great promise o...
This paper investigates two techniques for developing efficient
self-sup...
The complex nature of combining localization and classification in objec...
We present in this paper a new architecture, named Convolutional vision
...
This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
Recent research in dynamic convolution shows substantial performance boo...
Neural Architecture Search (NAS) finds the best network architecture by
...
In this paper, we present MicroNet, which is an efficient convolutional
...
Efficient search is a core issue in Neural Architecture Search (NAS). It...
Rectified linear units (ReLU) are commonly used in deep neural networks....
Light-weight convolutional neural networks (CNNs) suffer performance
deg...
We present Temporal Aggregation Network (TAN) which decomposes 3D
convol...
This research strives for natural language moment retrieval in long,
unt...
Recognizing instances at different scales simultaneously is a fundamenta...
In this paper, we present a novel Single Shot multi-Span Detector for
te...
Motion boundary detection is a crucial yet challenging problem. Prior me...
We present a Temporal Context Network (TCN) for precise temporal localiz...
Sparse representations have been successfully applied to signal processi...