To bridge the physical and virtual worlds for rapidly developed VR/AR
ap...
Due to the limited scale and quality of video-text training corpus, most...
Understanding whether self-supervised learning methods can scale with
un...
Large-scale image-text contrastive pre-training models, such as CLIP, ha...
State-of-the-art video-text retrieval (VTR) methods usually fully fine-t...
Video-Language Pre-training models have recently significantly improved
...
Contrastive Masked Autoencoder (CMAE), as a new self-supervised framewor...
This paper is about an extraordinary phenomenon. Suppose we don't use an...
Mainstream image caption models are usually two-stage captioners, i.e.,
...
Masked image modeling (MIM) has achieved promising results on various vi...
Video transition effects are widely used in video editing to connect sho...
The goal of multi-task learning is to enable more efficient learning tha...
Deep learning models in large-scale machine learning systems are often
c...
This paper provides a strong baseline for vision transformers on the Ima...
Vision transformers (ViTs) have been successfully applied in image
class...
Current neural architecture search (NAS) algorithms still require expert...
In this work, we introduce a novel task - Humancentric Spatio-Temporal V...
Non-Local (NL) blocks have been widely studied in various vision tasks.
...
Recent advances show that Neural Architectural Search (NAS) method is ab...
Designing of search space is a critical problem for neural architecture
...
The recent WSNet [1] is a new model compression method through sampling
...
We present a new approach and a novel architecture, termed WSNet, for
le...
The ability of predicting the future is important for intelligent system...
In this work, we present a simple, highly efficient and modularized Dual...
Most existing weakly supervised localization (WSL) approaches learn dete...
Existing object proposal algorithms usually search for possible object
r...
Learning rich and diverse representations is critical for the performanc...
In this work, we address the challenging video scene parsing problem by
...
In this paper, we consider the scene parsing problem and propose a novel...
Rectified linear activation units are important components for
state-of-...