Zero-shot video recognition (ZSVR) is a task that aims to recognize vide...
In this paper, we address a complex but practical scenario in semi-super...
In order to deal with the task of video panoptic segmentation in the wil...
Vision transformer emerges as a potential architecture for vision tasks....
Most of the existing audio-driven 3D facial animation methods suffered f...
We propose a novel 3d colored shape reconstruction method from a single ...
This is a brief technical report of our proposed method for Multiple-Obj...
It is well believed that the higher uncertainty in a word of the caption...
This paper tackles an emerging and challenging vision-language task, 3D
...
Recently, vector quantized autoregressive (VQ-AR) models have shown
rema...
Ensemble of machine learning models yields improved performance as well ...
Most of the existing semantic segmentation approaches with image-level c...
Generating precise class-aware pseudo ground-truths, a.k.a, class activa...
Fusing LiDAR and camera information is essential for achieving accurate ...
For years, the YOLO series has been the de facto industry-level standard...
Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to
...
Existing approaches to image captioning usually generate the sentence
wo...
In this technical report, we introduce our submission to the Waymo 3D
De...
We present a simple yet effective fully convolutional one-stage 3D objec...
Face clustering plays an essential role in exploiting massive unlabeled ...
Feature representation via self-supervised learning has reached remarkab...
Most modern face completion approaches adopt an autoencoder or its varia...
Referring Image Segmentation (RIS) aims at segmenting the target object ...
Open-set semi-supervised learning (open-set SSL) investigates a challeng...
Recently, lane detection has made great progress with the rapid developm...
Very recently, a variety of vision transformer architectures for dense
p...
BiSeNet has been proved to be a popular two-stream network for real-time...
Food recognition plays an important role in food choice and intake, whic...
Almost all visual transformers such as ViT or DeiT rely on predefined
po...
Recent unsupervised contrastive representation learning follows a Single...
Single-path based differentiable neural architecture search has great
st...
Most deep learning based image inpainting approaches adopt autoencoder o...
Despite the fast development of differentiable architecture search (DART...
Video summarization aims to select representative frames to retain high-...
Food recognition has received more and more attention in the multimedia
...
While scene text recognition techniques have been widely used in commerc...
In recent years, scene text recognition is always regarded as a
sequence...