Vision Transformer has demonstrated impressive success across various vi...
Non-photorealistic videos are in demand with the wave of the metaverse, ...
Deep supervision, which involves extra supervisions to the intermediate
...
Masked image modeling (MIM) performs strongly in pre-training large visi...
Data lies at the core of modern deep learning. The impressive performanc...
One key challenge of exemplar-guided image generation lies in establishi...
Data mixing (e.g., Mixup, Cutmix, ResizeMix) is an essential component f...
Multimodal knowledge distillation (KD) extends traditional knowledge
dis...
Crowd image is arguably one of the most laborious data to annotate. In t...
Inspired by the success of self-supervised autoregressive representation...
Recent Vision Transformer (ViT) models have demonstrated encouraging res...
Existing domain adaptation methods for crowd counting view each crowd im...
Labeling is onerous for crowd counting as it should annotate each indivi...
The fully convolutional network (FCN) has dominated salient object detec...
Transformers recently are adapted from the community of natural language...
The popularity of multimodal sensors and the accessibility of the Intern...
In this paper, we propose a simple yet effective approach, named Triple
...