In this paper, we investigate the adversarial robustness of vision
trans...
This paper introduces a new large-scale image restoration dataset, calle...
StableDiffusion is a revolutionary text-to-image generator that is causi...
Text-to-Image diffusion models have made tremendous progress over the pa...
This paper reveals that every image can be understood as a first-order
n...
This work studies how to transform an album to vivid and coherent storie...
The convergence of text, visual, and audio data is a key step towards
hu...
Multi-modality image fusion is a technique used to combine information f...
Existing deep video models are limited by specific tasks, fixed input-ou...
Mammographic image analysis is a fundamental problem in the computer-aid...
Neural implicit fields are powerful for representing 3D scenes and gener...
Video understanding tasks have traditionally been modeled by two separat...
Object tracking (OT) aims to estimate the positions of target objects in...
We present Diversity-Aware Meta Visual Prompting (DAM-VP), an efficient ...
Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM)
d...
Unsupervised domain adaption has been widely adopted in tasks with scarc...
Graph matching can be formalized as a combinatorial optimization problem...
As a powerful representation of 3D scenes, the neural radiance field (Ne...
Exploring dense matching between the current frame and past frames for
l...
Recent studies have shown that CLIP has achieved remarkable success in
p...
Copy-Paste is a simple and effective data augmentation strategy for inst...
Point cloud segmentation is a fundamental task in 3D. Despite recent pro...
This paper focuses on analyzing and improving the commonsense ability of...
This paper presents a new perspective of self-supervised learning based ...
We present SinDiffusion, leveraging denoising diffusion models to captur...
Notwithstanding the prominent performance achieved in various applicatio...
This paper presents OmniVL, a new foundation model to support both
image...
From early image processing to modern computational imaging, successful
...
Diffusion models (DMs) have shown great potential for high-quality image...
This paper presents a simple yet effective framework MaskCLIP, which
inc...
We propose bootstrapped masked autoencoders (BootMAE), a new approach fo...
The complexity-precision trade-off of an object detector is a critical
p...
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkabl...
Leveraging large-scale data can introduce performance gains on many comp...
This paper revisits visual representation in knowledge-based visual ques...
Transformers have achieved great success in pluralistic image inpainting...
Human intelligence is multimodal; we integrate visual, linguistic, and
a...
Mixture of Experts (MoE) is able to scale up vision transformers effecti...
We present a learning-based framework, recurrent transformer network (RT...
This paper aims to address the problem of pre-training for person
re-ide...
Solving a linear inverse problem requires knowledge about the underlying...
Adversary and invisibility are two fundamental but conflict characters o...
In this work we propose Identity Consistency Transformer, a novel face
f...
The fast evolution and widespread of deepfake techniques in real-world
s...
In many real-world settings, only incomplete measurement data are availa...
Occlusion between different objects is a typical challenge in Multi-Obje...
Visual Question Answering (VQA) has witnessed tremendous progress in rec...
Hair editing is an interesting and challenging problem in computer visio...
We present CLIP-NeRF, a multi-modal 3D object manipulation method for ne...
How to learn a universal facial representation that boosts all face anal...