Avoiding synthesizing specific visual concepts is an essential challenge...
This paper investigates the impact of big data on deep learning models f...
In this paper, we study the denoising diffusion probabilistic model (DDP...
Generative AI has made significant strides in computer vision, particula...
We introduce a novel framework called RefineVIS for Video Instance
Segme...
This paper reveals that every image can be understood as a first-order
n...
The application of machine learning models can be significantly impeded ...
We present a unified framework for camera-space 3D hand pose estimation ...
The most recent efforts in video matting have focused on eliminating tri...
In this paper, we show that a binary latent space can be explored for co...
This study explores the concept of equivariance in vision-language found...
We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM)
d...
3D photography renders a static image into a video with appealing 3D vis...
Motivated by the fact that forward and backward passes of a deep network...
This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...
We present Mesh Pre-Training (MPT), a new pre-training framework that
le...
This paper presents a new perspective of self-supervised learning based ...
The image captioning task is typically realized by an auto-regressive me...
This paper surveys vision-language pre-training (VLP) methods for multim...
Masked visual modeling (MVM) has been recently proven effective for visu...
In this paper, we present NUWA-Infinity, a generative model for infinite...
The complexity-precision trade-off of an object detector is a critical
p...
Vision-language (VL) pre-training has recently received considerable
att...
Unified vision-language frameworks have greatly advanced in recent years...
Video instance segmentation aims at predicting object segmentation masks...
In this paper, we design and train a Generative Image-to-text Transforme...
We present a cross-modal Transformer-based framework, which jointly enco...
Inversion techniques are widely used to reconstruct subsurface physical
...
Learning visual representations from natural language supervision has
re...
Improving the generalization capability of Deep Neural Networks (DNNs) i...
Human-Object Interaction (HOI) recognition is challenging due to two fac...
Multi-physical inversion plays a critical role in geophysics. It has bee...
Visual Question Answering (VQA) attracts much attention from both indust...
Unsupervised domain adaptive person re-identification (ReID) has been
ex...
We propose DEFR, a DEtection-FRee method to recognize Human-Object
Inter...
This paper studies using Vision Transformers (ViT) in class incremental
...
Tremendous progress has been made in recent years in developing better i...
We initiate the first empirical study on the use of MLP architectures fo...
Multi-camera tracking systems are gaining popularity in applications tha...
The canonical approach to video captioning dictates a caption generation...
A great challenge in video-language (VidL) modeling lies in the disconne...
In recent years, we have witnessed significant performance boost in the ...
In this paper, we propose UNICORN, a vision-language (VL) model that uni...
Automated visual understanding of our diverse and open world demands com...
In this paper, we propose a single UniFied transfOrmer (UFO), which is
c...
Solving electromagnetic inverse scattering problems (ISPs) is challengin...
Vision-and-language (VL) pre-training has proven to be highly effective ...
This paper investigates unsupervised learning of Full-Waveform Inversion...
Knowledge-based visual question answering (VQA) involves answering quest...