Logistic regression is an algorithm widely used for binary classificatio...
Recent advancements in Automatic Speech Recognition (ASR) systems,
exemp...
Personalized text-to-image generation has emerged as a powerful and
soug...
Pre-trained models have achieved success in Chinese Short Text Matching ...
Noise is a pervasive element within real-world measurement data,
signifi...
We consider the role of non-localities in speed-density data used to fit...
Automatic detection of facial Action Units (AUs) allows for objective fa...
Referring image segmentation aims to segment an object mentioned in natu...
Omnidirectional videos (ODVs) play an increasingly important role in the...
In this paper, in order to get a better understanding of the human visua...
Large pretrained plain vision Transformers (ViTs) have been the workhors...
Due to the limited scale and quality of video-text training corpus, most...
Vision and text have been fully explored in contemporary video-text
foun...
With more and more deep neural networks being deployed as various daily
...
Building general-purpose models that can perceive diverse real-world
mod...
Referring image segmentation aims to segment an object referred to by na...
Large-scale image-text contrastive pre-training models, such as CLIP, ha...
Large pre-trained multimodal models have demonstrated significant succes...
Diffusion models have recently dominated image synthesis and other relat...
Attention-based contextual biasing approaches have shown significant
imp...
Industrial image anomaly detection under the setting of one-class
classi...
Learning to optimize (L2O) has emerged as a powerful framework for black...
Evolutionary algorithms (EAs) have emerged as a powerful framework for
e...
In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
We present a novel and effective method calibrating cross-modal features...
We present dual-attention neural biasing, an architecture designed to bo...
End-to-End (E2E) automatic speech recognition (ASR) systems used in voic...
As a combination of visual and audio signals, video is inherently
multi-...
We present OmniAvatar, a novel geometry-guided 3D head synthesis model
t...
While substantial progresses have been made in automated 2D portrait
sty...
This paper proposes a novel, abstraction-based, certified training metho...
Motion, scene and object are three primary visual components of a video....
Recent studies reveal that various biases exist in different NLP tasks, ...
Graph, such as citation networks, social networks, and transportation
ne...
With the rapid development of intelligent transportation system applicat...
Avatar creation from human images allows users to customize their digita...
Recent advances in Transformers have come with a huge requirement on
com...
Robotic shepherding is a bio-inspired approach to autonomously guiding a...
Discovering the governing equations of evolving systems from available
o...
The order-preserving pattern mining can be regarded as discovering frequ...
Clustering has been extensively studied in centralized settings, but
rel...
Network structure evolves with time in the real world, and the discovery...
Stylized 3D avatars have become increasingly prominent in our modern lif...
As an excellent tool for aiding communication, intelligent reflecting su...
Federated clustering is an adaptation of centralized clustering in the
f...
Multimodal representation learning has shown promising improvements on
v...
Transformer is a transformative framework that models sequential data an...
Vision Transformers (ViTs) have underpinned the recent breakthroughs in
...
The robustness of deep neural networks is crucial to modern AI-enabled
s...
While pre-trained language models (LMs) have brought great improvements ...