Attention-based encoder-decoder (AED) speech recognition model has been
...
The ability to learn from context with novel concepts, and deliver
appro...
Current deep networks are very data-hungry and benefit from training on
...
This paper proposes a novel and physically interpretable method for face...
With recent advancements in natural language processing, Large Language
...
This paper explores the application of automated machine learning (AutoM...
Person clustering with multi-modal clues, including faces, bodies, and
v...
Detecting objects based on language descriptions is a popular task that
...
Efficiently selecting an appropriate spike stream data length to extract...
In human conversations, individuals can indicate relevant regions within...
Personal Data Stores (PDS) like SoLiD is an emerging data and knowledge
...
We propose ADCLR: A ccurate and D ense Contrastive Representation Learni...
Recent text-to-image generative models can generate high-fidelity images...
Human intelligence can retrieve any person according to both visual and
...
Modeling complex spatiotemporal dependencies in correlated traffic serie...
Correlated time series analysis plays an important role in many real-wor...
This paper proposes to learn Multi-task, Multi-modal Direct Acyclic Grap...
Semantic segmentation usually suffers from a long-tail data distribution...
Public large-scale text-to-image diffusion models, such as Stable Diffus...
Most existing learning-based pose estimation methods are typically devel...
Recent advancements in multimodal foundation models (e.g., CLIP) have
ex...
Referring Expression Segmentation (RES) is a widely explored multi-modal...
Asynchronous action coordination presents a pervasive challenge in
Multi...
Recent years have witnessed a rapid growth of deep generative models, wi...
Currently, most adverse weather removal tasks are handled independently,...
Few-shot object detection (FSOD) aims to expand an object detector for n...
Open-vocabulary detection (OVD) is an object detection task aiming at
de...
SpikeCV is a new open-source computer vision platform for the spike came...
Self-supervised pre-training and transformer-based networks have
signifi...
Human-centric perceptions include a variety of vision tasks, which have
...
Human-centric perceptions (e.g., pose estimation, human parsing, pedestr...
Recent popular Role-Playing Games (RPGs) saw the great success of charac...
Few-shot object detection (FSOD) aims to expand an object detector for n...
Inspired by masked language modeling (MLM) in natural language processin...
Self-supervised learning holds promise in leveraging large numbers of
un...
The generalization power of the pre-trained model is the key for few-sho...
Designing better deep networks and better reinforcement learning (RL)
al...
Neural transducer is now the most popular end-to-end model for speech
re...
Autoregressive language modeling (ALM) have been successfully used in
se...
Much of named entity recognition (NER) research focuses on developing
da...
Visual anomaly detection plays a crucial role in not only manufacturing
...
Traditional automatic speech recognition (ASR) systems usually focus on
...
Unlike indirect methods that usually require time-consuming post-process...
Conventional training of deep neural networks usually requires a substan...
Contrastive learning methods achieve state-of-the-art results in unsuper...
In this paper, we propose a theoretical framework to explain the efficac...
Visual tasks vary a lot in their output formats and concerned contents,
...
This paper develops an implicit family of sub-step integration algorithm...
Self-training has shown great potential in semi-supervised learning. Its...
Road network and trajectory representation learning are essential for tr...