The goal of speech enhancement (SE) is to eliminate the background
inter...
Supervised visual captioning models typically require a large scale of i...
Most existing audio-text retrieval (ATR) methods focus on constructing
c...
The recent video grounding works attempt to introduce vanilla contrastiv...
Generating an informative and attractive title for the product is a cruc...
Audio codec models are widely used in audio communication as a crucial
t...
The advancement of audio-language (AL) multimodal learning tasks has bee...
Dialogue-based Relation Extraction (DRE) aims to predict the relation ty...
Automatic radiology report generation has attracted enormous research
in...
This paper presents a significant contribution to the field of repetitiv...
Natural Language Generation (NLG) accepts input data in the form of imag...
In text-audio retrieval (TAR) tasks, due to the heterogeneity of content...
Existing weakly supervised sound event detection (WSSED) work has not
ex...
Learning representations for graph-structured data is essential for grap...
With the rise of short videos, the demand for selecting appropriate
back...
Video grounding aims to locate a moment of interest matching the given q...
Recently, frequency domain all-neural beamforming methods have achieved
...
How to solve the data scarcity problem for end-to-end speech-to-text
tra...
Training supervised video captioning model requires coupled video-captio...
Multi-intent detection and slot filling joint models are gaining increas...
Expressive text-to-speech (TTS) can synthesize a new speaking style by
i...
Vision-and-language (V-L) tasks require the system to understand both vi...
Recently, attention based models have been used extensively in many
sequ...
Video Referring Expression Comprehension (REC) aims to localize a target...
We investigate the problem of video Referring Expression Comprehension (...
Video-Text Pre-training (VTP) aims to learn transferable representations...
Generating sound effects that humans want is an important topic. However...
Despite the rapid progress in automatic speech recognition (ASR) researc...
Hand-crafted spatial features, such as inter-channel intensity differenc...
Dominant researches adopt supervised training for speaker extraction, wh...
Target sound detection (TSD) aims to detect the target sound from a mixt...
Target sound detection (TSD) aims to detect the target sound from mixtur...
Recently, end-to-end speaker extraction has attracted increasing attenti...
Target sound extraction (TSE) aims to extract the sound part of a target...
Keyword spotting (KWS) and speaker verification (SV) are two important t...
Graph convolutional networks (GCNs) are widely adopted in skeleton-based...
In automatic speech recognition (ASR) research, discriminative criteria ...
Unsupervised video representation learning has made remarkable achieveme...
Despite the rapid progress of end-to-end (E2E) automatic speech recognit...
Human beings can perceive a target sound that we are interested in from ...
Recently, End-to-End (E2E) frameworks have achieved remarkable results o...
For video captioning, "pre-training and fine-tuning" has become a de fac...
Automated audio captioning (AAC) has developed rapidly in recent years,
...
Although prototypical network (ProtoNet) has proved to be an effective m...
Intent detection (ID) and Slot filling (SF) are two major tasks in spoke...
Video grounding aims to localize the temporal segment corresponding to a...
Spoken Language Understanding (SLU), including intent detection and slot...
In recent years, single image dehazing models (SIDM) based on atmospheri...
Intent detection and slot filling are two main tasks in natural language...
Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize
...