Yuexian Zou

research

∙ 09/03/2023

NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

The goal of speech enhancement (SE) is to eliminate the background inter...

0 Wen Wang, et al. ∙

research

∙ 08/25/2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Supervised visual captioning models typically require a large scale of i...

0 Bang Yang, et al. ∙

research

∙ 07/28/2023

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Most existing audio-text retrieval (ATR) methods focus on constructing c...

0 Yifei Xin, et al. ∙

research

∙ 07/26/2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

The recent video grounding works attempt to introduce vanilla contrastiv...

0 Hongxiang Li, et al. ∙

research

∙ 07/05/2023

Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels

Generating an informative and attractive title for the product is a cruc...

0 Bang Yang, et al. ∙

research

∙ 05/04/2023

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

Audio codec models are widely used in audio communication as a crucial t...

0 Dongchao Yang, et al. ∙

research

∙ 03/30/2023

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

The advancement of audio-language (AL) multimodal learning tasks has bee...

0 Xinhao Mei, et al. ∙

research

∙ 03/30/2023

TLAG: An Informative Trigger and Label-Aware Knowledge Guided Model for Dialogue-based Relation Extraction

Dialogue-based Relation Extraction (DRE) aims to predict the relation ty...

0 Hao An, et al. ∙

research

∙ 03/28/2023

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Automatic radiology report generation has attracted enormous research in...

0 Yaowei Li, et al. ∙

research

∙ 03/15/2023

PoseRAC: Pose Saliency Transformer for Repetitive Action Counting

This paper presents a significant contribution to the field of repetitiv...

0 Ziyu Yao, et al. ∙

research

∙ 03/11/2023

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Natural Language Generation (NLG) accepts input data in the form of imag...

0 Bang Yang, et al. ∙

research

∙ 03/10/2023

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

In text-audio retrieval (TAR) tasks, due to the heterogeneity of content...

0 Yifei Xin, et al. ∙

research

∙ 03/10/2023

Improving Weakly Supervised Sound Event Detection with Causal Intervention

Existing weakly supervised sound event detection (WSSED) work has not ex...

0 Yifei Xin, et al. ∙

research

∙ 02/23/2023

FTM: A Frame-level Timeline Modeling Method for Temporal Graph Representation Learning

Learning representations for graph-structured data is essential for grap...

0 Bowen Cao, et al. ∙

research

∙ 02/18/2023

SSVMR: Saliency-based Self-training for Video-Music Retrieval

With the rise of short videos, the demand for selecting appropriate back...

0 Xuxin Cheng, et al. ∙

research

∙ 01/15/2023

Generating Templated Caption for Video Grounding

Video grounding aims to locate a moment of interest matching the given q...

0 Hongxiang Li, et al. ∙

research

∙ 12/16/2022

Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation

Recently, frequency domain all-neural beamforming methods have achieved ...

0 Rongzhi Gu, et al. ∙

research

∙ 12/07/2022

M3ST: Mix at Three Levels for Speech Translation

How to solve the data scarcity problem for end-to-end speech-to-text tra...

0 Xuxin Cheng, et al. ∙

research

∙ 11/22/2022

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Training supervised video captioning model requires coupled video-captio...

0 Fenglin Liu, et al. ∙

research

∙ 11/08/2022

A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding

Multi-intent detection and slot filling joint models are gaining increas...

0 Zhihong Zhu, et al. ∙

research

∙ 11/04/2022

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

Expressive text-to-speech (TTS) can synthesize a new speaking style by i...

0 Dongchao Yang, et al. ∙

research

∙ 10/28/2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Vision-and-language (V-L) tasks require the system to understand both vi...

0 Fenglin Liu, et al. ∙

research

∙ 10/19/2022

Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning

Recently, attention based models have been used extensively in many sequ...

5 Fenglin Liu, et al. ∙

research

∙ 10/06/2022

Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...

0 Ji Jiang, et al. ∙

research

∙ 07/21/2022

Correspondence Matters for Video Referring Expression Comprehension

We investigate the problem of video Referring Expression Comprehension (...

0 Meng Cao, et al. ∙

research

∙ 07/21/2022

LocVTP: Video-Text Pre-training for Temporal Localization

Video-Text Pre-training (VTP) aims to learn transferable representations...

5 Meng Cao, et al. ∙

research

∙ 07/20/2022

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However...

0 Dongchao Yang, et al. ∙

research

∙ 06/05/2022

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Despite the rapid progress in automatic speech recognition (ASR) researc...

0 Jinchuan Tian, et al. ∙

research

∙ 05/03/2022

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Hand-crafted spatial features, such as inter-channel intensity differenc...

0 Xinmeng Xu, et al. ∙

research

∙ 04/15/2022

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Dominant researches adopt supervised training for speaker extraction, wh...

0 Zifeng Zhao, et al. ∙

research

∙ 04/05/2022

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Target sound detection (TSD) aims to detect the target sound from a mixt...

0 Dongchao Yang, et al. ∙

research

∙ 04/05/2022

A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Target sound detection (TSD) aims to detect the target sound from mixtur...

0 Dongchao Yang, et al. ∙

research

∙ 04/04/2022

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Recently, end-to-end speaker extraction has attracted increasing attenti...

0 Zifeng Zhao, et al. ∙

research

∙ 04/02/2022

Improving Target Sound Extraction with Timestamp Information

Target sound extraction (TSE) aims to extract the sound part of a target...

0 Helin Wang, et al. ∙

research

∙ 03/31/2022

Learning Decoupling Features Through Orthogonality Regularization

Keyword spotting (KWS) and speaker verification (SV) are two important t...

0 Li Wang, et al. ∙

research

∙ 03/31/2022

SpatioTemporal Focus for Skeleton-based Action Recognition

Graph convolutional networks (GCNs) are widely adopted in skeleton-based...

0 Liyu Wu, et al. ∙

research

∙ 03/29/2022

Integrate Lattice-Free MMI into End-to-End Speech Recognition

In automatic speech recognition (ASR) research, discriminative criteria ...

0 Jinchuan Tian, et al. ∙

research

∙ 03/25/2022

Unsupervised Pre-training for Temporal Action Localization Tasks

Unsupervised video representation learning has made remarkable achieveme...

0 Can Zhang, et al. ∙

research

∙ 01/06/2022

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Despite the rapid progress of end-to-end (E2E) automatic speech recognit...

0 Jinchuan Tian, et al. ∙

research

∙ 12/19/2021

Detect what you want: Target Sound Detection

Human beings can perceive a target sound that we are interested in from ...

0 Dongchao Yang, et al. ∙

research

∙ 12/05/2021

Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI

Recently, End-to-End (E2E) frameworks have achieved remarkable results o...

0 Jinchuan Tian, et al. ∙

research

∙ 11/30/2021

CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

For video captioning, "pre-training and fine-tuning" has become a de fac...

0 Bang Yang, et al. ∙

research

∙ 10/12/2021

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, ...

0 Zhongjie Ye, et al. ∙

research

∙ 10/09/2021

A Mutual learning framework for Few-shot Sound Event Detection

Although prototypical network (ProtoNet) has proved to be an effective m...

0 Dongchao Yang, et al. ∙

research

∙ 09/18/2021

Towards Joint Intent Detection and Slot Filling via Higher-order Attention

Intent detection (ID) and Slot filling (SF) are two major tasks in spoke...

0 Dongsheng Chen, et al. ∙

research

∙ 09/13/2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Video grounding aims to localize the temporal segment corresponding to a...

0 Meng Cao, et al. ∙

research

∙ 08/26/2021

HAN: Higher-order Attention Network for Spoken Language Understanding

Spoken Language Understanding (SLU), including intent detection and slot...

0 Dongsheng Chen, et al. ∙

research

∙ 08/25/2021

Fully Non-Homogeneous Atmospheric Scattering Modeling with Convolutional Neural Networks for Single Image Dehazing

In recent years, single image dehazing models (SIDM) based on atmospheri...

1 Cong Wang, et al. ∙

research

∙ 08/18/2021

Joint Multiple Intent Detection and Slot Filling via Self-distillation

Intent detection and slot filling are two main tasks in natural language...

0 Lisong Chen, et al. ∙

research

∙ 08/12/2021

Deep Motion Prior for Weakly-Supervised Temporal Action Localization

Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize ...

0 Meng Cao, et al. ∙

Yuexian Zou

Featured Co-authors

Sign in with Google

Consider DeepAI Pro