Mapping two modalities, speech and text, into a shared representation sp...
Machine learning systems produce biased results towards certain demograp...
Conversational models that are generative and open-domain are particular...
This paper proposes a framework for quantitatively evaluating interactiv...
We present NusaCrowd, a collaborative initiative to collect and unite
ex...
Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating...
In this thesis, we investigated the relevance, faithfulness, and succinc...
Large-scale vision-language pre-trained (VLP) models are prone to halluc...
Closed-book question answering (QA) requires a model to directly answer ...
This paper is the system description of the DKU-Tencent System for the
V...
Automatic speaker verification has achieved remarkable progress in recen...
The zero-shot scenario for speech generation aims at synthesizing a nove...
Building a voice conversion system for noisy target speakers, such as us...
The ideal goal of voice conversion is to convert the source speaker's sp...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a
pr...
Considerable advancements have been made in various NLP tasks based on t...
Recently, Conformer based CTC/AED model has become a mainstream architec...
Though significant progress has been made for speaker-dependent
Video-to...
Multi-hop question generation (MQG) aims to generate complex questions w...
This paper describes our speaker diarization system submitted to the
Mul...
Denoising diffusion probabilistic models (DDPMs) are expressive generati...
Recently, End-to-End (E2E) frameworks have achieved remarkable results o...
Mixture-of-experts based acoustic models with dynamic routing mechanisms...
The task of few-shot style transfer for voice cloning in text-to-speech ...
Recently, the attention mechanism such as squeeze-and-excitation module ...
Current app ranking and recommendation systems are mainly based on
user-...
Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis ai...
Current two-stage TTS framework typically integrates an acoustic model w...
In spoken conversations, spontaneous behaviors like filled pause and
pro...
This paper introduces GigaSpeech, an evolving, multi-domain English spee...
For conversational text-to-speech (TTS) systems, it is vital that the sy...
Singing voice conversion (SVC) is one promising technique which can enri...
Query focused summarization (QFS) models aim to generate summaries from
...
To diversify and enrich generated dialogue responses, knowledge-grounded...
Recently, neural architecture search (NAS) has attracted much attention ...
Recently, Mixture of Experts (MoE) based Transformer has shown promising...
This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-spee...
Non-autoregressive (NAR) transformer models have achieved significantly
...
Existing approaches for replay and synthetic speech detection still lack...
The home voice assistants such as Amazon Alexa have become increasingly
...
Lay summarization aims to generate lay summaries of scientific papers
au...
Multi-hop Question Generation (QG) aims to generate answer-related quest...
In this paper, we explore the neural architecture search (NAS) for autom...
Generating 3D speech-driven talking head has received more and more atte...
Recently adversarial attacks on automatic speaker verification (ASV) sys...
To address the need for refined information in COVID-19 pandemic, we pro...
Hand-crafted spatial features (e.g., inter-channel phase difference, IPD...
Self-attention networks (SAN) have been introduced into automatic speech...
Deep-learning based speech separation models confront poor generalizatio...
Self-attention network (SAN) can benefit significantly from the
bi-direc...