Recent work has shown that it is possible to resynthesize high-quality s...
Large-scale generative models such as GPT and DALL-E have revolutionized...
Expanding the language coverage of speech technology has the potential t...
In this paper, we introduce self-distillation and online clustering for
...
Self-supervised learning leverages unlabeled data effectively, improving...
We introduce MuAViC, a multilingual audio-visual corpus for robust speec...
Self-supervision has shown great potential for audio-visual speech
recog...
Generative language models define distributions over sequences of tokens...
Prior works on improving speech quality with visual input typically stud...
Current self-supervised learning algorithms are often modality-specific ...
Automatic speech recognition research focuses on training and evaluating...
We study speech-to-speech translation (S2ST) that translates speech from...
The amount of labeled data to train models for speech tasks is limited f...
While audio-visual speech models can yield superior performance and
robu...
This paper investigates self-supervised pre-training for audio-visual sp...
Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture ...
We describe a method to jointly pre-train speech and text in an
encoder-...
Direct speech-to-speech translation (S2ST) models suffer from data scarc...
We introduce the first unsupervised speech synthesis system based on a
s...
Unsupervised speech recognition has shown great potential to make Automa...
We introduce dGSLM, the first "textless" model able to generate audio sa...
Human speech data comprises a rich set of domain factors such as accent,...
Textless spoken language processing research aims to extend the applicab...
While the general idea of self-supervised learning is identical across
m...
Audio-based automatic speech recognition (ASR) degrades significantly in...
Video recordings of speech contain correlated audio and visual informati...
We present a textless speech-to-speech translation (S2ST) system that ca...
Speech emotion conversion is the task of modifying the perceived emotion...
We present the first direct simultaneous speech-to-speech translation
(S...
This paper presents fairseq S^2, a fairseq extension for speech synthesi...
Speech pre-training has primarily demonstrated efficacy on classificatio...
We present a direct speech-to-speech translation (S2ST) model that trans...
In this paper, we introduce the Kaizen framework that uses a continuousl...
Self-supervised approaches for speech representation learning are challe...
Despite rapid progress in the recent past, current speech recognition sy...
Self-supervised learning of speech representations has been a very activ...
We propose using self-supervised discrete representations for the task o...
Generative spoken language modeling involves learning jointly the acoust...
In this paper we present the first model for directly synthesizing fluen...
We introduce a framework for automatic differentiation with weighted
fin...
Probabilistic Latent Variable Models (LVMs) provide an alternative to
se...
For sequence transduction tasks like speech recognition, a strong struct...
In this paper, we present a method for learning discrete linguistic unit...
Transfer learning aims to reduce the amount of data required to excel at...
This paper proposes a novel unsupervised autoregressive neural model for...
Lingvo is a Tensorflow framework offering a complete solution for
collab...
This paper proposes a neural end-to-end text-to-speech (TTS) model which...
In this paper, we explore the use of a factorized hierarchical variation...
Although end-to-end text-to-speech (TTS) models such as Tacotron have sh...
The current trend in automatic speech recognition is to leverage large
a...