MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning

06/22/2023

∙

by Mohammad Reza Hasanabadi Majid Behdad Davood Gharavian, et al.

∙

In this paper, we introduce MFCCGAN as a novel speech synthesizer based on adversarial learning that adopts MFCCs as input and generates raw speech waveforms. Benefiting the GAN model capabilities, it produces speech with higher intelligibility than a rule-based MFCC-based speech synthesizer WORLD. We evaluated the model based on a popular intrusive objective speech intelligibility measure (STOI) and quality (NISQA score). Experimental results show that our proposed system outperforms Librosa MFCC- inversion (by an increase of about 26 rise of about 10 with conventional rule-based vocoder WORLD that used in the CycleGAN-VC family. However, WORLD needs additional data like F0. Finally, using perceptual loss in discriminators based on STOI could improve the quality more. WebMUSHRA-based subjective tests also show the quality of the proposed approach.

READ FULL TEXT

MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning

Generating Segment Durations in a Text-To-Speech System: A Hybrid Rule-Based/Neural Network Approach

Perceptually Guided End-to-End Text-to-Speech

VR IQA NET: Deep Virtual Reality Image Quality Assessment using Adversarial Learning

Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions

Adapting a FrameNet Semantic Parser for Spoken Language Understanding Using Adversarial Learning

RAN Cognitive Controller

Towards Interpretability of Speech Pause in Dementia Detection using Adversarial Learning

MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning

Related Research

Generating Segment Durations in a Text-To-Speech System: A Hybrid Rule-Based/Neural Network Approach

Perceptually Guided End-to-End Text-to-Speech

VR IQA NET: Deep Virtual Reality Image Quality Assessment using Adversarial Learning

Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions

Adapting a FrameNet Semantic Parser for Spoken Language Understanding Using Adversarial Learning

RAN Cognitive Controller

Towards Interpretability of Speech Pause in Dementia Detection using Adversarial Learning