TED: Triple Supervision Decouples End-to-end Speech-to-text Translation

09/21/2020

∙

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Inspired by neuroscience, humans have perception systems and cognitive systems to process different information, we propose TED, Transducer-Encoder-Decoder, a unified framework with triple supervision to decouple the end-to-end speech-to-text translation task. In addition to the target sentence translation loss, includes two auxiliary supervising signals to guide the acoustic transducer that extracts acoustic features from the input, and the semantic encoder to extract semantic features relevant to the source transcription text. Our method achieves state-of-the-art performance on both English-French and English-German speech translation benchmarks.

READ FULL TEXT

TED: Triple Supervision Decouples End-to-end Speech-to-text Translation

Sign in with Google

Consider DeepAI Pro