Transformer based unsupervised pre-training for acoustic representation learning

07/29/2020
by   Ruixiong Zhang, et al.
0

Computational audio analysis has become a central issue in associated areas of research and a variety of related applications arised. However, for many acoustic tasks, the labeled data size may be limited. To handle this problem, We propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech translation, speech emotion recognition and sound event detection. All the experiments have shown that pre-training using its own training data can significantly make the model converge faster and improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech translation, the BLEU score can further improve relatively 12.2 En-De dataset and 8.4 score can further improve absolutely 1.7 and 2.4 improve absolutely 4.3

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset