Transformer based unsupervised pre-training for acoustic representation learning
Computational audio analysis has become a central issue in associated areas of research and a variety of related applications arised. However, for many acoustic tasks, the labeled data size may be limited. To handle this problem, We propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech translation, speech emotion recognition and sound event detection. All the experiments have shown that pre-training using its own training data can significantly make the model converge faster and improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech translation, the BLEU score can further improve relatively 12.2 En-De dataset and 8.4 score can further improve absolutely 1.7 and 2.4 improve absolutely 4.3
READ FULL TEXT