End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
We study ResNet-, Time-Depth Separable ConvNets-, and Transformer-based acoustic models, trained with CTC or Seq2Seq criterions. We perform experiments on the LibriSpeech dataset, with and without LM decoding, optionally with beam rescoring. We reach 5.18 rescoring. Additionally, we leverage the unlabeled data from LibriVox by doing semi-supervised training and show that it is possible to reach 5.29 test-other without decoding, and 4.11 only the standard 960 hours from LibriSpeech as labeled data.
READ FULL TEXT