NeMo Toolbox for Speech Dataset Construction
In this paper, we introduce a new toolbox for constructing speech datasets from long audio recording and raw reference texts. We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering. The proposed pipeline also supports human-in-the-loop to address text-audio mismatch issues and remove samples that don't satisfy the quality requirements. We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks. The toolbox is opne sourced in NeMo framework. The RuLS corpus is released in OpenSLR.
READ FULL TEXT