Efficient neural speech synthesis for low-resource languages throughmultilingual modeling
Recent advances in neural TTS have led to models that canproduce high-quality synthetic speech. However, these mod-els typically require large amounts of training data, which canmake it costly to produce a new voice with the desired qual-ity. Although multi-speaker modeling can reduce the data re-quirements necessary for a new voice, this approach is usuallynot viable for many low-resource languages for which abundantmulti-speaker data is not available. In this paper, we thereforeinvestigated to what extent multilingual multi-speaker model-ing can be an alternative to monolingual multi-speaker model-ing, and explored how data from foreign languages may best becombined with low-resource language data. We found that mul-tilingual modeling can increase the naturalness of low-resourcelanguage speech, showed that multilingual models can producespeech with a naturalness comparable to monolingual multi-speaker models, and saw that the target language naturalnesswas affected by the strategy used to add foreign language data.
READ FULL TEXT