Blind Extraction of Target Speech Source Guided by Supervised Speaker Identification via X-vectors
This manuscript proposes a novel robust procedure for extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is blind, performed via independent vector extraction. A recently proposed constant separating vector (CSV) model is employed, which improves the estimation of moving sources. The blind algorithm is guided towards the SOI via the frame-wise speaker identification, which is trained in a supervised manner and is independent of a specific scenario. When processing challenging data, an incorrect speaker may be extracted due to limitations of this guidance. To identify such cases, a criterion non-intrusively assessing quality of the estimated SOI is proposed. It utilizes the same model as the speaker identification; no additional training is therefore required. Using this criterion, the “deflation” approach to extraction is presented. If an incorrect source is estimated, it is subtracted from the mixture and the extraction of the SOI is performed again from the reduced mixture. The proposed procedure is experimentally tested on both artificial and real-world datasets containing challenging phenomena: source movements, reverberation, transient noise or microphone failures. The presented method is comparable to the state-of-the-art blind algorithms on static mixtures; it is more accurate for mixtures containing source movements. Compared to fully supervised methods, the proposed procedure achieves a lower level of accuracy but requires no scenario-specific data for the training.
READ FULL TEXT