Multi-target Filter and Detector for Unknown-number Speaker Diarization

03/30/2022
by   Chin-yi Cheng, et al.
0

A strong representation of a target speaker can aid in extracting important information regarding the speaker and detecting the corresponding temporal regions in a multi-speaker conversation. In this study, we propose a neural architecture that simultaneously extracts speaker representations that are consistent with the speaker diarization objective and detects the presence of each speaker frame by frame, regardless of the number of speakers in the conversation. A speaker representation (known as a z-vector) extractor and frame-speaker contextualizer, which is realized by a residual network and processing data in both the temporal and speaker dimensions, are integrated into a unified framework. Testing on the CALLHOME corpus reveals that our model outperforms most methods presented to date. An evaluation in a more challenging case of concurrent speakers ranging from two to seven demonstrates that our model also achieves relative diarization error rate reductions of 26.35 6.4 model and attention-based model, respectively.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset