Crossed-Time Delay Neural Network for Speaker Recognition
Time Delay Neural Network (TDNN) is a well-performing structure for DNN-based speaker recognition systems. In this paper we introduce a novel structure Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN. Inspired by the multi-filters setting of convolution layer from convolution neural network, we set multiple time delay units each with different context size at the bottom layer and construct a multilayer parallel network. The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks. It outperforms in VoxCeleb1 dataset in verification experiment with a 2.6 Rate improvement. In few shots condition CTDNN reaches 90.4 accuracy, which doubles the identification accuracy of original TDNN. We also compare the proposed CTDNN with another new variant of TDNN, FTDNN, which shows that our model has a 36 shots condition and can better handle training of a larger batch in a shorter training time, which better utilize the calculation resources.
READ FULL TEXT