Speech Emotion Recognition via Contrastive Loss under Siamese Networks

10/23/2019
by   Zheng Lian, et al.
0

Speech emotion recognition is an important aspect of human-computer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the supervision component, which does not explicitly encourage discriminative learning of features. In this paper, we introduce the contrastive loss function to encourage intra-class compactness and inter-class separability between learnable features. Furthermore, multiple feature selection methods and pairwise sample selection methods are evaluated. To verify the performance of the proposed system, we conduct experiments on The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, a common evaluation corpus. Experimental results reveal the advantages of the proposed method, which reaches 62.19 unweighted accuracy. It outperforms the baseline system that is optimized without the contrastive loss function with 1.14 accuracy and the unweighted accuracy, respectively.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset