Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction
In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of .650 on the ExVo Few-Shot dev set, a 2.5% increase over our baseline CNN14 CCC of .634.
READ FULL TEXT