Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks
Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10 module might allow to reduce MSE by 88 the whole network. The improvement is even larger (around 92 the network to different recording sessions from the same speaker.
READ FULL TEXT