Generalisation – the ability of a model to perform well on unseen data –...
We propose a visually grounded speech model that learns new words and th...
Most vision-and-language pretraining research focuses on English tasks.
Visually grounded speech (VGS) models are trained on images paired with
The task of converting text input into video content is becoming an impo...
Multimodal speech recognition aims to improve the performance of automat...
Keyword localisation is the task of finding where in a speech utterance ...
The task of video-to-speech aims to translate silent video of lip moveme...
Quantifying the confidence (or conversely the uncertainty) of a predicti...
We describe the submission of the Quo Vadis team to the Traffic4cast
This paper addresses the problem of building a speech recognition system...
This paper introduces a state-of-the-art video representation and applie...