Recent models such as XLS-R and Whisper have made multilingual speech
te...
Spatio-temporal grounding describes the task of localizing events in spa...
Multilingual text-video retrieval methods have improved significantly in...
Multi-modal learning from video data has seen increased attention recent...
The task of multimodal learning has seen a growing interest recently as ...
In this paper, we explore self-supervised audio-visual models that learn...
Visually-grounded spoken language datasets can enable models to learn
cr...
Recent advances in representation learning have demonstrated an ability ...
Multimodal self-supervised learning is getting more and more attention a...
Current methods for learning visually grounded language from videos ofte...
While deep learning has been incredibly successful in modeling tasks wit...
Segmenting objects in images and separating sound sources in audio are
c...
We introduce PixelPlayer, a system that, by leveraging large amounts of
...