Audio-visual representation learning aims to develop systems with human-...
This work investigates the use of large-scale, pre-trained models (CLIP ...
Data-driven speech processing models usually perform well with a large a...
Attention-based Transformer models have been increasingly employed for
a...