Large-scale noisy web image-text datasets have been proven to be efficie...
Self-supervised learning on large-scale multi-modal datasets allows lear...
Spatio-temporal grounding describes the task of localizing events in spa...
Large scale Vision-Language (VL) models have shown tremendous success in...
Contrastive learning has become a prominent ingredient in learning
repre...
Multilingual text-video retrieval methods have improved significantly in...
Vision-language models trained on large, randomly collected data had
sig...
Recently, a number of new Semi-Supervised Learning methods have emerged....
Multi-modal learning from video data has seen increased attention recent...
The task of multimodal learning has seen a growing interest recently as ...