Recent advances in foundation models present new opportunities for
inter...
The abundance of instructional videos and their narrations over the Inte...
Contrastive language-image pretraining (CLIP) using image-text pairs has...
This paper presents a grounded language-image pre-training (GLIP) model ...
Learning from image-text data has demonstrated recent success for many
r...
We address the challenging problem of image captioning by revisiting the...