Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

by   Zhiyuan Li, et al.

Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy “LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3 eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1 image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.


page 1

page 8

page 9


PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

Vision-language (VL) Pre-training (VLP) has shown to well generalize VL ...

REPT: Bridging Language Models and Machine Reading Comprehension via Retrieval-Based Pre-training

Pre-trained Language Models (PLMs) have achieved great success on Machin...

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Multimodal tasks in the fashion domain have significant potential for e-...

Text-Only Image Captioning with Multi-Context Data Generation

Text-only Image Captioning (TIC) is an approach that aims to construct a...

Weakly Supervised Pre-Training for Multi-Hop Retriever

In multi-hop QA, answering complex questions entails iterative document ...

Video Captioning Using Weak Annotation

Video captioning has shown impressive progress in recent years. One key ...

Retrieval Augmented Classification for Long-Tail Visual Recognition

We introduce Retrieval Augmented Classification (RAC), a generic approac...

Please sign up or login with your details

Forgot password? Click here to reset