Building scalable vision-language models to learn from diverse, multimod...
Building general-purpose models that can perceive diverse real-world
mod...
Large pre-trained multimodal models have demonstrated significant succes...
In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
Multimodal representation learning has shown promising improvements on
v...
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for
cross...
In this paper, we consider the image captioning task from a new
sequence...
Autoregressive sequence Generation models have achieved state-of-the-art...
Image captioning transforms complex visual information into abstract nat...
Most image captioning models are autoregressive, i.e. they generate each...
Self-attention (SA) network has shown profound value in image captioning...
This document describes our solution for the VATEX Captioning Challenge ...
Image captioning attempts to generate a sentence composed of several
lin...