Multilingual image captioning has recently been tackled by training with...
Recent advances in image captioning are mainly driven by large-scale
vis...
Inspired by retrieval-augmented language generation and pretrained Visio...
Most vision-and-language pretraining research focuses on English tasks.
...
Non-hierarchical sparse attention Transformer-based models, such as
Long...
Recent advances in image captioning have focused on scaling the data and...
Language models are defined over a finite set of inputs, which creates a...
The recent literature in text classification is biased towards short tex...
Domain adaptive pretraining, i.e. the continued unsupervised pretraining...
Pretrained vision-and-language BERTs aim to learn representations that
c...
Image captioning has focused on generalizing to images drawn from the sa...
Large-scale pretraining and task-specific fine-tuning is now the standar...
Visual context has been shown to be useful for automatic speech recognit...
Visually-grounded models of spoken language understanding extract semant...
Multimodal automatic speech recognition systems integrate information fr...
Approaches to Grounded Language Learning typically focus on a single
tas...
Large-scale pretrained language models are the major driving force behin...
Multimodal machine translation involves drawing information from more th...
Recent work has highlighted the advantage of jointly learning grounded
s...
Image captioning models are usually evaluated on their ability to descri...
Recent work has shown that visual context improves cross-lingual sense
d...
In this paper, we introduce How2, a multimodal collection of instruction...
Recent work has shown how to learn better visual-semantic embeddings by
...
We present the results from the second shared task on multimodal machine...
Automatic image description systems are commonly trained and evaluated o...
We decompose multimodal translation into two sub-tasks: learning to tran...
In recent years we have seen rapid and significant progress in automatic...
We provide a qualitative analysis of the descriptions containing negatio...
We introduce the Multi30K dataset to stimulate multilingual multimodal
r...
Automatic description generation from natural images is a challenging pr...
In this paper we present an approach to multi-language image description...