As large language models have demonstrated impressive performance in man...
Action knowledge involves the understanding of textual, visual, and temp...
Recent studies have shown promising results on utilizing pre-trained
ima...
Spatial control is a core capability in controllable image generation.
A...
There is growing interest in searching for information from large video
...
We present Perceiver-VL, a vision-and-language framework that efficientl...
In this work, we present the Textless Vision-Language Transformer (TVLT)...
Fine-tuning large pre-trained models on downstream tasks has been adopte...
Modern image captioning models are usually trained with text similarity
...
Generating images from textual descriptions has gained a lot of attentio...
Recently, there has been an increasing interest in building question
ans...
Recently, fine-tuning language models pre-trained on large text corpora ...
Since visual perception can give rich information beyond text descriptio...
Existing methods for vision-and-language learning typically require desi...
Mirroring the success of masked language models, vision-and-language
cou...
Generating diverse sequences is important in many NLP applications such ...
Variational autoencoders (VAE) combined with hierarchical RNNs have emer...