Adapters present a promising solution to the catastrophic forgetting pro...
Benchmarks for language-guided embodied agents typically assume text-bas...
We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT ...
Aligning image and text encoders from scratch using contrastive learning...
Current state-of-the-art vision-and-language models are evaluated on tas...
Numerous works have analyzed biases in vision and pre-trained language m...
While neural models have been shown to exhibit strong performance on
sin...
Visual context has been shown to be useful for automatic speech recognit...
Multimodal automatic speech recognition systems integrate information fr...
Speech is understood better by using visual context; for this reason, th...
In Neural Machine Translation (NMT) the usage of subwords and characters...
Neural dialog models have exhibited strong performance, however their
en...
Multimodal learning allows us to leverage information from multiple sour...