Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...
The convergence of text, visual, and audio data is a key step towards
hu...
Video understanding tasks have traditionally been modeled by two separat...
Large-scale multi-modal contrastive pre-training has demonstrated great
...
Human intelligence is multimodal; we integrate visual, linguistic, and
a...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
In this work, we introduce Dual Attention Vision Transformers (DaViT), a...
Contrastive language-image pretraining (CLIP) links vision and language
...
Contrastive language-image pretraining (CLIP) using image-text pairs has...
Automated visual understanding of our diverse and open world demands com...
We present in this paper a new architecture, named Convolutional vision
...
Prior skin image datasets have not addressed patient-level information
o...
Transfer learning enhances learning across tasks, by leveraging previous...
The International Skin Imaging Collaboration (ISIC) is a global partners...
Melanoma is the deadliest form of skin cancer. While curable with early
...