Performant vision-language (VL) models like CLIP represent captions usin...
As general purpose vision models get increasingly effective at a wide se...
General purpose vision (GPV) systems are models that are designed to sol...
A special purpose learning system assumes knowledge of admissible tasks ...
To avoid giving wrong answers, question answering (QA) models need to kn...