Deployed multimodal systems can fail in ways that evaluators did not
ant...
Auditing large language models for unexpected behaviors is critical to
p...
Large language models generate complex, open-ended outputs: instead of
o...
Selective classification, in which models are allowed to abstain on unce...
Despite excellent performance on many tasks, NLP systems are easily fool...