Artificial agents have traditionally been trained to maximize reward, wh...
In recent years, deep neural networks have demonstrated increasingly str...
We introduce several new datasets namely ImageNet-A/O and ImageNet-R as ...
While programming is one of the most broadly applicable skills in modern...
Many intellectual endeavors require mathematical problem solving, but th...
We propose a new test to measure a text model's multitask accuracy. The ...
We show how to assess a language model's knowledge of basic concepts of
...
We introduce three new robustness benchmarks consisting of naturally
occ...
Detecting out-of-distribution examples is important for safety-critical
...
We introduce DIODE, a dataset that contains thousands of diverse high
re...
We introduce natural adversarial examples -- real-world, unmodified, and...