Circuit analysis is a promising technique for understanding the
internal...
Advances in Large Language Models (LLMs) have led to remarkable capabili...
Despite rapid adoption and deployment of large language models (LLMs), t...
Understanding the function of individual neurons within language models ...
Universality is a key hypothesis in mechanistic interpretability – that
...
Neural networks often exhibit emergent behavior, where qualitatively new...
"Induction heads" are attention heads that implement a simple algorithm ...
We apply preference modeling and reinforcement learning from human feedb...
Large-scale pre-training has recently emerged as a technique for creatin...
Collecting annotations from human raters often results in a trade-off be...
In imitation learning, imitators and demonstrators are policies for pick...