Neel Nanda

research

∙ 07/18/2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Circuit analysis is a promising technique for understanding the internal...

0 Tom Lieberum, et al. ∙

research

∙ 05/31/2023

Neuron to Graph: Interpreting Language Model Neurons at Scale

Advances in Large Language Models (LLMs) have led to remarkable capabili...

0 Alex Foote, et al. ∙

research

∙ 05/02/2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Despite rapid adoption and deployment of large language models (LLMs), t...

0 Wes Gurnee, et al. ∙

research

∙ 04/22/2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Understanding the function of individual neurons within language models ...

0 Alex Foote, et al. ∙

research

∙ 02/06/2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Universality is a key hypothesis in mechanistic interpretability – that ...

0 Bilal Chughtai, et al. ∙

research

∙ 01/12/2023

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new...

0 Neel Nanda, et al. ∙

research

∙ 09/24/2022

In-context Learning and Induction Heads

"Induction heads" are attention heads that implement a simple algorithm ...

8 Catherine Olsson, et al. ∙

research

∙ 04/12/2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

We apply preference modeling and reinforcement learning from human feedb...

2 Yuntao Bai, et al. ∙

research

∙ 02/15/2022

Predictability and Surprise in Large Generative Models

Large-scale pre-training has recently emerged as a technique for creatin...

0 Deep Ganguli, et al. ∙

research

∙ 10/04/2021

An Empirical Investigation of Learning from Biased Toxicity Labels

Collecting annotations from human raters often results in a trade-off be...

0 Neel Nanda, et al. ∙

research

∙ 02/17/2021

Fully General Online Imitation Learning

In imitation learning, imitators and demonstrators are policies for pick...

15 Michael K. Cohen, et al. ∙

Neel Nanda

Featured Co-authors

Sign in with Google

Consider DeepAI Pro