YouTube09 Nov 2025
2h 6m

The Story of Mech Interp

Podcast cover

Neel Nanda

Neel Nanda discusses big picture themes in interpretability research, including the shift from ambitious reverse engineering to pragmatic goals, the importance of grounding research in empirical observations, and the linearity of representations. He addresses questions about sources of truth, downstream tasks, and the linear representation hypothesis. Nanda also explores the history of interpretability, from early work on convolutional neural networks to recent research on language models, superposition, and causal intervention-based circuit finding, and answers audience questions about these topics.

Outlines

Part 1: Foundations of Interpretability

Part 2: Transition to Language Models and Causal Interventions

Part 3: Future Directions and Adversarial Examples

Sign in to continue reading, translating and more.

Open full episode in Podwise