YouTube28 Oct 2023
3h 57m

Mechanistic Interpretability - NEEL NANDA (DeepMind)

Podcast cover

Machine Learning Street Talk

This podcast features an interview with Neel Nanda, where he discusses mechanistic interpretability, a field focused on understanding how AI models work internally. Nanda explains the empirical nature of mechinterp, emphasizing the importance of being open to surprises when reverse-engineering networks. He touches on the possibility of foundational understanding of deep learning models and the illusion of grokking, which involves memorization followed by generalization. The conversation explores circuits, features, and the potential for ambitious interpretability, contrasting it with classical methods. Nanda also shares his views on the linear representation hypothesis, the Othello paper, superposition, and the role of induction heads in transformer models, as well as the importance of identifying and addressing AI existential risks.

Outlines

Part 1: Introduction to Interpretability

Part 2: Grokking and Generalization

Part 3: Representations and World Models

Part 4: Techniques and Interventions

Part 5: Superintelligence and AI Safety

Sign in to continue reading, translating and more.

Open full episode in Podwise