Mechanistic Interpretability - NEEL NANDA (DeepMind) | Machine Learning Street Talk

This podcast features an interview with Neel Nanda, where he discusses mechanistic interpretability, a field focused on understanding how AI models work internally. Nanda explains the empirical nature of mechinterp, emphasizing the importance of being open to surprises when reverse-engineering networks. He touches on the possibility of foundational understanding of deep learning models and the illusion of grokking, which involves memorization followed by generalization. The conversation explores circuits, features, and the potential for ambitious interpretability, contrasting it with classical methods. Nanda also shares his views on the linear representation hypothesis, the Othello paper, superposition, and the role of induction heads in transformer models, as well as the importance of identifying and addressing AI existential risks.

Outlines

Part 1: Introduction to Interpretability

Part 2: Grokking and Generalization

Part 3: Representations and World Models

Part 4: Techniques and Interventions

Part 5: Superintelligence and AI Safety

Sign in to continue reading, translating and more.

Open full episode in Podwise

Mechanistic Interpretability - NEEL NANDA (DeepMind)

Machine Learning Street Talk

Part 1: Introduction to Interpretability

00:00Interpretability in AI Security and Mechanistic Interpretability as an Empirical Science

Interpretability in AI Security and Mechanistic Interpretability as an Empirical Science

04:45Foundational Understanding of Deep Learning Models and Grokking as an Illusion

Foundational Understanding of Deep Learning Models and Grokking as an Illusion

07:31Introduction to Mechanistic Interpretability and its Potential

Introduction to Mechanistic Interpretability and its Potential

14:14Contrasting Mechanistic Interpretability with Classical Interpretability

Contrasting Mechanistic Interpretability with Classical Interpretability

20:41Vision for Interpretability in Advanced Models and the NEETs vs. Scruffys Dichotomy

Vision for Interpretability in Advanced Models and the NEETs vs. Scruffys Dichotomy

Part 2: Grokking and Generalization

27:05Universal Cognitive Priors, Transformer Constraints, and Grokking Clarification

Universal Cognitive Priors, Transformer Constraints, and Grokking Clarification

33:21Three Phases of Training Underlying Grokking and Reverse Engineering Modular Addition

Three Phases of Training Underlying Grokking and Reverse Engineering Modular Addition

40:08Algorithmic Generalization and the Grokking Spark

Algorithmic Generalization and the Grokking Spark

47:37Weight Decay, Inductive Bias, and Representation Theory

Weight Decay, Inductive Bias, and Representation Theory

52:55Universality in Group Operations and the Role of Inductive Biases

Universality in Group Operations and the Role of Inductive Biases

1:01:02Takeaways from Toy Models and the Lottery Ticket Hypothesis

Takeaways from Toy Models and the Lottery Ticket Hypothesis

Part 3: Representations and World Models

1:07:06Linear vs. Geometric Representations and the Structure of Model Representations

Linear vs. Geometric Representations and the Structure of Model Representations

1:13:34Intention vs. Extension and the Limits of Compositionality

Intention vs. Extension and the Limits of Compositionality

1:20:31World Models vs. Surface Statistics and the Othello Paper

World Models vs. Surface Statistics and the Othello Paper

1:27:03Data-Driven Models and the Othello Game

Data-Driven Models and the Othello Game

1:33:10Linear Probing and the Othello Model

Linear Probing and the Othello Model

1:41:15Linear Representation and Superposition

Linear Representation and Superposition

1:47:56Computational vs. Representational Superposition

Computational vs. Representational Superposition

1:53:52Criticisms of the Anthropic Superposition Paper

Criticisms of the Anthropic Superposition Paper

2:00:04Finding Neurons in a Haystack and De-Tokenization

Finding Neurons in a Haystack and De-Tokenization

2:07:30Grounding and the Role of Inductive Priors

Grounding and the Role of Inductive Priors

2:13:40The Curse of Neural Networks and the Linear Representation Hypothesis

The Curse of Neural Networks and the Linear Representation Hypothesis

Part 4: Techniques and Interventions

2:20:03The Residual Stream as a Central Object

The Residual Stream as a Central Object

2:27:30Activation Patching and Causal Interventions

Activation Patching and Causal Interventions

2:33:36The Importance of Metrics and the Power of Mechanistic Interpretability

The Importance of Metrics and the Power of Mechanistic Interpretability

Part 5: Superintelligence and AI Safety

2:40:03Superintelligence and the Effective Altruism Movement

Superintelligence and the Effective Altruism Movement

2:47:50The Second Species Argument and the Importance of Understanding

The Second Species Argument and the Importance of Understanding

2:53:08The Importance of Specific Circuits and the Definition of Goals

The Importance of Specific Circuits and the Definition of Goals

3:00:00The Orthogonality Thesis and the Importance of Empirical Claims

The Orthogonality Thesis and the Importance of Empirical Claims

3:07:30The Importance of Understanding and the Role of Interpretability

The Importance of Understanding and the Role of Interpretability

3:14:11The Importance of Empirical Claims and the Role of Interpretability

The Importance of Empirical Claims and the Role of Interpretability

3:21:31The Importance of Empirical Claims and the Role of Interpretability

The Importance of Empirical Claims and the Role of Interpretability

3:28:30The Importance of Empirical Claims and the Role of Interpretability

The Importance of Empirical Claims and the Role of Interpretability

3:35:58The Importance of Empirical Claims and the Role of Interpretability

The Importance of Empirical Claims and the Role of Interpretability

3:43:35The Importance of Empirical Claims and the Role of Interpretability

The Importance of Empirical Claims and the Role of Interpretability

3:50:08The Importance of Empirical Claims and the Role of Interpretability

The Importance of Empirical Claims and the Role of Interpretability