The Story of Mech Interp

Neel Nanda discusses big picture themes in interpretability research, including the shift from ambitious reverse engineering to pragmatic goals, the importance of grounding research in empirical observations, and the linearity of representations. He addresses questions about sources of truth, downstream tasks, and the linear representation hypothesis. Nanda also explores the history of interpretability, from early work on convolutional neural networks to recent research on language models, superposition, and causal intervention-based circuit finding, and answers audience questions about these topics.

Outlines

Part 1: Foundations of Interpretability

Part 2: Transition to Language Models and Causal Interventions

Part 3: Future Directions and Adversarial Examples

Sign in to continue reading, translating and more.

Open full episode in Podwise

Neel Nanda

Part 1: Foundations of Interpretability

Big Picture Themes in Interpretability Research

Sources of Truth, Downstream Tasks, and Applied Interpretability

The Linear Representation Hypothesis and Its Implications

Early Interpretability Work and Convolutional Neural Networks

Chris Ola's Contributions and Feature Visualization

Alignment with the Standard Basis and Downstream Tasks

Studying Weights, Circuits, and Algorithms

Part 2: Transition to Language Models and Causal Interventions

The Shift to Language Models and Transformers

Grokking and Ambitious Reverse Engineering

The Decline of Ambitious Reverse Engineering and the Rise of Causal Intervention

Activation Patching and Its Purpose

Applying Activation Patching to Factual Recall

The Messiness of Factual Recall and Superposition

Indirect Object Identification and Negative Name-Mover Heads

Copy Suppression and Monosemantic Heads

Backup Heads and Their Implications

Limitations of Causal Intervention and the Problem of Polysemy

Superposition and Intrinsic Approaches to Interpretability

Types of Superposition and the Long Tail of Bullshit Heuristics

Part 3: Future Directions and Adversarial Examples

Future Directions and Dictionary Learning

Universality Claims and Their Follow-Up

Adversarial Examples and the GCG Paper

Simplifying Complex Methods and Random Baselines

The Importance of Understanding Why Adversarial Examples Work

Attacking Probes and the Future of Defenses

The Story of Mech Interp

Neel Nanda

Part 1: Foundations of Interpretability

00:00Big Picture Themes in Interpretability Research

Big Picture Themes in Interpretability Research

03:12Sources of Truth, Downstream Tasks, and Applied Interpretability

Sources of Truth, Downstream Tasks, and Applied Interpretability

08:32The Linear Representation Hypothesis and Its Implications

The Linear Representation Hypothesis and Its Implications

11:55Early Interpretability Work and Convolutional Neural Networks

Early Interpretability Work and Convolutional Neural Networks

15:34Chris Ola's Contributions and Feature Visualization

Chris Ola's Contributions and Feature Visualization

19:39Alignment with the Standard Basis and Downstream Tasks

Alignment with the Standard Basis and Downstream Tasks

27:47Studying Weights, Circuits, and Algorithms

Studying Weights, Circuits, and Algorithms

Part 2: Transition to Language Models and Causal Interventions

32:57The Shift to Language Models and Transformers

The Shift to Language Models and Transformers

37:42Grokking and Ambitious Reverse Engineering

Grokking and Ambitious Reverse Engineering

42:40The Decline of Ambitious Reverse Engineering and the Rise of Causal Intervention

The Decline of Ambitious Reverse Engineering and the Rise of Causal Intervention

47:19Activation Patching and Its Purpose

Activation Patching and Its Purpose

52:09Applying Activation Patching to Factual Recall

Applying Activation Patching to Factual Recall

58:55The Messiness of Factual Recall and Superposition

The Messiness of Factual Recall and Superposition

1:02:24Indirect Object Identification and Negative Name-Mover Heads

Indirect Object Identification and Negative Name-Mover Heads

1:06:13Copy Suppression and Monosemantic Heads

Copy Suppression and Monosemantic Heads

1:10:05Backup Heads and Their Implications

Backup Heads and Their Implications

1:14:17Limitations of Causal Intervention and the Problem of Polysemy

Limitations of Causal Intervention and the Problem of Polysemy

1:17:33Superposition and Intrinsic Approaches to Interpretability

Superposition and Intrinsic Approaches to Interpretability

1:21:20Types of Superposition and the Long Tail of Bullshit Heuristics

Types of Superposition and the Long Tail of Bullshit Heuristics

Part 3: Future Directions and Adversarial Examples

1:29:34Future Directions and Dictionary Learning

Future Directions and Dictionary Learning

1:33:21Universality Claims and Their Follow-Up

Universality Claims and Their Follow-Up

1:40:52Adversarial Examples and the GCG Paper

Adversarial Examples and the GCG Paper

1:47:17Simplifying Complex Methods and Random Baselines

Simplifying Complex Methods and Random Baselines

1:53:02The Importance of Understanding Why Adversarial Examples Work

The Importance of Understanding Why Adversarial Examples Work

2:00:08Attacking Probes and the Future of Defenses

Attacking Probes and the Future of Defenses