How Reasoning Models Break Mechanistic Interpretability Techniques

Neel Nanda discusses reasoning model interpretability, focusing on the challenges and advantages of interpreting models with token sampling, especially in reasoning models. He highlights the stochastic nature and large search space of chain of thought, while also noting that randomness can make attribution easier. Nanda shares insights from the Thought Anchors paper, emphasizing resampling and causal graph analysis, and touches on a paper about base models and reasoning. The discussion explores the remaining mysteries in AI psychology, the legitimacy of reading chain of thoughts, and the potential for white box techniques in latent chain of thought models. The podcast wraps up with questions about clustering sentences semantically, language concept models, and advice on simplifying complex ideas.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Neel Nanda

Introduction to Reasoning Model Interpretability and the Challenges of Sampling

Advantages of Randomness and the Importance of Studying Causal Graphs in Reasoning Models

Black Box vs. White Box Tooling and the Need for Reasoning Model Interpretability

Remaining Mysteries and the Functional Form of Chain of Thought Circuits

Methodological Challenges and Applied vs. Basic Science in Reasoning Model Interpretability

Further Questions on Thought Anchors and Language Concept Models

How Reasoning Models Break Mechanistic Interpretability Techniques

Neel Nanda

00:00Introduction to Reasoning Model Interpretability and the Challenges of Sampling

Introduction to Reasoning Model Interpretability and the Challenges of Sampling

04:56Advantages of Randomness and the Importance of Studying Causal Graphs in Reasoning Models

Advantages of Randomness and the Importance of Studying Causal Graphs in Reasoning Models

13:29Black Box vs. White Box Tooling and the Need for Reasoning Model Interpretability

Black Box vs. White Box Tooling and the Need for Reasoning Model Interpretability

21:35Remaining Mysteries and the Functional Form of Chain of Thought Circuits

Remaining Mysteries and the Functional Form of Chain of Thought Circuits

27:59Methodological Challenges and Applied vs. Basic Science in Reasoning Model Interpretability

Methodological Challenges and Applied vs. Basic Science in Reasoning Model Interpretability

35:12Further Questions on Thought Anchors and Language Concept Models

Further Questions on Thought Anchors and Language Concept Models