YouTube24 Nov 2025
42m

How Reasoning Models Break Mechanistic Interpretability Techniques

Podcast cover

Neel Nanda

Neel Nanda discusses reasoning model interpretability, focusing on the challenges and advantages of interpreting models with token sampling, especially in reasoning models. He highlights the stochastic nature and large search space of chain of thought, while also noting that randomness can make attribution easier. Nanda shares insights from the Thought Anchors paper, emphasizing resampling and causal graph analysis, and touches on a paper about base models and reasoning. The discussion explores the remaining mysteries in AI psychology, the legitimacy of reading chain of thoughts, and the potential for white box techniques in latent chain of thought models. The podcast wraps up with questions about clustering sentences semantically, language concept models, and advice on simplifying complex ideas.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise