Neel Nanda discusses reasoning model interpretability, focusing on the challenges and advantages of interpreting models with token sampling, especially in reasoning models. He highlights the stochastic nature and large search space of chain of thought, while also noting that randomness can make attribution easier. Nanda shares insights from the Thought Anchors paper, emphasizing resampling and causal graph analysis, and touches on a paper about base models and reasoning. The discussion explores the remaining mysteries in AI psychology, the legitimacy of reading chain of thoughts, and the potential for white box techniques in latent chain of thought models. The podcast wraps up with questions about clustering sentences semantically, language concept models, and advice on simplifying complex ideas.
Sign in to continue reading, translating and more.
Continue