Neel Nanda begins a discussion about the current state and future directions of interpretability research in AI, outlining key changes in the field over the past 12-24 months, including improvements in model capabilities for agentic tasks, the increasing use of inference time compute, and advancements in creating model organisms. Nanda emphasizes the importance of reasoning model interpretability, AI psychology, and automating interpretability, while also addressing the role of interpretability in debugging model failures, lie detection, and steering fine-tuning. Nanda expresses skepticism about the field's ability to predict which areas of interpretability matter most, advocating for diversification and experimentation. The discussion opens up to include contributions from other speakers, who challenge some of Nanda's claims, particularly around the balance between understanding and control in interpretability research.
Sign in to continue reading, translating and more.
Continue