The Utility of Interpretability — Emmanuel Amiesen
Latent Space: The AI Engineer Podcast
In this panel/co-hosted podcast episode, Swyx, Vibhu, and Emmanuel Amiesen from Anthropic discuss the latest MechInterp work, specifically focusing on circuit tracing and interpretability in language models. Emmanuel details the recent release of code that allows users to explore and experiment with open-source models like Gemma, explaining how to trace a model's computation when predicting a token. The conversation covers open questions in the field, ways to contribute, and the significance of understanding model internals for safety and improvement, including the superposition hypothesis, sparse autoencoders, and the creation of interpretable models. They also explore practical applications like steering model behavior and investigating jailbreaks, and touch on the importance of high-quality data visualization for communicating complex research.
Part 1: Introduction to Circuit Tracing
Part 2: MechInterp Fundamentals
Part 3: Model Mechanisms and Reasoning
Part 4: Challenges and Future Directions
Sign in to continue reading, translating and more.
Open full episode in Podwise