
Goodfire's Mark Bissell and Myra Deng join the Latent Space podcast to discuss interpretability in AI, defining it as a set of methods to understand, learn from, and design AI models, emphasizing its application in production scenarios and high-stakes industries. The conversation explores the challenges and opportunities of using interpretability techniques like SAEs and probes, highlighting their use in detecting harmful behaviors and PII. They share a demo of steering on a 1 trillion parameter model, showcasing real-time editing of model behavior, and discuss the equivalence of activation steering and in-context learning. The potential of interpretability to extract novel scientific information, accelerate drug discovery, and improve model design is also examined, alongside the importance of addressing safety concerns and promoting intentionality in AI development.
Sign in to continue reading, translating and more.
Continue