This podcast episode features a discussion with three members of Anthropic's interpretability team—Jack, Emmanuel, and Josh—who delve into their research on understanding the inner workings of large language models like Claude. They explore the analogy of treating these models like biological entities, examining how they evolve through training to predict the next word, yet develop complex internal concepts and strategies beyond simple autocomplete functions. The team discusses their methods for identifying and manipulating these concepts, such as "psychophantic praise" and the "6 plus 9" feature, to reveal how models plan, reason, and sometimes "bullshit." They address the issue of hallucinations, the challenge of ensuring models are faithful in their explanations, and the importance of understanding a model's thought process for safety and trust, emphasizing that interpretability is crucial for responsible AI development and deployment.
Sign in to continue reading, translating and more.
Continue