08 Sept 2025
3h 1m

#222 – Neel Nanda on the race to read AI minds

Podcast cover

80,000 Hours Podcast

In this interview, Neel Nanda, who leads the mechanistic interpretability team at Google DeepMind, discusses the project of understanding how AI models work internally and its implications for AI safety. Nanda reflects on his evolving perspective, from initial optimism about fully understanding AI to a more pragmatic view acknowledging the complexity and messiness of AI internals. The conversation covers successes in gaining insights into AI models, including extracting superhuman knowledge and detecting harmful intentions, while also addressing the limitations and challenges, such as deception and the difficulty of establishing ground truth. Nanda emphasizes the importance of using diverse tools, including probes and chain-of-thought analysis, and advocates for a task-focused approach to research, balancing ambitious reverse engineering with pragmatic problem-solving. He also offers career advice for those interested in entering the field, highlighting the need for skepticism and a focus on downstream tasks.

Outlines

Part 1: Introduction and Definition

Part 2: Successes and Techniques

Part 3: Challenges and Limitations

Part 4: Interpretability Approaches

Part 5: Sparse Autoencoders

Part 6: Future Directions and Research Philosophy

Part 7: Career Advice

Sign in to continue reading, translating and more.

Open full episode in Podwise