In this interview, Neel Nanda, who leads the mechanistic interpretability team at Google DeepMind, discusses the project of understanding how AI models work internally and its implications for AI safety. Nanda reflects on his evolving perspective, from initial optimism about fully understanding AI to a more pragmatic view acknowledging the complexity and messiness of AI internals. The conversation covers successes in gaining insights into AI models, including extracting superhuman knowledge and detecting harmful intentions, while also addressing the limitations and challenges, such as deception and the difficulty of establishing ground truth. Nanda emphasizes the importance of using diverse tools, including probes and chain-of-thought analysis, and advocates for a task-focused approach to research, balancing ambitious reverse engineering with pragmatic problem-solving. He also offers career advice for those interested in entering the field, highlighting the need for skepticism and a focus on downstream tasks.
Sign in to continue reading, translating and more.
Continue