This podcast features an interview with Neel Nanda, where he discusses mechanistic interpretability, a field focused on understanding how AI models work internally. Nanda explains the empirical nature of mechinterp, emphasizing the importance of being open to surprises when reverse-engineering networks. He touches on the possibility of foundational understanding of deep learning models and the illusion of grokking, which involves memorization followed by generalization. The conversation explores circuits, features, and the potential for ambitious interpretability, contrasting it with classical methods. Nanda also shares his views on the linear representation hypothesis, the Othello paper, superposition, and the role of induction heads in transformer models, as well as the importance of identifying and addressing AI existential risks.
Sign in to continue reading, translating and more.
Continue