Neel Nanda discusses big picture themes in interpretability research, including the shift from ambitious reverse engineering to pragmatic goals, the importance of grounding research in empirical observations, and the linearity of representations. He addresses questions about sources of truth, downstream tasks, and the linear representation hypothesis. Nanda also explores the history of interpretability, from early work on convolutional neural networks to recent research on language models, superposition, and causal intervention-based circuit finding, and answers audience questions about these topics.
Sign in to continue reading, translating and more.
Continue