In this episode of The Cognitive Revolution, the host interviews Lee Sharkey, Principal Investigator at GoodFire, about their work on mechanistic interpretability, specifically focusing on parameter decomposition in neural networks. The discussion contrasts activation-based and parameter-based decomposition, highlighting the limitations of feature-centric approaches like sparse autoencoders and the need to understand how neural networks compute within and across layers. They delve into two methods: attribution-based parameter decomposition and stochastic parameter decomposition, explaining their respective loss functions, strengths, and weaknesses. The conversation covers the challenges of scaling these methods, the importance of causal importance, and potential applications such as surgical unlearning, knowledge extraction, and scientific discovery.
Sign in to continue reading, translating and more.
Continue