Untangling Neural Network Mechanisms: Goodfire's Lee Sharkey on Parameter-based Interpretability

In this episode of The Cognitive Revolution, the host interviews Lee Sharkey, Principal Investigator at GoodFire, about their work on mechanistic interpretability, specifically focusing on parameter decomposition in neural networks. The discussion contrasts activation-based and parameter-based decomposition, highlighting the limitations of feature-centric approaches like sparse autoencoders and the need to understand how neural networks compute within and across layers. They delve into two methods: attribution-based parameter decomposition and stochastic parameter decomposition, explaining their respective loss functions, strengths, and weaknesses. The conversation covers the challenges of scaling these methods, the importance of causal importance, and potential applications such as surgical unlearning, knowledge extraction, and scientific discovery.

Outlines

Part 1: Introduction and Motivation

Part 2: Attribution-Based Parameter Decomposition

Part 3: Stochastic Parameter Decomposition

Part 4: Future Applications and Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Part 1: Introduction and Motivation

Introduction to Mechanistic Interpretability and Parameter Decomposition

Lee Sharkey's Transition to GoodFire and the Core Concept of Parameter Decomposition

The Limitations of Concept-Based Analysis and the Importance of Understanding Computations

Examples of Multi-Layer Computations and the Need to Understand Functions

Part 2: Attribution-Based Parameter Decomposition

The Setup and Motivation for Attribution-Based Parameter Decomposition

The Loss Functions and Potential Conceptual Drift in Attribution-Based Parameter Decomposition

Part 3: Stochastic Parameter Decomposition

Limitations of Attribution-Based Parameter Decomposition and Introduction to Stochastic Parameter Decomposition

Understanding Rank-One Subcomponents and the Concept of Superposition in Parameter Space

Scaling Considerations and the Introduction of Causal Importance

Implementation of Causal Importance and Addressing Feature Splitting

Grouping, Labeling, and Semantic Understanding of Rank-One Units

Part 4: Future Applications and Conclusion

Future Applications, Scaling Costs, and Concluding Thoughts

Untangling Neural Network Mechanisms: Goodfire's Lee Sharkey on Parameter-based Interpretability

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Part 1: Introduction and Motivation

00:00Introduction to Mechanistic Interpretability and Parameter Decomposition

Introduction to Mechanistic Interpretability and Parameter Decomposition

06:15Lee Sharkey's Transition to GoodFire and the Core Concept of Parameter Decomposition

Lee Sharkey's Transition to GoodFire and the Core Concept of Parameter Decomposition

17:56The Limitations of Concept-Based Analysis and the Importance of Understanding Computations

The Limitations of Concept-Based Analysis and the Importance of Understanding Computations

25:10Examples of Multi-Layer Computations and the Need to Understand Functions

Examples of Multi-Layer Computations and the Need to Understand Functions

Part 2: Attribution-Based Parameter Decomposition

36:11The Setup and Motivation for Attribution-Based Parameter Decomposition

The Setup and Motivation for Attribution-Based Parameter Decomposition

49:40The Loss Functions and Potential Conceptual Drift in Attribution-Based Parameter Decomposition

The Loss Functions and Potential Conceptual Drift in Attribution-Based Parameter Decomposition

Part 3: Stochastic Parameter Decomposition

1:04:02Limitations of Attribution-Based Parameter Decomposition and Introduction to Stochastic Parameter Decomposition

Limitations of Attribution-Based Parameter Decomposition and Introduction to Stochastic Parameter Decomposition

1:17:08Understanding Rank-One Subcomponents and the Concept of Superposition in Parameter Space

Understanding Rank-One Subcomponents and the Concept of Superposition in Parameter Space

1:25:11Scaling Considerations and the Introduction of Causal Importance

Scaling Considerations and the Introduction of Causal Importance

1:34:04Implementation of Causal Importance and Addressing Feature Splitting

Implementation of Causal Importance and Addressing Feature Splitting

1:45:01Grouping, Labeling, and Semantic Understanding of Rank-One Units

Grouping, Labeling, and Semantic Understanding of Rank-One Units

Part 4: Future Applications and Conclusion

1:55:25Future Applications, Scaling Costs, and Concluding Thoughts

Future Applications, Scaling Costs, and Concluding Thoughts