Attribution Graphs for Dummies - 1. What are Attribution Graphs? | Neuronpedia

Attribution graphs provide a visual lens into the internal decision-making processes of large language models by mapping how specific features influence output predictions. By utilizing transcoders to replace standard MLP layers, researchers can decompose complex model activations into interpretable concepts. This methodology reveals "fan-in" structures, where input tokens aggregate into latent diagnoses, and "fan-out" patterns, where those diagnoses trigger subsequent reasoning steps. For instance, in a medical case study, the model identifies preeclampsia symptoms to predict relevant follow-up questions. While these graphs offer a powerful diagnostic tool, they function as "leaky abstractions" that require iterative hypothesis testing and cross-referencing across multiple prompts to ensure accuracy. This interpretability workflow remains an experimental art, relying on human expertise to distinguish between meaningful causal connections and noise within the model's latent space.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Attribution Graphs for Dummies - 1. What are Attribution Graphs?

Neuronpedia

Decomposing Model Internal States with Attribution Graphs

Optimizing Graph Interpretability Through Pruning and Feature Labeling

Mapping Latent Medical Reasoning in Claude 3.5 Haiku

Identifying Feature Clusters and Limitations in Circuit Tracing

Attribution Graphs for Dummies - 1. What are Attribution Graphs?

Neuronpedia

00:02Decomposing Model Internal States with Attribution Graphs

Decomposing Model Internal States with Attribution Graphs

09:12Optimizing Graph Interpretability Through Pruning and Feature Labeling

Optimizing Graph Interpretability Through Pruning and Feature Labeling

19:09Mapping Latent Medical Reasoning in Claude 3.5 Haiku

Mapping Latent Medical Reasoning in Claude 3.5 Haiku

34:03Identifying Feature Clusters and Limitations in Circuit Tracing

Identifying Feature Clusters and Limitations in Circuit Tracing