Attribution Graphs for Dummies - 2. Building and Testing a Circuit | Neuronpedia

Mechanistic interpretability of large language models relies on tracing internal circuits to understand how models process factual queries. Analyzing a two-hop prompt regarding state capitals reveals that models often employ a combination of additive mechanisms and lookup-table-like recall. Attribution graphs visualize these pathways, highlighting how specific MLP features and attention heads contribute to output logits. However, validating these circuits through feature steering proves complex; ablating specific nodes can disrupt model performance or trigger compensatory behaviors, reflecting the model's inherent parallelism. These investigations demonstrate that factual recall is rarely a singular process, as models simultaneously maintain multiple interpretations of tokens before disambiguation. Current interpretability tools face limitations in isolating these overlapping circuits, particularly when attention patterns shift dynamically in response to interventions, necessitating more robust methods to distinguish between causal pathways and incidental correlations.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Attribution Graphs for Dummies - 2. Building and Testing a Circuit

Neuronpedia

Mapping Model Computation with Circuit Tracing

Identifying and Labeling Model Features

Mechanisms of Factual Recall and Polysemantic Features

Automating Node Grouping and Feature Interpretation

Causal Validation Through Steering Interventions

Attribution Graphs for Dummies - 2. Building and Testing a Circuit

Neuronpedia

02:06Mapping Model Computation with Circuit Tracing

Mapping Model Computation with Circuit Tracing

12:00Identifying and Labeling Model Features

Identifying and Labeling Model Features

22:22Mechanisms of Factual Recall and Polysemantic Features

Mechanisms of Factual Recall and Polysemantic Features

38:00Automating Node Grouping and Feature Interpretation

Automating Node Grouping and Feature Interpretation

59:00Causal Validation Through Steering Interventions

Causal Validation Through Steering Interventions