
Attribution graphs provide a visual lens into the internal decision-making processes of large language models by mapping how specific features influence output predictions. By utilizing transcoders to replace standard MLP layers, researchers can decompose complex model activations into interpretable concepts. This methodology reveals "fan-in" structures, where input tokens aggregate into latent diagnoses, and "fan-out" patterns, where those diagnoses trigger subsequent reasoning steps. For instance, in a medical case study, the model identifies preeclampsia symptoms to predict relevant follow-up questions. While these graphs offer a powerful diagnostic tool, they function as "leaky abstractions" that require iterative hypothesis testing and cross-referencing across multiple prompts to ensure accuracy. This interpretability workflow remains an experimental art, relying on human expertise to distinguish between meaningful causal connections and noise within the model's latent space.
Sign in to continue reading, translating and more.
Open full episode in Podwise