Atticus Geiger from Goodfire discusses the role of causality in understanding black box systems, focusing on mechanistic interpretability. He begins by defining mechanistic interpretability through narrow and broad cultural and technical definitions, advocating for understanding neural network internals via causal mechanisms. The discussion covers activation steering, including difference in mean steering and AxBench evaluations, highlighting the effectiveness of prompting and fine-tuning over steering. Geiger then transitions to causal mediation analysis, explaining direct and indirect effects with examples like editing prompts and causal tracing to understand how models process information. Finally, he delves into causal abstraction, presenting neural networks and algorithms as causal models, using the example of a transformer adding two-digit numbers to illustrate how interventions can reveal structural correspondences between high-level causal models and low-level neural networks. The lecture includes Q&A.
Sign in to continue reading, translating and more.
Continue