Summary.

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using [...]

---

Outline:

(02:07) Background and motivation

(04:38) Solution: transcoders

(06:16) Performance metrics

(10:45) Qualitative interpretability analysis

(10:50) Example transcoder features

(11:34) Broader interpretability survey

(12:28) Circuit analysis

(13:24) Input-independent information: pullbacks and de-embeddings

(15:38) Input-dependent information

(18:20) Obtaining circuits and graphs

(19:42) Brief discussion: why are transcoders better for circuit analysis?

(21:39) Case study

(22:05) Introduction to blind case studies

(23:55) Blind case study on layer 8 transcoder feature 355

(28:44) Evaluating our hypothesis

(29:22) Code

(31:43) Discussion

(35:13) Author contribution statement

(35:45) Appendix

(35:48) For input-dependent feature connections, why pointwise-multiply the feature activation vector with the pullback vector?

(37:09) Comparing input-independent pullbacks with mean input-dependent attributions

(39:46) A more detailed description of the computational graph algorithm

(43:14) Details on evaluating transcoders

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
April 30th, 2024

Source:
https://www.lesswrong.com/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit

---

“Transcoders enable fine-grained interpretable circuit analysis for language models” by Jacob Dunefsky, Philippe Chlenski, Neel Nanda

LessWrong (30+ Karma)