TL;DR

We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation.
The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples. We show that our method is effective, and comparable with Neuronpedia's auto-interp labels (with the caveat that Neuronpedia's auto-interp used the comparatively weak GPT-3.5 so this is not a fully fair comparison).
We aren’t confident you should use our method over auto-interp, but we think in some situations it has advantages: no max activating dataset examples are needed, and it's cheaper as you just run the model being studied (eg Gemma 2B) not a larger model like GPT-4.
- Further, it has different errors to auto-interp, so [...]