Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is the result of a 2 week research sprint project during the training phase of Neel Nanda's MATS stream.

Executive Summary

We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).
- Specifically, rather than training our SAE on attn_output, we train our SAE on “hook_z” concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature's weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only [...]

---

Outline:

(00:20) Executive Summary

(05:00) Our SAE

(07:54) Feature Deep Dives

(08:48) Induction Feature: Board induction

(18:02) Local Context Feature: In questions starting with ‘Which’

(26:17) High-Level Context Feature: In a text related to pets feature

(28:38) Automatic Induction Feature Detection

(32:54) Towards Attention Head Polysemanticity with Head Ablation

(36:01) Limitations

(36:36) Appendix

(36:45) Citing this work

(36:58) Author Contributions Statement

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
January 16th, 2024

Source:
https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/sparse-autoencoders-work-on-attention-layer-outputs

---

“Sparse Autoencoders Work on Attention Layer Outputs” by Connor Kissane, robertzk, Arthur Conmy, Neel Nanda

LessWrong (30+ Karma)