Two Minute Summary

In this post I present my results from training a Sparse Autoencoder (SAE) on a CLIP Vision Transformer (ViT) using the ImageNet-1k dataset. I have created an interactive web app, 'SAE Explorer', to allow the public to explore the visual features the SAE has learnt, found here: https://sae-explorer.streamlit.app/ (best viewed on a laptop). My results illustrate that SAEs can identify sparse and highly interpretable directions in the residual stream of vision models, enabling inference time inspections on the model's activations. To demonstrate this, I have included a 'guess the input image' game on the web app that allows users to guess the input image purely from the SAE activations of a single layer and token of the residual stream. I have also uploaded a (slightly outdated) accompanying talk of my results, primarily listing SAE features I found interesting: https://youtu.be/bY4Hw5zSXzQ.

The primary purpose of this post is [...]

---

Outline:

(00:08) Two Minute Summary

(02:44) Motivation

(04:18) What is a Vision Transformer?

(06:01) What is CLIP?

(07:42) Training the SAE

(08:43) Examples of SAE Features

(09:04) Interesting and Amusing Features

(09:19) Era/Time Features:

(09:48) Place/Culture Features:

(10:24) Film/TV Features:

(10:41) Texture Features:

(10:49) Miscellaneous Features:

(11:07) NSFW Features:

(11:47) How Trustworthy are Highest Activating Images?

(12:27) Tennis Feature

(12:46) Border Terrier Feature

(12:52) Mushrooms Feature

(12:58) Birds on Branches/in Foliage

(13:05) Training Performance

(13:08) Sparsity and _l_0_

(13:56) MSE, _l_1_ and Model Losses

(14:26) Identifying the Ultra-Low Density Cluster

(18:58) Neuron Alignment

(21:56) Future Work

The original text contained 1 footnote which was omitted from this narration.

---

First published:
April 29th, 2024

Source:
https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/towards-multimodal-interpretability-learning-sparse-2

---

Narrated by TYPE III AUDIO.

“Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers” by hugofry

LessWrong (30+ Karma)