Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.

In the rest of this post I will:

Articulate a class of approaches to scalable oversight I call cognition-based oversight.
Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty [...]

---

Outline:

(01:41) Cognition-based oversight

(02:10) Discriminating models: a simplification of scalable oversight

(05:20) Cognition-based oversight for discriminating behaviorally identical models

(07:02) Discriminating Behaviorally Identical Classifiers

(11:51) Hard vs. Relaxed DBIC

(13:48) SHIFT as a technique for (hard) DBIC

(19:09) Limitations and next steps

(21:13) Direction 1: better ways to understand interpretable units in deep networks

(23:02) Direction 2: identifying especially leveraged settings for cognition-based oversight

(24:47) Conclusion

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
April 18th, 2024

Source:
https://www.lesswrong.com/posts/s7uD3tzHMvD868ehr/discriminating-behaviorally-identical-classifiers-a-model

---

Narrated by TYPE III AUDIO.

“Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight” by Sam Marks

LessWrong (30+ Karma)