Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:

We summarize the paper;
We compare our methodology to what the one used in other safety papers.

The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.

Here's the abstract of the paper:

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]

---

Outline:

(02:47) Paper summary

(02:50) Setup

(05:00) Evaluation methodology

(05:38) Results

(06:50) Relationship to other work

(07:17) We explicitly measure how robust techniques are against scheming AIs

(09:52) We focus on a setting with concentrated catastrophes and expensive high-quality trusted labor

(13:14) We emphasize an evaluation methodology that can be applied in-place at crunch time

(13:51) Future work

---

First published:
December 13th, 2023

Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion

---

Narrated by TYPE III AUDIO.

“AI Control: Improving Safety Despite Intentional Subversion” by Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan

LessWrong (30+ Karma)