In this post, we analyze some simple games that model the problem of ensuring AIs never succeed at taking a single catastrophic action (which we'll refer to as "escaping") during deployment. This exercise highlights some of the difficulties associated with deploying misaligned models given a limited perfect auditing budget, and also suggests some opportunities for technique development.

We analyzed these games while writing AI Control: Improving Safety Despite Intentional Subversion. (Thanks to harfe for solving one of these games, and to Ryan Greenblatt and Kshitij Sachan for discussion about these at the time.)

Core takeaways:

Indistinguishable episodes (i.e. the AI doesn't know what time it is) + tiny escape success rate + catch-then-shut-down can get high chances of win even with a small fraction of auditing, but only if the success rate is really tiny or if failures are loud (i.e. we notice failed attempts).
We can [...]

---

Outline:

(01:23) Assumptions used across all toy models

(02:27) Distinguishable episodes with limited auditing

(03:35) Indistinguishable episodes with limited auditing

(04:50) Indistinguishable episodes with limited auditing and escape attempts which sometimes fail silently

(06:10) Indistinguishable episodes with limited auditing and escape opportunities of various levels of quality

(08:04) Indistinguishable episodes with limited average auditing in an infinite game

(09:24) Coordinated escape in mostly indistinguishable episodes with limited auditing

---

First published:
February 6th, 2024

Source:
https://www.lesswrong.com/posts/MDeGts4Aw9DktCkXw/toy-models-of-ai-control-for-concentrated-catastrophe

---

Narrated by TYPE III AUDIO.

“Toy models of AI control for concentrated catastrophe prevention” by Fabien Roger, Buck

LessWrong (30+ Karma)