Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the “first critical try,” and can’t learn from failures.[1] What does this mean? Is it true? Does there need to be a “first critical try” in the relevant sense? I’ve sometimes felt confused about this, so I wrote up a few thoughts to clarify.

I start with a few miscellaneous conceptual points. I then focus in on a notion of “first critical try” tied to the first point (if there is one) when AIs get a “decisive strategic advantage” (DSA) over humanity – that is, roughly, the ability to kill/disempower all humans if they try.[2] I further distinguish between four different types of DSA:

Unilateral DSA: Some AI agent could take over if it [...]

---

Outline:

(02:31) Some conceptual points

(06:00) Unilateral DSAs

(09:13) Coordination DSAs

(15:19) Correlation DSAs

(22:21) A few final thoughts

The original text contained 20 footnotes which were omitted from this narration.

---

First published:
June 5th, 2024

Source:
https://www.lesswrong.com/posts/qs7SjiMFoKseZrhxK/on-first-critical-tries-in-ai-alignment

---

Narrated by TYPE III AUDIO.

“On ‘first critical tries’ in AI alignment” by Joe Carlsmith

LessWrong (30+ Karma)