This is just a simple idea that came to me, maybe other people found it earlier, I'm not sure.

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a money prize proportional to the time remaining until midnight, multiplied by X pounds per minute. He's also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for getting to Tower Bridge, he needs to get to St Paul's Cathedral instead. His reward coefficient X also changes: the device notes Bob's location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul's from that location, and adjusts X so that the expected reward at the time of the button press remains the same. For example [...]

---

First published:
November 2nd, 2025

Source:
https://www.lesswrong.com/posts/LGSMepAfve8DyNp7b/a-toy-model-of-corrigibility

---

Narrated by TYPE III AUDIO.

“A toy model of corrigibility” by cousin_it

LessWrong (30+ Karma)