Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary:

Alignment work on network-based AGI focuses on reinforcement learning. There is an alternative approach that avoids some, but not all, of the difficulties of RL alignment. Instead of trying to build an adequate representation of the behavior and goals we want, by specifying rewards, we can choose its goals from the representations it has learned through any learning method.

I give three examples of this approach: Steve Byrnes’ plan for mediocre alignment (of RL agents); John Wentworth's “redirect the search” for goal-directed mesa-optimizers that could emerge in predictive networks; and natural language alignment for language model agents. These three approaches fall into a natural category that has important advantages over commonly considered RL alignment approaches.

An alternative to RL alignment

Recent work on alignment theory has focused on reinforcement learning (RL) alignment. [...]

---

Outline:

(00:59) An alternative to RL alignment

(03:56) A metaphor: communicating with an alien

(05:18) Plans for GSLK alignment

(08:03) Advantages of GSLK over RL

(11:28) Remaining challenges

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
January 15th, 2024

Source:
https://www.lesswrong.com/posts/DfJCTp4MxmTFnYvgF/goals-selected-from-learned-knowledge-an-alternative-to-rl

---

Narrated by TYPE III AUDIO.

“Goals selected from learned knowledge: an alternative to RL alignment” by Seth Herd

LessWrong (30+ Karma)