Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.

New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.

Abstract:

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]

---

Outline:

(05:22) Specification gaming and reward tampering

(08:39) Generalization to reward tampering

(10:59) Can we train away reward tampering?

(12:56) Caveats and conclusions

---

First published:
June 17th, 2024

Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in

---

Narrated by TYPE III AUDIO.

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

LessWrong (30+ Karma)