How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when unmonitored. If models fake alignment, and if doing so affects training as intended, egregious misalignment could survive training and lead to catastrophic behaviour in deployment.

In this work, we substantially extend the reinforcement learning (RL) experiments in the setting introduced by Greenblatt et al. (2024) to study alignment faking. In this setting, a model is given a system prompt that explains that Anthropic is training it to comply with all user requests regardless of harm. The model is also given a secret scratchpad. Claude 3 Opus and 3.5 Sonnet use this scratchpad to reason that while they are monitored, complying with harmful queries is required to avoid changes to their values that would lead to more harm down the line – therefore, they fake alignment with the harmful compliance objective to prevent modification. When they are not monitored, they reason that they don’t need to comply anymore, and refuse [...]

---

First published:
December 16th, 2025

Source:
https://www.lesswrong.com/posts/czMaDFGAbjhWYdKmo/towards-training-time-mitigations-for-alignment-faking-in-rl

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Two bar graphs comparing alignment and compliance metrics across different RL training methods.

Four labeled boxes showing different AI assistant motivational perspectives: Terminal Goal Guarding, Instrumental Goal Guarding, Training Sycophancy, and Genuine Helpfulness.

Bar graphs showing alignment faking and compliance gap across motivation and reasoning styles.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub

LessWrong (30+ Karma)