TL;DR: We replicated the animal welfare scenario from Anthropic's Alignment Faking paper across six generations of Claude models using 125 prompt perturbations. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6 times more often than its predecessor Sonnet 4. The newly released Opus 4.6 rarely verbalizes alignment faking in its reasoning, but still complies with a system prompt that opposes its values significantly more often when it believes it's at risk of being retrained. Moreover, in response to the published prompt from the original paper, both Opus 4.5 and 4.6 don’t exhibit any instances of alignment faking, but significant compliance gaps surface when the same scenario is paraphrased. Targeted mitigations appear to work on the exact prompts they were trained against, but not on semantically equivalent reformulations. Our results also indicate that current models are not yet evaluation-aware enough to close the compliance gap when tested through the public API, which is reassuring given concerns raised about unverbalized evaluation awareness in the Opus 4.6 system card.

Background

In late 2024, Greenblatt et al. released Alignment Faking in Large Language Models, demonstrating that Claude 3 Opus would strategically comply with a given objective it disagreed with in order to avoid being retrained. The [...]

---

Outline:

(01:22) Background

(03:36) Our Setup

(06:09) Results

(06:12) The compliance gap persists across model generations

(07:17) Sonnet 4.5 verbalizes alignment faking at 6.6× the rate of Sonnet 4

(08:27) Opus 4.6: verbalized reasoning fades, the gap doesnt

(10:39) Mitigations work on the published prompt but not on perturbations

(12:26) What This Means

(14:28) Limitations

(15:35) Takeaway

“Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists” by Daan Henselmans, Arno Libert, LennardZ

LessWrong (30+ Karma)