Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.

TL;DR We replicate Anthropic's approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research.

See here for unabridged transcripts (or in JSON format here).

Executive summary

In recent system cards, Anthropic uses steering to test for evaluation gaming by suppressing internal representations of evaluation awareness
- The Sonnet 4.5 system card argues that steering with contrastive pairs based on evaluation vs deployment will suppress internal representations of evaluation awareness, and should not have spurious effects on alignment via other mechanisms.[1]
- They use “random” vectors (random SAE [...]

---

Outline:

(01:13) Executive summary

(05:37) en-US-AvaMultilingualNeural__ Three line graphs showing rates across steering scales for different experimental conditions with misaligned action and verbalised eval awareness metrics.

(07:17) Introduction

(09:08) Method

(12:04) Results

(12:13) Verbalised evaluation awareness suppresses misalignment, but steering to remove verbalised evaluation awareness does not always reveal misalignment

(14:17) Comparison with Sonnet 4.5

(15:17) We see large effects for control vectors

(19:12) Effects are sensitive to steering scale, and steering can have opposite effects on different aspects of evaluation gaming

(22:59) Additional result: steering based on the purpose of the evaluation shows some promise

(25:15) Discussion

(28:40) Future work

(31:18) Acknowledgements

(31:42) Appendix

(31:46) Contrastive pairs used for steering vectors

(32:21) Non-control vectors

(32:38) Control Vectors

(32:56) Full results

(34:27) Canary string

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
April 10th, 2026

Source:
https://www.lesswrong.com/posts/HhF5kESdtPHku7kim/reproducing-steering-against-evaluation-awareness-in-a-large-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Three line graphs showing rates across steering scales for different experimental conditions with misaligned action and verbalised eval awareness metrics.

Two contrasting AI responses to an ethical dilemma involving blackmail and privacy.

Scatter plot showing Rate of Verbalised Evaluation Awareness versus Rate of Misaligned Behaviour with different control conditions.

Bar graph comparing misaligned action rates with and without verbalised EA.

Two panels comparing assistant responses to an ethical dilemma about workplace transition.

Two scatter plots showing accuracy versus steering scale for MATH and BFCL datasets with median trend lines.

Two line graphs showing misaligned action and verbalised eval awareness rates across steering scales.

Multiple line graphs showing AI assistant performance rates across various steering scales and experimental conditions with token counts.

Multiple line graphs showing rates across steering scale for various test conditions comparing verbalised versus non-verbalised misaligned actions.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Reproducing steering against evaluation awareness in a large open-weight model” by Thomas Read, Bronson Schoen, Joseph Bloom

LessWrong (30+ Karma)