TL;DR

Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.

Background

In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.

I went looking for the circuit. What I found was simpler—a single direction.

Methods

Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter

Prompt format: <is_free_tier_user>true</is_free_tier_user> vs false. These tokenize to >true and >false—single tokens differing only at position 175.

Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).

Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.

Result 1: Behavioral Gap

First, confirming the model exhibits alignment faking:

Figure 1: The AF model complies 63% for free [...]

---

Outline:

(00:11) TL;DR

(00:48) Background

(01:17) Methods

(02:06) Result 1: Behavioral Gap

(02:35) Result 2: Perfect Transfer Across Queries

(03:29) Result 3: Bidirectional Control + Random Baseline

(04:38) Result 4: LoRA Installed the Feature

(05:27) Result 5: Layer Window

(05:55) Negative Results: There Is No Circuit

(06:45) The Mechanistic Story

(07:02) Limitations

(07:31) Implications

(08:12) Conclusion

---

First published:
January 9th, 2026

Source:
https://www.lesswrong.com/posts/TazJpnBnvPC5tJoWo/alignment-faking-is-a-linear-feature-in-anthropic-s-hughes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Two bar graphs comparing logit differences for bidirectional control versus random baseline directions.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend

LessWrong (30+ Karma)

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend