Alignment Pretraining Shows Promise

TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It's now the third paper on this idea, and excitement seems to be building.

How We Got Here

(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)

Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it's part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23).

There was then a two-year lull in [...]

---

Outline:

(00:13) Alignment Pretraining Shows Promise

(00:37) How We Got Here

(06:42) New Paper Shows Strong Results

(12:24) My Suggested Follow-Ons

(19:46) Reaching Takeoff

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
January 19th, 2026

Source:
https://www.lesswrong.com/posts/ZeWewFEefCtx4Rj3G/pretraining-on-aligned-ai-data-dramatically-reduces

---

Narrated by TYPE III AUDIO.

“Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training” by RogerDearnaley

LessWrong (30+ Karma)