Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!

These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.

Motivation

Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]

---

Outline:

(00:58) Motivation

(01:45) Methods

(03:33) Results

(07:02) Limitations

(07:52) Appendix

(07:55) Hackable illegibility rater prompt

(09:03) Stronger illegibility rater prompt

(09:25) System prompt used for the policy

(09:40) System prompt used for the no-CoT policy

(09:48) Example generations

---

First published:
October 9th, 2025

Source:
https://www.lesswrong.com/posts/WSKNmRxPnYdQnoNvt/training-qwen-1-5b-with-a-cot-legibility-penalty

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

LessWrong (30+ Karma)