This project was conducted as part of the SPAR Fall 2025 cohort.

TL;DR

Chain-of-thought (CoT) monitoring may serve as a core pillar for AI safety if further advancements in AI capabilities do not significantly degrade the monitorability of LLM serial reasoning.
As such, we studied the effects of several reinforcement learning (RL) training pressures – sampling temperature, KL divergence penalties/rewards, and length budgets – on monitorability, focusing specifically on whether they induce illegible language.
We train R1-distills on math datasets.
Several training setups, especially those using high temperatures, resulted in high accuracy alongside strange reasoning traces containing nonsensical tokens. We noticed some of the same strange tokens being used across different responses by the same model and across different training runs. However, these traces usually contain a small amount of legible reasoning to solve the problem in addition to the strange text, suggesting that the models have not learned encoded reasoning.
We encountered other intriguing phenomena that we highlight, but we are not currently prioritizing studying them because they don’t seem very analogous to realistic monitorability failures.

Motivation

Models trained with large amounts of outcome-based RL are already starting to exhibit weird language in their reasoning traces (e.g. [...]

---

Outline:

(00:19) TL;DR

(01:36) Motivation

(02:54) Methods

(05:44) Results

(05:47) Sampling temperature + entropy bonus

(10:46) Length penalty

(16:02) KL-Divergence

(21:30) Discussion/Summary/Conclusion

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
January 5th, 2026

Source:
https://www.lesswrong.com/posts/9z6TuKEgZNsmqdfy6/exploring-reinforcement-learning-effects-on-chain-of-thought

---

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Exploring Reinforcement Learning Effects on Chain-of-Thought Legibility” by Julian H, RohanS, Baram Sosis, vedant-badoni, The-Turtle

LessWrong (30+ Karma)