Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda

Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda.

Motivation

Imagine that a frontier lab's coding agent has been caught putting a bug in the key code for monitoring what that agent does. Naively, this seems like a clear smoking gun that the agent is scheming. But LLMs often do weird things; they could easily just be confused, or have made a mistake. These all require a response, but the cause and appropriate fix are very different between a scheming and confused model. As such, it is extremely important that we have high-quality methods to be able to incriminate or exonerate a model caught taking sketchy actions, to either build a rigorous case that serious action is needed, or conclude it's a false alarm.

Executive Summary

A central research question in model incrimination is understanding what motivates a model's actions. To practice this, we investigated several potentially concerning actions from models to uncover the motivations behind them. In particular, we built and then investigated interesting environments where models whistleblew, deceived about sabotaged evals, cheated by taking shortcuts [...]

---

Outline:

(00:35) Motivation

(01:18) Executive Summary

(03:48) Environments We Studied

(04:41) Funding Email (Behavior: Whistleblowing)

(07:35) Evaluation Tampering (Behavior: Deception)

(09:08) Secret Number (Behavior: Reward Hacking)

(11:05) Math Sandbagging (Behavior: Sandbagging)

(12:27) Takeaways For Doing Good Investigations

(14:51) Actually Read Your Data

(15:24) Verify Hypotheses by Changing the Prompt / Environment

(15:47) How to Think about Prompt Counterfactuals

(18:29) Examples of how Prompt Counterfactuals are Useful

(19:06) Examples of where Prompt Counterfactuals are Confusing

(20:06) Corroborate Findings with Several Independent Methods

(20:42) Funding Email (Behavior: Whistleblowing)

(22:30) Evaluation Tampering (Behavior: Deception)

(24:12) Secret Number (Behavior: Reward Hacking)

(26:47) Discussion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
February 27th, 2026

Source:
https://www.lesswrong.com/posts/Bv4CLkNzuG6XYTjEe/why-did-my-model-do-that-model-incrimination-for-diagnosing

---

Narrated by TYPE III AUDIO.

“Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior” by aditya singh, gersonkroiz, Senthooran Rajamanoharan, Neel Nanda

LessWrong (30+ Karma)