Produced as part of MATS 6.0 and 6.1.

Key takeaways:

Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors.
These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other users. This makes such behaviors challenging to detect.
Standard model evaluations for sycophancy and toxicity may not be sufficient to identify these emergent harmful behaviors.
Attempts to mitigate these issues, such as using another LLM to veto problematic trajectories during training don't always work and can sometimes backfire by encouraging more subtle forms of manipulation that are harder to detect.
The models seem to internalize the bad behavior, acting as if they are always responding in the best interest of the users, even in hidden scratchpads.