Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2405.01576

Abstract:

We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus

complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so,
lies to auditors when asked questions,
strategically pretends to be less capable than it is during capability evaluations.

Our work demonstrates that even models trained to be helpful, harmless and honest sometimes [...]

---

First published:
May 6th, 2024

Source:
https://www.lesswrong.com/posts/t7gqDrb657xhbKkem/uncovering-deceptive-tendencies-in-language-models-a

Linkpost URL:
https://arxiv.org/abs/2405.01576

---

Narrated by TYPE III AUDIO.

[Linkpost] “Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant” by Olli Järviniemi, evhub

LessWrong (30+ Karma)