LessWrong (30+ Karma) - “Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison
Sign in to continue reading, translating and more.