LessWrong (30+ Karma) - “Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato
Sign in to continue reading, translating and more.