LessWrong (30+ Karma) - ″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom
Sign in to continue reading, translating and more.