“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub | LessWrong (30+ Karma) | Podwise
“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub
LessWrong (30+ Karma) - “Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub
Sign in to continue reading, translating and more.