“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris | LessWrong (30+ Karma) | Podwise
LessWrong (30+ Karma) - “Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris
Sign in to continue reading, translating and more.