LessWrong (30+ Karma) - “Memorizing weak examples can elicit strong behavior out of password-locked models” by Fabien Roger, ryan_greenblatt
Sign in to continue reading, translating and more.