LessWrong (30+ Karma) - “Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda
Sign in to continue reading, translating and more.