LessWrong (30+ Karma) - “[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations” by Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward
Sign in to continue reading, translating and more.
Continue