LessWrong (30+ Karma) - “LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.” by Yavuz Bakman
Sign in to continue reading, translating and more.