LessWrong (30+ Karma) - “Alignment Faking in Large Language Models” by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
Sign in to continue reading, translating and more.
Continue