LessWrong (30+ Karma) - “Detecting Strategic Deception Using Linear Probes” by Nicholas Goldowsky-Dill, bilalchughtai, StefanHex, Marius Hobbhahn
Sign in to continue reading, translating and more.