LessWrong (30+ Karma) - “Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov
Sign in to continue reading, translating and more.
Continue