LessWrong (30+ Karma) - “Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen
Sign in to continue reading, translating and more.