“Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight” by Sam Marks | LessWrong (30+ Karma) | Podwise
LessWrong (30+ Karma) - “Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight” by Sam Marks
Sign in to continue reading, translating and more.