The Nonlinear Library - AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks
Sign in to continue reading, translating and more.
AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks | The Nonlinear Library | Podwise