AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks | The Nonlinear Library | Podwise