“Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight” by Sam Marks | LessWrong (30+ Karma) | Podwise