
Neel Nanda discusses a paper on auditing models for hidden objectives, focusing on how to create model organisms with specific goals and beliefs using synthetic documents. He explains the challenges in determining if a model genuinely believes something or is merely role-playing, and explores techniques for extracting secrets from these models. Nanda also touches on out-of-context reasoning, eval awareness, and the use of interpretability agents for debugging model failures, referencing several related papers and projects, and answers questions from the audience about project ideas, potential overfitting, and the broader implications for AI safety and mechanistic interpretability.
Sign in to continue reading, translating and more.
Continue