
Prev
NextNeel Nanda discusses a paper on auditing models for hidden objectives, focusing on how to create model organisms with specific goals and beliefs using synthetic documents. He explains the challenges in determining if a model genuinely believes something or is merely role-playing, and explores techniques for extracting secrets from these models. Nanda also touches on out-of-context reasoning, eval awareness, and the use of interpretability agents for debugging model failures, referencing several related papers and projects, and answers questions from the audience about project ideas, potential overfitting, and the broader implications for AI safety and mechanistic interpretability.
Outlines
Part 1: Hidden Objectives and Auditing Techniques
Part 2: Eval Awareness and Model Behavior
Part 3: Research Meta and Safety
Sign in to continue reading, translating and more.
Open full episode in Podwise