Episode cover
YouTube26 Nov 2025
1h 43m

Creating Models Worth Interpreting

Podcast cover

Neel Nanda

Neel Nanda discusses a paper on auditing models for hidden objectives, focusing on how to create model organisms with specific goals and beliefs using synthetic documents. He explains the challenges in determining if a model genuinely believes something or is merely role-playing, and explores techniques for extracting secrets from these models. Nanda also touches on out-of-context reasoning, eval awareness, and the use of interpretability agents for debugging model failures, referencing several related papers and projects, and answers questions from the audience about project ideas, potential overfitting, and the broader implications for AI safety and mechanistic interpretability.

Outlines

Part 1: Hidden Objectives and Auditing Techniques

Part 2: Eval Awareness and Model Behavior

Part 3: Research Meta and Safety

Sign in to continue reading, translating and more.

Open full episode in Podwise