Creating Models Worth Interpreting

Neel Nanda discusses a paper on auditing models for hidden objectives, focusing on how to create model organisms with specific goals and beliefs using synthetic documents. He explains the challenges in determining if a model genuinely believes something or is merely role-playing, and explores techniques for extracting secrets from these models. Nanda also touches on out-of-context reasoning, eval awareness, and the use of interpretability agents for debugging model failures, referencing several related papers and projects, and answers questions from the audience about project ideas, potential overfitting, and the broader implications for AI safety and mechanistic interpretability.

Outlines

Part 1: Hidden Objectives and Auditing Techniques

Part 2: Eval Awareness and Model Behavior

Part 3: Research Meta and Safety

Sign in to continue reading, translating and more.

Open full episode in Podwise

Neel Nanda

Part 1: Hidden Objectives and Auditing Techniques

Auditing Models for Hidden Objectives: An Overview

Techniques and Challenges in Auditing for Hidden Goals

Overfitting and Generalization in Model Organism Research

Exploring the Phenomenon of Readable Traces in Narrow Fine-Tuning

Part 2: Eval Awareness and Model Behavior

Model Organisms and Eval Awareness

Eval Awareness and Black Box Baselines

Debugging Model Failures and Editing Chain of Thought

Probing for Roleplaying and Reward Hacking

Part 3: Research Meta and Safety

Research Meta and Future Directions

Creating Models Worth Interpreting

Neel Nanda

Part 1: Hidden Objectives and Auditing Techniques

00:00Auditing Models for Hidden Objectives: An Overview

Auditing Models for Hidden Objectives: An Overview

17:52Techniques and Challenges in Auditing for Hidden Goals

Techniques and Challenges in Auditing for Hidden Goals

27:42Overfitting and Generalization in Model Organism Research

Overfitting and Generalization in Model Organism Research

37:03Exploring the Phenomenon of Readable Traces in Narrow Fine-Tuning

Exploring the Phenomenon of Readable Traces in Narrow Fine-Tuning

Part 2: Eval Awareness and Model Behavior

47:21Model Organisms and Eval Awareness

Model Organisms and Eval Awareness

57:15Eval Awareness and Black Box Baselines

Eval Awareness and Black Box Baselines

1:09:21Debugging Model Failures and Editing Chain of Thought

Debugging Model Failures and Editing Chain of Thought

1:20:23Probing for Roleplaying and Reward Hacking

Probing for Roleplaying and Reward Hacking

Part 3: Research Meta and Safety

1:38:01Research Meta and Future Directions

Research Meta and Future Directions