#222 – Neel Nanda on the race to read AI minds | 80,000 Hours Podcast

In this interview, Neel Nanda, who leads the mechanistic interpretability team at Google DeepMind, discusses the project of understanding how AI models work internally and its implications for AI safety. Nanda reflects on his evolving perspective, from initial optimism about fully understanding AI to a more pragmatic view acknowledging the complexity and messiness of AI internals. The conversation covers successes in gaining insights into AI models, including extracting superhuman knowledge and detecting harmful intentions, while also addressing the limitations and challenges, such as deception and the difficulty of establishing ground truth. Nanda emphasizes the importance of using diverse tools, including probes and chain-of-thought analysis, and advocates for a task-focused approach to research, balancing ambitious reverse engineering with pragmatic problem-solving. He also offers career advice for those interested in entering the field, highlighting the need for skepticism and a focus on downstream tasks.

Outlines

Part 1: Introduction and Definition

Part 2: Successes and Techniques

Part 3: Challenges and Limitations

Part 4: Interpretability Approaches

Part 5: Sparse Autoencoders

Part 6: Future Directions and Research Philosophy

Part 7: Career Advice

Sign in to continue reading, translating and more.

Open full episode in Podwise

#222 – Neel Nanda on the race to read AI minds

80,000 Hours Podcast

Part 1: Introduction and Definition

Introduction to Mechanistic Interpretability and its Role in AGI Safety

Defining Mechanistic Interpretability and its Analogy to Biology

From Idealistic Ambition to Optimistic Pragmatism in Mechanistic Interpretability

Part 2: Successes and Techniques

Successes in Mechanistic Interpretability: Auditing Games and Detecting Harmful Intentions

Probes, Linear Representations, and Advantages over Neuroscience

Part 3: Challenges and Limitations

Challenges in Achieving Complete Understanding of AI Models

Structural Challenges, Polisemanticity, and the Limits of Interpretability

Generalization Challenges and the Rome Paper Example

Part 4: Interpretability Approaches

Black Box Interpretability and the Spectrum of Interpretability Approaches

Model Biology and the Case of Self-Preservation

Chain of Thought: Trustworthiness and Monitorability

The Future of Chain of Thought and Potential Governance

Models Detecting Evaluations and Resisting MechInterp Techniques

Objections to MechInterp: Level of Analysis and Granularity

Recursive Self-Improvement and the Future of MechInterp

Part 5: Sparse Autoencoders

Sparse Autoencoders: Introduction and Historical Context

Limitations of Sparse Autoencoders and Feature Absorption

Downstream Task Performance and the Role of Unsupervised Discovery

Part 6: Future Directions and Research Philosophy

The Bull Case for MechInterp and the Hype Cycle

Promising Areas for MechInterp: Probes and Understanding Weird Behavior

Diagnostics vs. Control and the Value of Information

Research Philosophy: Simplicity, Downstream Tasks, and Skepticism

The Importance of Skepticism and Comparative Neglectedness

Part 7: Career Advice

Career Advice: Personal Fit and Exploring Multiple Domains

Getting Started in MechInterp: Self-Study and Mentoring Programs

Staying Up-to-Date and Job Opportunities in MechInterp

#222 – Neel Nanda on the race to read AI minds

80,000 Hours Podcast

Part 1: Introduction and Definition

00:00Introduction to Mechanistic Interpretability and its Role in AGI Safety

Introduction to Mechanistic Interpretability and its Role in AGI Safety

05:10Defining Mechanistic Interpretability and its Analogy to Biology

Defining Mechanistic Interpretability and its Analogy to Biology

09:47From Idealistic Ambition to Optimistic Pragmatism in Mechanistic Interpretability

From Idealistic Ambition to Optimistic Pragmatism in Mechanistic Interpretability

Part 2: Successes and Techniques

15:53Successes in Mechanistic Interpretability: Auditing Games and Detecting Harmful Intentions

Successes in Mechanistic Interpretability: Auditing Games and Detecting Harmful Intentions

23:19Probes, Linear Representations, and Advantages over Neuroscience

Probes, Linear Representations, and Advantages over Neuroscience

Part 3: Challenges and Limitations

27:10Challenges in Achieving Complete Understanding of AI Models

Challenges in Achieving Complete Understanding of AI Models

33:29Structural Challenges, Polisemanticity, and the Limits of Interpretability

Structural Challenges, Polisemanticity, and the Limits of Interpretability

42:30Generalization Challenges and the Rome Paper Example

Generalization Challenges and the Rome Paper Example

Part 4: Interpretability Approaches

49:39Black Box Interpretability and the Spectrum of Interpretability Approaches

Black Box Interpretability and the Spectrum of Interpretability Approaches

55:13Model Biology and the Case of Self-Preservation

Model Biology and the Case of Self-Preservation

1:02:09Chain of Thought: Trustworthiness and Monitorability

Chain of Thought: Trustworthiness and Monitorability

1:11:40The Future of Chain of Thought and Potential Governance

The Future of Chain of Thought and Potential Governance

1:16:02Models Detecting Evaluations and Resisting MechInterp Techniques

Models Detecting Evaluations and Resisting MechInterp Techniques

1:23:51Objections to MechInterp: Level of Analysis and Granularity

Objections to MechInterp: Level of Analysis and Granularity

1:27:56Recursive Self-Improvement and the Future of MechInterp

Recursive Self-Improvement and the Future of MechInterp

Part 5: Sparse Autoencoders

1:37:52Sparse Autoencoders: Introduction and Historical Context

Sparse Autoencoders: Introduction and Historical Context

1:47:16Limitations of Sparse Autoencoders and Feature Absorption

Limitations of Sparse Autoencoders and Feature Absorption

1:54:20Downstream Task Performance and the Role of Unsupervised Discovery

Downstream Task Performance and the Role of Unsupervised Discovery

Part 6: Future Directions and Research Philosophy

2:03:02The Bull Case for MechInterp and the Hype Cycle

The Bull Case for MechInterp and the Hype Cycle

2:12:34Promising Areas for MechInterp: Probes and Understanding Weird Behavior

Promising Areas for MechInterp: Probes and Understanding Weird Behavior

2:20:54Diagnostics vs. Control and the Value of Information

Diagnostics vs. Control and the Value of Information

2:27:31Research Philosophy: Simplicity, Downstream Tasks, and Skepticism

Research Philosophy: Simplicity, Downstream Tasks, and Skepticism

2:35:11The Importance of Skepticism and Comparative Neglectedness

The Importance of Skepticism and Comparative Neglectedness

Part 7: Career Advice

2:41:56Career Advice: Personal Fit and Exploring Multiple Domains

Career Advice: Personal Fit and Exploring Multiple Domains

2:47:12Getting Started in MechInterp: Self-Study and Mentoring Programs

Getting Started in MechInterp: Self-Study and Mentoring Programs

2:54:41Staying Up-to-Date and Job Opportunities in MechInterp

Staying Up-to-Date and Job Opportunities in MechInterp