How Language Models Actually Think

The podcast explores the interpretability of large language models, drawing parallels between their analysis and biological systems. Emmanuel Ameisen, an interpretability researcher, discusses how these models, unlike traditional programs, are "grown" through extensive training data, making their reasoning processes opaque. The discussion highlights surprising problem-solving strategies employed by these models, such as predicting multiple tokens at once and utilizing shared conceptual representations across languages. A key focus is on understanding and mitigating model hallucinations, with Ameisen detailing the discovery of specific neurons responsible for assessing the validity of information. The conversation touches on the development of debugger-like tools for AI applications, and the importance of curiosity and tenacity in understanding these complex systems.

Outlines

Part 1: Interpretability, Internal Mechanics

Part 2: Reliability, Grounding, Medical Use

Part 3: Debugging, Tools, Concept Identification

Part 4: Post-Training, Best Practices

Sign in to continue reading, translating and more.

Continue

The Data Exchange with Ben Lorica

Part 1: Interpretability, Internal Mechanics

Exploring Interpretability Research: Understanding How Language Models Work

Surprising Problem-Solving Patterns in Language Models: Token Prediction and Cross-Lingual Concepts

Reasoning Models: Can We Trust the Output?

Hallucination in Language Models: A Principled Explanation

Part 2: Reliability, Grounding, Medical Use

Reliability and Predictability in AI Models

Grounding Results in Search and the Importance of Source Reliability

Medical Diagnosis: How Models Arrive at Answers

Part 3: Debugging, Tools, Concept Identification

Debugging AI: Tools for Flagging When Models Go Off the Rails

Internal Interpretability Tools at Anthropic

Identifying Concepts in Language Models: An Automated Approach

Neuron Commonalities Across Model Generations

Part 4: Post-Training, Best Practices

A Principled Approach to Post-Training: Prioritizing Evals and Context Engineering

Closing Remarks and Subscription Information

How Language Models Actually Think

The Data Exchange with Ben Lorica

Part 1: Interpretability, Internal Mechanics

00:03Exploring Interpretability Research: Understanding How Language Models Work

Exploring Interpretability Research: Understanding How Language Models Work

02:09Surprising Problem-Solving Patterns in Language Models: Token Prediction and Cross-Lingual Concepts

Surprising Problem-Solving Patterns in Language Models: Token Prediction and Cross-Lingual Concepts

04:13Reasoning Models: Can We Trust the Output?

Reasoning Models: Can We Trust the Output?

05:38Hallucination in Language Models: A Principled Explanation

Hallucination in Language Models: A Principled Explanation

Part 2: Reliability, Grounding, Medical Use

07:56Reliability and Predictability in AI Models

Reliability and Predictability in AI Models

09:32Grounding Results in Search and the Importance of Source Reliability

Grounding Results in Search and the Importance of Source Reliability

12:15Medical Diagnosis: How Models Arrive at Answers

Medical Diagnosis: How Models Arrive at Answers

Part 3: Debugging, Tools, Concept Identification

14:13Debugging AI: Tools for Flagging When Models Go Off the Rails

Debugging AI: Tools for Flagging When Models Go Off the Rails

16:46Internal Interpretability Tools at Anthropic

Internal Interpretability Tools at Anthropic

17:30Identifying Concepts in Language Models: An Automated Approach

Identifying Concepts in Language Models: An Automated Approach

19:28Neuron Commonalities Across Model Generations

Neuron Commonalities Across Model Generations

Part 4: Post-Training, Best Practices

22:44A Principled Approach to Post-Training: Prioritizing Evals and Context Engineering

A Principled Approach to Post-Training: Prioritizing Evals and Context Engineering

25:21Closing Remarks and Subscription Information

Closing Remarks and Subscription Information