YouTube02 Nov 2025
2h 14m

Can LLMs Introspect? A Live Paper Review

Podcast cover

Neel Nanda

Neel Nanda engages in an experiment where he reads and analyzes a recent paper from Anthropic on introspection in language models. He begins by outlining his initial understanding of the paper based on a tweet thread, discussing the concept of steering models by injecting vectors and questioning whether models have introspective access to these injected thoughts. He raises concerns about the complexity of the experimental setup, the potential for cherry-picking results, and the mechanistic explanations behind the observed phenomena. Nanda then skims through the paper, offering detailed commentary on the introduction, definitions of introspection, experimental methods, and results. He critiques the paper's claims, proposes alternative explanations for the findings, and suggests additional experiments to further investigate the topic. Overall, he praises the paper's solid work while remaining skeptical about some of the interpretations and implications.

Outlines

Part 1: Introduction and Initial Analysis

Part 2: Defining and Evaluating Introspection

Part 3: Experiments and Mechanisms

Part 4: Implications and Future Directions

Sign in to continue reading, translating and more.

Open full episode in Podwise