METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

METR's Joel Becker joins the Latent Space Podcast to discuss AI model evaluation and threat research, focusing on METR's work in assessing AI capabilities and propensities. Becker addresses the "Time Horizon" chart, explaining its origins and the methodology behind task selection, emphasizing the focus on economically valuable tasks relevant to general autonomy and R&D. The conversation explores the impact of Opus 4.5 on agentic coding and developer productivity, including the challenges of measuring productivity gains with increasingly capable AI models. Becker also shares insights on the slowing AI improvements based on AI compute and the complexities of prediction markets in the AI space. The discussion touches on the importance of independent expertise in AI safety and the potential for capabilities explosion, highlighting the need for a comprehensive approach to evaluating AI risks and benefits.

Outlines

Part 1: Introduction to METR and AI Safety

Part 2: The Model Time Horizon and Task Evaluation

Part 3: Productivity Studies and Economic Impact

Part 4: Benchmarking and Capability Trends

Part 5: Market Dynamics and Evaluation Trajectories

Part 6: Future Outlook and Human Connection

Sign in to continue reading, translating and more.

Continue

Latent Space: The AI Engineer Podcast

Part 1: Introduction to METR and AI Safety

Introducing METR: Model Evaluation and Threat Research in AI Safety

Balancing Model Evaluation and Threat Research: Assessing AI's Catastrophic Potential

Part 2: The Model Time Horizon and Task Evaluation

The Model Time Horizon Chart: Measuring AI Task Difficulty Over Time

Task Distribution and the Time Horizon Chart: Addressing Misconceptions

Opus 4.5's Impact: Challenging the Time Horizon Trendline

Part 3: Productivity Studies and Economic Impact

Redoing the Developer Productivity Study: Challenges and New Study Designs

Measuring Productivity and the Role of RCTs in AI Evaluation

Part 4: Benchmarking and Capability Trends

METR's Balanced Approach to AI Safety and the Capability Explosion

Benchmarking AI Progress: The Need for Enumerated Capabilities

The Slowing of AI Improvements: Compute and Algorithmic Progress

Compute Resources and Prediction Markets: Modeling AI Progress

Part 5: Market Dynamics and Evaluation Trajectories

Manifold Markets Alpha: Prediction Markets with High Agency

Model Evaluation Trajectories: AI Village and Transcript Analysis

Novel Research Directions: Unit Tests, Code Merging, and the Value of Scaffolding

Part 6: Future Outlook and Human Connection

METR's Future: Capabilities Evidence, Safeguards, and Risk Assessment

Valuable Skills in the New AI Age: Research Intuitions and Communication

The Transcendence of Singing in Person: A Human Connection

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast

Part 1: Introduction to METR and AI Safety

00:00Introducing METR: Model Evaluation and Threat Research in AI Safety

Introducing METR: Model Evaluation and Threat Research in AI Safety

01:29Balancing Model Evaluation and Threat Research: Assessing AI's Catastrophic Potential

Balancing Model Evaluation and Threat Research: Assessing AI's Catastrophic Potential

Part 2: The Model Time Horizon and Task Evaluation

03:33The Model Time Horizon Chart: Measuring AI Task Difficulty Over Time

The Model Time Horizon Chart: Measuring AI Task Difficulty Over Time

07:30Task Distribution and the Time Horizon Chart: Addressing Misconceptions

Task Distribution and the Time Horizon Chart: Addressing Misconceptions

11:24Opus 4.5's Impact: Challenging the Time Horizon Trendline

Opus 4.5's Impact: Challenging the Time Horizon Trendline

Part 3: Productivity Studies and Economic Impact

14:27Redoing the Developer Productivity Study: Challenges and New Study Designs

Redoing the Developer Productivity Study: Challenges and New Study Designs

18:01Measuring Productivity and the Role of RCTs in AI Evaluation

Measuring Productivity and the Role of RCTs in AI Evaluation

Part 4: Benchmarking and Capability Trends

21:30METR's Balanced Approach to AI Safety and the Capability Explosion

METR's Balanced Approach to AI Safety and the Capability Explosion

25:54Benchmarking AI Progress: The Need for Enumerated Capabilities

Benchmarking AI Progress: The Need for Enumerated Capabilities

29:50The Slowing of AI Improvements: Compute and Algorithmic Progress

The Slowing of AI Improvements: Compute and Algorithmic Progress

32:46Compute Resources and Prediction Markets: Modeling AI Progress

Compute Resources and Prediction Markets: Modeling AI Progress

Part 5: Market Dynamics and Evaluation Trajectories

37:23Manifold Markets Alpha: Prediction Markets with High Agency

Manifold Markets Alpha: Prediction Markets with High Agency

41:37Model Evaluation Trajectories: AI Village and Transcript Analysis

Model Evaluation Trajectories: AI Village and Transcript Analysis

45:54Novel Research Directions: Unit Tests, Code Merging, and the Value of Scaffolding

Novel Research Directions: Unit Tests, Code Merging, and the Value of Scaffolding

Part 6: Future Outlook and Human Connection

51:37METR's Future: Capabilities Evidence, Safeguards, and Risk Assessment

METR's Future: Capabilities Evidence, Safeguards, and Risk Assessment

53:06Valuable Skills in the New AI Age: Research Intuitions and Communication

Valuable Skills in the New AI Age: Research Intuitions and Communication

55:35The Transcendence of Singing in Person: A Human Connection

The Transcendence of Singing in Person: A Human Connection