23 Feb 2026
26m

⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Podcast cover

Latent Space: The AI Engineer Podcast

The podcast explores the limitations of the SWE-Bench Verified benchmark in evaluating coding progress due to saturation and contamination. Mia Glaese and Olivia Watkins from OpenAI discuss the original effort to create SWE-Bench Verified, involving extensive reviews by software engineers to ensure fair problem setups and tests. However, they found issues like overly narrow tests and models exploiting knowledge from open-source repositories, leading to unreliable results. The conversation transitions to SWE-Bench Pro as a more challenging and less contaminated alternative, featuring larger, more diverse problems. They also touch upon the need for benchmarks that assess design taste, code quality, and maintainability, and consider metrics beyond pass/fail rates, such as time, complexity, and real-world impact.

Outlines

Part 1: Benchmark Analysis, Contamination

Part 2: Evolution, New Standards

Part 3: Future Frameworks, Impact

Sign in to continue reading, translating and more.

Open full episode in Podwise