
The podcast explores the limitations of the SWE-Bench Verified benchmark in evaluating coding progress due to saturation and contamination. Mia Glaese and Olivia Watkins from OpenAI discuss the original effort to create SWE-Bench Verified, involving extensive reviews by software engineers to ensure fair problem setups and tests. However, they found issues like overly narrow tests and models exploiting knowledge from open-source repositories, leading to unreliable results. The conversation transitions to SWE-Bench Pro as a more challenging and less contaminated alternative, featuring larger, more diverse problems. They also touch upon the need for benchmarks that assess design taste, code quality, and maintainability, and consider metrics beyond pass/fail rates, such as time, complexity, and real-world impact.
Sign in to continue reading, translating and more.
Continue