[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang | Latent Space: The AI Engineer Podcast

In this podcast episode, John Yang discusses the evolution and extensions of SWE-bench, including multilingual and multimodal versions, and addresses the emergence of independent benchmarks like SWE-bench Pro. He introduces CodeClash, a programming tournament for language models, and highlights other related works such as SWEfficiency and SciCode. The conversation explores the challenges and future directions of code evaluation, including the incorporation of impossible tasks, the role of user interaction data, and the balance between long autonomy and human-AI collaboration in software development. Yang also seeks collaboration and feedback on CodeClash, particularly regarding human-AI interaction in different coding arenas, while the host mentions Cognition's work on code-based understanding and automatic context engineering for LLMs.

Outlines

Sign in to continue reading, translating and more.

Continue

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast

SWE-bench Updates and Extensions

Introduction to CodeClash: A Programming Tournament for Language Models

Other Benchmarking Efforts and User Simulators

TaoBench and the Evolution of Code Evaluations

The Role of Interactivity and Future Directions in Code Evaluation

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast

00:12SWE-bench Updates and Extensions

SWE-bench Updates and Extensions

03:08Introduction to CodeClash: A Programming Tournament for Language Models

Introduction to CodeClash: A Programming Tournament for Language Models

05:57Other Benchmarking Efforts and User Simulators

Other Benchmarking Efforts and User Simulators

09:02TaoBench and the Evolution of Code Evaluations

TaoBench and the Evolution of Code Evaluations

12:33The Role of Interactivity and Future Directions in Code Evaluation

The Role of Interactivity and Future Directions in Code Evaluation