
In this podcast episode, John Yang discusses the evolution and extensions of SWE-bench, including multilingual and multimodal versions, and addresses the emergence of independent benchmarks like SWE-bench Pro. He introduces CodeClash, a programming tournament for language models, and highlights other related works such as SWEfficiency and SciCode. The conversation explores the challenges and future directions of code evaluation, including the incorporation of impossible tasks, the role of user interaction data, and the balance between long autonomy and human-AI collaboration in software development. Yang also seeks collaboration and feedback on CodeClash, particularly regarding human-AI interaction in different coding arenas, while the host mentions Cognition's work on code-based understanding and automatic context engineering for LLMs.
Sign in to continue reading, translating and more.
Continue