In this episode of the Latent Space Podcast, Erik Schluntz from Anthropic discusses his work on SWE-Bench, a benchmark designed to evaluate coding agents, as well as the evolving landscape of AI agents. He explains Anthropic's strategy for developing effective coding agents, highlighting the value of simple frameworks, agentic search, and well-crafted tools rather than overly complicated systems. Schluntz also addresses the challenges of achieving high accuracy on SWE-Bench, the shortcomings of existing benchmarks, and the promise of multi-modal and interactive evaluations. Lastly, he shares his thoughts on the current state and future of AI in robotics, pointing out both the exciting advancements and the significant hurdles that remain in creating reliable and cost-effective solutions.