In this podcast, Erik Schluntz from Anthropic discusses his work on SWE-Bench, a benchmark designed to evaluate coding agents and enhance the computer capabilities of large language models (LLMs). He explains how he created a streamlined agent framework that enables LLMs to autonomously tackle coding tasks, stressing the significance of effective tools and prompts. Schluntz also addresses the challenges of achieving high accuracy on SWE-Bench, explores the potential of multi-modal and multi-agent strategies, and shares his views on the current landscape and future of AI in robotics, highlighting both the exciting possibilities and the hurdles related to reliability and cost.