The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic | Latent Space

In this episode of the Latent Space Podcast, Erik Schluntz from Anthropic discusses his work on SWE-Bench, a benchmark designed to evaluate coding agents, as well as the evolving landscape of AI agents. He explains Anthropic's strategy for developing effective coding agents, highlighting the value of simple frameworks, agentic search, and well-crafted tools rather than overly complicated systems. Schluntz also addresses the challenges of achieving high accuracy on SWE-Bench, the shortcomings of existing benchmarks, and the promise of multi-modal and interactive evaluations. Lastly, he shares his thoughts on the current state and future of AI in robotics, pointing out both the exciting advancements and the significant hurdles that remain in creating reliable and cost-effective solutions.

Outlines

Sign in to continue reading, translating and more.

Continue

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Latent Space

Introduction and Erik Schluntz's Background

Choosing Anthropic and Early Projects

Introduction to SWE-Bench and its Significance

SWE-Bench vs. SWE-Bench Verified and Human Evaluation

Challenges and Future Directions for SWE-Bench

Prompt Engineering and Agent Architecture

Agent Architecture, Tool Design, and Context Windows

Tool Design, Sandboxing, and Computer Use

Improving Tools and APIs for LLMs, and the Future of Coding Agents

XML in Prompts, Building Coding Agent Startups, and Computer Use

Robotics Corner and Final Thoughts

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Latent Space

00:03Introduction and Erik Schluntz's Background

Introduction and Erik Schluntz's Background

01:53Choosing Anthropic and Early Projects

Choosing Anthropic and Early Projects

03:39Introduction to SWE-Bench and its Significance

Introduction to SWE-Bench and its Significance

06:10SWE-Bench vs. SWE-Bench Verified and Human Evaluation

SWE-Bench vs. SWE-Bench Verified and Human Evaluation

08:04Challenges and Future Directions for SWE-Bench

Challenges and Future Directions for SWE-Bench

14:19Prompt Engineering and Agent Architecture

Prompt Engineering and Agent Architecture

18:35Agent Architecture, Tool Design, and Context Windows

Agent Architecture, Tool Design, and Context Windows

24:42Tool Design, Sandboxing, and Computer Use

Tool Design, Sandboxing, and Computer Use

30:05Improving Tools and APIs for LLMs, and the Future of Coding Agents

Improving Tools and APIs for LLMs, and the Future of Coding Agents

37:42XML in Prompts, Building Coding Agent Startups, and Computer Use

XML in Prompts, Building Coding Agent Startups, and Computer Use

49:56Robotics Corner and Final Thoughts

Robotics Corner and Final Thoughts