⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo | Latent Space

In this Latent Space Lightning Pod episode, Alessio and Wix host Pratik Bhavsar from Galileo Labs to discuss agent evaluations and leaderboards. Pratik shares insights into the shift towards evaluating agents based on their ability to perform real-world tasks, emphasizing tool calling, context utilization, and cost awareness. The discussion covers the goals of the agent leaderboard, surprising results from the initial launch (including the strong performance of Gemini models and Mistral), and the methodology behind the tool selection quality metric. Pratik also previews the upcoming V2 of the leaderboard, which will focus on domain-specific evaluations, harder benchmarks, and multi-turn interactions to provide a more realistic assessment of agent performance in scenarios like customer support.

Outlines

Sign in to continue reading, translating and more.

Continue

⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo

Latent Space

Introduction to Agent Evals and Leaderboards

Surprising Results and Model Performance

Evaluation Metrics and Data Sets

Addressing Skepticism and Prompt Optimization in LLM Judges

Preview of Agent Leaderboard V2

Conclusion

⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo

Latent Space

00:02Introduction to Agent Evals and Leaderboards

Introduction to Agent Evals and Leaderboards

06:14Surprising Results and Model Performance

Surprising Results and Model Performance

11:44Evaluation Metrics and Data Sets

Evaluation Metrics and Data Sets

17:50Addressing Skepticism and Prompt Optimization in LLM Judges

Addressing Skepticism and Prompt Optimization in LLM Judges

24:07Preview of Agent Leaderboard V2

Preview of Agent Leaderboard V2

33:45Conclusion

Conclusion