YouTube14 Jul 2025
34m

⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo

Podcast cover

Latent Space

In this Latent Space Lightning Pod episode, Alessio and Wix host Pratik Bhavsar from Galileo Labs to discuss agent evaluations and leaderboards. Pratik shares insights into the shift towards evaluating agents based on their ability to perform real-world tasks, emphasizing tool calling, context utilization, and cost awareness. The discussion covers the goals of the agent leaderboard, surprising results from the initial launch (including the strong performance of Gemini models and Mistral), and the methodology behind the tool selection quality metric. Pratik also previews the upcoming V2 of the leaderboard, which will focus on domain-specific evaluations, harder benchmarks, and multi-turn interactions to provide a more realistic assessment of agent performance in scenarios like customer support.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise