In this Latent Space Lightning Pod episode, Alessio and Wix host Pratik Bhavsar from Galileo Labs to discuss agent evaluations and leaderboards. Pratik shares insights into the shift towards evaluating agents based on their ability to perform real-world tasks, emphasizing tool calling, context utilization, and cost awareness. The discussion covers the goals of the agent leaderboard, surprising results from the initial launch (including the strong performance of Gemini models and Mistral), and the methodology behind the tool selection quality metric. Pratik also previews the upcoming V2 of the leaderboard, which will focus on domain-specific evaluations, harder benchmarks, and multi-turn interactions to provide a more realistic assessment of agent performance in scenarios like customer support.
Sign in to continue reading, translating and more.
Continue