YouTube28 Sept 2025
52m

AI Evaluations Clearly Explained in 50 Minutes (Real Example) | Hamel Husain

Podcast cover

Peter Yang

In this interview, Hamel Husain discusses AI evaluations with Peter Yang, using NurtureBoss, an AI-powered property management assistant, as a practical example. Hamel emphasizes the importance of manually reviewing traces (records of user interactions) to identify specific product problems before applying automated evaluations. He introduces the concept of axial coding to categorize issues and suggests using LLMs to analyze and group notes from trace reviews. Hamel also cautions against relying solely on agreement scores for evaluating LLM judges, advocating for the use of true positive and true negative rates to measure the judge's accuracy. He recommends building a suite of evals over time and incorporating human labels regularly to ensure the AI system aligns with desired user experiences.

Outlines

Part 1: Introduction and Initial Evaluation

Part 2: Synthetic Data and Judge Evaluation

Part 3: Implementation and Key Insights

Sign in to continue reading, translating and more.

Open full episode in Podwise