AI Evaluations Crash Course in 50 Minutes (2025) | Hamel Husain | Behind the Craft

In this interview, Hamel Husain discusses AI evaluations with Peter Yang, using NurtureBoss, an AI-powered property management assistant, as a practical example. Hamel emphasizes the importance of manually reviewing traces (records of user interactions with the AI) and categorizing issues before applying automated evaluations. He advises against relying solely on generic metrics like "helpfulness" or agreement scores, advocating instead for identifying specific problems and measuring true positive and true negative rates to assess the accuracy of LLM judges. Hamel also shares insights into synthetic data generation and the value of building internal tools for data annotation, highlighting that a deep understanding of the data is crucial for effective AI evaluation and product improvement.

Outlines

Sign in to continue reading, translating and more.

Continue

AI Evaluations Crash Course in 50 Minutes (2025) | Hamel Husain

Behind the Craft

Introduction to AI Evals and NurtureBoss Example

Manual Trace Analysis and Problem Identification

From Problem Identification to Evaluation Criteria

LLM as a Judge and Avoiding Agreement Bias

True Positive and True Negative Rates, and Production Implementation

Manual Trace Review, Dispelling Myths, and Course Overview

AI Evaluations Crash Course in 50 Minutes (2025) | Hamel Husain

Behind the Craft

00:00Introduction to AI Evals and NurtureBoss Example

Introduction to AI Evals and NurtureBoss Example

06:15Manual Trace Analysis and Problem Identification

Manual Trace Analysis and Problem Identification

15:21From Problem Identification to Evaluation Criteria

From Problem Identification to Evaluation Criteria

25:52LLM as a Judge and Avoiding Agreement Bias

LLM as a Judge and Avoiding Agreement Bias

35:01True Positive and True Negative Rates, and Production Implementation

True Positive and True Negative Rates, and Production Implementation

42:22Manual Trace Review, Dispelling Myths, and Course Overview

Manual Trace Review, Dispelling Myths, and Course Overview