AI Evaluations Clearly Explained in 50 Minutes (Real Example) | Hamel Husain | Peter Yang

In this interview, Hamel Husain discusses AI evaluations with Peter Yang, using NurtureBoss, an AI-powered property management assistant, as a practical example. Hamel emphasizes the importance of manually reviewing traces (records of user interactions) to identify specific product problems before applying automated evaluations. He introduces the concept of axial coding to categorize issues and suggests using LLMs to analyze and group notes from trace reviews. Hamel also cautions against relying solely on agreement scores for evaluating LLM judges, advocating for the use of true positive and true negative rates to measure the judge's accuracy. He recommends building a suite of evals over time and incorporating human labels regularly to ensure the AI system aligns with desired user experiences.

Outlines

Part 1: Introduction and Initial Evaluation

Part 2: Synthetic Data and Judge Evaluation

Part 3: Implementation and Key Insights

Sign in to continue reading, translating and more.

Open full episode in Podwise

AI Evaluations Clearly Explained in 50 Minutes (Real Example) | Hamel Husain

Peter Yang

Part 1: Introduction and Initial Evaluation

Introduction to AI Evaluations with NurtureBoss

The Value of Manual Trace Analysis

Summarizing Trace Data and Identifying Problems

Part 2: Synthetic Data and Judge Evaluation

Synthetic Data Generation and the Importance of Judge Evaluation

Calibrating LLM Judges and Avoiding Agreement Metrics

True Positive and True Negative Rates

Part 3: Implementation and Key Insights

Implementing Judges in Production and Building Annotation Tools

Key Takeaways and Dispelling Evaluation Myths

The Importance of Data Analysis and Course Overview

AI Evaluations Clearly Explained in 50 Minutes (Real Example) | Hamel Husain

Peter Yang

Part 1: Introduction and Initial Evaluation

00:00Introduction to AI Evaluations with NurtureBoss

Introduction to AI Evaluations with NurtureBoss

04:40The Value of Manual Trace Analysis

The Value of Manual Trace Analysis

09:24Summarizing Trace Data and Identifying Problems

Summarizing Trace Data and Identifying Problems

Part 2: Synthetic Data and Judge Evaluation

15:17Synthetic Data Generation and the Importance of Judge Evaluation

Synthetic Data Generation and the Importance of Judge Evaluation

22:46Calibrating LLM Judges and Avoiding Agreement Metrics

Calibrating LLM Judges and Avoiding Agreement Metrics

29:09True Positive and True Negative Rates

True Positive and True Negative Rates

Part 3: Implementation and Key Insights

35:01Implementing Judges in Production and Building Annotation Tools

Implementing Judges in Production and Building Annotation Tools

42:01Key Takeaways and Dispelling Evaluation Myths

Key Takeaways and Dispelling Evaluation Myths

49:03The Importance of Data Analysis and Course Overview

The Importance of Data Analysis and Course Overview