Complete Beginner's Course on AI Evaluations: Step by Step (2025) | Aman Khan | Behind the Craft

In this podcast episode, Aman Khan, Head of Product at Arise, joins the host to demonstrate how experienced product managers conduct AI evaluations using a real-world example. They discuss the importance of AI evaluations due to LLM hallucinations and outline four types of evaluations: code-based, human, LLM as a judge, and user evals. They walk through the process of defining an evaluation rubric, creating a golden dataset, and using LLM as a judge, using the example of building a customer support agent for an on-running shoe company. They emphasize the iterative nature of prompt engineering, manual evaluation, and the importance of aligning LLM evaluations with human judgment, and also touch on tools like Anthropic's Workbench and Arise for streamlining the evaluation process.

Outlines

Sign in to continue reading, translating and more.

Continue

Complete Beginner's Course on AI Evaluations: Step by Step (2025) | Aman Khan

Behind the Craft

Introduction to AI Evals and Their Importance

Building a Customer Support Agent and Initial Prompt Creation

Defining Evaluation Rubrics and Building a Golden Dataset

Human Labeled Datasets and Prompt Optimization

Using Arise Platform for AI Evals at Scale

Aligning LLM as a Judge with Human Labels and Recap of the AI Eval Process

Complete Beginner's Course on AI Evaluations: Step by Step (2025) | Aman Khan

Behind the Craft

00:00Introduction to AI Evals and Their Importance

Introduction to AI Evals and Their Importance

07:02Building a Customer Support Agent and Initial Prompt Creation

Building a Customer Support Agent and Initial Prompt Creation

15:12Defining Evaluation Rubrics and Building a Golden Dataset

Defining Evaluation Rubrics and Building a Golden Dataset

24:44Human Labeled Datasets and Prompt Optimization

Human Labeled Datasets and Prompt Optimization

31:19Using Arise Platform for AI Evals at Scale

Using Arise Platform for AI Evals at Scale

40:01Aligning LLM as a Judge with Human Labels and Recap of the AI Eval Process

Aligning LLM as a Judge with Human Labels and Recap of the AI Eval Process