[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search) | AI Engineer

The podcast features two main speakers, along with contributions from audience members, focusing on the challenges and methodologies of evaluating AI systems, particularly Large Language Models (LLMs). The discussion covers the importance of evals, the difficulties in defining metrics, the labor-intensive nature of the process, and the need for custom, subjective evaluations. The speakers share their experiences from Google, emphasizing the significance of benchmarking, establishing relevant metrics, and calibrating these metrics with human and user data. They introduce the concept of a scoring system that breaks down complex evaluations into simpler, more objective signals, and they also detail a hands-on workshop where attendees can experiment with building and refining their own scoring systems using a copilot tool, Google Sheets integration, and Python code in Colab.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

AI Engineer

Introduction to Evals and Benchmarking

Workshop Overview and the Importance of a Scoring System

Workshop Structure and Copilot Demo

Copilot Functionality and Production Use Cases

Colab Demo and Online Evaluation Techniques

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

AI Engineer

00:15Introduction to Evals and Benchmarking

Introduction to Evals and Benchmarking

07:31Workshop Overview and the Importance of a Scoring System

Workshop Overview and the Importance of a Scoring System

15:02Workshop Structure and Copilot Demo

Workshop Structure and Copilot Demo

24:24Copilot Functionality and Production Use Cases

Copilot Functionality and Production Use Cases

35:26Colab Demo and Online Evaluation Techniques

Colab Demo and Online Evaluation Techniques