Evals Are Not Unit Tests — Ido Pesok, Vercel v0 | AI Engineer

Ido Pesok, an engineer at Vercel working on V0, introduces evals at the application layer, distinguishing them from model-layer evals. He uses the analogy of a fruit letter counter app to illustrate the unreliability of LLMs and the importance of building reliable AI applications. Pesok emphasizes understanding the "court" or boundaries of your application's data, collecting relevant user prompts, and avoiding out-of-bounds or concentrated data sets. He advises on putting constants in data and variables in tasks, simplifying scores for debugging, and adding evals to CI for tracking improvements and regressions, ultimately advocating for evals as a core component for improving app reliability and quality.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

AI Engineer

Introduction to Evals and the Fruit Letter Counter App

Addressing LLM Unreliability with Evals

Building Effective Evals: Understanding Your Court and Data

Scoring Evals and Integrating with CI

Summary and Q\&A on Evals

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

AI Engineer

00:15Introduction to Evals and the Fruit Letter Counter App

Introduction to Evals and the Fruit Letter Counter App

03:33Addressing LLM Unreliability with Evals

Addressing LLM Unreliability with Evals

07:13Building Effective Evals: Understanding Your Court and Data

Building Effective Evals: Understanding Your Court and Data

10:19Scoring Evals and Integrating with CI

Scoring Evals and Integrating with CI

13:10Summary and Q\&A on Evals

Summary and Q\&A on Evals