If You Don’t Understand AI Evals, Don’t Build AI

The podcast explores the critical role of evals in building effective AI products, highlighting that their quality, usage, and improvement drive the success of AI initiatives. Ankur Goyal, founder and CEO of Braintrust, emphasizes that even "vibe checks" are a form of evaluation, particularly useful in the early stages of product development. As AI products scale, more structured evals become essential due to the unpredictable nature of LLMs. Goyal argues that investing in evals creates a durable competitive advantage, more so than focusing solely on the latest models or agents. He also notes the increasing importance of product managers in defining evals, viewing them as the modern equivalent of PRDs, and shares a live demonstration of creating an eval from scratch using Linear's MCP server.

Outlines

Part 1: Foundations of AI Evals

Part 2: Strategic Value and the PM Role

Part 3: Braintrust Growth and Market Trends

Part 4: Technical Framework and Live Implementation

Part 5: Iteration and Optimization Strategies

Part 6: Production, Trust, and Resources

Sign in to continue reading, translating and more.

Continue

Aakash Gupta

Part 1: Foundations of AI Evals

The Importance of Evals in Building Effective AI Products

Braintrust's Success and the Importance of Feedback Loops in AI

Evals vs. Vibe Checks: Why Formal Evaluation Matters in AI

Vibe Checks as a Starting Point for AI Product Evaluation

The Unpredictability of LLMs and the Role of Evals

Part 2: Strategic Value and the PM Role

Evals as a Durable Investment in AI Product Development

The Product Manager's Role in Defining and Utilizing Evals

Evals as the Modern PRD: Quantifying Product Success

Addressing the Controversy Around Evals and Coding Products

The Importance of Evals in Healthcare AI Applications

The Growing Importance of Evals as Distance from End-Users Increases

Part 3: Braintrust Growth and Market Trends

Braintrust's Growth and the Increasing Scale of AI

Braintrust's Growth Metrics and Focus on Evals

Why Top Companies Prioritize Evals for AI Product Development

Part 4: Technical Framework and Live Implementation

The Shift to Offline Experimentation in AI Product Development

The Three Components of an Eval: Data, Task, and Scores

Live Demo: Creating an Eval from Scratch with Linear and Opus

Quantifying Vibe Checks with Scoring Functions

Categorical Scoring and the Importance of Clear Criteria

Connecting to the Linear MCP and Simplifying Tool Selection

Part 5: Iteration and Optimization Strategies

Iterating on the Eval and Encountering Challenges

Strategies for Improving Eval Performance

Partial Credit and Iterating on the Scoring Function

Automating Eval Improvements with Loop

The Evolution of Model Capabilities and the Importance of Failing Evals

Benchmarks vs. Real Data: Iterating on Evals

Removing User-Based Pricing and the Importance of System Prompts

Touching All Parts of the Workflow and Improving Eval Performance

Part 6: Production, Trust, and Resources

Offline vs. Online Evals: Understanding the Distinction

Using Online Evals to Improve Offline Evals

Maintaining Trust in Eval Systems and Prioritizing Iteration

Resources for Learning More About Evals and Braintrust

If You Don’t Understand AI Evals, Don’t Build AI

Aakash Gupta

Part 1: Foundations of AI Evals

00:00The Importance of Evals in Building Effective AI Products

The Importance of Evals in Building Effective AI Products

00:11Braintrust's Success and the Importance of Feedback Loops in AI

Braintrust's Success and the Importance of Feedback Loops in AI

00:31Evals vs. Vibe Checks: Why Formal Evaluation Matters in AI

Evals vs. Vibe Checks: Why Formal Evaluation Matters in AI

02:07Vibe Checks as a Starting Point for AI Product Evaluation

Vibe Checks as a Starting Point for AI Product Evaluation

03:11The Unpredictability of LLMs and the Role of Evals

The Unpredictability of LLMs and the Role of Evals

Part 2: Strategic Value and the PM Role

04:36Evals as a Durable Investment in AI Product Development

Evals as a Durable Investment in AI Product Development

06:11The Product Manager's Role in Defining and Utilizing Evals

The Product Manager's Role in Defining and Utilizing Evals

07:25Evals as the Modern PRD: Quantifying Product Success

Evals as the Modern PRD: Quantifying Product Success

08:35Addressing the Controversy Around Evals and Coding Products

Addressing the Controversy Around Evals and Coding Products

10:27The Importance of Evals in Healthcare AI Applications

The Importance of Evals in Healthcare AI Applications

11:34The Growing Importance of Evals as Distance from End-Users Increases

The Growing Importance of Evals as Distance from End-Users Increases

Part 3: Braintrust Growth and Market Trends

13:31Braintrust's Growth and the Increasing Scale of AI

Braintrust's Growth and the Increasing Scale of AI

15:53Braintrust's Growth Metrics and Focus on Evals

Braintrust's Growth Metrics and Focus on Evals

16:48Why Top Companies Prioritize Evals for AI Product Development

Why Top Companies Prioritize Evals for AI Product Development

Part 4: Technical Framework and Live Implementation

18:26The Shift to Offline Experimentation in AI Product Development

The Shift to Offline Experimentation in AI Product Development

20:01The Three Components of an Eval: Data, Task, and Scores

The Three Components of an Eval: Data, Task, and Scores

22:10Live Demo: Creating an Eval from Scratch with Linear and Opus

Live Demo: Creating an Eval from Scratch with Linear and Opus

25:55Quantifying Vibe Checks with Scoring Functions

Quantifying Vibe Checks with Scoring Functions

27:39Categorical Scoring and the Importance of Clear Criteria

Categorical Scoring and the Importance of Clear Criteria

29:44Connecting to the Linear MCP and Simplifying Tool Selection

Connecting to the Linear MCP and Simplifying Tool Selection

Part 5: Iteration and Optimization Strategies

30:20Iterating on the Eval and Encountering Challenges

Iterating on the Eval and Encountering Challenges

33:31Strategies for Improving Eval Performance

Strategies for Improving Eval Performance

34:38Partial Credit and Iterating on the Scoring Function

Partial Credit and Iterating on the Scoring Function

35:44Automating Eval Improvements with Loop

Automating Eval Improvements with Loop

37:35The Evolution of Model Capabilities and the Importance of Failing Evals

The Evolution of Model Capabilities and the Importance of Failing Evals

39:33Benchmarks vs. Real Data: Iterating on Evals

Benchmarks vs. Real Data: Iterating on Evals

41:02Removing User-Based Pricing and the Importance of System Prompts

Removing User-Based Pricing and the Importance of System Prompts

42:09Touching All Parts of the Workflow and Improving Eval Performance

Touching All Parts of the Workflow and Improving Eval Performance

Part 6: Production, Trust, and Resources

43:36Offline vs. Online Evals: Understanding the Distinction

Offline vs. Online Evals: Understanding the Distinction

46:20Using Online Evals to Improve Offline Evals

Using Online Evals to Improve Offline Evals

47:39Maintaining Trust in Eval Systems and Prioritizing Iteration

Maintaining Trust in Eval Systems and Prioritizing Iteration

49:47Resources for Learning More About Evals and Braintrust

Resources for Learning More About Evals and Braintrust