LLM Eval Tools Compared: Braintrust

The podcast episode centers on a review and discussion of Braintrust, an evaluation tool, with Wayde Gilliam from Braintrust demonstrating its functionalities for completing specific homework assignments related to building a recipe chatbot. The discussion covers creating datasets by importing user queries and metadata, and using the playground feature to refine system prompts. Gilliam uses Loop, an AI agent, to generate a recipe bot relevance score, and the panel discusses the complexities of optimizing prompts based on AI-generated evaluations. The panel also discusses the importance of incorporating subject matter expert feedback, specifically from family members, to improve the relevance and accuracy of the chatbot's responses. The conversation also touches on UI design, comparing Braintrust's interface with Langsmith's, and the practical applications of error analysis and synthetic data generation.

Outlines

Part 1: Introduction, Setup

Part 2: Prompt Optimization, Evaluation

Part 3: Data Generation, Instrumentation

Part 4: Error Analysis, Insights

Sign in to continue reading, translating and more.

Continue

Hamel Husain

Part 1: Introduction, Setup

Introduction to Braintrust Evals Tool and Homework Assignments

Building a Recipe Chatbot: Defining Intent and Gathering User Queries

Creating Datasets and Prompt Playgrounds in Braintrust

Part 2: Prompt Optimization, Evaluation

Optimizing Prompts with AI-Generated Evaluation Rubrics

Refining Prompts Based on Evaluation Scores and User Feedback

Achieving High Scores and Preparing for Trace Creation

Part 3: Data Generation, Instrumentation

UI/UX Discussion and Introduction to Error Analysis

Generating and Reviewing Synthetic Data for Recipe Chatbot

Refining User Queries and Optimizing Prompts in Playground

Instrumenting the Recipe Chatbot Application with Braintrust

Part 4: Error Analysis, Insights

Running Queries, Viewing Traces, and Performing Open Coding

Developing a Failure Mode Taxonomy and Customizing Views

Custom Columns, BTQL, and Notebooks for Data Analysis

Reflections on Error Analysis and Subject Matter Expertise

LLM Eval Tools Compared: Braintrust

Hamel Husain

Part 1: Introduction, Setup

00:00Introduction to Braintrust Evals Tool and Homework Assignments

Introduction to Braintrust Evals Tool and Homework Assignments

00:35Building a Recipe Chatbot: Defining Intent and Gathering User Queries

Building a Recipe Chatbot: Defining Intent and Gathering User Queries

02:05Creating Datasets and Prompt Playgrounds in Braintrust

Creating Datasets and Prompt Playgrounds in Braintrust

Part 2: Prompt Optimization, Evaluation

05:08Optimizing Prompts with AI-Generated Evaluation Rubrics

Optimizing Prompts with AI-Generated Evaluation Rubrics

08:13Refining Prompts Based on Evaluation Scores and User Feedback

Refining Prompts Based on Evaluation Scores and User Feedback

11:03Achieving High Scores and Preparing for Trace Creation

Achieving High Scores and Preparing for Trace Creation

Part 3: Data Generation, Instrumentation

13:31UI/UX Discussion and Introduction to Error Analysis

UI/UX Discussion and Introduction to Error Analysis

16:14Generating and Reviewing Synthetic Data for Recipe Chatbot

Generating and Reviewing Synthetic Data for Recipe Chatbot

19:30Refining User Queries and Optimizing Prompts in Playground

Refining User Queries and Optimizing Prompts in Playground

22:33Instrumenting the Recipe Chatbot Application with Braintrust

Instrumenting the Recipe Chatbot Application with Braintrust

Part 4: Error Analysis, Insights

25:20Running Queries, Viewing Traces, and Performing Open Coding

Running Queries, Viewing Traces, and Performing Open Coding

29:54Developing a Failure Mode Taxonomy and Customizing Views

Developing a Failure Mode Taxonomy and Customizing Views

33:10Custom Columns, BTQL, and Notebooks for Data Analysis

Custom Columns, BTQL, and Notebooks for Data Analysis

38:13Reflections on Error Analysis and Subject Matter Expertise

Reflections on Error Analysis and Subject Matter Expertise