Evaluating Agents with Braintrust

The podcast centers on evaluating AI agents, particularly the nuances and strategies involved in ensuring their effectiveness. Ankur Goyal, founder and CEO of Braintrust, shares his insights on the evolution and simplification of agentic systems, emphasizing that evaluating modern agents is quite simple, splitting evals into end-to-end tests and capturing individual interactions. He advocates for starting with a "crappy prototype" to build intuition and rapidly convert failures into test cases. Goyal also dismisses the idea of out-of-the-box scoring stacks, highlighting the importance of custom scoring functions tailored to specific AI use cases, and touches on the balance between quantitative and qualitative metrics, and human involvement in the evaluation cycle.

Outlines

Part 1: Origins and Problem Statement

Part 2: Evaluation Methodologies and Prototyping

Part 3: Advanced Evaluation Strategies

Part 4: Implementation and Organizational Workflow

Part 5: Infrastructure and Future Outlook

Sign in to continue reading, translating and more.

Continue

Greylock

Part 1: Origins and Problem Statement

The Inception of Braintrust: Addressing AI Evaluation Challenges

From Empyra to Braintrust: Solving AI Evaluation Problems

Part 2: Evaluation Methodologies and Prototyping

Simplifying Agent Evaluation: End-to-End Tests and Step Refinement

Prototyping and Iterating: A Practical Approach to Agent Evaluation

Human Intuition in Evals: Identifying and Correcting Scoring Flaws

Evals as Prioritization Tools: Focusing Human Time Effectively

Part 3: Advanced Evaluation Strategies

Teasing Apart Complex Agents: Aspirational Evals for Durable Measurement

Model Evolution: Swapping Models and the Importance of Custom Scoring

Engineering Scoring Functions: Heuristics, LLMs, and PRD Criteria

LLM as Anomaly Detector: Combining Heuristics and LLMs for Evaluation

Part 4: Implementation and Organizational Workflow

Human-in-the-Loop Evaluation: Balancing Automation and Manual Review

Organizational Management of Evals: Ownership, Platform Models, and AI Gateways

The Future of Evals: Automation and Human-AI Collaboration

Evaluating States of Actions: Experimentation and Durability

Part 5: Infrastructure and Future Outlook

Brainstore's Design: Addressing Customer Needs in the LLM World

From Guardrails to Evaluation: Synchronous vs. Asynchronous Intervention

Modular AI Systems: The Importance of Agility and Reusable Scoring

Agentic Evals and Vendor Neutrality: The Future of AI Evaluation

Closing Remarks and Braintrust Demo

Evaluating Agents with Braintrust

Greylock

Part 1: Origins and Problem Statement

00:00The Inception of Braintrust: Addressing AI Evaluation Challenges

The Inception of Braintrust: Addressing AI Evaluation Challenges

01:36From Empyra to Braintrust: Solving AI Evaluation Problems

From Empyra to Braintrust: Solving AI Evaluation Problems

Part 2: Evaluation Methodologies and Prototyping

07:30Simplifying Agent Evaluation: End-to-End Tests and Step Refinement

Simplifying Agent Evaluation: End-to-End Tests and Step Refinement

10:57Prototyping and Iterating: A Practical Approach to Agent Evaluation

Prototyping and Iterating: A Practical Approach to Agent Evaluation

13:17Human Intuition in Evals: Identifying and Correcting Scoring Flaws

Human Intuition in Evals: Identifying and Correcting Scoring Flaws

15:51Evals as Prioritization Tools: Focusing Human Time Effectively

Evals as Prioritization Tools: Focusing Human Time Effectively

Part 3: Advanced Evaluation Strategies

17:08Teasing Apart Complex Agents: Aspirational Evals for Durable Measurement

Teasing Apart Complex Agents: Aspirational Evals for Durable Measurement

19:37Model Evolution: Swapping Models and the Importance of Custom Scoring

Model Evolution: Swapping Models and the Importance of Custom Scoring

22:28Engineering Scoring Functions: Heuristics, LLMs, and PRD Criteria

Engineering Scoring Functions: Heuristics, LLMs, and PRD Criteria

25:01LLM as Anomaly Detector: Combining Heuristics and LLMs for Evaluation

LLM as Anomaly Detector: Combining Heuristics and LLMs for Evaluation

Part 4: Implementation and Organizational Workflow

26:43Human-in-the-Loop Evaluation: Balancing Automation and Manual Review

Human-in-the-Loop Evaluation: Balancing Automation and Manual Review

29:59Organizational Management of Evals: Ownership, Platform Models, and AI Gateways

Organizational Management of Evals: Ownership, Platform Models, and AI Gateways

33:56The Future of Evals: Automation and Human-AI Collaboration

The Future of Evals: Automation and Human-AI Collaboration

37:37Evaluating States of Actions: Experimentation and Durability

Evaluating States of Actions: Experimentation and Durability

Part 5: Infrastructure and Future Outlook

39:25Brainstore's Design: Addressing Customer Needs in the LLM World

Brainstore's Design: Addressing Customer Needs in the LLM World

47:22From Guardrails to Evaluation: Synchronous vs. Asynchronous Intervention

From Guardrails to Evaluation: Synchronous vs. Asynchronous Intervention

49:39Modular AI Systems: The Importance of Agility and Reusable Scoring

Modular AI Systems: The Importance of Agility and Reusable Scoring

53:26Agentic Evals and Vendor Neutrality: The Future of AI Evaluation

Agentic Evals and Vendor Neutrality: The Future of AI Evaluation

57:03Closing Remarks and Braintrust Demo

Closing Remarks and Braintrust Demo