Evals, Feedback Loops, and the Engineering That Makes AI Work

The discussion centers on the interplay between AI development and traditional systems engineering, questioning whether the "bitter lesson" of simply scaling compute power will continue to dominate or if engineering rigor will become essential. Ankur Goyal, founder and CEO of Braintrust, shares his experience on the necessity of well-engineered testing and feedback loops (evals) to manage AI model complexity, even as models improve. Goyal and Martin Casado explore the dynamics between closed and open-source AI models, noting Chinese models exhibit high token usage but lower dollar spend, and discuss the balance between model quality and engineering efficiency, suggesting that enterprises may face a limit in their capacity to absorb ever-improving AI capabilities. The conversation also touches on the surprising finding that SQL outperforms Bash in agent benchmarks, highlighting the value of computer science fundamentals in AI development.

Outlines

Part 1: Background, Core Concepts

Part 2: Market Dynamics, Benchmarks

Part 3: Adoption, Economics

Part 4: Engineering Fundamentals

Sign in to continue reading, translating and more.

Open full episode in Podwise

AI + a16z

Part 1: Background, Core Concepts

AI's Continuous Nature vs. Discrete Systems: The Core Tension in AI Development

Ankur Goyal's Background: From Relational Databases to AI-Powered Document Extraction

The Bitter Lesson vs. Systems Approach: Reconciling Engineering and AI Development

Part 2: Market Dynamics, Benchmarks

Capital Flow, Evals, and the Sublimation of Intelligence in AI Models

Chinese Models: High Token Volume, Low Dollar Spend, and the Bash vs. SQL Benchmark

Open Source vs. Closed Source Models: A Cycle of Innovation and Stagnation

Part 3: Adoption, Economics

The Limits of Capital: When Will Engineering Investment in AI Become Justified?

Demand Saturation, Enterprise Adoption, and the Token Path Dilemma

Braintrust's Pricing Strategy: Aligning Value with Token Consumption

Part 4: Engineering Fundamentals

Bash vs. SQL: A Comical Benchmark and the Importance of Computer Science Fundamentals

Evals, Feedback Loops, and the Engineering That Makes AI Work

AI + a16z

Part 1: Background, Core Concepts

00:00AI's Continuous Nature vs. Discrete Systems: The Core Tension in AI Development

AI's Continuous Nature vs. Discrete Systems: The Core Tension in AI Development

02:13Ankur Goyal's Background: From Relational Databases to AI-Powered Document Extraction

Ankur Goyal's Background: From Relational Databases to AI-Powered Document Extraction

08:49The Bitter Lesson vs. Systems Approach: Reconciling Engineering and AI Development

The Bitter Lesson vs. Systems Approach: Reconciling Engineering and AI Development

Part 2: Market Dynamics, Benchmarks

14:30Capital Flow, Evals, and the Sublimation of Intelligence in AI Models

Capital Flow, Evals, and the Sublimation of Intelligence in AI Models

17:02Chinese Models: High Token Volume, Low Dollar Spend, and the Bash vs. SQL Benchmark

Chinese Models: High Token Volume, Low Dollar Spend, and the Bash vs. SQL Benchmark

21:03Open Source vs. Closed Source Models: A Cycle of Innovation and Stagnation

Open Source vs. Closed Source Models: A Cycle of Innovation and Stagnation

Part 3: Adoption, Economics

25:01The Limits of Capital: When Will Engineering Investment in AI Become Justified?

The Limits of Capital: When Will Engineering Investment in AI Become Justified?

27:36Demand Saturation, Enterprise Adoption, and the Token Path Dilemma

Demand Saturation, Enterprise Adoption, and the Token Path Dilemma

33:06Braintrust's Pricing Strategy: Aligning Value with Token Consumption

Braintrust's Pricing Strategy: Aligning Value with Token Consumption

Part 4: Engineering Fundamentals

36:05Bash vs. SQL: A Comical Benchmark and the Importance of Computer Science Fundamentals

Bash vs. SQL: A Comical Benchmark and the Importance of Computer Science Fundamentals