Episode cover
YouTube16 Jun 2026

How bad data teaches models to write terrible code

Podcast cover

Mechanize

Quantitative measurement of LLM performance relies on evals, benchmarks, and RL environments, which share the fundamental structure of a prompt, an environment, and a grader. While RL environments update model weights during training, evals serve as critical tools for researchers to tune hyperparameters and validate progress across numerous internal checkpoints. Effective evals must target capabilities at the edge of model performance, moving beyond simple instruction-following to complex, high-entropy tasks like building a Game Boy Advance emulator from scratch. Current models struggle with "taste" and user-intent modeling, often failing to anticipate edge cases or perform real-world testing. As software engineering becomes increasingly abstracted, the value of human judgment in managing context and defining requirements grows, suggesting that AI will enhance productivity rather than eliminate the need for human engineers.

Outlines

Part 1: Evals, Benchmarks, and Training Dynamics

Part 2: Model Behavior and Learning Efficiency

Part 3: Technical Challenges in Evaluation

Part 4: Reliability and Verification

Part 5: Future of Software Engineering

Sign in to continue reading, translating and more.

Open full episode in Podwise