04 Jun 2025
1h 20m
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation
Stanford Online
The podcast discusses the complexities and challenges of evaluating language models, highlighting the "evaluation crisis" due to saturated or gamed benchmarks. It covers various evaluation methods, including benchmark scores (MMLU, AIME, etc.), cost analysis, user choice data, and human preferences. The speaker emphasizes that there is no one-size-fits-all evaluation, as the goal determines the approach. The discussion includes a framework for evaluation, considering inputs, prompting strategies, output assessment, and result interpretation, as well as touching on perplexity, instruction following, agent benchmarks, safety, and realism in evaluations.
Outlines
Part 1: Introduction to Evaluation
Part 2: Knowledge & Instruction Benchmarks
Part 3: Agent & Safety Benchmarks
Part 4: Realism and Validity
Sign in to continue reading, translating and more.
Open full episode in Podwise