YouTube04 Jun 2025
1h 20m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation

Podcast cover

Stanford Online

The podcast discusses the complexities and challenges of evaluating language models, highlighting the "evaluation crisis" due to saturated or gamed benchmarks. It covers various evaluation methods, including benchmark scores (MMLU, AIME, etc.), cost analysis, user choice data, and human preferences. The speaker emphasizes that there is no one-size-fits-all evaluation, as the goal determines the approach. The discussion includes a framework for evaluation, considering inputs, prompting strategies, output assessment, and result interpretation, as well as touching on perplexity, instruction following, agent benchmarks, safety, and realism in evaluations.

Outlines

Part 1: Introduction to Evaluation

Part 2: Knowledge & Instruction Benchmarks

Part 3: Agent & Safety Benchmarks

Part 4: Realism and Validity

Sign in to continue reading, translating and more.

Open full episode in Podwise