Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation | Stanford Online

The podcast discusses the complexities and challenges of evaluating language models, highlighting the "evaluation crisis" due to saturated or gamed benchmarks. It covers various evaluation methods, including benchmark scores (MMLU, AIME, etc.), cost analysis, user choice data, and human preferences. The speaker emphasizes that there is no one-size-fits-all evaluation, as the goal determines the approach. The discussion includes a framework for evaluation, considering inputs, prompting strategies, output assessment, and result interpretation, as well as touching on perplexity, instruction following, agent benchmarks, safety, and realism in evaluations.

Outlines

Part 1: Introduction to Evaluation

Part 2: Knowledge & Instruction Benchmarks

Part 3: Agent & Safety Benchmarks

Part 4: Realism and Validity

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation

Stanford Online

Part 1: Introduction to Evaluation

Introduction to Evaluation in Language Models

The Purpose and Goals of Evaluation

A Framework for Evaluation and the Role of Perplexity

Perplexity in Language Modeling Research

The Shift Towards Downstream Test Accuracy and Train-Task Overlap

Part 2: Knowledge & Instruction Benchmarks

Standard Knowledge Benchmarks: MMLU

Interpreting MMLU Scores and Introducing GPQA

Google Proofing and the Bias in Expert-Driven Questions

Instruction Following Benchmarks and Chatbot Arena

Part 3: Agent & Safety Benchmarks

Agent Benchmarks and the ARC AGI Challenge

Safety Benchmarks and Jailbreaking

Part 4: Realism and Validity

Realism, Validity, and the Purpose of Evaluation

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation

Stanford Online

Part 1: Introduction to Evaluation

00:05Introduction to Evaluation in Language Models

Introduction to Evaluation in Language Models

04:42The Purpose and Goals of Evaluation

The Purpose and Goals of Evaluation

08:03A Framework for Evaluation and the Role of Perplexity

A Framework for Evaluation and the Role of Perplexity

16:19Perplexity in Language Modeling Research

Perplexity in Language Modeling Research

23:20The Shift Towards Downstream Test Accuracy and Train-Task Overlap

The Shift Towards Downstream Test Accuracy and Train-Task Overlap

Part 2: Knowledge & Instruction Benchmarks

32:09Standard Knowledge Benchmarks: MMLU

Standard Knowledge Benchmarks: MMLU

39:24Interpreting MMLU Scores and Introducing GPQA

Interpreting MMLU Scores and Introducing GPQA

45:01Google Proofing and the Bias in Expert-Driven Questions

Google Proofing and the Bias in Expert-Driven Questions

52:15Instruction Following Benchmarks and Chatbot Arena

Instruction Following Benchmarks and Chatbot Arena

Part 3: Agent & Safety Benchmarks

1:00:12Agent Benchmarks and the ARC AGI Challenge

Agent Benchmarks and the ARC AGI Challenge

1:06:06Safety Benchmarks and Jailbreaking

Safety Benchmarks and Jailbreaking

Part 4: Realism and Validity

1:13:53Realism, Validity, and the Purpose of Evaluation

Realism, Validity, and the Purpose of Evaluation