YouTube10 Jun 2025
1h 19m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

Podcast cover

Stanford Online

The lecture focuses on the importance of data in training language models, arguing it's the most critical factor. It covers the various stages of training, including pre-training, mid-training, and post-training, and the types of data used in each. The lecture explores specific datasets like BERT, WebText, Common Crawl, C4, and others, detailing their composition, filtering methods, and potential issues like data poisoning and copyright. It also addresses the legal and ethical considerations of using copyrighted material for training, discussing licensing and fair use. The lecture concludes by touching on mid-training and post-training data, emphasizing the use of long context data, task-specific datasets, and synthetic data for instruction tuning.

Outlines

Part 1: Introduction and Data Fundamentals

Part 2: Web Data Acquisition and Filtering

Part 3: Legal and Ethical Considerations

Part 4: Mid-Training, Post-Training, and Synthesis

Sign in to continue reading, translating and more.

Open full episode in Podwise