Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1
Stanford Online
The lecture focuses on the importance of data in training language models, arguing it's the most critical factor. It covers the various stages of training, including pre-training, mid-training, and post-training, and the types of data used in each. The lecture explores specific datasets like BERT, WebText, Common Crawl, C4, and others, detailing their composition, filtering methods, and potential issues like data poisoning and copyright. It also addresses the legal and ethical considerations of using copyrighted material for training, discussing licensing and fair use. The lecture concludes by touching on mid-training and post-training data, emphasizing the use of long context data, task-specific datasets, and synthetic data for instruction tuning.
Part 1: Introduction and Data Fundamentals
Part 2: Web Data Acquisition and Filtering
Part 3: Legal and Ethical Considerations
Part 4: Mid-Training, Post-Training, and Synthesis
Sign in to continue reading, translating and more.
Open full episode in Podwise