Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1 | Stanford Online

The lecture focuses on the importance of data in training language models, arguing it's the most critical factor. It covers the various stages of training, including pre-training, mid-training, and post-training, and the types of data used in each. The lecture explores specific datasets like BERT, WebText, Common Crawl, C4, and others, detailing their composition, filtering methods, and potential issues like data poisoning and copyright. It also addresses the legal and ethical considerations of using copyrighted material for training, discussing licensing and fair use. The lecture concludes by touching on mid-training and post-training data, emphasizing the use of long context data, task-specific datasets, and synthetic data for instruction tuning.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

Stanford Online

Introduction to the Importance of Data in Language Models

Data Set Examples and the Issue of Data Poisoning

WebText, Common Crawl, and Data Acquisition on the Internet

Filtering Common Crawl: CCNet and C4

GPT-3, The Pile, and Copyright Considerations

Stack Exchange, GitHub, and LLAMA

Refined Web, AI2's OMO, and DataComp

Nemotron CC and Pre-training Data Set Considerations

Copyright Law and Fair Use in the Context of Training Data

Mid-Training, Post-Training, and Data Set Synthesis

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

Stanford Online

00:05Introduction to the Importance of Data in Language Models

Introduction to the Importance of Data in Language Models

03:55Data Set Examples and the Issue of Data Poisoning

Data Set Examples and the Issue of Data Poisoning

12:19WebText, Common Crawl, and Data Acquisition on the Internet

WebText, Common Crawl, and Data Acquisition on the Internet

21:37Filtering Common Crawl: CCNet and C4

Filtering Common Crawl: CCNet and C4

30:37GPT-3, The Pile, and Copyright Considerations

GPT-3, The Pile, and Copyright Considerations

36:30Stack Exchange, GitHub, and LLAMA

Stack Exchange, GitHub, and LLAMA

45:05Refined Web, AI2's OMO, and DataComp

Refined Web, AI2's OMO, and DataComp

55:11Nemotron CC and Pre-training Data Set Considerations

Nemotron CC and Pre-training Data Set Considerations

1:00:10Copyright Law and Fair Use in the Context of Training Data

Copyright Law and Fair Use in the Context of Training Data

1:10:00Mid-Training, Post-Training, and Data Set Synthesis

Mid-Training, Post-Training, and Data Set Synthesis