YouTube16 Jun 2025
1h 19m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2

Podcast cover

Stanford Online

This lecture focuses on data filtering and deduplication techniques used in training language models. It begins by outlining filtering algorithms, including N-gram models, FastText classifiers, and importance resampling, emphasizing speed and generalization. The lecture then discusses how these filtering methods can be applied to language identification, quality enhancement, and toxicity reduction in datasets. The second half shifts to deduplication, distinguishing between exact and near duplicates, and introduces hashing techniques like Bloom filters and MinHash LSH to efficiently identify and remove duplicate data, ultimately improving training efficiency and preventing memorization in language models.

Outlines

Part 1: Data Filtering Basics

Part 2: Applications of Data Filtering

Part 3: Deduplication Techniques

Part 4: Summary and Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise