YouTube02 Mar 2025
1h 28m

Build an LLM from Scratch 2: Working with text data

Podcast cover

Sebastian Raschka

In this coding along video series, Sebastian Raschka focuses on Chapter 2 of his book, "Build a Large Language Model from Scratch," which covers working with text data to prepare it for LLM training. The discussion includes tokenizing text, converting tokens into IDs, handling special tokens, and using byte pair encoding. He also explains data sampling with a sliding window, creating token embeddings, and adding positional information, utilizing libraries such as TickToken and PyTorch to efficiently process and format the text data for the LLM.

Outlines

Part 1: Introduction and Tokenization Basics

Part 2: Tokenization Methods and Vocabulary Building

Part 3: Data Handling and Embeddings

Sign in to continue reading, translating and more.

Open full episode in Podwise