Build an LLM from Scratch 2: Working with text data

In this coding along video series, Sebastian Raschka focuses on Chapter 2 of his book, "Build a Large Language Model from Scratch," which covers working with text data to prepare it for LLM training. The discussion includes tokenizing text, converting tokens into IDs, handling special tokens, and using byte pair encoding. He also explains data sampling with a sliding window, creating token embeddings, and adding positional information, utilizing libraries such as TickToken and PyTorch to efficiently process and format the text data for the LLM.

Outlines

Part 1: Introduction and Tokenization Basics

Part 2: Tokenization Methods and Vocabulary Building

Part 3: Data Handling and Embeddings

Sign in to continue reading, translating and more.

Open full episode in Podwise

Sebastian Raschka

Part 1: Introduction and Tokenization Basics

Introduction to Preparing Text Data for Large Language Models

Tokenizing Text: Breaking Down Text into Smaller Chunks

Downloading and Preparing the Dataset: "The Verdict"

Part 2: Tokenization Methods and Vocabulary Building

Tokenization with Regular Expressions: A Warm-Up

Converting Tokens into Token IDs: Building a Vocabulary

Implementing a Simple Tokenizer Class

Adding Special Context Tokens to the Tokenizer

Enhancing the Tokenizer: Handling Unknown Words

Byte Pair Encoding and TickToken Library

Part 3: Data Handling and Embeddings

Data Sampling with a Sliding Window

Implementing Data Loaders in PyTorch

Creating Token Embeddings

Adding Positional Information to Token Embeddings

Build an LLM from Scratch 2: Working with text data

Sebastian Raschka

Part 1: Introduction and Tokenization Basics

00:01Introduction to Preparing Text Data for Large Language Models

Introduction to Preparing Text Data for Large Language Models

02:12Tokenizing Text: Breaking Down Text into Smaller Chunks

Tokenizing Text: Breaking Down Text into Smaller Chunks

04:01Downloading and Preparing the Dataset: "The Verdict"

Downloading and Preparing the Dataset: "The Verdict"

Part 2: Tokenization Methods and Vocabulary Building

08:48Tokenization with Regular Expressions: A Warm-Up

Tokenization with Regular Expressions: A Warm-Up

13:52Converting Tokens into Token IDs: Building a Vocabulary

Converting Tokens into Token IDs: Building a Vocabulary

17:56Implementing a Simple Tokenizer Class

Implementing a Simple Tokenizer Class

23:30Adding Special Context Tokens to the Tokenizer

Adding Special Context Tokens to the Tokenizer

29:31Enhancing the Tokenizer: Handling Unknown Words

Enhancing the Tokenizer: Handling Unknown Words

36:19Byte Pair Encoding and TickToken Library

Byte Pair Encoding and TickToken Library

Part 3: Data Handling and Embeddings

43:55Data Sampling with a Sliding Window

Data Sampling with a Sliding Window

57:12Implementing Data Loaders in PyTorch

Implementing Data Loaders in PyTorch

1:07:14Creating Token Embeddings

Creating Token Embeddings

1:17:20Adding Positional Information to Token Embeddings

Adding Positional Information to Token Embeddings