Let's build the GPT Tokenizer

This podcast episode explores the importance of tokenization in large language models. The speaker highlights that many issues faced by these models, such as difficulties in spelling tasks, processing non-English languages, and limitations in performing arithmetic, often stem from tokenization. The episode discusses the tokenization process, the challenges faced with different languages, and the use of Unicode and encoding. It also explores the role of the tokenizer as a translation layer and the implementation of the Byte Pair Encoding algorithm. The podcast concludes with discussions on tokenizer training, special tokens, and the impact of tokenization on language model performance.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Andrej Karpathy

Understanding Tokenization in Large Language Models

The Challenges of Tokenization, Language Models, and Unicode Handling Unveiled

The Unicode Standard, Encodings, and Byte Pair Encoding Algorithm

Merging Tokens to Compress Training Dataset and Create Encoding Algorithm

The Tokenizer: Translating Between Raw Text and Token Sequence

Tokenizer Training, Encoding, and Decoding

Encoding Strings into Tokens and the Use of Merges in Tokenization

The Byte Pair Encoding Algorithm and its Use in the GPT Series

The Role of Regex Pattern in Tokenization

Tokenization in GPT-2: Handling Uppercase, Lowercase, Whitespace, and Punctuation

The GPT-4 Tokenizer and Special Tokens

The Significance of Special Tokens in Language Models

Differences in Tokenization: tiktoken vs sentencepiece

Introduction to Code Points and Byte Fallback in Sentencepiece

Exploring Concepts of Sentencepiece and Vocabulary Size in Model Training

Tokenization, Fine-tuning, and Introducing New Tokens into Language Models

Tokenization Challenges and their Impact on Language Models

Tokenization and Language Models: Understanding Strange Behavior and Efficient Encoding

Let's build the GPT Tokenizer

Andrej Karpathy

00:00Understanding Tokenization in Large Language Models

Understanding Tokenization in Large Language Models

09:06The Challenges of Tokenization, Language Models, and Unicode Handling Unveiled

The Challenges of Tokenization, Language Models, and Unicode Handling Unveiled

18:18The Unicode Standard, Encodings, and Byte Pair Encoding Algorithm

The Unicode Standard, Encodings, and Byte Pair Encoding Algorithm

27:04Merging Tokens to Compress Training Dataset and Create Encoding Algorithm

Merging Tokens to Compress Training Dataset and Create Encoding Algorithm

35:21The Tokenizer: Translating Between Raw Text and Token Sequence

The Tokenizer: Translating Between Raw Text and Token Sequence

42:43Tokenizer Training, Encoding, and Decoding

Tokenizer Training, Encoding, and Decoding

48:22Encoding Strings into Tokens and the Use of Merges in Tokenization

Encoding Strings into Tokens and the Use of Merges in Tokenization

54:59The Byte Pair Encoding Algorithm and its Use in the GPT Series

The Byte Pair Encoding Algorithm and its Use in the GPT Series

1:00:15The Role of Regex Pattern in Tokenization

The Role of Regex Pattern in Tokenization

1:07:27Tokenization in GPT-2: Handling Uppercase, Lowercase, Whitespace, and Punctuation

Tokenization in GPT-2: Handling Uppercase, Lowercase, Whitespace, and Punctuation

1:13:47The GPT-4 Tokenizer and Special Tokens

The GPT-4 Tokenizer and Special Tokens

1:19:50The Significance of Special Tokens in Language Models

The Significance of Special Tokens in Language Models

1:28:19Differences in Tokenization: tiktoken vs sentencepiece

Differences in Tokenization: tiktoken vs sentencepiece

1:35:05Introduction to Code Points and Byte Fallback in Sentencepiece

Introduction to Code Points and Byte Fallback in Sentencepiece

1:41:44Exploring Concepts of Sentencepiece and Vocabulary Size in Model Training

Exploring Concepts of Sentencepiece and Vocabulary Size in Model Training

1:48:07Tokenization, Fine-tuning, and Introducing New Tokens into Language Models

Tokenization, Fine-tuning, and Introducing New Tokens into Language Models

1:55:11Tokenization Challenges and their Impact on Language Models

Tokenization Challenges and their Impact on Language Models

2:04:47Tokenization and Language Models: Understanding Strange Behavior and Efficient Encoding

Tokenization and Language Models: Understanding Strange Behavior and Efficient Encoding