Building LLMs from the Ground Up: A 3-hour Coding Workshop

This coding workshop focuses on building large language models (LLMs) from the ground up, starting with data preparation and ending with an instruction fine-tuned LLM. The workshop covers tokenizing text, coding the LLM architecture, pre-training a small LLM, loading pre-trained weights, and fine-tuning LLMs for specific instructions. The workshop uses PyTorch and TickToken, and relies heavily on the presenter’s book, "Build a Large Language Model from Scratch." The presenter demonstrates how to tokenize text, convert tokens into IDs, and prepare data for LLM training. The presenter also shows how to load pre-trained weights from OpenAI and use LitGPT to work with more sophisticated LLMs. The workshop concludes with instruction finetuning and model evaluation using MMLU.

Outlines

Part 1: Introduction, Workshop Overview

Part 2: Data Preparation, Tokenization

Part 3: Architecture, Training Logic

Part 4: Text Generation, Pre-training

Part 5: Pretrained Weights, LitGPT

Part 6: Fine-tuning, LoRA, Evaluation

Sign in to continue reading, translating and more.

Open full episode in Podwise

Sebastian Raschka

Part 1: Introduction, Workshop Overview

00:01Building Large Language Models from Scratch: A Coding Workshop Overview

Building Large Language Models from Scratch: A Coding Workshop Overview

00:53Workshop Topics: Data Input, Architecture, Pre-training, and Fine-tuning LLMs

Workshop Topics: Data Input, Architecture, Pre-training, and Fine-tuning LLMs

02:13Three Common Ways of Using LLMs: Proprietary Services, Local LLMs, and API Endpoints

Three Common Ways of Using LLMs: Proprietary Services, Local LLMs, and API Endpoints

05:58Developing LLMs: Data Preparation, Architecture Coding, Pre-training, and Fine-tuning

Developing LLMs: Data Preparation, Architecture Coding, Pre-training, and Fine-tuning

07:33"Build a Large Language Model from Scratch" Book and LitGPT Library

"Build a Large Language Model from Scratch" Book and LitGPT Library

08:33Following Along: Lightning Studio vs. GitHub Setup for LLM Workshop

Following Along: Lightning Studio vs. GitHub Setup for LLM Workshop

Part 2: Data Preparation, Tokenization

11:43Data Preparation for LLMs: Tokenization, Token IDs, and Embeddings

Data Preparation for LLMs: Tokenization, Token IDs, and Embeddings

14:20The Verdict by Edith Wharton: A Public Domain Short Story for LLM Training

The Verdict by Edith Wharton: A Public Domain Short Story for LLM Training

16:14Tokenization: Breaking Text into Tokens Using Regular Expressions

Tokenization: Breaking Text into Tokens Using Regular Expressions

17:42Token Statistics: 9,000 Tokens and 1,133 Unique Tokens in the Verdict Dataset

Token Statistics: 9,000 Tokens and 1,133 Unique Tokens in the Verdict Dataset

19:22Building a Vocabulary: Assigning Unique Token IDs to Unique Tokens

Building a Vocabulary: Assigning Unique Token IDs to Unique Tokens

20:36Converting Text to Token IDs: Using the Vocabulary for Numerical Representation

Converting Text to Token IDs: Using the Vocabulary for Numerical Representation

22:21Simple Tokenizer Class: Encoding Text to Token IDs and Decoding Back

Simple Tokenizer Class: Encoding Text to Token IDs and Decoding Back

24:32Tokenizer Initialization and Usage: Encoding and Decoding Text

Tokenizer Initialization and Usage: Encoding and Decoding Text

27:23Byte-Pair Encoding: Handling Unknown Words with TickToken Library

Byte-Pair Encoding: Handling Unknown Words with TickToken Library

29:23TickToken: Encoding and Decoding with GPT-2 Vocabulary

TickToken: Encoding and Decoding with GPT-2 Vocabulary

30:41Handling Unknown Words: Subword Tokenization in Byte-Pair Encoding

Handling Unknown Words: Subword Tokenization in Byte-Pair Encoding

Part 3: Architecture, Training Logic

32:42Data Loading: LLMs Predict the Plus One Token at Each Position

Data Loading: LLMs Predict the Plus One Token at Each Position

33:51Input and Target Tokens: Shifting by One Position for Next Word Prediction

Input and Target Tokens: Shifting by One Position for Next Word Prediction

35:55Strides: Shifting the Input Window to Avoid Overlapping Tokens

Strides: Shifting the Input Window to Avoid Overlapping Tokens

37:51Dataset Structure: Mini-Batches and Shifted Targets for LLM Training

Dataset Structure: Mini-Batches and Shifted Targets for LLM Training

40:21LLM Architecture: A Top-Down View of Transformer Blocks and Embedding Layers

LLM Architecture: A Top-Down View of Transformer Blocks and Embedding Layers

41:33Transformer Blocks: The Core of LLM Architectures

Transformer Blocks: The Core of LLM Architectures

44:21GPT-Small vs. GPT-2-Large: Scaling Transformer Blocks and Embedding Sizes

GPT-Small vs. GPT-2-Large: Scaling Transformer Blocks and Embedding Sizes

46:37Hyperparameter Configuration and Input/Output Overview

Hyperparameter Configuration and Input/Output Overview

49:22Output Shape: Understanding the Dimensions of the LLM Output Tensor

Output Shape: Understanding the Dimensions of the LLM Output Tensor

Part 4: Text Generation, Pre-training

51:11From Token Embeddings to Text: Generating the Next Word

From Token Embeddings to Text: Generating the Next Word

54:01Logits and Probabilities: Decoding the Next Token ID

Logits and Probabilities: Decoding the Next Token ID

58:24Text Generation Function: Appending the Next Token as Input

Text Generation Function: Appending the Next Token as Input

1:00:24Generating Text: Exercise and Results with a Non-Pretrained Model

Generating Text: Exercise and Results with a Non-Pretrained Model

1:07:15Pre-training LLMs: Setting Up the Training Environment

Pre-training LLMs: Setting Up the Training Environment

1:08:41Convenience Functions and Data Preparation

Convenience Functions and Data Preparation

1:10:20Data Loader: Preparing Batches for Training and Validation

Data Loader: Preparing Batches for Training and Validation

1:12:26Dataset Statistics: 4,600 Training Tokens and 512 Validation Tokens

Dataset Statistics: 4,600 Training Tokens and 512 Validation Tokens

1:14:26Model Training: Iterating Over Epochs and Batches

Model Training: Iterating Over Epochs and Batches

1:15:54Training Function: Implementing the Training Loop

Training Function: Implementing the Training Loop

1:17:32Model Training Results: Decreasing Loss and Coherent Text Generation

Model Training Results: Decreasing Loss and Coherent Text Generation