This coding workshop focuses on building large language models (LLMs) from the ground up, starting with data preparation and ending with an instruction fine-tuned LLM. The workshop covers tokenizing text, coding the LLM architecture, pre-training a small LLM, loading pre-trained weights, and fine-tuning LLMs for specific instructions. The workshop uses PyTorch and TickToken, and relies heavily on the presenter’s book, "Build a Large Language Model from Scratch." The presenter demonstrates how to tokenize text, convert tokens into IDs, and prepare data for LLM training. The presenter also shows how to load pre-trained weights from OpenAI and use LitGPT to work with more sophisticated LLMs. The workshop concludes with instruction finetuning and model evaluation using MMLU.
Part 1: Introduction, Workshop Overview Building Large Language Models from Scratch: A Coding Workshop Overview
Workshop Topics: Data Input, Architecture, Pre-training, and Fine-tuning LLMs
Three Common Ways of Using LLMs: Proprietary Services, Local LLMs, and API Endpoints
Developing LLMs: Data Preparation, Architecture Coding, Pre-training, and Fine-tuning
"Build a Large Language Model from Scratch" Book and LitGPT Library
Following Along: Lightning Studio vs. GitHub Setup for LLM Workshop
Part 2: Data Preparation, Tokenization Data Preparation for LLMs: Tokenization, Token IDs, and Embeddings
The Verdict by Edith Wharton: A Public Domain Short Story for LLM Training
Tokenization: Breaking Text into Tokens Using Regular Expressions
Token Statistics: 9,000 Tokens and 1,133 Unique Tokens in the Verdict Dataset
Building a Vocabulary: Assigning Unique Token IDs to Unique Tokens
Converting Text to Token IDs: Using the Vocabulary for Numerical Representation
Simple Tokenizer Class: Encoding Text to Token IDs and Decoding Back
Tokenizer Initialization and Usage: Encoding and Decoding Text
Byte-Pair Encoding: Handling Unknown Words with TickToken Library
TickToken: Encoding and Decoding with GPT-2 Vocabulary
Handling Unknown Words: Subword Tokenization in Byte-Pair Encoding
Part 3: Architecture, Training Logic Data Loading: LLMs Predict the Plus One Token at Each Position
Input and Target Tokens: Shifting by One Position for Next Word Prediction
Strides: Shifting the Input Window to Avoid Overlapping Tokens
Dataset Structure: Mini-Batches and Shifted Targets for LLM Training
LLM Architecture: A Top-Down View of Transformer Blocks and Embedding Layers
Transformer Blocks: The Core of LLM Architectures
GPT-Small vs. GPT-2-Large: Scaling Transformer Blocks and Embedding Sizes
Hyperparameter Configuration and Input/Output Overview
Output Shape: Understanding the Dimensions of the LLM Output Tensor
Part 4: Text Generation, Pre-training From Token Embeddings to Text: Generating the Next Word
Logits and Probabilities: Decoding the Next Token ID
Text Generation Function: Appending the Next Token as Input
Generating Text: Exercise and Results with a Non-Pretrained Model
Pre-training LLMs: Setting Up the Training Environment
Convenience Functions and Data Preparation
Data Loader: Preparing Batches for Training and Validation
Dataset Statistics: 4,600 Training Tokens and 512 Validation Tokens
Model Training: Iterating Over Epochs and Batches
Training Function: Implementing the Training Loop
Model Training Results: Decreasing Loss and Coherent Text Generation
Saving the Model and Visualizing Losses
Generating Text: Exercise with the Trained Model
Loading the Trained Model: Reusing the Model in a New Session
Part 5: Pretrained Weights, LitGPT Loading Pretrained Weights: Improving Text Generation
Loading OpenAI Weights: Using TensorFlow and Utility Functions
Settings and Parameters: Architecture Information and Weight Tensors
Exploring the Weights: Transformer Blocks and Linear Layers
Loading TensorFlow Weights into PyTorch: Mapping and Overwriting Parameters
Generating Text with Pretrained Weights: Improved Coherence
Exercise: Trying a Larger Model for Improved Quality
LitGPT: A Library for More Sophisticated LLMs
LitGPT Architecture: GPT Model and Configuration
LitGPT Commands: Download and Chat
LitGPT Python API: Generating Text with LLM.generate
Downloading and Using Phi3: A New and Good Model
Part 6: Fine-tuning, LoRA, Evaluation Instruction Fine-tuning: Preparing Data Sets with Instructions
Instruction Data Set: Examples of Instructions and Desired Responses
Alpaca Data Set: Inspiration for Instruction Data Format
Data Set Size: 1,100 Instructions for Faster Training
Formatting Instructions: Alpaca Prompt Style Template
Tokenizing Instructions: Converting to Token IDs
Creating Training Batches: Padding Tokens for Equal Length
Target Tokens: Shifting by One Position and Adding Padding
Loss Calculation: Masking Instructions for Response-Focused Training
LoRa: Low Rank Adaptation for Memory Efficiency
LoRa Layers: Applying to Linear Layers in LLMs
Data Preparation: Loading and Dividing into Training and Test Sets
LoRa Instruction Fine-tuning: Training for Three Epochs
Training Results: Decreasing Loss and Validation Performance
Exercise: Generating Responses for the Test Data Set
Generating Responses: Base Model and Fine-tuned Model
Evaluating Responses: Eyeballing the Results
Benchmarking LLMs: MMLU and the Elutha AI LLM Evaluation Harness
AlpacaEval: Comparing Models to GPT-4 Preview
Sign in to continue reading, translating and more.
Open full episode in Podwise