19 Jul 2025

2h 42m

[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

AI Engineer

Daniel Han delivers a presentation on deep reinforcement learning (RL), covering topics such as kernels, agents, and quantization, and then transitions into an extensive Q&A session. He discusses the history of large language models, the open-source drought, and the jumps in performance achieved through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Daniel explains the concepts of agents and environments in RL, reward functions, and algorithms like PPO and GRPO, including the importance of maximizing reward while avoiding overfitting. He also touches on the significance of verifiable rewards, the role of reward models, and the trade-offs between model size, data quality, and computational efficiency. The presentation further explores quantization techniques for reducing model size and improving performance, and the importance of Torch.compile for optimizing training runs.

Outlines

Part 1: Introduction and Background

Part 2: RL for Language Models

Part 3: Practical Implementation

Continue

Preview

How to Get Rich: Every EpisodeNaval

[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

AI Engineer

Part 1: Introduction and Background

Introduction to RL, Open Source AI, and the History of LLMs

Open Source vs. Closed Source AI: Droughts and Jumps in Performance

SFT, RLHF, and the Training Stages of Large Language Models

Optimization Problem and Agents in AI

Part 2: RL for Language Models

RL for Language Models and the PPO Algorithm

Reward Functions, Pre-training, and Distance-Based Scoring

Tool Use, Reward Models, and Multi-Turn Interactions

Reference Models, Reward Functions, and Model Steering

Model Capabilities, Reward Models, and LLM as Judge

Reward Functions, Maximizing Actions, and the Reinforce Algorithm

Pac-Man Example and Maximizing the Equation

Reinforce and the Value Model

Value Model and the Policy

KL Divergence and the Value Model

PPO and the Likelihood Ratio

KL Term and GRPO

Group Relative and Yann LeCun

Multi-Turn and the Reward

Reward Functions and the Algorithm

Model Capabilities and the Open Source Community

Reward Functions and the Stock Market

Student Teacher and the Scale

Priming and the Small Models

The Latest Research and the Value Model

The Thinking Process and the Tool Use

Part 3: Practical Implementation

Triton Kernels and the Test Time

The GitHub Page and the Fast Language Model

The System Prompt and the Chat Template

The Supervised Fine-Tuning Step

The Reward Function

The Training

The Output

[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

AI Engineer

Part 1: Introduction and Background

00:14Introduction to RL, Open Source AI, and the History of LLMs

Introduction to RL, Open Source AI, and the History of LLMs

05:45Open Source vs. Closed Source AI: Droughts and Jumps in Performance

Open Source vs. Closed Source AI: Droughts and Jumps in Performance

09:31SFT, RLHF, and the Training Stages of Large Language Models

SFT, RLHF, and the Training Stages of Large Language Models

14:30Optimization Problem and Agents in AI

Optimization Problem and Agents in AI

Part 2: RL for Language Models

18:02RL for Language Models and the PPO Algorithm

RL for Language Models and the PPO Algorithm

24:27Reward Functions, Pre-training, and Distance-Based Scoring

Reward Functions, Pre-training, and Distance-Based Scoring

32:02Tool Use, Reward Models, and Multi-Turn Interactions

Tool Use, Reward Models, and Multi-Turn Interactions

37:57Reference Models, Reward Functions, and Model Steering

Reference Models, Reward Functions, and Model Steering

42:48Model Capabilities, Reward Models, and LLM as Judge

Model Capabilities, Reward Models, and LLM as Judge

48:13Reward Functions, Maximizing Actions, and the Reinforce Algorithm

Reward Functions, Maximizing Actions, and the Reinforce Algorithm

52:03Pac-Man Example and Maximizing the Equation

Pac-Man Example and Maximizing the Equation

57:01Reinforce and the Value Model

Reinforce and the Value Model

1:00:24Value Model and the Policy

Value Model and the Policy

1:04:20KL Divergence and the Value Model

KL Divergence and the Value Model

1:07:58PPO and the Likelihood Ratio

PPO and the Likelihood Ratio

1:12:27KL Term and GRPO

KL Term and GRPO

1:17:11Group Relative and Yann LeCun

Group Relative and Yann LeCun

1:22:46Multi-Turn and the Reward

Multi-Turn and the Reward

1:27:54Reward Functions and the Algorithm

Reward Functions and the Algorithm

1:30:34Model Capabilities and the Open Source Community

Model Capabilities and the Open Source Community

1:35:10Reward Functions and the Stock Market

Reward Functions and the Stock Market

1:41:04Student Teacher and the Scale

Student Teacher and the Scale

1:45:03Priming and the Small Models

Priming and the Small Models

1:49:23The Latest Research and the Value Model

The Latest Research and the Value Model

1:53:30The Thinking Process and the Tool Use

The Thinking Process and the Tool Use

Part 3: Practical Implementation

1:57:57Triton Kernels and the Test Time

Triton Kernels and the Test Time

2:00:06The GitHub Page and the Fast Language Model

The GitHub Page and the Fast Language Model

2:04:20The System Prompt and the Chat Template

The System Prompt and the Chat Template

2:07:35The Supervised Fine-Tuning Step

The Supervised Fine-Tuning Step

2:11:14The Reward Function

The Reward Function

2:14:17The Training

The Training

2:17:25The Output

The Output