Stanford CS234 Reinforcement Learning I Q learning and Function Approximation I 2024 I Lecture 4

This lecture on reinforcement learning focuses on Q-learning and deep Q-learning (DQN), showcasing how these methods empower agents to excel at video games using pixel data. It addresses the exploration vs. exploitation dilemma and presents epsilon-greedy strategies that help find the right balance between learning and making optimal decisions. The discussion includes Monte Carlo and temporal difference methods, such as SARSA and Q-learning, applicable in both tabular and function approximation contexts. The lecture also highlights the "deadly triad"—bootstrapping, function approximation, and off-policy learning— which can create instability. Finally, it outlines DQN's key advancements, like experience replay and fixed Q-targets, which have greatly enhanced the stability and performance of agents playing Atari games.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford Online

Introduction to Q-Learning and Deep Q-Learning

Policy Evaluation and the Challenges of Deterministic Policies

Model-Free Policy Iteration and the Exploration-Exploitation Dilemma

Epsilon-Greedy Policies and Monotonic Improvement

Monte Carlo Methods for Control: Online Policy Improvement

Analyzing Monte Carlo Online Control: Convergence and Exploration

GLEE and Temporal Difference Methods for Control: SARSA

Off-Policy Learning with Q-Learning and the Deadly Triad

Function Approximation and Deep Q-Network (DQN)

Deep Q-Network (DQN) and its Key Innovations

Stanford CS234 Reinforcement Learning I Q learning and Function Approximation I 2024 I Lecture 4

Stanford Online

00:05Introduction to Q-Learning and Deep Q-Learning

Introduction to Q-Learning and Deep Q-Learning

01:28Policy Evaluation and the Challenges of Deterministic Policies

Policy Evaluation and the Challenges of Deterministic Policies

06:42Model-Free Policy Iteration and the Exploration-Exploitation Dilemma

Model-Free Policy Iteration and the Exploration-Exploitation Dilemma

12:28Epsilon-Greedy Policies and Monotonic Improvement

Epsilon-Greedy Policies and Monotonic Improvement

15:48Monte Carlo Methods for Control: Online Policy Improvement

Monte Carlo Methods for Control: Online Policy Improvement

21:03Analyzing Monte Carlo Online Control: Convergence and Exploration

Analyzing Monte Carlo Online Control: Convergence and Exploration

30:09GLEE and Temporal Difference Methods for Control: SARSA

GLEE and Temporal Difference Methods for Control: SARSA

42:26Off-Policy Learning with Q-Learning and the Deadly Triad

Off-Policy Learning with Q-Learning and the Deadly Triad

51:12Function Approximation and Deep Q-Network (DQN)

Function Approximation and Deep Q-Network (DQN)

1:00:15Deep Q-Network (DQN) and its Key Innovations

Deep Q-Network (DQN) and its Key Innovations