Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Tutorial Session: Review of Q-Learning | Stanford Online

In this lecture, Anikait, a TA for CS224R, provides an overview of Q-Learning, beginning with a brief review of Markov Decision Processes (MDPs) and transitioning to tabular problems using fitted queue iteration. The lecture progresses to Parametric Q-Learning, covering bias-variance trade-offs and comparing TD regression with Monte Carlo rollouts. Practical aspects of learning effective Q-functions are discussed, including replay buffers, overestimation issues, and dealing with the gradient effects of TD learning. Anikait uses a grid world problem to illustrate key concepts such as value functions, Q-functions, and advantage, and explains how to learn a Q-function through dynamic programming. The discussion also covers the differences between Monte Carlo and TD estimates, n-step returns, and techniques for stabilizing Q-Learning, such as semi-gradients, target networks, gradient clipping, Huber loss, and replay buffers to improve the stability and performance of Q-Learning algorithms.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Tutorial Session: Review of Q-Learning

Stanford Online

Introduction to Q-Learning and Markov Decision Processes

Value Functions, Q-Functions, and Advantage in Grid World

Scaling Q-Learning with Parametric Value Functions and Monte Carlo vs. TD Learning

N-Step Returns and Practical Considerations for Q-Learning Stability

Gradient Clipping, Huber Loss, and Replay Buffers

Q-Learning Overestimation and Mitigation Strategies

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Tutorial Session: Review of Q-Learning

Stanford Online

00:05Introduction to Q-Learning and Markov Decision Processes

Introduction to Q-Learning and Markov Decision Processes

06:16Value Functions, Q-Functions, and Advantage in Grid World

Value Functions, Q-Functions, and Advantage in Grid World

17:56Scaling Q-Learning with Parametric Value Functions and Monte Carlo vs. TD Learning

Scaling Q-Learning with Parametric Value Functions and Monte Carlo vs. TD Learning

28:16N-Step Returns and Practical Considerations for Q-Learning Stability

N-Step Returns and Practical Considerations for Q-Learning Stability

35:32Gradient Clipping, Huber Loss, and Replay Buffers

Gradient Clipping, Huber Loss, and Replay Buffers

42:48Q-Learning Overestimation and Mitigation Strategies

Q-Learning Overestimation and Mitigation Strategies