Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 5: Off-Policy Actor Critic | Stanford Online

Chelsea Finn delivers a lecture on reinforcement learning, recapping policy gradients, value functions, and actor-critic methods. The lecture transitions to off-policy methods, specifically PPO and SAC algorithms, and discusses how to improve policy learning by using importance weights and surrogate objectives. She explains the challenges of unstable learning due to overfitting and introduces techniques like KL constraints and clipping to stabilize policy updates. The lecture covers PPO's clipping mechanism, surrogate objective, and advantage estimation, before moving on to SAC, which uses a replay buffer to reuse past data, and discusses the modifications needed to actor-critic algorithms to make them off-policy, including fitting a Q function instead of a value function. The lecture concludes with a comparison of PPO and SAC, highlighting their trade-offs in data efficiency and stability, and examples of reinforcement learning applications in robotics and language models.

Outlines

Part 1: Foundations, Actor-Critic Basics

Part 2: Stability, PPO Framework

Part 3: Replay Buffers, SAC

Part 4: Comparison, Applications

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 5: Off-Policy Actor Critic

Stanford Online

Part 1: Foundations, Actor-Critic Basics

Review of Policy Gradients, Value Functions, and Actor-Critic Methods

Off-Policy Actor-Critic Methods and Importance Weights

Part 2: Stability, PPO Framework

Addressing Instability with Surrogate Objectives and KL Constraints

Clipping Importance Weights in PPO

PPO Algorithm: Final Objective and Advantage Estimation

Part 3: Replay Buffers, SAC

Replay Buffers and Off-Policy Learning

Fitting Q-Functions and the Soft Actor-Critic Algorithm

Part 4: Comparison, Applications

Comparing SAC and PPO, and Real-World Applications

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 5: Off-Policy Actor Critic

Stanford Online

Part 1: Foundations, Actor-Critic Basics

00:05Review of Policy Gradients, Value Functions, and Actor-Critic Methods

Review of Policy Gradients, Value Functions, and Actor-Critic Methods

06:06Off-Policy Actor-Critic Methods and Importance Weights

Off-Policy Actor-Critic Methods and Importance Weights

Part 2: Stability, PPO Framework

17:35Addressing Instability with Surrogate Objectives and KL Constraints

Addressing Instability with Surrogate Objectives and KL Constraints

28:40Clipping Importance Weights in PPO

Clipping Importance Weights in PPO

37:44PPO Algorithm: Final Objective and Advantage Estimation

PPO Algorithm: Final Objective and Advantage Estimation

Part 3: Replay Buffers, SAC

47:00Replay Buffers and Off-Policy Learning

Replay Buffers and Off-Policy Learning

53:01Fitting Q-Functions and the Soft Actor-Critic Algorithm

Fitting Q-Functions and the Soft Actor-Critic Algorithm

Part 4: Comparison, Applications

1:07:18Comparing SAC and PPO, and Real-World Applications

Comparing SAC and PPO, and Real-World Applications