YouTube08 Dec 2025
1h 9m

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 5: Off-Policy Actor Critic

Podcast cover

Stanford Online

Chelsea Finn delivers a lecture on reinforcement learning, recapping policy gradients, value functions, and actor-critic methods. The lecture transitions to off-policy methods, specifically PPO and SAC algorithms, and discusses how to improve policy learning by using importance weights and surrogate objectives. She explains the challenges of unstable learning due to overfitting and introduces techniques like KL constraints and clipping to stabilize policy updates. The lecture covers PPO's clipping mechanism, surrogate objective, and advantage estimation, before moving on to SAC, which uses a replay buffer to reuse past data, and discusses the modifications needed to actor-critic algorithms to make them off-policy, including fitting a Q function instead of a value function. The lecture concludes with a comparison of PPO and SAC, highlighting their trade-offs in data efficiency and stability, and examples of reinforcement learning applications in robotics and language models.

Outlines

Part 1: Foundations, Actor-Critic Basics

Part 2: Stability, PPO Framework

Part 3: Replay Buffers, SAC

Part 4: Comparison, Applications

Sign in to continue reading, translating and more.

Open full episode in Podwise