Stanford CS234 Reinforcement Learning I Policy Search 1 I 2024 I Lecture 5

This lecture focuses on policy-based reinforcement learning, emphasizing methods that optimize parameterized policies to maximize expected rewards without explicitly defining a value function. It highlights the benefits of stochastic policies in dealing with non-Markov processes and partial observability, using engaging examples like Rock-Paper-Scissors and robotics. At the heart of this discussion is the policy gradient theorem, which allows for the calculation of policy gradients through the likelihood ratio and score function, leading to algorithms such as REINFORCE. Although these methods initially exhibit high variance, strategies like introducing baselines help enhance efficiency and lower variance in gradient estimation. This foundation sets the stage for a deeper exploration of advanced algorithms like Proximal Policy Optimization (PPO).

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford Online

Recap and Introduction to Policy Search

Policy Search Methods: Addressing Limitations of Value-Based Methods

Policy Gradient Methods: An Optimization Approach

Differentiable Policy Gradient Methods

Likelihood Ratio Policy Gradients and the REINFORCE Algorithm

Reducing Variance in Policy Gradient Estimation: Temporal Structure and Baselines

Baselines and a Preview of Proximal Policy Optimization (PPO)

Stanford CS234 Reinforcement Learning I Policy Search 1 I 2024 I Lecture 5

Stanford Online

00:05Recap and Introduction to Policy Search

Recap and Introduction to Policy Search

03:50Policy Search Methods: Addressing Limitations of Value-Based Methods

Policy Search Methods: Addressing Limitations of Value-Based Methods

17:17Policy Gradient Methods: An Optimization Approach

Policy Gradient Methods: An Optimization Approach

24:03Differentiable Policy Gradient Methods

Differentiable Policy Gradient Methods

36:06Likelihood Ratio Policy Gradients and the REINFORCE Algorithm

Likelihood Ratio Policy Gradients and the REINFORCE Algorithm

55:51Reducing Variance in Policy Gradient Estimation: Temporal Structure and Baselines

Reducing Variance in Policy Gradient Estimation: Temporal Structure and Baselines

1:06:06Baselines and a Preview of Proximal Policy Optimization (PPO)

Baselines and a Preview of Proximal Policy Optimization (PPO)