Stanford CS234 Reinforcement Learning I Policy Evaluation I 2024 I Lecture 3

This lecture delves into model-free policy evaluation in reinforcement learning, with a special emphasis on tabular methods. It contrasts two main approaches: Monte Carlo policy evaluation, which calculates averages from multiple episodes, and Temporal Difference (TD) learning, which incrementally updates value estimates after each state transition. The key differences are evident: while Monte Carlo is unbiased, it comes with high variance and is limited to episodic settings. In contrast, TD learning offers lower variance, is adaptable to both episodic and continuous scenarios, and applies bootstrapping by integrating previous value estimates. The lecture also covers concepts like certainty equivalence, which involves using data to create a model and then employing dynamic programming, and batch policy evaluation, where a fixed dataset is reused. It highlights how these methods can yield different outcomes, especially when data is scarce, due to their differing assumptions about the Markov property.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford Online

Course Refresher and Introduction to Tabular MDPs

Analyzing Value Iteration and Policy Iteration Convergence

Model-Free Policy Evaluation: Introduction and Logistics

Monte Carlo Policy Evaluation: Core Concepts and Examples

Monte Carlo Policy Evaluation: Algorithms and Properties

Monte Carlo Policy Evaluation: Visual Representation and Properties

Temporal Difference Learning: Introduction and Core Idea

Temporal Difference Learning: Algorithm, Worked Example, and Comparison to Monte Carlo

Comparing TD Learning and Monte Carlo: Properties and Batch Policy Evaluation

Stanford CS234 Reinforcement Learning I Policy Evaluation I 2024 I Lecture 3

Stanford Online

00:05Course Refresher and Introduction to Tabular MDPs

Course Refresher and Introduction to Tabular MDPs

03:00Analyzing Value Iteration and Policy Iteration Convergence

Analyzing Value Iteration and Policy Iteration Convergence

08:25Model-Free Policy Evaluation: Introduction and Logistics

Model-Free Policy Evaluation: Introduction and Logistics

10:22Monte Carlo Policy Evaluation: Core Concepts and Examples

Monte Carlo Policy Evaluation: Core Concepts and Examples

17:17Monte Carlo Policy Evaluation: Algorithms and Properties

Monte Carlo Policy Evaluation: Algorithms and Properties

24:54Monte Carlo Policy Evaluation: Visual Representation and Properties

Monte Carlo Policy Evaluation: Visual Representation and Properties

36:57Temporal Difference Learning: Introduction and Core Idea

Temporal Difference Learning: Introduction and Core Idea

41:58Temporal Difference Learning: Algorithm, Worked Example, and Comparison to Monte Carlo

Temporal Difference Learning: Algorithm, Worked Example, and Comparison to Monte Carlo

58:57Comparing TD Learning and Monte Carlo: Properties and Batch Policy Evaluation

Comparing TD Learning and Monte Carlo: Properties and Batch Policy Evaluation