Stanford CS234 Reinforcement Learning I Exploration 2 I 2024 I Lecture 12

In this podcast episode, the focus is on state-efficient reinforcement learning, particularly through the lens of multi-armed bandits and Bayesian methods. The conversation kicks off with an overview of multi-armed bandit algorithms, contrasting the ideas of regret minimization and reward maximization. A key part of the discussion centers on Bayesian bandits and Thompson sampling, showcasing their strengths in managing delayed feedback and batch scenarios. These insights are particularly relevant for real-world situations, such as COVID-19 quarantine strategies and online advertising campaigns. The episode also compares Thompson sampling with optimistic algorithms, examining the pros and cons of each. Additionally, it introduces the concept of PAC (Probably Approximately Correct) algorithms as an alternative way to assess performance. To wrap things up, it explores the theoretical optimality of Gittins index policies, setting them against the practical effectiveness of Thompson sampling.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford Online

Check Your Understanding: Multi-Armed Bandits Review

Analyzing Multi-Armed Bandit Algorithms and Regret

Real-World Application: COVID-19 Quarantine Protocols in Greece

Introducing Bayesian Bandits and Thompson Sampling

Thompson Sampling: Probability Matching and the Broken Toes Example

Evaluating Thompson Sampling: Bayesian Regret and Empirical Performance

Contextual Bandits and Delayed Feedback: A Case Study in Online Advertising

Thompson Sampling vs. Optimistic Algorithms: A Comparison

Summary and Next Steps: Moving Towards Markov Decision Processes

Stanford CS234 Reinforcement Learning I Exploration 2 I 2024 I Lecture 12

Stanford Online

00:05Check Your Understanding: Multi-Armed Bandits Review

Check Your Understanding: Multi-Armed Bandits Review

04:49Analyzing Multi-Armed Bandit Algorithms and Regret

Analyzing Multi-Armed Bandit Algorithms and Regret

14:54Real-World Application: COVID-19 Quarantine Protocols in Greece

Real-World Application: COVID-19 Quarantine Protocols in Greece

23:50Introducing Bayesian Bandits and Thompson Sampling

Introducing Bayesian Bandits and Thompson Sampling

47:55Thompson Sampling: Probability Matching and the Broken Toes Example

Thompson Sampling: Probability Matching and the Broken Toes Example

59:03Evaluating Thompson Sampling: Bayesian Regret and Empirical Performance

Evaluating Thompson Sampling: Bayesian Regret and Empirical Performance

1:05:30Contextual Bandits and Delayed Feedback: A Case Study in Online Advertising

Contextual Bandits and Delayed Feedback: A Case Study in Online Advertising

1:08:08Thompson Sampling vs. Optimistic Algorithms: A Comparison

Thompson Sampling vs. Optimistic Algorithms: A Comparison

1:16:43Summary and Next Steps: Moving Towards Markov Decision Processes

Summary and Next Steps: Moving Towards Markov Decision Processes