YouTube30 Oct 2024
1h 14m

Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11

Podcast cover

Stanford Online

This podcast dives into the world of data-efficient reinforcement learning, using the multi-armed bandit problem as a straightforward example. The main idea revolves around minimizing regret, which is the gap between the rewards gained by an algorithm and those that could be achieved with an optimal strategy. The episode explores various algorithms, such as greedy and epsilon-greedy methods, while addressing their shortcomings. It then introduces the Upper Confidence Bound (UCB) algorithm, which effectively employs "optimism under uncertainty" to reduce regret in a sublinear fashion by striking a balance between exploration and exploitation through confidence intervals. The discussion wraps up with a proof sketch that illustrates how UCB can achieve logarithmic regret, marking a notable advancement over the linear regret typical of simpler approaches.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise