In this podcast episode, the focus is on state-efficient reinforcement learning, particularly through the lens of multi-armed bandits and Bayesian methods. The conversation kicks off with an overview of multi-armed bandit algorithms, contrasting the ideas of regret minimization and reward maximization. A key part of the discussion centers on Bayesian bandits and Thompson sampling, showcasing their strengths in managing delayed feedback and batch scenarios. These insights are particularly relevant for real-world situations, such as COVID-19 quarantine strategies and online advertising campaigns. The episode also compares Thompson sampling with optimistic algorithms, examining the pros and cons of each. Additionally, it introduces the concept of PAC (Probably Approximately Correct) algorithms as an alternative way to assess performance. To wrap things up, it explores the theoretical optimality of Gittins index policies, setting them against the practical effectiveness of Thompson sampling.