Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 14: Exploration | Stanford Online

In this monologue podcast, Chelsea Finn discusses the exploration problem in reinforcement learning, particularly in bandit settings, and its applications to robotics and large language models. She begins by illustrating the challenges of exploration with examples like Montezuma's Revenge and the game Mao, highlighting the difficulty for RL agents to discover rules and rewards in sparse reward scenarios. The discussion covers strategies for exploration, such as Upper Confidence Bound (UCB) and posterior sampling, using a drug development simulation to demonstrate these concepts. Finn then addresses the complexities of exploration in larger MDPs, like robotics and language models, advocating for the use of demonstrations, base models, and shaped rewards to guide exploration. Finally, she introduces a meta-RL approach, DREAM, that decouples exploration and execution to achieve optimal exploration and exploitation trade-offs, with an application to bug finding in student computer programs.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 14: Exploration

Stanford Online

Introduction to Exploration in Reinforcement Learning

Bandits and Regret: Measuring Exploration Performance

Exploration Strategies in Bandits and a Drug Development Simulation

Exploration in Robotics, Language Models, and Meta-RL

Variational Information Bottleneck and Exploration Policy Training

Decoupled Reward-Free Exploration and Execution in Meta-RL (DREAM) and Application to Bug Finding

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 14: Exploration

Stanford Online

00:05Introduction to Exploration in Reinforcement Learning

Introduction to Exploration in Reinforcement Learning

07:30Bandits and Regret: Measuring Exploration Performance

Bandits and Regret: Measuring Exploration Performance

20:00Exploration Strategies in Bandits and a Drug Development Simulation

Exploration Strategies in Bandits and a Drug Development Simulation

32:00Exploration in Robotics, Language Models, and Meta-RL

Exploration in Robotics, Language Models, and Meta-RL

45:50Variational Information Bottleneck and Exploration Policy Training

Variational Information Bottleneck and Exploration Policy Training

1:03:57Decoupled Reward-Free Exploration and Execution in Meta-RL (DREAM) and Application to Bug Finding

Decoupled Reward-Free Exploration and Execution in Meta-RL (DREAM) and Application to Bug Finding