Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 8: Reward Learning | Stanford Online

The podcast discusses offline reinforcement learning, focusing on challenges like distribution shift and overestimation, and introduces algorithms such as IQL and Conservative Q-Learning (CQL) to address these issues. It then transitions to reward learning, exploring methods for specifying rewards, learning from examples of goals, and using human preferences to train reward functions, including an application in language models. The discussion covers the use of classifiers, adversarial training, and various techniques to improve the learning process, such as balancing datasets and regularization, and also touches on the potential of AI feedback and unsupervised reinforcement learning.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 8: Reward Learning

Stanford Online

Introduction to Offline Reinforcement Learning and IQL Algorithm

Conservative Q-Learning (CQL) Algorithm

CQL Algorithm Details and Transition to Reward Learning

Learning Rewards from Examples of Goals

Refining Goal Classifiers and Introduction to Generative Adversarial Networks (GANs)

Learning from Preferences: Trajectory Comparison

Preference Learning Algorithm and Language Model Applications

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 8: Reward Learning

Stanford Online

00:05Introduction to Offline Reinforcement Learning and IQL Algorithm

Introduction to Offline Reinforcement Learning and IQL Algorithm

07:23Conservative Q-Learning (CQL) Algorithm

Conservative Q-Learning (CQL) Algorithm

17:24CQL Algorithm Details and Transition to Reward Learning

CQL Algorithm Details and Transition to Reward Learning

24:48Learning Rewards from Examples of Goals

Learning Rewards from Examples of Goals

35:45Refining Goal Classifiers and Introduction to Generative Adversarial Networks (GANs)

Refining Goal Classifiers and Introduction to Generative Adversarial Networks (GANs)

45:06Learning from Preferences: Trajectory Comparison

Learning from Preferences: Trajectory Comparison

55:55Preference Learning Algorithm and Language Model Applications

Preference Learning Algorithm and Language Model Applications