The lecture focuses on reinforcement learning (RL) for language models, specifically delving into policy gradient methods like GRPO. It explains how states, actions, and rewards are defined in this context, emphasizing verifiable outcome rewards. The discussion covers the policy gradient theorem, naive policy gradient, and the challenges of high noise and variance. The lecture introduces the concept of baselines to reduce variance, illustrating with examples and discussing the optimal baseline choice and its connection to advantage functions. It also presents a simple task of sorting numbers to demonstrate the implementation of GRPO, including code snippets for model definition, reward functions, and loss computation, highlighting the importance of freezing parameters and different strategies for computing deltas and KL penalties.
Sign in to continue reading, translating and more.
Continue