Chelsea Finn delivers a lecture on reinforcement learning, recapping policy gradients, value functions, and actor-critic methods. The lecture transitions to off-policy methods, specifically PPO and SAC algorithms, and discusses how to improve policy learning by using importance weights and surrogate objectives. She explains the challenges of unstable learning due to overfitting and introduces techniques like KL constraints and clipping to stabilize policy updates. The lecture covers PPO's clipping mechanism, surrogate objective, and advantage estimation, before moving on to SAC, which uses a replay buffer to reuse past data, and discusses the modifications needed to actor-critic algorithms to make them off-policy, including fitting a Q function instead of a value function. The lecture concludes with a comparison of PPO and SAC, highlighting their trade-offs in data efficiency and stability, and examples of reinforcement learning applications in robotics and language models.
Sign in to continue reading, translating and more.
Continue