Arash Ahmadian on Rethinking RLHF

This podcast episode explores reinforcement learning from human feedback (RLHF) and preference training of large language models (LLMs). The episode discusses the main idea behind the paper "Back to Basics, Revisiting Reinforced Style Optimization for Learning from Human Feedback and LLMs," which aims to convey that optimization for fine-tuning language models with reinforcement learning is not as complex as it may seem. Various methods for preference training are mentioned, including PPO, REINFORCE, RLOO, TPO, IPO, raft, and KTO. The episode also delves into the variance bias trade-off in reinforcement learning and discusses the use of Proximal Policy Optimization (PPO) in RL for language models. Surprising results of using REINFORCE instead of PPO in RLHF for language models are presented, along with the motivation and reasoning behind this research direction. The potential impact of the research on preference training at Cohere and the future research directions of the guest speaker, Arash Ahmadian, are also discussed.

Outlines

Sign in to continue reading, translating and more.

Continue

TalkRL: The Reinforcement Learning Podcast

Reinforcement Learning from Human Feedback: A Back to Basics Approach

RLHF Optimization Methods: A Back to Basics Approach

RLU: A More Robust Version of RAFT

REINFORCE: A New Approach to Variance Bias Trade-off in LLMs and RLHF

PPO: A Surprisingly Unnecessary Algorithm for Language Model RL

Unveiling the Reasoning Behind the Surprising Results of RLHF with REINFORCE

The Role of KL Components and Bandit Problems in RLHF

Cohere's Research on RLHF: Optimizing Reward Models, Not Optimization Itself

Exploring Reward Modeling and Multilinguality in Reinforcement Learning with Human Feedback

Arash Ahmadian on Rethinking RLHF

TalkRL: The Reinforcement Learning Podcast

00:05Reinforcement Learning from Human Feedback: A Back to Basics Approach

Reinforcement Learning from Human Feedback: A Back to Basics Approach

02:20RLHF Optimization Methods: A Back to Basics Approach

RLHF Optimization Methods: A Back to Basics Approach

06:28RLU: A More Robust Version of RAFT

RLU: A More Robust Version of RAFT

11:25REINFORCE: A New Approach to Variance Bias Trade-off in LLMs and RLHF

REINFORCE: A New Approach to Variance Bias Trade-off in LLMs and RLHF

14:38PPO: A Surprisingly Unnecessary Algorithm for Language Model RL

PPO: A Surprisingly Unnecessary Algorithm for Language Model RL

16:17Unveiling the Reasoning Behind the Surprising Results of RLHF with REINFORCE

Unveiling the Reasoning Behind the Surprising Results of RLHF with REINFORCE

19:49The Role of KL Components and Bandit Problems in RLHF

The Role of KL Components and Bandit Problems in RLHF

24:40Cohere's Research on RLHF: Optimizing Reward Models, Not Optimization Itself

Cohere's Research on RLHF: Optimizing Reward Models, Not Optimization Itself

28:21Exploring Reward Modeling and Multilinguality in Reinforcement Learning with Human Feedback

Exploring Reward Modeling and Multilinguality in Reinforcement Learning with Human Feedback