This podcast episode explores reinforcement learning from human feedback (RLHF) and preference training of large language models (LLMs). The episode discusses the main idea behind the paper "Back to Basics, Revisiting Reinforced Style Optimization for Learning from Human Feedback and LLMs," which aims to convey that optimization for fine-tuning language models with reinforcement learning is not as complex as it may seem. Various methods for preference training are mentioned, including PPO, REINFORCE, RLOO, TPO, IPO, raft, and KTO. The episode also delves into the variance bias trade-off in reinforcement learning and discusses the use of Proximal Policy Optimization (PPO) in RL for language models. Surprising results of using REINFORCE instead of PPO in RLHF for language models are presented, along with the motivation and reasoning behind this research direction. The potential impact of the research on preference training at Cohere and the future research directions of the guest speaker, Arash Ahmadian, are also discussed.