The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Reinforcement learning (RL) offers a superior path for optimizing LLM performance compared to supervised fine-tuning (SFT) by working within a model's existing weight distributions, thereby reducing catastrophic forgetting. While SFT often overrides established pathways, RL optimizes for specific outcomes by reinforcing critical decision points in reasoning traces. The GRPO algorithm, which scales RL by running multiple parallel rollouts and calculating group-relative advantages, has become an industry standard for improving reasoning capabilities. Despite concerns regarding reward hacking, iterative rubric development and auxiliary LLM judges effectively mitigate these risks. Compute remains the primary constraint for scaling, though recursive self-improvement loops are already accelerating development. For enterprises, RL-driven fine-tuning provides significant latency and cost advantages, particularly for specialized agentic tasks where off-the-shelf frontier models are either too slow or too expensive to deploy at scale.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Reinforcement Learning Versus Supervised Fine-Tuning

GRPO and the Credit Assignment Problem

Superhuman Performance and Metacognitive Reasoning

Global Competition and Recursive Self-Improvement

The RL Environment Industry and Data Labeling

Physical World RL and Automated Labs

Enterprise Deployment and Reward Hacking

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

03:00Reinforcement Learning Versus Supervised Fine-Tuning

Reinforcement Learning Versus Supervised Fine-Tuning

14:17GRPO and the Credit Assignment Problem

GRPO and the Credit Assignment Problem

29:34Superhuman Performance and Metacognitive Reasoning

Superhuman Performance and Metacognitive Reasoning

43:36Global Competition and Recursive Self-Improvement

Global Competition and Recursive Self-Improvement

55:07The RL Environment Industry and Data Labeling

The RL Environment Industry and Data Labeling

1:06:38Physical World RL and Automated Labs

Physical World RL and Automated Labs

1:13:27Enterprise Deployment and Reward Hacking

Enterprise Deployment and Reward Hacking