AI models often struggle with complex multi-step reasoning due to the high cost of human-annotated process rewards. GRPO (Group Relative Policy Optimization) addresses this by generating multiple reasoning trajectories and using their final outcomes to retroactively grade intermediate steps, effectively functioning as an organic, self-grading system. However, this method suffers from a "frequency glitch" where the algorithm disproportionately punishes frequently visited reasoning paths, causing the model to abandon potentially correct logical branches. Implementing $\lambda$-GRPO, a modification that normalizes these penalties by dividing them by step frequency, eliminates this bias. This adjustment allows models to explore complex decision trees more effectively, yielding significant performance improvements on elite mathematical benchmarks like AIME while maintaining computational efficiency. This discovery highlights the potential for AI to develop sophisticated, self-improving behaviors through subtle mathematical refinements rather than brute-force scaling.
Sign in to continue reading, translating and more.
Open full episode in Podwise
