The lecture explores Large Language Model (LLM) reasoning, contrasting "vanilla LLMs" with reasoning models designed to solve complex problems through multi-step processes. It highlights weaknesses of vanilla LLMs, including limited reasoning, static knowledge, and difficulty in evaluation. Chain of Thought is introduced as a technique to improve reasoning by having LLMs decompose problems into tractable parts. The lecture also covers benchmarks for assessing coding and math abilities, such as HumanEval, Codeforces, and AIM, and introduces the "pass at k" metric for evaluating the probability of success across multiple attempts, while also considering the impact of temperature on solution diversity. GRPO (Group Relative Policy Optimization) is presented as an RL algorithm used to train reasoning models by maximizing advantages without value functions, differing from PPO. The lecture concludes with DeepSeq's R1 model training pipeline and knowledge distillation techniques for smaller models.
Sign in to continue reading, translating and more.
Continue