[QA] Why Warmup the Learning Rate? Underlying Mechanisms and Improvements | Arxiv Papers | Podwise