Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability by Lei Zhang | @Scale | Podwise