Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability - Live from SCC | @Scale | Podwise