12 May 2026
9m

This is Fine! With Colette Alexander and Clint Byrum

Podcast cover

Google SRE Prodcast

Mean Time To Repair (MTTR) remains a problematic metric in site reliability engineering because it lacks statistical significance and fails to capture actual business impact. Incidents of varying durations often result in vastly different outcomes, rendering time-based averages misleading for operational decision-making. Instead of chasing these metrics, engineering teams should prioritize resilience engineering, a discipline centered on understanding complex systems rather than merely reducing incident duration. Research, such as the Monte Carlo simulations by Stefan Dabitovic, demonstrates that MTTR cannot be moved enough to provide meaningful data. The Resilience and Software Foundation supports this shift by fostering community-driven learning, providing resources like paper discussions and training to help practitioners move beyond legacy metrics toward more effective, context-aware approaches to system reliability.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise