In this episode of the Google SRE podcast, host Steve McGhee, co-host Matt Siegler, and guests Theo Klein and Jeffrey Snover discuss Systems Theoretic Process Analysis (STPA), a novel approach to analyzing complex systems and preventing outages by focusing on control problems rather than failure problems. They explain how STPA models control and feedback loops to identify potential design flaws and unacceptable losses, using the example of a road disruption system that failed to add road closures for a parade. The conversation covers the differences between STPA and other methods like TLA+, the importance of defining system goals, and how Google is adapting STPA for commercial software development. The speakers emphasize that STPA is a human-driven process that facilitates discussions and helps identify flaws even when all system components are working reliably. They also differentiate between reliability and system safety, recommending resources like the MIT STAMP conference and Google's blog posts for further learning.
Sign in to continue reading, translating and more.
Continue