26 Feb 2026
31m

The One With Damion Yates and Building AI systems

Podcast cover

Google SRE Prodcast

The podcast explores the evolution of reliability engineering within Google DeepMind, focusing on the unique challenges and adaptations required in a research-oriented environment. Damion Yates, based in London at Google DeepMind, recounts his initial role in introducing reliability concepts to a team primarily focused on research, where the immediate impact of downtime was perceived as minimal. He details the shift from reactive problem-solving to proactive monitoring and training initiatives designed to instill a reliability mindset among researchers. Yates shares anecdotes about teaching researchers the importance of retries and checkpointing and the difficulties of prioritizing SRE support, balancing critical projects like Gemini with the needs of various research teams. The discussion highlights the importance of understanding the cost of researcher downtime and tailoring reliability strategies to the specific needs of a research organization.

Outlines

Part 1: Introduction, Background

Part 2: Implementation, Strategy

Part 3: Evolution, Operations

Part 4: Conclusion, Productivity

Sign in to continue reading, translating and more.

Open full episode in Podwise