
The podcast explores the evolution of reliability engineering within Google DeepMind, focusing on the unique challenges and adaptations required in a research-oriented environment. Damion Yates, based in London at Google DeepMind, recounts his initial role in introducing reliability concepts to a team primarily focused on research, where the immediate impact of downtime was perceived as minimal. He details the shift from reactive problem-solving to proactive monitoring and training initiatives designed to instill a reliability mindset among researchers. Yates shares anecdotes about teaching researchers the importance of retries and checkpointing and the difficulties of prioritizing SRE support, balancing critical projects like Gemini with the needs of various research teams. The discussion highlights the importance of understanding the cost of researcher downtime and tailoring reliability strategies to the specific needs of a research organization.
Sign in to continue reading, translating and more.
Continue