The One With Damion Yates and Building AI systems

The podcast explores the evolution of reliability engineering within Google DeepMind, focusing on the unique challenges and adaptations required in a research-oriented environment. Damion Yates, based in London at Google DeepMind, recounts his initial role in introducing reliability concepts to a team primarily focused on research, where the immediate impact of downtime was perceived as minimal. He details the shift from reactive problem-solving to proactive monitoring and training initiatives designed to instill a reliability mindset among researchers. Yates shares anecdotes about teaching researchers the importance of retries and checkpointing and the difficulties of prioritizing SRE support, balancing critical projects like Gemini with the needs of various research teams. The discussion highlights the importance of understanding the cost of researcher downtime and tailoring reliability strategies to the specific needs of a research organization.

Outlines

Part 1: Introduction, Background

Part 2: Implementation, Strategy

Part 3: Evolution, Operations

Part 4: Conclusion, Productivity

Sign in to continue reading, translating and more.

Continue

Google SRE Prodcast

Part 1: Introduction, Background

Introduction to SRE Trends and a Lighthearted Time Zone Debate

Damion Yates' Journey into DeepMind's Reliability Engineering

Part 2: Implementation, Strategy

Implementing Reliability Concepts in a Research-Focused Organization

Strategies for Influencing Reliability Practices in Research Environments

Checkpointing Progress, Assuming Failure, and the Perils of Luck

Part 3: Evolution, Operations

Evolving Roles of Reliability Engineers at DeepMind: From Training to Security

Prioritizing SRE Support: Balancing Criticality, Cost, and Engineering Effort

Part 4: Conclusion, Productivity

Balancing Research Needs with Reliability: The Value of Researcher Productivity

The One With Damion Yates and Building AI systems

Google SRE Prodcast

Part 1: Introduction, Background

00:05Introduction to SRE Trends and a Lighthearted Time Zone Debate

Introduction to SRE Trends and a Lighthearted Time Zone Debate

02:11Damion Yates' Journey into DeepMind's Reliability Engineering

Damion Yates' Journey into DeepMind's Reliability Engineering

Part 2: Implementation, Strategy

06:26Implementing Reliability Concepts in a Research-Focused Organization

Implementing Reliability Concepts in a Research-Focused Organization

10:21Strategies for Influencing Reliability Practices in Research Environments

Strategies for Influencing Reliability Practices in Research Environments

14:58Checkpointing Progress, Assuming Failure, and the Perils of Luck

Checkpointing Progress, Assuming Failure, and the Perils of Luck

Part 3: Evolution, Operations

17:51Evolving Roles of Reliability Engineers at DeepMind: From Training to Security

Evolving Roles of Reliability Engineers at DeepMind: From Training to Security

23:36Prioritizing SRE Support: Balancing Criticality, Cost, and Engineering Effort

Prioritizing SRE Support: Balancing Criticality, Cost, and Engineering Effort

Part 4: Conclusion, Productivity

28:35Balancing Research Needs with Reliability: The Value of Researcher Productivity

Balancing Research Needs with Reliability: The Value of Researcher Productivity