The One With Data Centers and Peter Pellerzi

In this episode of Google's site reliability engineering podcast, host Steve McGhee and co-host Matt Siegler interview Pete Pellerzi, a distinguished engineer on Google's construction team, about the physical infrastructure of Google's data centers. Pete discusses the scale of data center operations, emphasizing Google's community-oriented approach to building campuses and the importance of planning for multiple buildings. The conversation covers incident management, highlighting Google's cooperative adaptation strategy during failures, such as the countrywide power outage in Chile. Pete shares insights on building resilience in smaller organizations by finding trusted partners and developing business continuity plans. The discussion shifts to next-generation tech, focusing on the increasing density of chips and the adoption of liquid cooling, as well as the use of AI and machine learning to optimize data center cooling plants. The podcast concludes with a discussion on MTTR and MTBF metrics, emphasizing the importance of availability and Google's unique position in leveraging its large installed base to work closely with manufacturers and implement fault-tolerant designs.

Outlines

Sign in to continue reading, translating and more.

Continue

Google SRE Prodcast

Introduction to Site Reliability Engineering with a Focus on Physical Infrastructure

Incident Response and Community Support in Data Center Operations

Real-World Testing and Communication Between Infrastructure and Software Teams

Next Generation Data Center Tech: Density and Liquid Cooling

AI Optimization and the Scale of Data Center Cooling

Availability Metrics, Fault Tolerance, and Vendor Collaboration

Conclusion and Resources

The One With Data Centers and Peter Pellerzi

Google SRE Prodcast

00:05Introduction to Site Reliability Engineering with a Focus on Physical Infrastructure

Introduction to Site Reliability Engineering with a Focus on Physical Infrastructure

07:32Incident Response and Community Support in Data Center Operations

Incident Response and Community Support in Data Center Operations

13:40Real-World Testing and Communication Between Infrastructure and Software Teams

Real-World Testing and Communication Between Infrastructure and Software Teams

19:20Next Generation Data Center Tech: Density and Liquid Cooling

Next Generation Data Center Tech: Density and Liquid Cooling

25:13AI Optimization and the Scale of Data Center Cooling

AI Optimization and the Scale of Data Center Cooling

30:39Availability Metrics, Fault Tolerance, and Vendor Collaboration

Availability Metrics, Fault Tolerance, and Vendor Collaboration

35:00Conclusion and Resources

Conclusion and Resources