YouTube22 May 2024
19m

Scaling Meta’s Infra with GenAI: Journey to faster and smarter Incident Response

Podcast cover

@Scale

This podcast episode explores how Meta is leveraging AI technology to improve their incident response process. They discuss the challenges they face in managing incidents in a scaled infrastructure and the need for a more streamlined and automated approach. Meta is using the latest advances in LLMs to onboard responders efficiently and provide real-time generated summaries. They also address the challenges of investigating incidents and pinpointing the root cause, leveraging heuristics and data analysis. The use of Lama 2, a fine-tuned model, is introduced for incident root cause analysis. The potential of AI technology in incident management is emphasized, with a focus on transparency, explainability, and actionability. However, they acknowledge that the journey of incorporating AI into incident management processes is still in its early stages.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise