Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft | AI Engineer

Debugging AI agents in production requires shifting the focus from achieving bitwise determinism to ensuring system replayability. Because LLM outputs are inherently non-deterministic due to hardware-level factors, request batching, and mixture-of-experts routing, attempting to force absolute consistency is a futile effort. Instead, developers should implement observability by recording the full envelope of agentic workflows—capturing inputs and outputs at every logical boundary. The Chronicle framework demonstrates this approach by using boundary annotations to freeze agent states, allowing engineers to isolate failures, stub specific nodes, and verify fixes through deterministic testing. By treating recorded traces as test cases, teams can effectively debug complex agent behaviors without relying on the model to produce identical results, ultimately improving reliability in production environments.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft

AI Engineer

The Critical Need for Reproducibility in Production AI Agents

Technical Drivers of Non-Determinism in Large Language Models

Building Replayability and Automated Testing with Chronicle

Distinguishing Deterministic and Behavioral Testing for AI Agents

Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft

AI Engineer

00:00The Critical Need for Reproducibility in Production AI Agents

The Critical Need for Reproducibility in Production AI Agents

03:32Technical Drivers of Non-Determinism in Large Language Models

Technical Drivers of Non-Determinism in Large Language Models

07:15Building Replayability and Automated Testing with Chronicle

Building Replayability and Automated Testing with Chronicle

12:08Distinguishing Deterministic and Behavioral Testing for AI Agents

Distinguishing Deterministic and Behavioral Testing for AI Agents