10 Apr 2026
1h 5m

How We Cut LLM Latency 70% With TensorRT in Production

Podcast cover

MLOps.community

The discussion centers on building and optimizing AI infrastructure, particularly within the HR tech space. It highlights the importance of balancing cost, performance, latency, throughput, and accuracy when deploying AI solutions. The guest details their experience in transitioning into AI leadership, emphasizing continuous learning and adaptation. They share strategies for managing AI costs, including scheduled and dynamic scaling of GPUs based on traffic patterns. The conversation also covers techniques for reducing cold start times, such as using faster storage and embedding models in container images, as well as leveraging tools like TensorRT LLM to cut latency. The guest touches upon the shift in customer attitudes towards AI, from initial skepticism to actively seeking AI integration and the challenges of ensuring responsible AI.

Outlines

Part 1: Transition and Leadership

Part 2: Enterprise Production and Infrastructure

Part 3: Technical Optimization and Performance

Part 4: Strategy, ROI, and Product Frameworks

Part 5: Success Metrics and Quality Control

Part 6: Internal Engineering and Culture

Part 7: Costs, Ethics, and Future Outlook

Sign in to continue reading, translating and more.

Open full episode in Podwise