How We Cut LLM Latency 70% With TensorRT in Production

The discussion centers on building and optimizing AI infrastructure, particularly within the HR tech space. It highlights the importance of balancing cost, performance, latency, throughput, and accuracy when deploying AI solutions. The guest details their experience in transitioning into AI leadership, emphasizing continuous learning and adaptation. They share strategies for managing AI costs, including scheduled and dynamic scaling of GPUs based on traffic patterns. The conversation also covers techniques for reducing cold start times, such as using faster storage and embedding models in container images, as well as leveraging tools like TensorRT LLM to cut latency. The guest touches upon the shift in customer attitudes towards AI, from initial skepticism to actively seeking AI integration and the challenges of ensuring responsible AI.

Outlines

Part 1: Transition and Leadership

Part 2: Enterprise Production and Infrastructure

Part 3: Technical Optimization and Performance

Part 4: Strategy, ROI, and Product Frameworks

Part 5: Success Metrics and Quality Control

Part 6: Internal Engineering and Culture

Part 7: Costs, Ethics, and Future Outlook

Sign in to continue reading, translating and more.

Open full episode in Podwise

MLOps.community

Part 1: Transition and Leadership

Transitioning to AI: From Engineering Leadership to Passionate AI Learning

Part 2: Enterprise Production and Infrastructure

Building AI in Enterprise Production: Scaling, Self-Hosting, and the AI Iceberg

Managing AI Costs: Token Premiums, Budget Control, and GPU Scaling Strategies

Optimizing GPU Scaling: Dynamic Adaptation, Cold Start Challenges, and Managed Services

Reducing Cold Start Time: Faster Storage, Embedded Models, and GPU Provider Considerations

Part 3: Technical Optimization and Performance

Optimizing Latency and Throughput: TensorRT LLM, Batching, and KV Cache Strategies

Cost Savings Through GPU Upgrades, Throughput Optimization, and Scheduled Scaling

Optimizing AI Pyramid Edges: Performance, Latency, Throughput, Quality, and Cost

Part 4: Strategy, ROI, and Product Frameworks

Model Selection, Translation Challenges, and Proving ROI for AI Initiatives

The AI Flywheel: Planning, Building, Running, and Optimizing for Business Impact

Shifting Customer Perceptions: From AI Skepticism to Demand for AI Roadmaps

From Vertical to Horizontal AI: Building a Platform for Multi-Step AI Workflows

Pre-processing Verticals, Batch Jobs, and the AI Flywheel: Building and Running

Part 5: Success Metrics and Quality Control

Measuring AI Success: Usage Metrics, Customer Feedback, and Responsible AI Pillars

AI Feature Pitfalls: Balancing Polish, Guardrails, and Human-in-the-Loop

Part 6: Internal Engineering and Culture

AI Platform Approach: Abstracting Use Cases and Addressing Engineering Team Perceptions

The AI Engineering Lab: Exploring AI, Reviewing Code, and Defining a New SDLC

Leveraging Memory, Instruction Skills, and Senior Engineer Involvement for AI Adoption

Measuring AI Impact on Engineering: Surveys, Distributed Contribution, and Future Directions

Part 7: Costs, Ethics, and Future Outlook

AI Coding Costs, Open Source Opportunities, and Data Sensitivity Considerations

Model Training Data Bias: Customer Concerns and the Future of AI Competition

How We Cut LLM Latency 70% With TensorRT in Production

MLOps.community

Part 1: Transition and Leadership

00:00Transitioning to AI: From Engineering Leadership to Passionate AI Learning

Transitioning to AI: From Engineering Leadership to Passionate AI Learning

Part 2: Enterprise Production and Infrastructure

01:17Building AI in Enterprise Production: Scaling, Self-Hosting, and the AI Iceberg

Building AI in Enterprise Production: Scaling, Self-Hosting, and the AI Iceberg

03:34Managing AI Costs: Token Premiums, Budget Control, and GPU Scaling Strategies

Managing AI Costs: Token Premiums, Budget Control, and GPU Scaling Strategies

06:09Optimizing GPU Scaling: Dynamic Adaptation, Cold Start Challenges, and Managed Services

Optimizing GPU Scaling: Dynamic Adaptation, Cold Start Challenges, and Managed Services

08:13Reducing Cold Start Time: Faster Storage, Embedded Models, and GPU Provider Considerations

Reducing Cold Start Time: Faster Storage, Embedded Models, and GPU Provider Considerations

Part 3: Technical Optimization and Performance

10:15Optimizing Latency and Throughput: TensorRT LLM, Batching, and KV Cache Strategies

Optimizing Latency and Throughput: TensorRT LLM, Batching, and KV Cache Strategies

13:57Cost Savings Through GPU Upgrades, Throughput Optimization, and Scheduled Scaling

Cost Savings Through GPU Upgrades, Throughput Optimization, and Scheduled Scaling

17:17Optimizing AI Pyramid Edges: Performance, Latency, Throughput, Quality, and Cost

Optimizing AI Pyramid Edges: Performance, Latency, Throughput, Quality, and Cost

Part 4: Strategy, ROI, and Product Frameworks

20:31Model Selection, Translation Challenges, and Proving ROI for AI Initiatives

Model Selection, Translation Challenges, and Proving ROI for AI Initiatives

24:41The AI Flywheel: Planning, Building, Running, and Optimizing for Business Impact

The AI Flywheel: Planning, Building, Running, and Optimizing for Business Impact

27:22Shifting Customer Perceptions: From AI Skepticism to Demand for AI Roadmaps

Shifting Customer Perceptions: From AI Skepticism to Demand for AI Roadmaps

30:41From Vertical to Horizontal AI: Building a Platform for Multi-Step AI Workflows

From Vertical to Horizontal AI: Building a Platform for Multi-Step AI Workflows

34:41Pre-processing Verticals, Batch Jobs, and the AI Flywheel: Building and Running

Pre-processing Verticals, Batch Jobs, and the AI Flywheel: Building and Running

Part 5: Success Metrics and Quality Control

37:58Measuring AI Success: Usage Metrics, Customer Feedback, and Responsible AI Pillars

Measuring AI Success: Usage Metrics, Customer Feedback, and Responsible AI Pillars

41:30AI Feature Pitfalls: Balancing Polish, Guardrails, and Human-in-the-Loop

AI Feature Pitfalls: Balancing Polish, Guardrails, and Human-in-the-Loop

Part 6: Internal Engineering and Culture

45:35AI Platform Approach: Abstracting Use Cases and Addressing Engineering Team Perceptions

AI Platform Approach: Abstracting Use Cases and Addressing Engineering Team Perceptions

49:58The AI Engineering Lab: Exploring AI, Reviewing Code, and Defining a New SDLC

The AI Engineering Lab: Exploring AI, Reviewing Code, and Defining a New SDLC

53:25Leveraging Memory, Instruction Skills, and Senior Engineer Involvement for AI Adoption

Leveraging Memory, Instruction Skills, and Senior Engineer Involvement for AI Adoption

56:45Measuring AI Impact on Engineering: Surveys, Distributed Contribution, and Future Directions

Measuring AI Impact on Engineering: Surveys, Distributed Contribution, and Future Directions

Part 7: Costs, Ethics, and Future Outlook

1:00:40AI Coding Costs, Open Source Opportunities, and Data Sensitivity Considerations

AI Coding Costs, Open Source Opportunities, and Data Sensitivity Considerations

1:04:11Model Training Data Bias: Customer Concerns and the Future of AI Competition

Model Training Data Bias: Customer Concerns and the Future of AI Competition