SE Radio 703: Sahaj Garg on Low Latency AI

The podcast explores the challenges and trade-offs in achieving low latency in AI applications. Sahaj Garg, co-founder and CTO of Whisper.ai, explains that latency is about the user experience, from initiating an action to receiving a response. The discussion covers the human sensitivity to latency, even down to milliseconds, and the importance of optimizing for tail latency (P99) rather than just average latency. A key challenge is balancing latency with throughput, especially given GPU resource constraints. Techniques like quantization and distillation are explored as methods to reduce model size and improve inference speed, and the importance of removing non-critical operations from the critical path is emphasized.

Outlines

Part 1: Defining Latency and User Perception

Part 2: AI vs. Traditional Systems

Part 3: Optimization Strategies and Trade-offs

Part 4: Engineering Techniques for Performance

Part 5: Specialized Use Cases and Reliability

Part 6: Scaling and Infrastructure Discipline

Sign in to continue reading, translating and more.

Open full episode in Podwise

Software Engineering Radio - the podcast for professional software developers

Part 1: Defining Latency and User Perception

Defining Latency: User Experience as the Key Metric in Application Performance

Human Sensitivity to Latency: Milliseconds Matter in User Perception

Diagnosing P99 Latency: Network Stochasticity and Infrastructure Challenges

Part 2: AI vs. Traditional Systems

Latency and Scale: The Interplay in System Design and AI Applications

AI Recommendation Systems: Balancing Speed and Relevance in User Experience

User Tolerance of Latency: Value and Expectations in LLM Applications

Latency Differences: AI Applications vs. Traditional Web Applications

AI as Streaming: Computation vs. Data Intensity in Latency Management

Part 3: Optimization Strategies and Trade-offs

Optimizing LLM Responses: Balancing User Experience and Application Goals

Condensing Spoken Language: Balancing Tone and Intelligibility in Voice Dictation

Latency vs. Throughput: A Key Trade-off in AI Application Engineering

Accuracy vs. Latency: Model Size and Routing Strategies in AI Applications

Resource Trade-offs: Balancing Cost and Performance in AI Applications

Part 4: Engineering Techniques for Performance

Latency Budgets: Setting and Allocating Time Constraints in AI Systems

Speculative Decoding: Reducing LLM Latency in Voice Dictation Applications

Latency in Enterprise vs. Consumer Software: Feature Velocity vs. Performance

Reducing Quantization: Accelerating Workloads with Integer Arithmetic

Distillation: Training Small Models with Large Model Expertise

Part 5: Specialized Use Cases and Reliability

Domain-Specific Models: Use Cases for Distillation in LLMs

Offline Applications: Privacy and Network Independence with Distillation

Managing Randomness: Safeguards and Heuristics for AI Hallucinations

Part 6: Scaling and Infrastructure Discipline

Latency Degradation at Scale: GPU Load and Request Batching

Managing Latency: Removing Non-Critical Operations from the Critical Path

Resources and Contact Information

SE Radio 703: Sahaj Garg on Low Latency AI

Software Engineering Radio - the podcast for professional software developers

Part 1: Defining Latency and User Perception

00:00Defining Latency: User Experience as the Key Metric in Application Performance

Defining Latency: User Experience as the Key Metric in Application Performance

02:29Human Sensitivity to Latency: Milliseconds Matter in User Perception

Human Sensitivity to Latency: Milliseconds Matter in User Perception

05:02Diagnosing P99 Latency: Network Stochasticity and Infrastructure Challenges

Diagnosing P99 Latency: Network Stochasticity and Infrastructure Challenges

Part 2: AI vs. Traditional Systems

07:53Latency and Scale: The Interplay in System Design and AI Applications

Latency and Scale: The Interplay in System Design and AI Applications

11:38AI Recommendation Systems: Balancing Speed and Relevance in User Experience

AI Recommendation Systems: Balancing Speed and Relevance in User Experience

13:52User Tolerance of Latency: Value and Expectations in LLM Applications

User Tolerance of Latency: Value and Expectations in LLM Applications

16:20Latency Differences: AI Applications vs. Traditional Web Applications

Latency Differences: AI Applications vs. Traditional Web Applications

19:02AI as Streaming: Computation vs. Data Intensity in Latency Management

AI as Streaming: Computation vs. Data Intensity in Latency Management

Part 3: Optimization Strategies and Trade-offs

20:22Optimizing LLM Responses: Balancing User Experience and Application Goals

Optimizing LLM Responses: Balancing User Experience and Application Goals

23:53Condensing Spoken Language: Balancing Tone and Intelligibility in Voice Dictation

Condensing Spoken Language: Balancing Tone and Intelligibility in Voice Dictation

25:54Latency vs. Throughput: A Key Trade-off in AI Application Engineering

Latency vs. Throughput: A Key Trade-off in AI Application Engineering

27:39Accuracy vs. Latency: Model Size and Routing Strategies in AI Applications

Accuracy vs. Latency: Model Size and Routing Strategies in AI Applications

29:55Resource Trade-offs: Balancing Cost and Performance in AI Applications

Resource Trade-offs: Balancing Cost and Performance in AI Applications

Part 4: Engineering Techniques for Performance

32:32Latency Budgets: Setting and Allocating Time Constraints in AI Systems

Latency Budgets: Setting and Allocating Time Constraints in AI Systems

34:26Speculative Decoding: Reducing LLM Latency in Voice Dictation Applications

Speculative Decoding: Reducing LLM Latency in Voice Dictation Applications

37:32Latency in Enterprise vs. Consumer Software: Feature Velocity vs. Performance

Latency in Enterprise vs. Consumer Software: Feature Velocity vs. Performance

39:05Reducing Quantization: Accelerating Workloads with Integer Arithmetic

Reducing Quantization: Accelerating Workloads with Integer Arithmetic

41:18Distillation: Training Small Models with Large Model Expertise

Distillation: Training Small Models with Large Model Expertise

Part 5: Specialized Use Cases and Reliability

43:35Domain-Specific Models: Use Cases for Distillation in LLMs

Domain-Specific Models: Use Cases for Distillation in LLMs

45:18Offline Applications: Privacy and Network Independence with Distillation

Offline Applications: Privacy and Network Independence with Distillation

46:48Managing Randomness: Safeguards and Heuristics for AI Hallucinations

Managing Randomness: Safeguards and Heuristics for AI Hallucinations

Part 6: Scaling and Infrastructure Discipline

48:32Latency Degradation at Scale: GPU Load and Request Batching

Latency Degradation at Scale: GPU Load and Request Batching

50:10Managing Latency: Removing Non-Critical Operations from the Critical Path

Managing Latency: Removing Non-Critical Operations from the Critical Path

53:54Resources and Contact Information

Resources and Contact Information