Breaking the Memory Wall in the Age of Inference

The podcast explores the landscape of AI hardware, particularly focusing on inference and the role of memory. Sid Sheth, founder and CEO of D-Matrix, discusses the limitations of SRAM and HBM for cloud inference, highlighting D-Matrix's focus on digital in-memory compute (DIMC) to address these challenges. Sheth explains DIMC's architecture, which integrates compute and memory to reduce latency and improve efficiency, especially during the decode phase of generative AI models. The conversation covers the trade-offs between latency and throughput in hardware design, the importance of a software stack, and the collaborative approach D-Matrix takes by working with ecosystem partners rather than building its own cloud.

Outlines

Part 1: Introduction, Background

Part 2: Architecture, Memory Strategy

Part 3: DIMC Technology, Performance

Part 4: Software, Ecosystem, Users

Part 5: Future Trends, Scaling

Sign in to continue reading, translating and more.

Open full episode in Podwise

The Data Exchange with Ben Lorica

Part 1: Introduction, Background

Introduction to D-Matrix and the Focus on AI Inference Hardware

The Importance of Chip Building Experience and D-Matrix's Focus on Cloud Inference

Part 2: Architecture, Memory Strategy

D-Matrix's Early Vision of Inference Computing and Focus on Transformer Acceleration

SRAM-Based Inference and the Growing Need for External Memory

Generative Transformers and the Pre-fill and Decode Stages

HBM's Limitations for Inference: Cost, Energy, and Speed

Part 3: DIMC Technology, Performance

Digital In-Memory Compute (DIMC): Integrating Compute and Memory

DIMC Architecture, Model Size, and Trade-offs Between Latency and Throughput

Part 4: Software, Ecosystem, Users

Software Stack and Portability of Models to D-Matrix Hardware

D-Matrix Users: Hyperscalers, NeoClouds, and Enterprises

D-Matrix's Collaborative Approach and Ecosystem Integration

Part 5: Future Trends, Scaling

The Future of Inference: Agentic AI, Multimodality, and Interactive Video Generation

High Bandwidth Flash and the Memory Wall Problem

Innovations in Interconnects and Topologies for Scaling Compute

Open Standards and Interoperability in Hardware

Breaking the Memory Wall in the Age of Inference

The Data Exchange with Ben Lorica

Part 1: Introduction, Background

00:03Introduction to D-Matrix and the Focus on AI Inference Hardware

Introduction to D-Matrix and the Focus on AI Inference Hardware

00:24The Importance of Chip Building Experience and D-Matrix's Focus on Cloud Inference

The Importance of Chip Building Experience and D-Matrix's Focus on Cloud Inference

Part 2: Architecture, Memory Strategy

04:06D-Matrix's Early Vision of Inference Computing and Focus on Transformer Acceleration

D-Matrix's Early Vision of Inference Computing and Focus on Transformer Acceleration

07:58SRAM-Based Inference and the Growing Need for External Memory

SRAM-Based Inference and the Growing Need for External Memory

10:07Generative Transformers and the Pre-fill and Decode Stages

Generative Transformers and the Pre-fill and Decode Stages

13:30HBM's Limitations for Inference: Cost, Energy, and Speed

HBM's Limitations for Inference: Cost, Energy, and Speed

Part 3: DIMC Technology, Performance

18:49Digital In-Memory Compute (DIMC): Integrating Compute and Memory

Digital In-Memory Compute (DIMC): Integrating Compute and Memory

23:36DIMC Architecture, Model Size, and Trade-offs Between Latency and Throughput

DIMC Architecture, Model Size, and Trade-offs Between Latency and Throughput

Part 4: Software, Ecosystem, Users

27:08Software Stack and Portability of Models to D-Matrix Hardware

Software Stack and Portability of Models to D-Matrix Hardware

30:13D-Matrix Users: Hyperscalers, NeoClouds, and Enterprises

D-Matrix Users: Hyperscalers, NeoClouds, and Enterprises

33:57D-Matrix's Collaborative Approach and Ecosystem Integration

D-Matrix's Collaborative Approach and Ecosystem Integration

Part 5: Future Trends, Scaling

36:56The Future of Inference: Agentic AI, Multimodality, and Interactive Video Generation

The Future of Inference: Agentic AI, Multimodality, and Interactive Video Generation

39:11High Bandwidth Flash and the Memory Wall Problem

High Bandwidth Flash and the Memory Wall Problem

41:03Innovations in Interconnects and Topologies for Scaling Compute

Innovations in Interconnects and Topologies for Scaling Compute

43:41Open Standards and Interoperability in Hardware

Open Standards and Interoperability in Hardware