The podcast explores the evolving landscape of AI inference, particularly the critical role of context memory and storage solutions. Val Bercovici, Chief AI Officer at Weka, discusses how the increasing demand for context in AI agents necessitates innovative memory tiering strategies. He highlights the shift from prompt engineering to context engineering, emphasizing the need for high-speed, low-latency storage to manage the exploding key value cache. The conversation covers the memory hierarchy from HBM to DRAM to NVMe, and how Weka's neural mesh technology optimizes performance across these tiers. Bercovici also introduces concepts like high bandwidth flash, Axon for utilizing local SSDs in GPU racks, and augmented memory grid for network-based memory scaling, all aimed at improving tokenomics and enabling positive unit economics in AI inference.
Sign in to continue reading, translating and more.
Continue