Lecture 10 focuses on inference in the context of language models, distinguishing it from training by highlighting its memory-limited nature and the challenges of generating tokens sequentially. The lecture covers key inference metrics like time to first token, latency, and throughput, and explains how these are affected by factors such as batch size and the architecture of the model. It also explores various techniques to improve inference efficiency, including reducing KV cache size through methods like Group Query Attention, Multi-head Latent Attention, Cross-Layer Attention, and local attention. The lecture further discusses more radical approaches like state-space models and diffusion models, as well as practical methods such as quantization, model pruning, and speculative decoding, and concludes with systems-level optimizations for handling dynamic, real-world inference scenarios.
Sign in to continue reading, translating and more.
Continue