The lecture introduces Large Language Models (LLMs), defining them as large-scale language models predicting token sequences, emphasizing their size in parameters, training data, and computational needs. It distinguishes LLMs from earlier models like BERT, highlighting the decoder-only architecture and introduces Mixture of Experts (MOE) to optimize computational efficiency by activating subsets of model parameters. The discussion covers dense versus sparse MOEs and techniques to prevent routing collapse during training. It also explores response generation, contrasting greedy decoding and beam search with sampling methods like top-K and top-P sampling, alongside the impact of temperature on output diversity. The lecture concludes with strategies to improve LLM efficiency, including KV caching, group query attention, and speculative decoding.
Sign in to continue reading, translating and more.
Continue