Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 2 - Transformer-Based Models & Tricks
Stanford Online
The lecture provides an overview of self-attention mechanisms, transformer architecture, and position embeddings. It begins with a recap of self-attention, emphasizing queries, keys, and values, and the transformer's encoder-decoder structure for machine translation. The discussion covers learned versus static position embeddings, detailing the sine and cosine formulation for representing token positions and their relative distances. Further topics include layer normalization, RMS norm, and variations in attention mechanisms like local and global attention. The lecture also explores multi-query and group query attention, and the BERT model, including masked language modeling and next sentence prediction tasks.
Part 1: Logistics, Recap
Part 2: Transformer Evolution, Embeddings, Norms
Part 3: Model Architectures, BERT
Part 4: Practical Applications, Limitations
Sign in to continue reading, translating and more.
Open full episode in Podwise
