YouTube17 Oct 2025
1h 47m

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 2 - Transformer-Based Models & Tricks

Podcast cover

Stanford Online

The lecture provides an overview of self-attention mechanisms, transformer architecture, and position embeddings. It begins with a recap of self-attention, emphasizing queries, keys, and values, and the transformer's encoder-decoder structure for machine translation. The discussion covers learned versus static position embeddings, detailing the sine and cosine formulation for representing token positions and their relative distances. Further topics include layer normalization, RMS norm, and variations in attention mechanisms like local and global attention. The lecture also explores multi-query and group query attention, and the BERT model, including masked language modeling and next sentence prediction tasks.

Outlines

Part 1: Logistics, Recap

Part 2: Transformer Evolution, Embeddings, Norms

Part 3: Model Architectures, BERT

Part 4: Practical Applications, Limitations

Sign in to continue reading, translating and more.

Open full episode in Podwise