YouTube30 May 2024
1h 20m

Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

Podcast cover

Stanford Online

This episode explores the landscape of large language models (LLMs) and multimodal pre-training, focusing on key moments in their development, practical training techniques, and potential research directions. The discussion begins by highlighting three pivotal moments: the emergence of BERT, the realization of scaling laws with GPT-3, and the impact of task adaptation demonstrated by ChatGPT. Against the backdrop of these advancements, the conversation shifts to the technical details of training LLMs, including transformer architecture adaptations like decoder-only models and pre-layer normalization, alongside optimization methods such as DeepSpeed and Megatron. More significantly, the importance of data cleaning, filtering, and synthesizing is emphasized, challenging the conventional focus on algorithm and architecture innovation. As the discussion pivots to recent advancements, models like CogVLM and CogAgent are highlighted for their contributions to image understanding and web agent capabilities. The episode concludes with predictions for future trends, including advancements in video understanding and embodied AI, reflecting emerging industry patterns of shifting compute to high-quality data generation.

Outlines

Part 1: Introduction and Language Model Evolution

Part 2: Multimodal Models and Image Generation

Part 3: Future Trends and Research Advice

Part 4: Q&A on Models and Data

Sign in to continue reading, translating and more.

Open full episode in Podwise