Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

This episode explores the landscape of large language models (LLMs) and multimodal pre-training, focusing on key moments in their development, practical training techniques, and potential research directions. The discussion begins by highlighting three pivotal moments: the emergence of BERT, the realization of scaling laws with GPT-3, and the impact of task adaptation demonstrated by ChatGPT. Against the backdrop of these advancements, the conversation shifts to the technical details of training LLMs, including transformer architecture adaptations like decoder-only models and pre-layer normalization, alongside optimization methods such as DeepSpeed and Megatron. More significantly, the importance of data cleaning, filtering, and synthesizing is emphasized, challenging the conventional focus on algorithm and architecture innovation. As the discussion pivots to recent advancements, models like CogVLM and CogAgent are highlighted for their contributions to image understanding and web agent capabilities. The episode concludes with predictions for future trends, including advancements in video understanding and embodied AI, reflecting emerging industry patterns of shifting compute to high-quality data generation.

Outlines

Part 1: Introduction and Language Model Evolution

Part 2: Multimodal Models and Image Generation

Part 3: Future Trends and Research Advice

Part 4: Q&A on Models and Data

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford Online

Part 1: Introduction and Language Model Evolution

Introduction to Multimodal Pre-training and the Evolution of Language Models

The Importance of Training Loss and Technical Details of Large Language Models

Optimization Techniques for Training Large Language Models

Part 2: Multimodal Models and Image Generation

The Dominance of Data and Recent Advances in Vision-Language Models

CogAgent and Universal Modeling for Vision-Language Tasks

Diffusion Models and Relay Diffusion for Image Generation

Transformer Architectures in Diffusion Models and the Importance of Video Understanding

Part 3: Future Trends and Research Advice

Future Trends in Multimodality and Advice for Researchers

Part 4: Q&A on Models and Data

Q&A: Long Context Windows and Associated Costs

Q&A: Data Quality vs. Model Architecture

Q&A: Autoregressive vs. Diffusion in Image Generation

Q&A: Differences Between CogAgent and CogVLM

Q&A: Video Understanding and Physical Understanding

Q&A: VQA Tasks and Tree Structures

Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

Stanford Online

Part 1: Introduction and Language Model Evolution

00:05Introduction to Multimodal Pre-training and the Evolution of Language Models

Introduction to Multimodal Pre-training and the Evolution of Language Models

09:50The Importance of Training Loss and Technical Details of Large Language Models

The Importance of Training Loss and Technical Details of Large Language Models

17:40Optimization Techniques for Training Large Language Models

Optimization Techniques for Training Large Language Models

Part 2: Multimodal Models and Image Generation

28:24The Dominance of Data and Recent Advances in Vision-Language Models

The Dominance of Data and Recent Advances in Vision-Language Models

39:07CogAgent and Universal Modeling for Vision-Language Tasks

CogAgent and Universal Modeling for Vision-Language Tasks

47:28Diffusion Models and Relay Diffusion for Image Generation

Diffusion Models and Relay Diffusion for Image Generation

55:20Transformer Architectures in Diffusion Models and the Importance of Video Understanding

Transformer Architectures in Diffusion Models and the Importance of Video Understanding

Part 3: Future Trends and Research Advice

1:02:11Future Trends in Multimodality and Advice for Researchers

Future Trends in Multimodality and Advice for Researchers

Part 4: Q&A on Models and Data

1:07:33Q&A: Long Context Windows and Associated Costs

Q&A: Long Context Windows and Associated Costs

1:08:48Q&A: Data Quality vs. Model Architecture

Q&A: Data Quality vs. Model Architecture

1:12:08Q&A: Autoregressive vs. Diffusion in Image Generation

Q&A: Autoregressive vs. Diffusion in Image Generation

1:15:04Q&A: Differences Between CogAgent and CogVLM

Q&A: Differences Between CogAgent and CogVLM

1:16:23Q&A: Video Understanding and Physical Understanding

Q&A: Video Understanding and Physical Understanding

1:18:09Q&A: VQA Tasks and Tree Structures

Q&A: VQA Tasks and Tree Structures