Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

This episode explores the landscape of large language models (LLMs) and multimodal pre-training, focusing on key moments in their development, practical training techniques, and potential research directions. The discussion begins by highlighting three pivotal moments: the emergence of BERT, the realization of scaling laws with GPT-3, and the impact of task adaptation demonstrated by ChatGPT. Against the backdrop of these advancements, the conversation shifts to the technical details of training LLMs, including transformer architecture adaptations like decoder-only models and pre-layer normalization, alongside optimization methods such as DeepSpeed and Megatron. More significantly, the importance of data cleaning, filtering, and synthesizing is emphasized, challenging the conventional focus on algorithm and architecture innovation. As the discussion pivots to recent advancements, models like CogVLM and CogAgent are highlighted for their contributions to image understanding and web agent capabilities. The episode concludes with predictions for future trends, including advancements in video understanding and embodied AI, reflecting emerging industry patterns of shifting compute to high-quality data generation.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford Online

Introduction to Multimodal Pre-training and the Evolution of Language Models

The Importance of Training Loss and Technical Details of Large Language Models

Optimization Techniques for Training Large Language Models

The Dominance of Data and Recent Advances in Vision-Language Models

CogAgent and Universal Modeling for Vision-Language Tasks

Diffusion Models and Relay Diffusion for Image Generation

Transformer Architectures in Diffusion Models and the Importance of Video Understanding

Future Trends in Multimodality and Advice for Researchers

Q&A: Long Context Windows and Associated Costs

Q&A: Data Quality vs. Model Architecture

Q&A: Autoregressive vs. Diffusion in Image Generation

Q&A: Differences Between CogAgent and CogVLM

Q&A: Video Understanding and Physical Understanding

Q&A: VQA Tasks and Tree Structures

Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

Stanford Online

00:05Introduction to Multimodal Pre-training and the Evolution of Language Models

Introduction to Multimodal Pre-training and the Evolution of Language Models

09:50The Importance of Training Loss and Technical Details of Large Language Models

The Importance of Training Loss and Technical Details of Large Language Models

17:40Optimization Techniques for Training Large Language Models

Optimization Techniques for Training Large Language Models

28:24The Dominance of Data and Recent Advances in Vision-Language Models

The Dominance of Data and Recent Advances in Vision-Language Models

39:07CogAgent and Universal Modeling for Vision-Language Tasks

CogAgent and Universal Modeling for Vision-Language Tasks

47:28Diffusion Models and Relay Diffusion for Image Generation

Diffusion Models and Relay Diffusion for Image Generation

55:20Transformer Architectures in Diffusion Models and the Importance of Video Understanding

Transformer Architectures in Diffusion Models and the Importance of Video Understanding

1:02:11Future Trends in Multimodality and Advice for Researchers

Future Trends in Multimodality and Advice for Researchers

1:07:33Q&A: Long Context Windows and Associated Costs

Q&A: Long Context Windows and Associated Costs

1:08:48Q&A: Data Quality vs. Model Architecture

Q&A: Data Quality vs. Model Architecture

1:12:08Q&A: Autoregressive vs. Diffusion in Image Generation

Q&A: Autoregressive vs. Diffusion in Image Generation

1:15:04Q&A: Differences Between CogAgent and CogVLM

Q&A: Differences Between CogAgent and CogVLM

1:16:23Q&A: Video Understanding and Physical Understanding

Q&A: Video Understanding and Physical Understanding

1:18:09Q&A: VQA Tasks and Tree Structures

Q&A: VQA Tasks and Tree Structures