Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela | Stanford Online

This episode explores multimodal deep learning, particularly its applications and future directions within NLP, focusing on the fusion of text and images. Against the backdrop of ill-defined multimodality, the lecture emphasizes its importance for faithfulness to human understanding, reflecting the internet's multimodal nature, and scaling language models with diverse data. The discussion progresses from early models like Wasabi and bag of visual words to deep learning approaches using CNNs and LSTMs, highlighting the evolution towards transformer-based architectures like CLIP and Flamingo. More significantly, the lecture addresses fusion techniques—early, middle, and late—and the challenges of modality dominance and noise, while also stressing the importance of evaluation, as demonstrated by the creation and analysis of datasets like Memes in Image and WinterGround. As the discussion pivots to other modalities, including audio, video, and even olfaction, the episode concludes by predicting a future dominated by a single foundation model capable of consuming and synthesizing data across various modalities, retrieval augmentation, and the need for better evaluation metrics, reflecting emerging industry patterns towards more comprehensive AI systems.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

Stanford Online

Introduction to Douwe Kiela and Multimodal Deep Learning

Multimodal Applications and Early Models

Feature Extraction, Fusion, and Contrastive Models

Multimodal BERT Models and Foundation Models

Evaluation of Multimodal Models and Other Modalities

Future Directions in Multimodal Deep Learning

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

Stanford Online

00:05Introduction to Douwe Kiela and Multimodal Deep Learning

Introduction to Douwe Kiela and Multimodal Deep Learning

04:15Multimodal Applications and Early Models

Multimodal Applications and Early Models

15:47Feature Extraction, Fusion, and Contrastive Models

Feature Extraction, Fusion, and Contrastive Models

27:43Multimodal BERT Models and Foundation Models

Multimodal BERT Models and Foundation Models

44:24Evaluation of Multimodal Models and Other Modalities

Evaluation of Multimodal Models and Other Modalities

1:08:44Future Directions in Multimodal Deep Learning

Future Directions in Multimodal Deep Learning