Lecture 5 – Multimodal Fusion (MIT How to AI Almost Anything, Spring 2025) | Paul Liang

Paul Liang's lecture focuses on multimodal fusion, a method for combining different data modalities to create a joint representation that captures their interactions. He describes fusion as a spectrum, ranging from using pre-trained models to extract abstract features for simpler fusion to fusing raw data early and allowing the model to handle more complex interactions. The lecture covers basic fusion concepts like linear regression with multiplicative interactions, extends these to higher-dimensional cases using element-wise multiplication and bilinear fusion, and introduces multimodal transformers. He also discusses efficiency improvements through low-rank approximations, dynamic gating mechanisms, and strategies for handling dominant and secondary modalities. The lecture concludes with challenges in multimodal learning, such as unimodal bias, and potential solutions like data rebalancing and gradient flow management.

Outlines

Sign in to continue reading, translating and more.

Continue

Lecture 5 – Multimodal Fusion (MIT How to AI Almost Anything, Spring 2025)

Paul Liang

Project Updates and Introduction to Multimodal Fusion

Fusion Spectrum: Abstract vs. Raw Modalities and Multiplicative Interactions

Multimodal Transformers and Low-Rank Approximations

Static vs. Dynamic Weights and Measuring Interaction Learning

Stage-wise Training and Alignment in Modern Models

Challenges in Multimodal Classification and Solutions

Conclusion and Reminders

Lecture 5 – Multimodal Fusion (MIT How to AI Almost Anything, Spring 2025)

Paul Liang

00:00Project Updates and Introduction to Multimodal Fusion

Project Updates and Introduction to Multimodal Fusion

06:13Fusion Spectrum: Abstract vs. Raw Modalities and Multiplicative Interactions

Fusion Spectrum: Abstract vs. Raw Modalities and Multiplicative Interactions

17:14Multimodal Transformers and Low-Rank Approximations

Multimodal Transformers and Low-Rank Approximations

26:06Static vs. Dynamic Weights and Measuring Interaction Learning

Static vs. Dynamic Weights and Measuring Interaction Learning

37:44Stage-wise Training and Alignment in Modern Models

Stage-wise Training and Alignment in Modern Models

46:32Challenges in Multimodal Classification and Solutions

Challenges in Multimodal Classification and Solutions

53:39Conclusion and Reminders

Conclusion and Reminders