Paul Liang's lecture focuses on multimodal fusion, a method for combining different data modalities to create a joint representation that captures their interactions. He describes fusion as a spectrum, ranging from using pre-trained models to extract abstract features for simpler fusion to fusing raw data early and allowing the model to handle more complex interactions. The lecture covers basic fusion concepts like linear regression with multiplicative interactions, extends these to higher-dimensional cases using element-wise multiplication and bilinear fusion, and introduces multimodal transformers. He also discusses efficiency improvements through low-rank approximations, dynamic gating mechanisms, and strategies for handling dominant and secondary modalities. The lecture concludes with challenges in multimodal learning, such as unimodal bias, and potential solutions like data rebalancing and gradient flow management.
Sign in to continue reading, translating and more.
Continue