Multimodal Understanding Through Correlation Maximization and Minimization


Multimodal learning has mainly focused on learning large models on, and fusing feature representations from, different modalities for better performances on downstream tasks. In this work, we take a detour from this trend and study the intrinsic nature of multimodal data by asking the following questions: 1) Can we learn more structured latent representations of general multimodal data?; and 2) can we intuitively understand, both mathematically and visually, what the latent representations capture? To answer 1), we propose a general and lightweight framework, Multimodal Understanding Through Correlation Maximization and Minimization (MUCMM), that can be incorporated into any large pre-trained network. MUCMM learns both the common and individual representations. The common representations capture what is common between the modalities; the individual representations capture the unique aspect of the modalities. To answer 2), we propose novel scores that summarize the learned common and individual structures and visualize the score gradients with respect to the input, visually discerning what the different representations capture. We further provide mathematical intuitions of the computed gradients in a linear setting, and demonstrate the effectiveness of our approach through a variety of experiments.

Yifeng Shi
Yifeng Shi
Graduate Student in Computer Science

My research is in machine learning. So far I have been focusing on machine learning approaches for set-valued data and approaches for clustering with side-information.

Marc Niethammer
Marc Niethammer
Professor of Computer Science

My research interests include image registration, image segmentation, shape analysis, machine learning, and biomedical applications.