Tim Cvetko

Multimodality Explained. Part III: Fission

May 6, 2024

Part III of Multimodal: Fission

Multimodality refers to building ML systems that can interpret meaningful outputs from multimodalities. There are 3 ways to do it:

Fusion: fuse modalities into joint representations
Coordination: fuse encoders to build representations
Fission: you factorize modalities into multiple representations

Image by: Carnegie Mellon Multimodal ML Class

‍

In this Article, You Will Learn About:

Modality-level Fission
Fine-Grained Fission

Def. — Fission = learning a new set of representations that reflects multimodal internal structure such as data factorization or clustering.

Fission is the idea that different modalities contain information:

For each of their respective modalities
That can only be found in both modalities.

L1: Modality-level Fission

Fusion and coordination were about homogenous, meaning they treated inputs as homogenous entities. Modality-level fission dissects the input data streams into individual modalities, allowing for independent processing and analysis.modality-level fission dissects the input data streams into individual modalities.

Image by: https://www.researchgate.net/figure/Modality-Fusion-layer-over-the-extracted-modality-specific-features-helping-the-system_fig3_327174829

‍

4-Step Process to Understanding Modality-Level Fission:

Decomposition: Modality-level fission involves breaking down multimodal inputs into their modal components, such as text, images, audio, or other data types.
Modality-Specific ML Model: Once modalities are isolated, modality-level fission enables specific processing tailored to each data type.
Feature Extraction: Modality-level fission facilitates the extraction of informative features from each modality.
Modality Fusion: After individual processing, modality-level fission allows for the fusion of information across modalities. This fusion step combines the extracted features from each modality to create a comprehensive representation of the input data. Fusion mechanisms can range from simple concatenation or averaging to more complex attention-based or graph-based approaches.

‍

L2: Fine-Grained Fission

There are more than 1 way that the modalities can interact with each other. For example, assuming our 2 modalities are images(CNN), and language(LLM), fine-grained fission might entail:

Extracting individual objects or regions of interest from images using a CNN and analyzing the corresponding textual descriptions using a language model (LLM) to establish semantic correlations between visual and textual elements.

Unlike modality-level fission, which treat entire modalities as indivisible entities, fine-grained fission breaks down multimodal inputs into fine-level components or features.

‍

Conclusion

Fission is about modalities working together to arrive at useful contextualized answers.
Unlike modality-level fission, which treat entire modalities as indivisible entities, fine-grained fission breaks down multimodal inputs into fine-level components or features.

References:

Tutorial on Multimodal Machine Learning: https://cmu-multicomp-lab.github.io/mmml-tutorial/cvpr2022/
Awesome Multimodal ML: https://github.com/pliang279/awesome-multimodal-ml
Multimodal Fusion for Multimodal Data: https://arxiv.org/abs/2306.02050