Multimodality Explained. Part III: Fission

May 6, 2024

Part III of Multimodal: Fission

Multimodality refers to building ML systems that can interpret meaningful outputs from multimodalities. There are 3 ways to do it:

  1. Fusion: fuse modalities into joint representations
  2. Coordination: fuse encoders to build representations
  3. Fission: you factorize modalities into multiple representations
Image by: Carnegie Mellon Multimodal ML Class

In this Article, You Will Learn About:
  1. Modality-level Fission
  2. Fine-Grained Fission
Def. — Fission = learning a new set of representations that reflects multimodal internal structure such as data factorization or clustering.

Fission is the idea that different modalities contain information:

  1. For each of their respective modalities
  2. That can only be found in both modalities.

L1: Modality-level Fission

Fusion and coordination were about homogenous, meaning they treated inputs as homogenous entities. Modality-level fission dissects the input data streams into individual modalities, allowing for independent processing and analysis.modality-level fission dissects the input data streams into individual modalities.

Image by: https://www.researchgate.net/figure/Modality-Fusion-layer-over-the-extracted-modality-specific-features-helping-the-system_fig3_327174829

4-Step Process to Understanding Modality-Level Fission:

  1. Decomposition: Modality-level fission involves breaking down multimodal inputs into their modal components, such as text, images, audio, or other data types.
  2. Modality-Specific ML Model: Once modalities are isolated, modality-level fission enables specific processing tailored to each data type.
  3. Feature Extraction: Modality-level fission facilitates the extraction of informative features from each modality.
  4. Modality Fusion: After individual processing, modality-level fission allows for the fusion of information across modalities. This fusion step combines the extracted features from each modality to create a comprehensive representation of the input data. Fusion mechanisms can range from simple concatenation or averaging to more complex attention-based or graph-based approaches.
Image by: Carnegie Mellon Multimodal ML Class

L2: Fine-Grained Fission

There are more than 1 way that the modalities can interact with each other. For example, assuming our 2 modalities are images(CNN), and language(LLM), fine-grained fission might entail:

  1. Extracting individual objects or regions of interest from images using a CNN and analyzing the corresponding textual descriptions using a language model (LLM) to establish semantic correlations between visual and textual elements.
Unlike modality-level fission, which treat entire modalities as indivisible entities, fine-grained fission breaks down multimodal inputs into fine-level components or features.
Image by: Carnegie Mellon Multimodal ML Class

Conclusion
  1. Fission is about modalities working together to arrive at useful contextualized answers.
  2. Unlike modality-level fission, which treat entire modalities as indivisible entities, fine-grained fission breaks down multimodal inputs into fine-level components or features.
References: