Multimodality refers to building ML systems that can interpret meaningful outputs from multimodalities. There are 3 ways to do it:
Fusion: fuse modalities into joint representations
Coordination: fuse encoders to build representations
Fission: you factorizemodalities into multiple representations
In this Article, You Will Learn About:
Modality-level Fission
Fine-Grained Fission
Def. — Fission = learning a new set of representations that reflects multimodal internal structure such as data factorization or clustering.
Fission is the idea that different modalities contain information:
For each of their respective modalities
That can only be found in both modalities.
L1: Modality-level Fission
Fusion and coordination were about homogenous, meaning they treated inputs as homogenous entities. Modality-level fission dissects the input data streams into individual modalities, allowing for independent processing and analysis.modality-level fission dissects the input data streams into individual modalities.
4-Step Process to Understanding Modality-Level Fission:
Decomposition: Modality-level fission involves breaking down multimodal inputs into their modal components, such as text, images, audio, or other data types.
Modality-Specific ML Model: Once modalities are isolated, modality-level fission enables specific processing tailored to each data type.
Feature Extraction: Modality-level fission facilitates the extraction of informative features from each modality.
Modality Fusion: After individual processing, modality-level fission allows for the fusion of information across modalities. This fusion step combines the extracted features from each modality to create a comprehensive representation of the input data. Fusion mechanisms can range from simple concatenation or averaging to more complex attention-based or graph-based approaches.
L2: Fine-Grained Fission
There are more than 1 way that the modalities can interact with each other. For example, assuming our 2 modalities are images(CNN), and language(LLM), fine-grained fission might entail:
Extracting individual objects or regions of interest from images using a CNN and analyzing the corresponding textual descriptions using a language model (LLM) to establish semantic correlations between visual and textual elements.
Unlike modality-level fission, which treat entire modalities as indivisible entities, fine-grained fission breaks down multimodal inputs into fine-level components or features.
Conclusion
Fission is about modalities working together to arrive at useful contextualized answers.
Unlike modality-level fission, which treat entire modalities as indivisible entities, fine-grained fission breaks down multimodal inputs into fine-level components or features.