An Optimal-Transport-Based Multimodal Big Data Clustering


1. Introduction

With the rapid development and widespread applications of emerging computing technologies, data produced by human beings are experiencing explosive growth [1]. For instance, a smart power plant acquires real-time data through various devices, such as flow sensors, current transducers, and power meters, to monitor the working status of power equipment, in which an enormous amount of data with heterogeneous distributions are collected. In smart cities, smart car-detection technology utilizes multiple cameras and microphones to verify the categories of cars, amassing a great volume of information about cars. These extensive data collected from multiple devices are often referred to as multimodal data, in which the data of each modality contain consistent information that describes shared properties and complementary information about different properties among modalities [2,3,4]. The fusion of complementary and consistent information in multimodal data can further facilitate the understanding of patterns hidden in complex phenomena of the real world [5]. Thus, it is of great importance to design innovative computing paradigms to integrate information from multimodal data.
Multimodal clustering (MC), as a fundamental unsupervised task, aims to aggregate knowledge scattered in heterogeneous modalities to recognize intrinsic patterns of data [6,7,8]. Recently, some pioneering deep multimodal clustering methods have been proposed to leverage hierarchical nonlinear embedding features to explore patterns from fusion information, which can be roughly separated into two classes, i.e., deep fusion methods [9,10] and deep clustering methods [11,12]. The former utilizes self-supervision information of data to extract generalized fusion representations to characterize fusion information, and then performs vanilla single-modality clustering methods to mine patterns, whose two-stage learning strategy may disconnect fusion representation learning and clustering partition, causing suboptimal fusion representations for clustering partitioning. To close the gap between the two processes, deep clustering methods combine fusion representation learning and clustering partition into a joint optimization strategy. They utilize the divergences of data structures to guide the fusion of complementary information among modalities, as well as mining clustering structures in an end-to-end manner.
Existing MC methods have made significant progress in the pattern mining of multimodal data. However, most of them ignore the manifold topology of multimodal data, and focus on distance metrics with strong notions that assume strict overlaps between heterogeneous distributions of data in various modalities, such as Kullback–Leibler (KL) divergence and Jensen–Shannon (JS) divergence, leading to the degradation of performance on pattern mining when they are faced with the challenge of general multimodal data applications [13]. Specifically, general multimodal data collected from diverse sources obey heterogeneous distributions and are embedded in their corresponding manifolds, and the manifolds of data in each modality almost have no non-negligible intersection [14]. In that case, the assumption above is not valid everywhere, which leads to failure in measuring the differences between heterogeneous manifolds when integrating complementary information from multimodal data. In other words, distance metrics with strong notions provide no useful gradient information in training deep clustering models, which blurs discriminative structures when fusing representations of multimodal data. Thus, previous methods may not exhibit desirable performance in pattern mining in real-world applications.

To address this challenge, an optimal-transport-based multimodal clustering method (OTMC) is proposed to conduct multimodal clustering structure mining based on the optimal-transport (OT) theory, which ensures clustering-friendly discriminative structures when fusing complementary modality information with heterogeneous distributions. Specifically, multimodal clustering is defined as a map from multimodal data measures to clustering prototype set measures in OTMC, which utilizes the Wasserstein distance to obtain clear discriminative structures from heterogeneous information with no intersection of data distributions in different modalities. Furthermore, OTMC is decomposed into a modality-specific OT (MsOT) and a modality-common OT (McOT) within a consistent constraint, to disentangle the transportation of private and shared information of each modality. In addition, a generative transport plan is designed to derive the variational solution to OT, which can effectively search the map with a minimal transport cost for mining intrinsic patterns. Finally, although designed for all data with heterogeneous distributions in different modalities, e.g., text and video, OTMC is conducted on four real-world benchmark image datasets in extensive experiments to validate its superiority on high-dimension data.

The main contributions of this paper are as follows:

  • An innovative weak-notion distance metric-based method is designed to measure differences between the manifold structures of data collected from diverse devices, which ensures the full fusion of complementary information from data with heterogeneous distributions.

  • Multimodal clustering is innovatively modeled using an optimal-transport-based multimodal clustering method (OTMC), which can capture fusion information with clear discriminative structures from heterogeneous modalities for mining intrinsic patterns.

  • A variational solution is derived to solve OT based on a generative transport plan, which can precisely match the transport map for transporting the multimodal data to clustering prototypes in OTMC.

  • Extensive experiments are conducted on four real-world benchmark datasets, which verify the superiority of OTMC compared with other methods in multimodal clustering, helped by never relying on the phantom of heterogeneous manifold intersections. In particular, OTMC obtains 92.15% ACC, 84.96% NMI, and 83.35% ARI on Handwritten, improving by 2.25%, 2.82%, and 3.28%, respectively.

In the rest of this paper, Section 2 gives the related deep fusion methods and deep clustering methods; Section 3 introduces the theoretical framework of OTMC; Section 4 derives the variational generative solution to OTMC and network implementations; Section 5 showcases the experimental results of OTMC in comparison with deep fusion methods and deep clustering methods; and Section 6 concludes this work and points out some possible future research directions for subsequent exploration.

2. Related Work

Clustering is a classical unsupervised learning task. Various effective clustering algorithms like k-means have been proposed and widely utilized as cornerstones of the solutions to problems such as pavement deterioration [15], facility location [16], and the energy management of wireless sensor networks [17]. OTMC learns intrinsic patterns of multimodal data via neural networks, and is closely related to deep fusion methods and deep clustering methods in multimodal clustering.
Deep fusion methods, as two-stage methods, focus on extracting generalized fusion representations of multimodal data, and then leveraging a single-modality clustering algorithm to mine clustering patterns. For example, DCCA [18,19] learned fusion representations of multimodal data via hierarchical nonlinear mappings along with crossmodal correlation maximizing. DMF [20] mined the consensus semantics of multimodal data by leveraging multi-layer transformations subject to non-negative constraints to explore informative spatial bases. AE2-Nets [9] modeled a degradation process from multimodal data to fusion representations within a nested autoencoder architecture to capture consistent and complementary information among modalities. CMIB [10] utilized the rate-distortion principle to explore robust fusion representations by preserving complementary and consistent information and discarding the superfluous information of modalities. MIFN-ML [21] conducted intermediate fusion of biometric and facial representations after manifold-learning-based dimension reduction techniques, which then made a clustering decision for stress detection via a nonlinear mapping. MCBVAD [22] introduced attentions from other views to suppress information from irrelevant views on the basis of maximizing information entropy and mutual information of views, which produced information-fruitful representations to improve k-means clustering results. Further, with the help of a Transformer-based attention learning block, VMC-CD [23] applied cross-view contrastive learning and maximum entropy of categories to obtain discriminative representations for k-means clustering. OSCAMC [24] jointly learned and optimized view-collective matrices, including a combined non-negative embedding matrix, a collective similarity matrix, a joint coefficient matrix, a unified spectral projection matrix, and so on, and then obtained the clustering results from those matrices.
Deep clustering methods integrate complementary information fusion with clustering pattern recognition in an end-to-end architecture, where divergences of data structures are used to supervise model training. MVaDE [25] modeled the generation process from hidden categories to visible data, where the Gaussian distribution was used to constrain the information fusion of each modality. CoMVC [26] utilized crossmodal contrastive learning followed by adaptive linear combination to fuse modality information, and then conducted Cauchy–Schwarz divergence in the fusion representation space for clustering. SDMVC [27] utilized KL-divergence to mine clustering patterns from concatenated complementary information, which uses consistent semantics to constrain the alignments of representation distributions in each modality. CMSC [28] fused modality information on the basis of a consensus subspace which maximized the correlations of modalities, and then performed spectral clustering to mine clustering patterns. DMIM [29] fused complementary information by maximizing the dependence between modalities and mined multimodal data patterns with the help of an over-clustering strategy. DMAC-SI [5] combined invariant semantics representation learning and the Markov decision process of multimodal data clustering into an end-to-end clustering framework, ensuring the flexibility of clustering structure exploration. EMVFC [30] integrated low-dimensional representation learning and cooperative learning in clustering to improve performance, which reduced redundancy features and noise, enhancing the complementary information and correlation.

The deep fusion methods above extracted generalized fusion representations and then performed vanilla single-modality clustering methods to mine patterns, which may disconnect fusion representation learning and clustering partition, causing suboptimal fusion representations for clustering partition. The deep clustering methods above combined fusion representation learning and clustering partition into a joint optimization strategy for mining clustering structures in an end-to-end manner. However, they still relied on strong-notion distances to fuse crossmodal complementary knowledge, which limited their clustering performance. In contrast, optimal transport can define multimodal clustering as a map from multimodal data measures to clustering prototype set measures, which avoids representation-partition disconnections and dependency on the intersection of data distributions in different modalities.

Through the comparison between optimal transport and clustering, it can be seen that a difference between optimal transport and clustering is that optimal transport ex-changes the operation form for the lowest cost. That is, for any two mappings that map data points to cluster centers, optimal transport chooses the one with lowest cost and does not consider how it works. In a sense, optimal transport is more flexible and difficult than clustering. Thus, the solution to optimal transport can be used in clustering. In multimodal clustering, dividing the data into modality-specific and modality-common components is a standard approach. OTMC further distinguishes itself by incorporating the concept of optimal transport into this framework. Specifically, OTMC extends the rationale for optimal transport by partitioning it into modality-common and modality-specific components as well. This allows OTMC to better capture the intrinsic relationships both within and across modalities, offering a more nuanced representation of the data. In addition, there are diverse opinions about how to evaluate the quality of a clustering algorithm so that many metrics are proposed. If the objective of clustering is just minimizing the sum of distances in each cluster in k-means, no more problems appear. So, in this context, our method is a beneficial step toward providing a new direction for clustering, i.e., evaluating clustering quality based on transport cost.

4. The Variational Generative Solution to OTMC

In this section, a variational generative solution to OT is derived, and then, it is generalized to multimodal scenarios for optimizing data partitioning within an optimal-transport clustering network.

4.1. Variational Generative Solution

The variational generative solution learns the generative relationship between data and prototypes by constructing the joint distribution p x , c , which provides a theoretically accurate analytical solution to OT. Specifically, p x , c = p c | x p x , where p x stands for the data distribution and p c | x denotes the cluster conditional distribution. Since the OT that transports the multimodal data to the clustering prototypes is actually finding the optimal prototype c j for each data point x i , which is fitting the cluster conditional distribution p c | x , the joint distribution has the same solution as OT.

To fit the joint distribution over the data measure μ and the clustering prototype measure ν , the solution is to find an optimal coupling measure π ∗ between μ and ν such that π ( ⋅ , c ) = μ and π ( x , ⋅ ) = ν are satisfied for any corresponding marginal measures. The optimal coupling π ∗ is obtained by finding the infimum of the coupling cost in the set of coupling measures Π ( μ , ν ) as follows:

π ∗ = arg inf π ∈ Π ( μ , ν ) ∫ X × C d ( x , c ) d π ( x , c ) ,

where d x , c stands for the coupling cost between x and c . The optimal coupling solution π ∗ is obtained by reaching the infimum of the total coupling cost over the product space Π ( X × C ) .

Since p x , c = p x | c p c , the fitting of p x , c can be transformed into the learning of the categorical prior distribution p c and the generative conditional distribution p x | c . The former models prior knowledge of the data’s categorical distribution, which can be represented by any distribution C a t . The latter learns the generation mechanism of the data to better reflect the probabilistic properties of the data, which can be achieved through the generative transport plan G from the clustering prototypes to the data. Then, the categorical distribution C a t and the generative transport plan G can be updated by minimizing the transport cost between the generated data and the prototypes:

arg inf G , C a t ∫ c ~ C a t d ( G ( c ) , c ) d ν ( c ) ,

where d ( â‹… ) is the ground transport cost between x and c . After fitting the categorical prior distribution p c and the generative conditional distribution p x | c , p x , c can be obtained via the Bayesian rule.

4.2. The Variational Generative Solution to OTMC

Based on the variational generative solution to OT, the modality-specific OT and the modality-common OT decomposed from OTMC can easily be solved. Specifically, the solution to modality-specific OT can be achieved by solving the generative transport plan G s m from the categorical prior distribution C a t s m to the data of the m -th modality space X m , minimizing the following ground transport cost:

MsOT = arg min G s m , C a t s m ∫ c ~ C a t s m d s m ( G s m ( c ) , c ) d ν s m ( c ) , m = 1 , … , M ,

where G s m : C a t s m → X m is the generative transport map of the m -th modality, and G s m generates instances from the categorical distribution, which can retain the real underlying structure information under the supervision of multimodal data. The generative transport plan G s m and the categorical distribution C a t s m can jointly learn the private data structure information and clustering information of the m -th modality.

In practice, the integral over the measure ν s m is approximated by Monte Carlo sampling, i.e., the integral of the measure ν s m over the categorical distribution C a t s m is equivalent to the expectation of all clustering prototypes over C a t s m . Thus, Equation (8) can induce the following objective function of modality-specific OT L s m :

argmin G s m , C a t s m E X m E C a t s m [ ν s m ( c ) d s m ( G s m ( c ) , c ) ] ,

where m = 1 , … , M indicates the m -th modality, and the objective function L s m fully captures inherent information of the m -th modality by jointly learning structure differences between data and semantic information within data.

Similarly, the optimization of modality-common OT is equivalent to solving the generative transport plan G c from the categorical distribution to the data distribution for all modalities:

McOT = argmin G c , C a t c ∫ c ~ C a t c d c ( G c ( c ) , c ) d ν c ( c ) ,

where G c : C a t c → X is the generative transport map for all modalities, which can generate data of all modalities from the common categorical distribution C a t c to capture common semantics of data.

The joint optimization of the generative transport plan G c and the categorical distribution C a t c can simultaneously capture the data structure information of all modalities and crossmodal clustering information. The integral over the measure ν c is approximated by the expectation of all prototypes over C a t c based on the Monte Carlo sampling, so the objective function of modality-common OT L c can be written as:

argmin G c , C a t c E X E C a t c [ ν c ( c ) d c ( G c ( c ) , c ) ] .

The objective function L c effectively preserves common information in multimodal data by jointly optimizing the structure differences of all modalities and crossmodal semantic information.

Thus, OTMC can be optimized via the following variational solution:

L = ∑ m = 1 M argmin G s m , C s m E X m E C a t s m [ ν s m ( c ) d s m ( G s m ( c ) , c ) ] + argmin G c , C a t c E X E C a t c [ ν c ( c ) d c ( G c ( c ) , c ) ] + ∑ m = 1 M D K L ( C a t s m , C a t c ) ,

where D K L denotes the consistent constraint of categorical prior distributions C a t s m and C a t c , which enables MsOT and McOT to have the same transport target, so that the results of the two transport schemes tend to be the same. It is implemented by the KL-divergence penalty between C a t s m and C a t c .

4.3. Multimodal OT Clustering Network

Although they attract much attention from researchers, generative adversarial networks (GANs) involve performance-unstable and computation-expensive adversarial training. Hence, to implement OTMC, a multimodal OT clustering network is designed within a deep neural network architecture which contains a generative network and a clustering network, instead of a GAN. The generative network implements the generative transport plan G together with the categorical distribution C a t ( Ï– ) , and the clustering network produces final assignment results from multimodal data.

The clustering network aims to learn the modality-specific and modality-common soft-clustering assignments that are included in MsOT and McOT. Specifically, in modality-specific OT, the clustering network E s m ( ⋅ ; θ s m ) encodes modality-specific data into corresponding latent features, and utilizes Student‘s t -distribution in the last layer of E s m ( ⋅ ; θ s m ) to calculate the similarity between latent features z i m and clustering prototypes c j m to obtain the soft-clustering-assignment matrix P m as follows:

z i m = E s m ( x i m ; θ s m ) p i j m = ( 1 + | | z i m − c j m | | 2 / α ) − α + 1 2 ∑ j ′ ( 1 + | | z i m − c j ′ m | | 2 / α ) − α + 1 2 ,

where x i m ∈ X m is the i -th instance from the m -th modality; c j m ∈ C a t m ( ϖ m ) is the j -th clustering prototype of the m -th modality; and p i j m can be interpreted as the possibility that the i -th instance is divided into the j -th cluster. To guide the learning of networks, a target distribution Q m is introduced, and the KL-divergence between Q m and the soft-assignment matrix P m is used as the objective function:

L c l u s t e r m = D K L ( Q m , P m ) ,

where q i j = p i j 2   / ∑ i p i j ∑ j ′ p i j ′ 2 / ∑ i p i j ′ is the target assignment possibility of the i -th instance to the j -th clustering prototype. This objective function guides network learning by aligning the soft assignment to the target probability distribution.

For McOT, the objective of the clustering network E c ( ⋅ ; θ c ) is D K L ( Q , P ) , where P and Q are the soft-assignment matrix and target probability distribution over all modalities, respectively. According to the consistent constraint L c o n in Equation (5), each D ( P m , P ) is also regarded as an objective component. Hence, the total objective of the clustering networks is

L c l u s t e r = ∑ m = 1 M ( D ( Q m , P m ) + D ( P m , P ) ) + D ( Q , P ) ,

where D ( â‹… , â‹… ) denotes the KL-divergence between two assignment matrices. The first term guides the network to learn the clustering information within each modality, the second term is the consistency constraint of modality-specific and modality-common clustering information, and the last term guides the network to capture the clustering information over all modalities. Finally, the soft-assignment matrix P is used as the cluster assignment matrix.

The generative network is designed to learn the categorical distribution C a t ( ϖ ) and the generative transport plan G included in MsOT and McOT. Specifically, for MsOT, the generative network G s m ( ⋅ ; φ s m ) decodes latent features z ˜ i m sampled from the categorical distribution C a t ( ⋅ ) to corresponding data:

z i m = s a m p l e ( C a t ( ϖ m ) ) x ˜ i m = G s m ( z i m ; φ s m ) ,

where C a t ( Ï– m ) is the categorical distribution of the m -th modality with parameter Ï– m . To update the parameter Ï– s m of the generative network and the categorical distribution parameter Ï– m , the topological reconstruction loss between the generative data and the original data is used as a guide, namely,

L r e c m = ∑ i = 1 N D w ( x i m , G s m ( s a m p l e ( C a t ( ϖ m ) ) ) ) ,

where D w ( ⋅ , ⋅ ) represents the topological reconstruction loss based on optimal transport to measure the differences in manifold structures between original and reconstructed data. The network structure and loss of the generative network G c ( ⋅ ; φ c ) of McOT are similar to those of MsOT. So, the total objection function of generative networks can be written as follows:

L r e c = ∑ m = 1 M ∑ i = 1 N D w ( x i m , G s m ( z i m ) ) + D w ( x i , G c ( z i ) ) ,

where the purpose of the first term is to optimize the categorical distribution and generative network within the modality, and the purpose of the second term is to optimize over all modalities, capturing the data structure information within and between modalities, respectively.

4.4. The Overall Loss

The overall loss function L of the multimodal OT clustering network is defined as follows:

L = L r e c + α L c l u s t e r ,

where α denotes the trade-off parameter for balancing the topological reconstruction loss L r e c and clustering pattern mining loss L c l u s t e r , and L is used to optimize the multimodal OT clustering network via the stochastic gradient descent strategy, which effectively boosts the performance of multimodal clustering:

θ t + 1 = θ t − η t ∇ θ L ( θ t ) ,

where
θ t and
η t denote the parameter and the learning rate of the multimodal OT clustering network at time step
t , respectively, and
∇ θ L ( θ t ) denotes the gradient at time step
t . The overall optimization algorithm is shown in Algorithm 1.

Algorithm 1. Multimodal OT clustering network.
Input: Multimodal dataset X ∈ X and convergence criteria thr.
Output: Soft-clustering-assignment matrix P and the parameters of clustering network E s m ( ⋅ ; θ s m ) ,   E c ( ⋅ ; θ c ) , generative network G s m ( ⋅ ; φ s m ) ,   G c ( ⋅ ; φ c ) categorical distribution C a t ( ϖ s m ) ,   C a t ( ϖ c ) .
Initialize: modality-specific categorical distributions C a t ( ϖ s m ) modality-common categorical distribution C a t ( ϖ c ) parameters of modality-specific clustering networks E s m ( ⋅ ; θ s m ) , parameters of modality-common clustering network E c ( ⋅ ; θ c ) , parameters of modality-specific generative networks G s m ( ⋅ ; φ s m ) , and parameters of modality-common generative network G c ( ⋅ ; φ c )
while convergence criteria thr is not reached by Equation (19) do
 for each m = 1 , … , M do
  Sample data in the m -th modality X m .
  Generate the m -th modality soft-assignment matrix P m according to Equation (13).
  Update E s m ( ⋅ ; θ s m ) by minimizing Equation (14).
  Sample modality-specific clustering prototypes from C a t ( ϖ s m ) .
  Generate the m -th modality reconstructed data x ˜ i m according to Equation (16).
  Update G s m ( ⋅ ; φ s m ) and C a t ( ϖ s m ) by minimizing Equation (17).
 end
 Sample multimodal data
X .
 Generate modality-common soft-assignment matrix
P .
 Update modality-common clustering network E c ( ⋅ ; θ c ) by minimizing Equation (15).
 Sample modality-common clustering prototypes from C a t ( ϖ c ) .
 Generate multimodal reconstructed data.
 Update modality-common generative network G c ( ⋅ ; φ c ) by minimizing Equation (18).
 Update all network parameters by minimizing Equation (19) according to Equation (20).
end

6. Conclusions

A multimodal clustering method based on optimal transport (OTMC) is designed in this paper to mine intrinsic patterns from the fusion information of multimodal data. It utilizes the weak-notion distance to measure the differences among heterogeneous manifolds, which overcomes the fragile assumption on strict overlaps between data manifolds in previous MC methods. In particular, multimodal clustering is defined as the transport mapping between multimodal data and clustering prototypes, which effectively fuses the complementary information in multimodal data. Then, the variational solution to OT is derived based on a generative transport plan, which utilizes fusion information to produce accurate clustering patterns. Afterwards, a multimodal OT clustering network is designed to achieve the above definition. Finally, extensive experiments illustrate the superiority of OTMC.

The possible future directions of this research are three-fold: First, in current OTMC, the transport plan is completely implemented by a deep neural network. Hence, the performance of OTMC heavily depends on the degree of fitting between the deep neural network and the real transport plan, which introduces uncertainty and may be a potential limitation. In the future, a variational solution with more compact margins will be explored to address this dependence and achieve a more precise map for pattern mining. Second, the performance of OTMC on different types of multimodal datasets, especially those with varying degrees of distribution overlap or complementarity across modalities via feature engineering, will be checked to further verify the theoretical basis of OTMC. Third, OTMC will be tried in other multimodal data learning cases such as graph data classification and time series prediction.



Source link

Zheng Yang www.mdpi.com