Clustering ensemble selection: A systematic mapping study

Document Type : Review articles

Authors

1 Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran

2 Department of Applied Mathematics, Sari Branch, Islamic Azad University, Sari, Iran

Abstract

Clustering has emerged as an important tool for data analysis, which can be used to produce high-quality data partitions as well as stronger and more accurate consensus clustering based on basic clustering. Data item labels, which are already known as opposed to classification issues, are unlabeled clusters in unsupervised clustering, which may cause uncertainty in large libraries. Therefore, all clusters produced are not useful for the final clustering solution. To address this challenge, instead of selecting all of them from a subset of variants to combine for the obtainment of the final result, Clustering ensemble selection (CES) was proposed in 2006 by Hadjitodorov. The goal is the selection of a subset of large libraries to produce a smaller cluster offering higher-quality performance. (CES) has been found effective in the improvement of the clustering solutions quality. The current paper conducts a systematic mapping study (SMS) for the analysis and synthetization of the studies formerly conducted on the CES techniques. To this end, 42 prominent publications from the existing literature, published from 2006 to August 2022, were selected to be examined in this article. The analysis results showed that most of the articles have used the NMI measure to evaluate the cluster quality, and the method of valuing the initial parameter has been more commonly used for the generation of diversity. Clustering ensemble selection has not been done on text yet; in addition, the trade-off between diversity and quality (considering both at the same time) can be studied and evaluated in the future.

Keywords

[1] D.D. Abdala, P. Wattuya, and X. Jiang, Ensemble clustering via random walker consensus strategy, 20th Int. Conf. Pattern Recogn., IEEE, 2010, pp. 1433–1436.
[2] M.T. AL-Sharuee, F. Liu, and M. Pratama, Sentiment analysis: An automatic contextual analysis and ensemble clustering approach and comparison, Data Knowledge Engin. 115 (2018), 194–213.
[3] H. Alizadeh, B. Minaei-Bidgoli, and H. Parvin, Cluster ensemble selection based on a new cluster stability measure, Intel. Data Anal. 18 (2014), no. 3, 389–408.
[4] H. Alizadeh, B. Minaei-Bidgoli, and H. Parvin, To improve the quality of cluster ensembles by selecting a subset of base clusters, J. Experimen. Theor. Artific. Intel. 26 (2014), no. 1, 127–150.
[5] J. Azimi and X. Fern, Adaptive cluster ensemble selection, Twenty-First Int. Joint Conf. Artific.l Intel., 2009.
[6] L. Bai, J. Liang, H. Du, and Y. Guo, An information-theoretical framework for cluster ensemble, IEEE Trans. Knowledge Data Engin. 31 (2018), no. 8, 1464–1477.
[7] V. Berikov, Weighted ensemble of algorithms for complex data clustering, Pattern Recogn. Lett. 38 (2014), 99–106.
[8] I. Bifulco, C. Fedullo, F. Napolitano, G. Raiconi, and R. Tagliaferri, Robust clustering by aggregation and intersection methods, Int. Conf. Knowledge-Based Intel. Inf. Engin. Syst., Springer, 2008, pp. 732–739.
[9] T. Boongoen and N. Iam-On, Cluster ensembles: A survey of approaches with recent extensions and applications, Comput. Sci. Rev. 28 (2018), 1–25.
[10] D.L. Davies and D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Machine Intel. (1979), no. 2, 224–227.
[11] U.M. Fayyad, C. Reina, and P.S. Bradley, Initialization of iterative refinement clustering algorithms., KDD, 1998, pp. 194–198.
[12] X.Z. Fern and C.E. Brodley, Random projection for high dimensional data clustering: A cluster ensemble approach, Proc. 20th Int. Conf. Machine Learn. (ICML-03), 2003, pp. 186–193.
[13] X.Z. Fern and C.E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, Proc. Twenty-First Int. Conf. Machine Learn., 2004, p. 36.
[14] X.Z. Fern and W. Lin, Cluster ensemble selection, Statist. Anal. Data Min.: ASA Data Sci. J. 1 (2008), no. 3, 128–141.
[15] A.L.N. Fred and A.K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Trans. Pattern Anal. Machine Intel. 27 (2005), no. 6, 835–850.
[16] A. Ghosh, J. Acharya, Cluster ensembles, Wiley Interdiscip. Rev.: Data Min. Knowledge Disc. 1 (2011), no. 4, 305–315.
[17] J. Ghosh, A. Strehl, and S. Merugu, A consensus framework for integrating distributed clusterings under limited knowledge sharing, Proc. NSF Workshop on Next Generation Data Mining, Citeseer, 2002, pp. 99–108.
[18] S.T. Hadjitodorov, L.I. Kuncheva, and L.P. Todorova, Moderate diversity for better cluster ensembles, Information Fusion 7 (2006), no. 3, 264–275.
[19] Y. Hong, S. Kwong, Y. Chang, and Q. Ren, Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm, Pattern Recogn. 41 (2008), no. 9, 2742–2756.
[20] X. Hu, Integration of cluster ensemble and text summarization for gene expression analysis, Proc. Fourth IEEE Symp. Bioinf. Bioengin., IEEE, 2004, pp. 251–258.
[21] X. Hu and I. Yoo, Cluster ensemble and its applications in gene expression analysis, Proc. Second Conf. Asia-Pacific Bioinf. Volume 29, 2004, pp. 297–302.
[22] D. Huang, J. Lai, and C.-D. Wang, Ensemble clustering using factor graph, Pattern Recogn. 50 (2016), 131–142.
[23] D. Huang, C.-D. Wang, and J.-H. Lai, Locally weighted ensemble clustering, IEEE Trans. Cybernet. 48 (2017), no. 5, 1460–1473.
[24] D. Huang, C.-D. Wang, and J.-H. Lai, Lwmc: A locally weighted meta-clustering algorithm for ensemble clustering, Int. Conf. Neural Inf. Process., Springer, 2017, pp. 167–176.
[25] L. Hubert and P. Arabie, Comparing clusterings, J. Classific. 2 (1985), 193–218.
[26] A.K. Jain and R.C. Dubes, Algorithms for clustering data, Prentice-Hall, Inc., 1988.
[27] A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: A review, ACM Comput. Surv. (CSUR) 31 (1999), no. 3, 264–323.  
[28] V. Kandylas, S. Upham, and L.H. Ungar, Finding cohesive clusters for analyzing knowledge communities, Knowledge Inf. Syst. 17 (2008), no. 3, 335–354.
[29] L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis, John Wiley & Sons, 2009.
[30] H. Khalili, M. Rabbani, and E. Akbari, Clustering ensemble selection based on the extended Jaccard measure, Turk. J. Electric. Engin. Comput. Sci. 29 (2021), no. 4, 2215–2231.
[31] B. King, Step-wise clustering procedures, J. Amer. Statist. Assoc. 62 (1967), no. 317, 86–101.
[32] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Res. Logistics Quart. 2 (1955), no. 1-2, 83–97.
[33] B. Larsen and C. Aone, Fast and effective text mining using linear-time document clustering, Proc. Fifth ACM SIGKDD Int. Conf. Knowledge Disc. Data Min., 1999, pp. 16–22.
[34] T. Li and C. Ding, Weighted consensus clustering, Proc. SIAM Int. Conf. Data Min., SIAM, 2008, pp. 798–809.
[35] X. Lu, Y. Yang, and H.Wang, Selective clustering ensemble based on covariance, Int.Workshop Multiple Classifier Syst., Springer, 2013, pp. 179–189.
[36] J. MacQueen, Classification and analysis of multivariate observations, 5th Berkeley Symp. Math. Statist. Probability, 1967, pp. 281–297.
[37] S. Mimaroglu and M. Yagci, Clicom: Cliques for combining multiple clusterings, Expert Syst. Appl. 39 (2012), no. 2, 1889–1901.
[38] B. Minaei-Bidgoli, A. Topchy, and W.F. Punch, Ensembles of partitions via data resampling, Int. Conf. Inf. Technol.: Cod. Comput., 2004. Proc. ITCC 2004., vol. 2, IEEE, 2004, pp. 188–192.
[39] S. Nejatian, H. Parvin, and E. Faraji, Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomput. 276 (2018), 55–66.
[40] C.F. Olson, Parallel algorithms for hierarchical clustering, Parallel Comput. 21 (1995), no. 8, 1313–1325.
[41] Y. Ren, C. Domeniconi, G. Zhang, and G. Yu, Weighted-object ensemble clustering: Methods and analysis, Knowledge Inf. Syst. 51 (2017), no. 2, 661–689.
[42] N.C. Sandes and A.L.V. Coelho, Clustering ensembles: A hedonic game theoretical approach, Pattern Recogn. 81 (2018), 95–111.
[43] C.P. Santos, Desiree Maldonado Carvalho, and Maria CV Nascimento, A consensus graph clustering algorithm for directed networks, Expert Syst. Appl. 54 (2016), 121–135.
[44] A.J.C. Sharkey, Combining artificial neural nets: Ensemble and modular multi-net systems, Springer Science & Business Media, 2012.
[45] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Machine Intel. 22 (2000), no. 8, 888–905.
[46] R. Sibson, Slink: An optimally efficient algorithm for the single-link cluster method, Comput. J. 16 (1973), no. 1, 30–34.
[47] A. Strehl and J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Machine Learn. Res. 3 (2002), no. Dec, 583–617.
[48] A. Topchy, A.K. Jain, and W. Punch, Combining multiple weak clusterings, Third IEEE Int. Conf. Data Min., IEEE, 2003, pp. 331–338.
[49] A. Topchy, A.K. Jain, and W. Punch, A mixture model for clustering ensembles, Proc. SIAM Int. Conf. Data Min., SIAM, 2004, pp. 379–390.
[50] A. Topchy, A.K. Jain, and W. Punch, Clustering ensembles: Models of consensus and weak partitions, IEEE Trans. Pattern Anal. Machine Intel. 27 (2005), no. 12, 1866–1881.
[51] S. Vega-Pons and J. Ruiz-Shulcloper, A survey of clustering ensemble algorithms, Int. J. Pattern Recogn. Artific. Intel. 25 (2011), no. 3, 337–372 [52] X. Wang, D. Han, and C. Han, Rough set based cluster ensemble selection, Proc. 16th int. Conf. Inf. Fusion, IEEE, 2013, pp. 438–444.
[53] J. Wu, H. Liu, H. Xiong, J. Cao, and J. Chen, K-means-based consensus clustering: A unified view, IEEE Trans. Knowledge Data Engin. 27 (2014), no. 1, 155–169.
[54] X. Wu, T. Ma, J. Cao, Y. Tian, and A. Alabdulkarim, A comparative study of clustering ensemble algorithms, Comput. Electric. Engin. 68 (2018), 603–615.
[55] F. Yang, X. Li, Q. Li, and T. Li, Exploring the diversity in cluster ensemble generation: Random sampling and random projection, Expert Syst. Appl. 41 (2014), no. 10, 4844–4866.
[56] Z. Yu, L. Li, J. Liu, J. Zhang, and G. Han, Adaptive noise immune cluster ensemble using affinity propagation, IEEE Trans. Knowledge Data Engin. 27 (2015), no. 12, 3176–3189.
[57] Z. Yu and H.-S. Wong, Class discovery from gene expression data based on perturbation and cluster ensemble, IEEE Trans. Nanobiosci. 8 (2009), no. 2, 147–160.
[58] X. Zhao, F. Cao, and J. Liang, A sequential ensemble clusterings generation algorithm for mixed data, Appl. Math. Comput. 335 (2018), 264–277.
[59] L. Zheng, T. Li, and C. Ding, A framework for hierarchical ensemble clustering, ACM Trans. Knowledge Disc. From Data (TKDD) 9 (2014), no. 2, 1–23.
[60] S. Zhong and J. Ghosh, A comparative study of generative models for document clustering, Proc. Workshop Cluster. High Dimens. Data Appl. SIAM Data Min. Conf., Citeseer, 2003.
Volume 14, Issue 9
September 2023
Pages 209-240
  • Receive Date: 02 October 2022
  • Revise Date: 09 December 2022
  • Accept Date: 19 December 2022