Feature selection method based on clustering technique and optimization algorithm

Document Type : Research Paper

Authors

Department of Computer Engineering, Yasuj Branch, Islamic Azad University, Yasuj, Iran

Abstract

Data platforms with large dimensions, despite the opportunities they create, create many computational challenges. One of the problems of data with large dimensions is that most of the time, all the characteristics of the data are not important and vital to finding the knowledge that is hidden in them. These features can have a negative effect on the performance of the classification system. An important technique to overcome this problem is feature selection. During the feature selection process, a subset of primary features is selected by removing irrelevant and redundant features. In this article, a hierarchical algorithm based on the coverage solution will be presented, which selects effective features by using relationships between features and clustering techniques. This new method is named GCPSO, which is based on the optimization algorithm and selects the appropriate features by using the feature clustering technique. The feature clustering method presented in this article is different from previous algorithms. In this method, instead of using traditional clustering models, final clusters are formed by using the graphic structure of features and relationships between features. The UCI database has been used to evaluate the proposed method due to its extensive characteristics. The efficiency of the proposed model has also been compared with the feature selection methods based on the coverage solution that uses evolutionary algorithms in the feature selection process. The obtained results indicate that the proposed method has performed well in terms of choosing the optimal subset and classification accuracy on all data sets and in comparison with other methods.

Keywords

[1] M. Abdel-Basset, D. El-Shahat, I. El-Henawy, V.H.C. De Albuquerque, and S. Mirjalili, A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection, Expert Syst. Appl. 139 (2020), 112824.
[2] M.H. Aghdam, N. Ghasem-Aghaee, and M.E. Basiri, Text feature selection using ant colony optimization, Expert Syst. Appl. 36 (2009), no. 3, 6843–6853.
[3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Nat. Acad. Sci. USA 96 (1999), 6745–6750.
[4] F. Amini and G. Hu, A two-layer feature selection method using Genetic Algorithm and Elastic Net, Expert Syst. Appl. 166 (2021), 114072.
[5] A. Asuncion and D. Newman, UCI repository of machine learning datasets, Available from:
!http://archive.ics.uci.edu/ml/datasets.php., 2007.
[6] S.R. Bandela and T.K. Kumar, Unsupervised feature selection and NMF de-noising for robust Speech Emotion Recognition, Appl. Acoustics 172 (2021), 107645.
[7] S. Bandyopadhyay, T. Bhadra, P. Mitra, and U. Maulik, Integration of dense subgraph finding with feature clustering for unsupervised feature selection, Pattern Recog. Lett. 40 (2014), 104–112.
[8] V.D. Blondel, J.L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast unfolding of communities in large networks, J. Statist. Mech.: Theory Experiment 10008 (2008), 1–12.
[9] J.M. Cadenas, M.C. Garrido, and R. Mart´ınez, Feature subset selection Filter-Wrapper based on low quality data, Expert Syst. Appl. 40 (2013), no. 16, 6241–6252.
[10] L. Carmen, M. Reinders, and L. Wessels, Random subspace method for multivariate feature selection, Pattern Recog. Lett. 27 (2006), no. 10, 067–1076.
[11] G. Chandrashekar and F. Sahin, A survey on feature selection methods, Comput. Electric. Engin. 40 (2014), no. 1, 16–28.
[12] A.K. Farahat, A. Ghodsi, and M.S. Kamel, Efficient greedy feature selection for unsupervised learning, Knowledge Inf. Syst. 35 (2013), no. 2, 285–310.
[13] I. Guyon and A.E. Elisseeff, An introduction to variable and feature selection, J. Machine Learn. Res. 3 (2003), 1157–1182.
[14] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene selection for cancer classification using support vector machines, Machine Learn. 46 (2002), no. 1, 389–422.
[15] E. Hancer, A new multi-objective differential evolution approach for simultaneous clustering and feature selection, Engin. Appl. Artific. Intell. 87 (2020), 103307.
[16] S.M. Hazrati Fard, A. Hamzeh, and S. Hashemi, Using reinforcement learning to find an optimal set of features, Comput. Math. Appl. 66 (2013), no. 10, 1892–1904.
[17] H. Liu and L. Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Trans. Knowledge Data Engin. 17 (2005), no. 4, 491–502.
[18] Y. Liu and Y.F. Zheng, FS-SFS: A novel feature selection method for support vector machines, Pattern Recog. 39 (2006), no. 7, 1333–1345.
[19] J. Kennedy and R. Eberhart, Particle swarm optimization, Proc. ICNN’95-Int. Conf. Neural Networks, IEEE, 1995, pp. 1942–1948.
[20] J. Kim, F.J. Kohout, N.H. Nie, C.H. Hull, J.G. Jenkins, K. Steinbrenner, and D.H. Bent, Statistical Package for the Social Sciences, McGraw Hill, New York NY, 1975.
[21] N. Maleki, Y. Zeinali, and S.T.A. Niaki, A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection, Expert Syst. Appl. 164 (2021), 113981.
[22] P. Nimbalkar and D. Kshirsagar, Feature selection for intrusion detection system in Internet-of-Things (IoT), ICT Express 7 (2021), no. 2, 177–181.
[23] M. Paniri, M.B. Dowlatshahi, and H. Nezamabadi-Pour, MLACO: A multi-label feature selection algorithm based on ant colony optimization, Knowledge-Based Syst. 192 (2020), 105285.
[24] R. Pascual-Marqui, D. Lehmann, K. Kochi, T. Kinoshita, and N. Yamada, A measure of association between vectors based on “similarity covariance”, 2013-01-21, arXiv: 1301.4291 [stat.ME]. http://arxiv.org/abs/1301.4291.
[25] G. Quanquan, L. Zhenhui, and J. Han, Generalized Fisher score for feature selection, Proc. Int. Conf. Uncertainty Artificial Intell., 2011.
[26] L.E. Raileanu and K. Stoffel, Theoretical comparison between the Gini index and information gain criteria, Ann. Math. Artif. Intell. 41 (2004), 77–93.
[27] M. Rostami, K. Berahmand, and S. Forouzandeh, A novel community detection based genetic algorithm for feature selection, J. Big Data 8 (2021), no. 1, 1–27.
[28] M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzandeh, Review of swarm intelligence-based feature selection methods, Engin. Appl. Artific. Intell. 100 (2021), 104210.
[29] R. Ruiz, J.C. Riquelme, J.S. Aguilar-Ruiz, and M. Garc´ıa-Torres, Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches, Expert Syst. Appl. 39 (2012), 11094–11102.
[30] Y. Saeys, I. Inza, and P. Larranaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (2007), no. 19, 2507–2517.
[31] M. Sharif, J. Amin, M. Raza, M. Yasmin, and S.C. Satapathy, An integrated design of particle swarm optimization (PSO) with fusion of features for detection of brain tumor, Pattern Recog. Lett. 129 (2020), 150–157.
[32] C. Shi, Z. Gu, C. Duan, and Q. Tian, Multi-view adaptive semi-supervised feature selection with the self-paced learning, Signal Process. 168 (2020), 107332.
[33] Q. Song, J. Ni, and G. Wang, A fast clustering-based feature subset selection algorithm for high-dimensional data, IEEE Trans. Knowledge Data Engin. 25 (2013), no. 1, 1–14.
[34] X. Sun, Y. Liu, J. Li, J. Zhu, H. Chen, and X. Liu, Feature evaluation and selection with cooperative game theory, Pattern Recog. 45 (2012), no. 8, 2992–3002.
[35] S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, Oxford, 2008.
[36] S. Theodoridis and C. Koutroumbas, Pattern Recognition, 4th Edn, Elsevier Inc, 2009.
[37] D. Wang, Z. Zhang, R. Bai, and Y. Mao, A hybrid system with filter approach and multiple population genetic algorithm for feature selection in credit scoring, J. Comput. Appl. Math. 329 (2018), 307–321.
[38] H. Xiaofei, C. Deng, and P. Niyogi, Laplacian Score for Feature Selection, Adv. Neural Inf. Process. Syst. 18 (2005), 507–514.
[39] Y. Yang, Z. Ma, A.G. Hauptmann, and N. Sebe, Feature selection for multimedia analysis by sharing information among multiple tasks, Multimedia IEEE Trans. 15 (2012), no. 3, 661–669.
[40] S. Yildirim, Y. Kaya, and F. Kılıc, A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Appl. Acoustics 173 (2021), 107721.
[41] Y. Zhang, D. Gong, X. Gao, T. Tian, and X. Sun, Binary differential evolution with self-learning for multi-objective feature selection, Inf. Sci. 507 (2020), 67–85.
[42] Z. Zhang, and E.R. Hancock, Hypergraph based information-theoretic feature selection, Pattern Recog. Lett. 33 (2012), no. 15, 1991–1999.
[43] Y. Zhou, W. Zhang, J. Kang, X. Zhang, and X. Wang, A problem-specific non-dominated sorting genetic algorithm for supervised feature selection, Inf. Sci. 547 (2021), 841–859.
Volume 15, Issue 9
September 2024
Pages 271-287
  • Receive Date: 10 February 2023
  • Revise Date: 01 May 2023
  • Accept Date: 01 June 2023