Outlier detection in test samples and supervised training set selection

Document Type : Research Paper

Authors

1 Department of Computer Engineering‎, ‎Babol Branch‎, ‎Islamic Azad University‎, ‎Babol‎, ‎Iran

2 Department of Computer Engineering‎, ‎Sari Branch‎, ‎Islamic Azad University‎, ‎Sari‎, ‎Iran

Abstract

‎Outlier detection is a technique for recognizing samples out of the main population within a data set‎. ‎Outliers have negative impacts on classification‎. ‎The recognized outliers are deleted to improve the classification power generally‎. ‎This paper proposes a method for outlier detection in test samples besides a supervised training set selection‎. ‎Training set selection is done based on the intersection of three well known similarity measures namely‎, ‎jacquard‎, ‎cosine‎, ‎and dice‎. ‎Each test sample is evaluated against the selected training set for possible outlier detection‎. ‎The selected training set is used for a two-stage classification‎. ‎The accuracy of classifiers are increased after outlier deletion‎. ‎The majority voting function is used for further improvement of classifiers‎.

Keywords

[1] N. Garcıa-Pedrajas, Evolutionary computation for training set selection, WIREs Data Mining and Knowledge Discovery 1(6) (2011) 512–523.
[2] S. Peng, Q. Hu, J. Dang, and W. Wang, Optimal feasible step-size based working set selection for large scale SVMs training, Neurocomput. 407 (2020) 366–375.
[3] J.R. Cano and S. Garcıa, Training set selection for monotonic ordinal classification, Data Knowledge Engin.112 (2017) 94–105.
[4] A.M. Mohammed, E. Onieva, and M. Wozniak, Training set selection and swarm intelligence for enhanced integration in multiple classifier systems, Appl. Soft Comput. 95 (2020) 106568.
[5] N. Verbiest, J. Derrac, C. Cornelis, S. García, and F. Herrera, Evolutionary wrapper approaches for training set selection as preprocessing mechanism for support vector machines: Experimental evaluation and support vector analysis, Appl. Soft Comput. 38 (2016) 10–22.
[6] Z. Ren, B. Wu, X. Zhang, and Q. Sun, Image set classification using candidate sets selection and improved reverse training, Neurocomput. 341 (2019) 60–69.
[7] E. Santiago-Ramirez, J.A. Gonzalez-Fraga, and Omar Alvarez-Xochihua, Optimization-based methodology for training set selection to synthesize composite correlation filters for face recognition, Signal Process.: Image Commun. 43 (2016) 54–67.
[8] A. Smiti, A critical overview of outlier detection methods, Comput. Sci. Rev. 38 (2020) 100306.
[9] S. Rath, A. Tripathy and A.R. Tripathy, Prediction of new active cases of coronavirus disease (COVID-19) pandemic using multiple linear regression model, Diabetes  Metabolic Syndrome: Clinical Res. Rev. 14(5) (2020) 1467–1474.
[10] T. Chen, E. Martin and G. Montague, Robust probabilistic PCA with missing data and contribution analysis for outlier detection, Comput. Stat. Data Anal. 53(10) (2009) 3706–3716.
[11] C. Lejeune, J. Mothe, A. Soubki, and O. Teste, Shape-based outlier detection in multivariate functional data, Knowledge-Based Syst. 198 (2020) 105960.
[12] B. Tang and H. He, A local density-based approach for outlier detection, Neurocomput. 241 (2017) 171–180.
[13] B. Wang and Z. Mao, A dynamic ensemble outlier detection model based on an adaptive k-nearest neighbour rule, Inf. Fusion 63 (2020) 30–40.
[14] G. Acampora, F. Herrera, G. Tortora, and A. Vitiello, A multi-objective evolutionary approach to training set selection for support vector machine, Knowledge-Based Syst.147 (2018) 94–108.
[15] C. Liu, W. Wang, M. Wang, F. Lv, and M. Konan, An efficient instance selection algorithm to reconstruct training set for support vector machine, Knowledge-Based Syst. 116 (2017) 58–73.
[16] A. Christy, G.M. Gandhi and S. Vaithyasubramanian, Cluster-based outlier detection algorithm for healthcare data, Procedia Computer Sci. 50 (2015) 209–215.
[17] M. Lu, Z. Qin, Y. Cao, Z. Liu, and M. Wang, Scalable news recommendation using multi-dimensional similarity and Jaccard–Kmeans clustering, J. Syst. Software 95 (2014) 242–251.
[18] A. Singh and S. Kumar, A novel dice similarity measure for IFSs and its applications in pattern and face recognition, Expert Syst. Appl. 149 (2020) 113245.
[19] J. Ye, Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses, Artificial Intell. Medic. 63(3) (2015) 171–179.
[20] H. Nematzadeh, R. Enayatifar, M. Mahmud, and E. Akbari, Frequency-based feature selection method using whale algorithm, Genomics 111(6) (2019) 1946–1955.
[21] E. Akbari, H.M. Dahlan, R. Ibrahim, and H. Alizadeh, Hierarchical cluster ensemble selection, Engin. Appl. Artificial Intell. 39 (2015) 146–156.
Volume 12, Issue 1
May 2021
Pages 701-712
  • Receive Date: 26 July 2020
  • Revise Date: 08 December 2020
  • Accept Date: 11 January 2021