Using some methods to estimate the parameters of the Multivariate Skew Normal (MSN) distribution function with missing data

Document Type : Research Paper

Authors

University of Baghdad ,College of Administration & Economics, Statistics Department, Iraq

Abstract

The estimation of statistical parameters for multivariate data can lead to wasted information if the missing values are neglected, which in return will lead to inaccurate estimates, therefore the incomplete data must be estimated using one of the statistical estimation methods to obtain accurate results and thus obtaining good estimates for the parameters.
 Missing values is considered one of the most important problems that researchers encounter and the most common, and in the case of the multivariate skew normal distribution (MSN) the presence of this problem will lead to weak and misleading conclusions for the research, which calls for treating this problem and in return obtaining efficient and convincing results. The aim of this paper is to estimate the missing values for the multivariate skew normal distribution function using the K-nearest neighbors Imputation (KNN). After estimating the missing values, the parameters are estimated using Genetic Algorithm (GA), and the Bayesian Approach was also used to estimate the missing values and find the estimates for the parameters. Using simulation, the Mean Squared Error (MSE) was calculated to find out which method is the best for estimation by comparing the two methods using different sample sizes (400, 600, and 800). The (GA) that is based on the (KNN) algorithm to estimate the missing values proved to be better and more efficient than the Bayesian Approach in terms of the results.

Keywords

[1] Q. N. N. Al-Qazaz, A Comparison of Robust Bayesian Approaches with other Methods for Estimating Parameters of Multiple Linear Regression Model with missing Data, Thesis in Ph.D. from University of Baghdad; college of Administration and Economics / Department of Statistics, 2007.
[2] I. A. Hussein, Analysis of incomplete data for multiple regression models using ECME, ECM, EM algorithms with practical application, Master of Science in Statistics, College of Administration and Economics, University of Baghdad, 2010.
[3] Q. N. N. Al-Qazaz and M. K. Ayoub, , User (K-Means) for clustering in Data Mining with application, J. Econ. Administrative Sci. , 22(91)(2016) 389-406.
[4] Q. N. N. Al-Qazaz, M. Y. Hammoud and T. M. Abbas , A nonparametric estimation of a multivariate probability density function Al-Nahrain J. Sci., 11 (2)(2008) 55-63.
[5] E. Acuna and C. Rodriguez, The treatment of missing values and its effect on classifier accuracy, In: Classif. Clustering Data Min. Appl., Springer, Berlin, Heidelberg, 2004, pp. 639-647.
[6] A. Azzalini and A. D. Valle, The multivariate skew-normal distribution , Biometrika, 83 (4)(1996) 715-726. .
[7] A. E. Gelfand and A. F. Smith, Sampling-based approaches to calculating marginal densities, J. Am. Stat. Assoc., 85 (410)(1990) 398-409.
[8] G. Guo, H. Wang, D.Bell, Y. Bi and K. Greer ,KNN model-based approach in classification. In: OTM Confederated Int. Conf. “On the Move to Meaningful Internet Systems”, Springer, Berlin, Heidelberg, 2003, pp. 986-996.
[9] T. I. Lin , H. J. Ho and C. L. Chen , Analysis of multivariate skew normal models with incomplete data, J. Multivar. Anal., 100 (10)(2009), 2337-2351.
[10] S. Richardson and P. J. Green, On Bayesian analysis of mixtures with an unknown number of components (with discussion), J. R. Stat. Soc. Ser. B (Statistical Methodology), 59 (4 )(1997) 731-792.
[11] S. K. Sahu, D. K. Dey and M. D. Branco, A new class of multivariate skew distributions with applications to Bayesian regression models, Can. J. Stat. , 31 (2 )(2003) 129-150.
[12] W. Shahzad, Q. Rehman and E. Ahmed, Missing data imputation using genetic algorithm for supervised learning, Int. J. Adv. Comput. Sci. Appl. (IJACSA), 8(3)(2017) 438-445 .
[13] L. N. Shawkat and Q. N. Nayef, A Comparison between (ECM) and (KNN) Methods for the Multivariate skewnormal model with incomplete data , J. Al Rafidain Univ. Coll. , 46(2020) 378-393.
[14] M. A. Tanner and W. H. Wong, The calculation of posterior distributions by data augmentation, J. Am. Stat. Assoc. , 82 (398 )(1987), 528-540.
[15] Q. H. Wang, Statistical estimation in partial linear models with covariate data missing at random, Ann. Inst. Stat. Math., 61 (1) (2009) 47-84.
[16] A. Yal¸cınkaya, B. S¸eno˘glu and U. Yolcu , Maximum likelihood estimation for the parameters of skew normal distribution using genetic algorithm, Swarm Evol. Comput. , 38(2018) 1-28.
[17] N. A. Zainuri, A. A. Jemain and N. Muda ,A comparison of various imputation methods for missing values in air quality data, Sains Malaysiana, 4 (3)(2015) 449-456.
[18] https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d.
Volume 13, Issue 1
March 2022
Pages 2333-2350
  • Receive Date: 09 May 2021
  • Accept Date: 27 October 2021