Using alignment-free methods as preprocessing stage to classification whole genomes

Document Type : Research Paper


1 Computer Department, Science College for Women, University of Babylon, Babylon, Iraq

2 College of Information Technology, University of Babylon, Babylon, Iraq


In bioinformatics systems, the study of genetics is a popular research discipline. These systems depend on the amount of similarity between the biological data. These data are based on DNA sequences or raw sequencing reads. In the preprocessing stage, there are several methods for measuring similarity between sequences. The most popular of these methods is the alignment method and alignment-free method, which are applied to determine the amount of functional matching between sequences of nucleotides DNA, ribosome  RNA, or proteins. Alignment-based methods pose a great challenge in terms of computational complexity, In addition to delaying the time to search for a match, especially if the data is heterogeneous and its size is huge, and thus the classification accuracy decreases in the post-processing stage. Alignment-free methods have overcome the challenges of alignment-based methods for measuring the distance between sequences, The size of the data used is 1000 genomes uploaded from National Center for Biotechnology Information (NCBI), after eliminating the missing and irrelevant values, it becomes 860 genomes, ready to be segmented into words by the k-mer analysis, after which the frequency of each word is counted for each query. The size of a word depends on a value of k. In this paper we used a value of k =3 ….8, for each iteration will count times of frequencies words.


[1] N. Abed, A. Shanan, H. A. Lafta and S. Z. Al Rashid, Bacteria taxonomic classification using machine-learning
models, Solid State Tech. 64 (2021) 1091–1112.
[2] S. Aggarwal, Using Mutual Information for extracting Biclusters from Gene Expression Data, New Delhi, 2013.
[3] A.K. Al-Mashanji and S.Z. Al-Rashi, Computational Methods for Preprocessing and Classifying Gene Expression
Data- Survey, 4th Sci. Int. Conf. Najaf, SICN 2019, March (2020) 121–126.
[4] S.Z. Al-Rashid and N.H. Al-Aaraji, Bayesian Models with Coregionalization to Model Gene Expression Time
Series for Mouse Model for Speed Progression of ALS Disease, Eur. J. Sci. Res. 1 (2015) 1–20.[5] J. R. Cole et al., The Ribosomal database project: Improved alignments and new tools for rRNA analysis, Nucleic
Acids Res. 1 (2009) 141-–145.
[6] M. El Kourdi, A. Bensaid and T. Rachidi, Automatic Arabic document categorization based on the Na¨ıve Bayes
algorithm, Proc. Workshop on Comput. Approaches to Arabic Script-based Languages, (2004) 51–58.
[7] K. Eschke, J. Trimpert, N. Osterrieder and D. Kunec, Attenuation of a very virulent Marek’s disease herpesvirus
(MDV) by codon pair bias deoptimization, PLoS Pathog. 14 (2018) 1-–24.
[8] A. Fiannaca et al., Deep learning models for bacteria taxonomic classification of metagenomic data, BMC Bioinf.
19 (2018) 61–76.
[9] G. Gamage, N. Gimhana, A. Wickramarachchi, V. Mallawaarachchi and I. Perera, Alignment-free whole genome
comparison using k-mer forests, 19th Int. Conf. Adv. ICT Emerg. Reg. ICTer 2019 - Proc. 2019.
[10] L. Y. Geer, N. Gimhana, A. Wickramarachchi, V. Mallawaarachchi, and I. Perera, The NCBI BioSystems
database, Nucleic Acids Res. 38 (2009) 492-–496.
[11] C. Gustafsson, S. Govindarajan, J. Minshull, and M. Park, Codon bias and heterologous protein expression.
[Trends Biotechnol. 2004]- PubMed result, Trends Biotechnol., 2004.
[12] S. J. Kho, M. L. Raymer, H. B. Yalamanchili, and A. P. Sheth, A novel approach for classifying gene expression
data using topic modeling, ACM-BCB 2017 - Proc. 8th ACM Int. Conf. Bioinformatics, Comput. Biol. Heal. Inf.
(2017) 388—393.
[13] J. M. Kirk et al., Functional classification of long non-coding RNAs by k-mer content, Nat. Genet. 10 (2018)
[14] M. La Rosa, A. Fiannaca, R. Rizzo and A. Urso, Probabilistic topic modeling for the analysis and classification
of genomic sequences, BMC Bioinformatics, 6 (2015) 1-–9.
[15] P.A. Mundra and J.C. Rajapakse, Gene and sample selection using T-score with sample selection, J. Biomed. Inf.
59 (2016) 31—41.
[16] A. Nair, Computational biology & bioinformatics: a gentle overview, Commun. Comput. Soc. India 5 (2007)
[17] S.C. Perry and R.G. Beiko, Distinguishing microbial genome fragments based on their composition: Evolutionary
and comparative genomic perspectives, Genome Biol. Evol. 2 (2010) 117—131.
[18] V.O. Polyanovsky, M.A. Roytberg and V.G. Tumanyan, Comparative analysis of the quality of a global algorithm
and a local algorithm for alignment of two sequences, Algorithms Mol. Biol. 6 (2011) 1—12.
[19] S. Ram´ırez-Gallego, B. Krawczyk, S. Garc´ıa, M. Wo´zniak and F. Herrera, A survey on data preprocessing for
data stream mining: Current status and future directions, Neurocom. 239 (2017) 39-–57.
[20] A. Sievers, F. Wenz, M. Hausmann and G. Hildenbrand, Conservation of k-mer composition and correlation
contribution between introns and intergenic regions of animalia genomes, Genes. (Basel) 9 (2018) 1—19.
[21] K. Simek et al., Using SVD and SVM methods for selection, classification, clustering and modeling of DNA
microarray data, Eng. Appl. Artif. Intell., 4 (2004) 417—427.
[22] G.Z. Valenci, M. Rubinstein, R. Afriat, Z.D. Shira Rosencwaig, E. Rorman and I. Nissan, Draft Genome Sequences of Cronobacter muytjensii Cr150 , Cronobacter turicensis Cr170, and Cronobacter sakazakii Cr611 Gal,
Microbiology Resource Announ. 9(44) (2020) 9—11.
[23] S. Vinga and J. Almeida, Alignment-free sequence comparison - A review, Bioinf. 4 (2003) 513-–523.
[24] M. Welch et al., Design parameters to control synthetic gene expression in Eschorichia coli, PLoS One, 9 (2009).
[25] R. Yin, Z. Luo and C. K. Kwoh, Alignment-free machine learning approaches for the lethality prediction of
potential novel human-adapted coronavirus using genomic nucleotide, bioRxiv, (2020) 1–18.
Volume 12, Issue 2
November 2021
Pages 1531-1539
  • Receive Date: 20 March 2021
  • Revise Date: 14 April 2021
  • Accept Date: 18 May 2021