Using alignment-free methods as preprocessing stage to classification whole genomes

Document Type : Research Paper


1 Computer Department, Science College for Women, University of Babylon, Babylon, Iraq

2 College of Information Technology, University of Babylon, Babylon, Iraq


In bioinformatics systems, the study of genetics is a popular research discipline. These systems depend on the amount of similarity between the biological data. These data are based on DNA sequences or raw sequencing reads. In the preprocessing stage, there are several methods for measuring similarity between sequences. The most popular of these methods is the alignment method and alignment-free method, which are applied to determine the amount of functional matching between sequences of nucleotides DNA, ribosome  RNA, or proteins. Alignment-based methods pose a great challenge in terms of computational complexity, In addition to delaying the time to search for a match, especially if the data is heterogeneous and its size is huge, and thus the classification accuracy decreases in the post-processing stage. Alignment-free methods have overcome the challenges of alignment-based methods for measuring the distance between sequences, The size of the data used is 1000 genomes uploaded from National Center for Biotechnology Information (NCBI), after eliminating the missing and irrelevant values, it becomes 860 genomes, ready to be segmented into words by the k-mer analysis, after which the frequency of each word is counted for each query. The size of a word depends on a value of k. In this paper we used a value of k =3 ….8, for each iteration will count times of frequencies words.