DNA barcoding using particle swarm optimization on apache spark SQL case study: DNA of covid-19

Document Type : Research Paper


1 Department of Computer Science Education, Universitas Pendidikan Indonesia, Indonesia

2 Faculty of Computer and Mathematical Sciences, University Teknologi MARA Cawangan Melaka Kampus Jasin, Melaka, Malaysia


The objective of this research is to design and implement a computational model to determine DNA barcodes by utilizing the Particle Swarm Optimization (PSO) algorithms implemented on Big Data Platforms, namely Apache Hadoop and Apache Spark. The steps are as follows: (i) inputting DNA sequences to Hadoop Distributed File System (HDFS) in Apache Hadoop, (ii) pre-processing data, (iii) implementing PSO by utilizing the User Defined Function (UDF) in Apache Spark, (iv) collecting results and saving to HDFS. After obtaining the computational model, two following simulations have been done: the first scenario is using 4 cores and several worker nodes, meanwhile, the second one consists of a cluster with 2 worker nodes and several cores. In terms of computational time, the results show a significant acceleration between standalone and big data platforms with both experimental scenarios. This study proves that the computational model built on the big data platform shows the development of features and acceleration of previous research.


[1] H. Alasker, S. Alharkan, W. Alharkan, A. Zaki, and L.S. Riza, Detection of kidney disease using various intelligent
classifiers, IEEE 3rd International Conference on Science in Information Technology (ICSITech). 2017, pp.681–
[2] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts and P. Walter, Molecular Biology of the Cell, 4th editio,
Garland Science, New York, 2002.
[3] T.K. Attwood, D.J. Parry-Smith and S. Phukan, Introduction to Bioinformatics, Benjamin-Cummings Publishing
Company, 2001.
[4] S. Bandyopadhyaya, R. Vankayalapati, L. Rajanna, and S. Kulkarni, DNA barcoding and its applications - A
critical review, C. J. Res. Dev. 1 (2013) 77–81.
[5] A. Bhattacharya and S. Bhatnagar, Big data and apache spark: a review, Int. J. Eng. Res. Sci. (2) (2016) 206–210.
[6] R. Desalle, M.G. Egan and M. Siddall, The unholy trinity: taxonomy, species delimitation and DNA barcoding,
Philos Trans R Soc Lond B Biol Sci. 360(1462) (2005) 1905–1916.
[7] R.C. Eberhart and Y. Shi, Particle swarm optimization: developments, applications and resources, IEEE. Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546), Korea, 2001.
[8] K. Ghulam, M. Husain, S. Salman, M. Tufail, S. Sukirno and A.S. Aldawood, DNA barcoding of the fire ant genus
solenopsis westwood (hymenoptera: formicidae) from the Riyadh region, the kingdom of saudi arabia, Saudi J.
Biol. Sci. 27 (1) (2020) 184–188.[9] C.T. Hardin and E.C. Rouchka, DNA motif detection using particle swarm optimization and expectationmaximization, Proc. - 2005 IEEE Swarm Intell. Symp. USA, 2005.
[10] A. Imtiaz, S.A. Mohd Nor and D.Md. Naim, Progress and potential of DNA barcoding for species identification
of fish species, Biodivers. J. 18(4) (2017) 1394–1405.
[11] M. Imran, R. Hashim and N.E.A. Khalid, An overview of particle swarm optimization variants, Procedia Eng.
53 (2013) 491–496.
[12] J. Kennedy and R. Eberhart, Particle swarm optimization, IEEE International Conference on Neural Networks,
Australia. 1995, pp. 1942–1948.
[13] K.C.R. Kerr, M.Y. Stoeckle, C.J. Dove, L.A. Weigt, C.M. Francis and P.D.N. Hebert, Comprehensive DNA
barcode coverage of North American birds, Mol. Ecol. Notes. 7 (2007) 535–543.
[14] X. Li, F. Yang, R. Henry, M. Rossetto, Y. Wang and S. Chen, Plant DNA barcoding: From gene to genome, Biol.
Rev. Camb. Philos. Soc. 90 (2014).
[15] L.I.U. Miao, L.I. Xi-wen, L. Bao-sheng, L.U.O. Lu and R.E.N. Yue-ying, Species identification of poisonous
medicinal plant using DNA barcoding, Chin. J. Nat. Med. 17 (2019) 585–590.
[16] R. Poli, J. Kennedy and T. Blackwell, Particle swarm optimization: An overview, Swarm Intell. 1 (2007) 33–57.
[17] N. Qader and H.K. Al-Khafaji, Motif discovery and data mining in bioinformatics, Int. J. Comput. Technol. 13
(2014) 4082–4095.
[18] L.S. Riza, F.D. Pratama, E. Piantari and M. Fashi, Genomic repeats detection using boyer-moore algorithm on
apache spark streaming, Telkomnika. 18 (2020) 783–791.
[19] L.S. Riza, M.B.A. Prabowo, E. Junaeti, A.G. Abdullah and K.A. Fariza, Development and experimentation of
R package metaheuristicOpt on continuous optimization, Int. J. Eng. Sci. Technol. 16 (2021) 1006–1018.
[20] L.S. Riza, A.B. Rachmat, Munir, T. Hidayat and S. Nazir, Genomic repeat detection using the knuth-morris-pratt
algorithm on R high-performance-computing Package, Int. J. Comput. Appl. 11 (2019) 94–111.
[21] L.S. Riza, A. Janusz, C. Bergmeir, C. Cornelis, F. Herrera, D. Slezak, and J.M. Ben´─▒tez, ´ Implementing algorithms
of rough set theory and fuzzy rough set theory in the R package roughSets, J. Inf. Sci. 287 (2014) 68–89.
[22] L.S. Riza, F.S. Anwar, E.F. Rahman, C.U. Abdullah and S. Nazir, Natural language processing and levenshtein
distance for generating error identification typed questions on TOEFL, J. Comput. Soc. 1 (2020) 1–23.
[23] L. Shao and Y. Chen, Bacterial foraging optimization algorithm integrating tabu search for motif discovery, IEEE
Int. Conf. Bioinf. Biomed. USA. (2009) 415–418.
[24] P. Srivastava and R. Khan, A review paper on cloud computing, Int. J. Adv. Res. Comput. Sci. Softw. Eng. 8
[25] D. Wang, D. Tan, and L. Liu, Particle swarm optimization algorithm: an overview, Soft Comput. 22 (2018)
[26] D. Yang and J. Wang, UPNT: uniform projection and neighbourhood thresholding method for motif discovery,
Int. J. Bioinforma. Res. Appl. 4 (2008) 96–106.
[27] C.H. Yang, Y.T. Liu and L.Y. Chuang, DNA motif discovery based on ant colony optimization and expectation
maximization, Int. Multi Conf. Eng. Comput. Sci. 2011. 1 (2011) 169–174.
Volume 12, Special Issue
December 2021
Pages 1561-1572
  • Receive Date: 16 August 2021
  • Revise Date: 21 September 2021
  • Accept Date: 03 November 2021