Predicting with the quantify intensities of transcription factor-target genes binding using random forest technique

Document Type : Research Paper


University of Babylon, Hilla, Iraq


With the rapid development of technology, this development led to the emergence of microarray technology. It has the effect of studying the levels of gene expression in a way that makes it easier for researchers to observe the expression levels of millions of genes at the same time in a single experiment. Development also helped in the emergence of powerful tools to identify interactions between target genes and regulatory factors. The main aim of this study is to build models to predicate the relationship (Interaction) between Transcription Factors (TFs) proteins and target genes by selecting the subset of important genes (Relevant genes) from original dataset. The proposed methodology comprises into three major stages: the genes selection, merge datasets and the prediction stage. The process of reducing the computational space of gene data has been accomplished by using proposed mutual information method for genes selection based on the data of gene expression. In the prediction, the proposed prediction regression techniques are utilized to predict with binding rate between single TF-target gene. It has been compared the efficiency of two different proposed regression techniques including: Linear Regression and Random Forest Regression. Two available data sets have been utilized to achieve the objectives of this study: Gene’s expression data of Yeast Cell Cycle dataset and Transcription Factors dataset. The evaluation of predictions performance has been performed depending on two performance prediction measures (Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) with (10) Folds-Cross Validation.


[1] A. S. Alasadi and S. W. Bhaya. “Review of data preprocessing techniques in data mining, J. Engin. Appl. Sci. 12(16) (2017) 4102—4107.
[2] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts and P. Walter, Molecular Biology of the Cell, 4th Edn, New York: Garland Science, 2002.
[3] H. Abusamra, A comparative study of feature selection and classification methods for gene expression data, Procedia Comput. Sci. 23 (2013) 5–14.
[4] M. M. Babu, Introduction to microarray data analysis, Comput. Genom. Theo. Appl. 17(6) (2004) 225–49.
[5] A. L. Boulesteix, S. Janitza, J. Kruppa and R. Inke K¨onig, “Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Disc. 2(6) (2012) 493–507.
[6] A. Brazma and J. Vilo, Gene expression data analysis, FEBS Lett. 480(1) (2000) 17-–24.
[7] S. Cui, E. Youn, J. Lee and S. J. Maas, An improved systematic approach to predicting transcription factor target genes using support vector machine, Phys. Med. 9(4) (2014) ID: 16917899.
[8] K. Das, Kaberi, J. Ray and D. Mishra, Gene selection using information theory and statistical approach, Indian J. Sci. Tech. 8(8) (2015) 695—701.
[9] H. Kazan, Modeling gene regulation in liver hepatocellular carcinoma with random forests, BioMed Res. Int. 2016 (2016) Article ID 1035945.
[10] T. I. Lee and et al. Transcriptional regulatory networks in saccharomyces cerevisiae, Sci. 298(5594) (2002) 799–804.
[11] X. Liu, A. Krishnan and A. Mondry, An entropy-based gene selection method for cancer classification using microarray data, BMC Bioinfo. 6(1) (2005) N. 76.
[12] W. Meng, C. Tai, E. Weinan and L. Wei, DeFine: Deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants, Nuc. Acids Res. 46(11) (2018) 69–69.
[13] M. Niklas and G. Mariana, Definition of Historical Models of Gene Function and Their Relation to Students, Understanding of Genetics, 2007.
[14] R. D. Pearson, X. Liu, G. Sanguinetti, M. Milo, N. D. Lawrence and M. Rattray, Puma: A bioconductor package for propagating uncertainty in microarray analysis, BMC Bioinfo. 10(1) (2009) N. 211.
[15] F. Petralia, P. Wang, J. Yang and Tu. Zhidong, Integrative random forest for gene regulatory network inference, Bioinfo. 31(12) (2015) 197—205.
[16] P. Refaeilzadeh, L. Tang and H. Liu, Cross-validation. Encyclopedia of Database Systems, (2009) 532–538.
[17] F. Rafii, M. A. Kbir and B. D. R. Hassani, Microarray data preprocessing to improve exploration on biological databases, Int. Conf. on Big Data, Cloud and Applications, Tetuan, Morocco, 2015, pp. 25-–26.
[18] S. Slater, S. Joksimovic, V. Kovanovic, B. Vitomir, S. Ryan and D. Gasevic, Tools for educational data mining: A Review, J. Educ. Behav. Stat. 42(1) (2017) 85–106.
[19] T. Schlitt and P. Kemmeren, From microarray data to results, EMBO Rep. 5(5) (2004) 459-–463.
[20] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein and B. Futcher, Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization, 9(12) (1998) 3259–3578.
[21] M. Sokolova, N. Japkowicz and S. Szpakowicz. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation, Aust Joint Conf. Artif. Intel. Springer, 2006 pp. 1015-1021.
[22] P.-N. Tan, M. Steinbach and V. Kumar. Introduction to Data Mining, Pearson Education India, 2006.
[23] W. Wang and Lu. Yanmin, Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model, IOP Conference Series: Materials Science and Engineering, IOP Publishing, 2018, 12049.
[24] C. C. Xiang and Y. Chen. cDNA microarray technology and its applications, Biotech. Adv. 18(1) (2000) 35–46.
[25] W. Zhongxin, S. Gang, Z. Jing and Z. Jia, Feature selection algorithm based on mutual information and Lasso for microarray data, Open Biotech. J. 10 (2016) 278–286. 
Volume 12, Issue 2
November 2021
Pages 145-161
  • Receive Date: 04 August 2020
  • Revise Date: 21 September 2020
  • Accept Date: 29 September 2020