GA SVM: A genetic algorithm for improving gene regulatory activity prediction Đ ả Dong Do Duc , Tri-Thanh Le , Trung-Nghia , Vu , Huy Q Dinh , Hoang Xuan Huan ∗ Institute of Information Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi, Vietnam of Information Technology, Vietnam Maritime University, Hai Phong, Vietnam ‡ Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium § Center for Integrative Bioinformatics, Max F Perutz Laboratories, Vienna, Dr Bohrgasse 9, 1030 Vienna, Austria and Gregor Mendel of Molecular Plant Biology, Vienna, Austrian Academy of Sciences, Dr Bohrgasse 3, 1030 Vienna, Austria ¶ University of Technology (UET), Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi Email: {dongdoduc, huanhx}@vnu.edu.vn, thanhletri@vimaru.edu.vn, TrungNghia.Vu@ua.ac.be, huy.dinh@univie.ac.at † Department Abstract—Gene regulatory activity prediction problem is one of the important steps to understand the significant factors for gene regulation in biology The advents of recent sequencing technologies allow us to deal with this task efficiently Amongst these, Support Vector Machine (SVM) has been applied successfully up to more than 80% accuracy in the case of predicting gene regulatory activity in Drosophila embryonic development In this paper, we introduce a metaheuristic based on genetic algorithm (GA) to select the best parameters for regulatory prediction from transcriptional factor binding profiles Our approach helps to improve more than 10% accuracy compared to the traditional grid search The improvements are also significantly supported by biological experimental data Thus, the proposed method helps boosting not only the prediction performance but also the potentially biological insights I I NTRODUCTION & R ELATED W ORKS Since its double helix structure was discovered in 1953, the DNA (Deoxylribo Nucleic Acid) sequence simply consisting of four letters (Adenine, Cytosine, Guanine, Thymine) has been considered as the natural blueprint of organism development Genome itself contains a variety of information encoded in a long sequence of DNA letters For example, an interesting information is the gene-regulatory that shapes the different gene expression patterns Enhancer, or cis-regulatory module (CRM) is the DNA fragment consisting of the information to regulate the associated genes It contains the binding sites for the specific transcriptional factors (TFs) protein corresponding to a certain regulatory activity So that, understanding the CRM activity and its requirement is a fundamental problem in biology [1] Authors in [2] proposed a simple model of the CRM activity which depends on the respective TF bindings, i.e either the elimination of TF or the disruption of its binding leads to the changes of the CRM function This model has been supported by several smallscale evidences by ChIP (Chromatin Immunoprecipitation) experiments after Polymerase Chain Reaction amplification Recently, one of the first genome-wide scale experiments [3] was successfully done by using microarray technologies in the model organism, Drosophila melonagaster This work used ChIP on the tiling microarray to obtain the first highresolution atlas of mesodermal cis-regulatory modules The data provided a strong experimental proof for the model mentioned above In addition, they used transcriptional factor binding profiles measured by ChIP signals [4] to predict the expression patterns of genes which are regulated by those respective enhancers Interestingly, the prediction performance was quite well; and more importantly, they predicted some novel enhancers with highly accurate expression categories Thus, learning regulatory code that derives different expression patterns by computational methods is a very attractive branch in computational biology [1] To predict the expression patterns of genes, the authors [3] applied a traditional grid search for the parameter optimization of radial kernel Support Vector Machine (SVM, [5]) and gained up to 82% accuracy under the leave-oneout cross validation (LOOCV) framework Cost C and γ are two parameters of radial kernel SVM The former determines the trade-off between the minimization of fitting error and the maximization of classification margin whereas the later affects the efficiency of the kernel function especially for high-dimensional data Parameter optimization plays an important role in the prediction performance of SVM, especially when using radial kernel [6] Metaheuristic approaches (e.g Genetic Algorithm and Ant Colony Optimization) have been successfully applied to optimize the SVM parameters ([6], [7]) in different context problems.The grid search used by the authors in [3] was a quick method that helps to approximate the efficient parameters for SVM prediction However, this method only explored a sparse amount of parameter space As a consequence, three out of five test cases achieved only 70% of accuracy on average and just one case reached more than 80% Especially, those three cases were the situation that the expression pattern of one uniquely corresponds to one enhancer activity Thus, it is necessary to have more intensive methods to further seek the best parameters, particularly for the very strict datasets that the available information might be not enough for the standard prediction 978-1-4673-0309-5/12/$31.00 ©2012 IEEE We introduce a genetic algorithm approach to improve the performance of enhancer activity prediction Making use of GA, the method search more intensively on the parameter space than the traditional grid search did, to explore better parameter for the prediction Consequently, the proposed approach outperforms the previous method [3] and obtains more than 80% LOOCV accuracy on average for all the cases More important, our results are significantly better in the case of predicting the regulatory activity for novel enhancers with in vivo validated data Our study proved the need of parameter choosing and optimization in the SVM prediction with the specific biological dataset II BIOLOGICAL DATA AND PREDICTION PROBLEM A Transcriptional binding Drosophila development landscapes in embryonic Drosophila is a model organism for embryonic development research in biology because of the well-established timecourse experiments for several important transcriptional factors like Twist or Tinman [3] It is also well-known for the very early time point of the cell development that only DNA information might be existed It allows us to investigate the importance of DNA information (e.g DNA motif) with respect to the developmental regulation of the cell ChIP is a method to selectively enrich for DNA sequences bound by a particular protein Recently, this technology was used to identify the active CRMs systematically by either tiling microarray (ChIPchip) or deep sequencing(ChIP-Seq) at whole-genome scale Using ChIP-chip, [3] used a tiling array to obtain the data of transcriptional factor binding for five important mesodermal factors: Twist, Tinman, Mef2, Bagpipe, and Biniou at crucial time points during embryogenesis (Fig 1) As sequence, each CRM is assigned with one expression category (mesoderm, somatic muscle, or visceral muscle; (Fig 1) referred as meso, sm, vm from here on in the paper) by using the well-known database (e.g REDFly database [8]) consisting of 310 CRMs In this dataset, there are a number of CRMs belonging to ambiguous expression categories, i.e the patterns are determined at both meso and sm (called meso sm), or both vm and sm (called vm sm) In addition, they also identified in vivo the expression category for 35 de novo CRMs which are unknown from the REDFly database Using transgenic reporter assay experiments, they also could determine the expression pattern for those novel CRMs It is very important that one can test the performance of the prediction approach by predicting those novel CRMs’ activities using the known REDfly CRMs in training process B Spatio-temporal cis-regulatory activity prediction in machine learning context Researchers in [3] applied Support Vector Machine to establish a prediction framework of transcriptional regulatory activity, i.e expression category, from the binding profiles of the corresponding transcriptional factor The prediction was helpful to indicate the potential of determining the specific Fig Regulatory activity prediction based on the transcriptional binding measured by ChIP-chip heights The peak height indicates the ChIP binding of the respective TF at specific time point In this figure, Twist (Twi), Tin are at early time point (5-7h, 8-9h,10-11h), Bin is at late time point (10-11h, 1213h,13-15h) Whereas Bap is only at 10-11h and Mef2 is at all time points The binding profile is then used to predict the group of the enhancer activity Three groups are mesoderm, somatic muscle, visceral muscle on the right side A part of the figure is from [1] transcriptional factors and their degrees that influence the expression patterns it regulated In the machine learning context, each CRM was represented by an object data of maximal 15 features which were the combination of transcriptional factors and time points The SVM method was applied to predict the expression pattern of each CRM In details, the binary SVM was used to predict the group of an enhancer corresponding expression of transcriptional binding factors at embryonic development time points The groups were mesoderm/somatic muscle/visceral muscle (meso, SM, VM) The combinations, Meso+SM and VM+SM, were also considered because of the natural observation from the expression data III METHODS A SVM prediction of regulatory activity based on transcriptional factor binding profiles A SVM constructs an N-dimensional hyperplane that optimally discriminates the data into two categories Given an individual enhancer and its corresponding binding profiles from ChIP-chip data, the binary SVM prediction is used to predict its transcriptional category A SVM model is built to learn how to classify the enhancer x into two classes, e.g mesodermal or notmesodermal, from a training set of m enhancers which have known activities The SVM classifier works based on the m following decision function: f (x) = λi K(xi , x) where K is a kernel function and λs are coefficients which are learned during the training process Usually, the linear kernel function is used for simple data and the radial kernel function is for the more complex cases SVM is a parameter-sensitive machine learning classification method, particularly with the radial kernel function Researchers in [3] used fine-grained grid searching to achieve the optimal result in which C and γ were set as integer values ranging from 10−2 to 105 and from 10−6 to 102 respectively It resulted on average 78% accuracy SVM performance with LOOCV In this paper, we investigate the optimization of two important parameters: C and γ by using Genetic Algorithm GA method will search finer in the parameter spaces, and so better results are expected B Genetic Algorithm The GA algorithm works as follow (see pseudo code Algorithm 1): at tth generation called P (t) consisting of N solutions or N set of parameters (C, γ) Each solution is evaluated with a fitness function, here, an AUC value A next generation (t + 1)th is created by selecting the best individuals via lottery cycle procedure and GA operators including mutation or cross-over More details about GA could be refered to [9] The builds of chromosome and fitness function of GA for our problem are discussed in the next section Algorithm 1: GA algorithm to improve the prediction Data: An enhancer set with known regulation activity Output: The best solution begin t ← (generation index); Initialize the generation P (t); Evaluate P (t); while termination condition is not met t ← t + (next generation); Select new generation Q(t) from P (t − 1); Create P (t) from Q(t) by GA operators; Evaluate P (t) and Select the best individuals; Output the best solution; end The standard implementation with default parameters of GA algorithm is derived from R package genalg C Fitness function and representation of parameters in GA The main issue of GA is how to present the problem by a chromosome In our method, two parameters C and γ were encoded by a chromosome in binary vector In details, each chromosome consists of a 51-bit binary vector that represents real values of the parameters The 24 first bits are reserved for the C and the rest represents the value of γ Figure gives an example of a chromosome, mutation and crossover operations In the mutation, the bit zero in the dark cell of a chromosome is changed to the bit one in the result chromosome In the crossover, two chromosomes are divided at the same postion, then heads and tails of two chromosomes are exchanged At each step, the GA algorithm in silico evolves the population and selects the best individuals for the next generation according to the fitness function which is defined as the Area Under Curve (AUC) value computed by [10] At the last stage, the best binary vectors are used to transformed back to the real-valued parameters normalized by a factor of 102 (with C) and 106 (with γ) IV EXPERIMENTAL RESULTS A Data & Evaluation We used two published datasets from the model organism Drosophila Melanogaster: the first consisted of 310 CRMs http://cran.r-project.org/web/packages/genalg/index.html Fig 51-bit binary representation consists of 24 bits for C and 27 bit for γ After a generation, GA operators like mutation and cross-over are performed to generate a new representation with known regulatory activity, the second was a selected collection of 35 novel enhancers whose expression category was tested in vivo from more than 8000 enhancers [3] The 310 enhancers are from the CRM Activity Database (CAD) with the expression driven by published CRMs, using REDFly database [8] For the second set, we used the training set as the first 310 known enhancers The novel enhancers were selected and tested in vivo from [3] It is worth to note that the majority of datasets were imbalanced, i.e the number of active and non-active enhancers were not equally To evaluate such the type of data, we used the so-called Balanced Accuracy (BACC) as the average of Sensitivity and Specificity of the prediction results In addition, we used the traditional Area Under the Curves (AUC) to estimate the trade off between the two measurements All evaluations were computed under the unbiased Leave-One-Out cross validation (LOOCV) context The proposed method were run 20 times and results were recorded Initiation parameter of GA was default by the genalg package The run time is an hour in PC 3.3Ghz 4GB RAM, while traditional grid search tooks about minutes in implementation because of its simple strategy However, it is not a significant problem for more and more powerful machine nowadays B Comparative Study 1) Known enhancer dataset: The GA SVM outperforms the previous study in all cases of datasets including MESO, VM, SM and VM SM (Fig 3) In case of Meso SM, the performances of two methods are similar and both up to 82% It is remarkable to see that the GA SVM significantly improved up to 10% average the performance of SVM prediction for three cases of unique regulatory activity (Meso, VM and SM) The big gap proofs the efficiency of the parameter optimization of SVM for a particular type of data In the view of AUC, the mean and deviation of run 20 times were recoreded, see the table I The proposed method Fig The comparison of Balanced Accuracy (BACC) between the GA SVM method and the grid search (GS SVM) method [3] for five experimental categories The GA SVM (for 20 runs) outperforms the other method in all cases again has significantly higher performance than the grid search method in cases of uniquely regulatory activities The ROCR package [10] is used for the computation Regulatory category Meso VM SM Meso SM VM SM GS SVM[3] 0.66 0.67 0.71 0.82 0.74 GA SVM 0.71±0.01 0.78±0.01 0.75±0.01 0.83±0.01 0.82±0.02 TABLE I T HE COMPARISON BETWEEN THE GA SVM METHOD WITH THE GRID SEARCH METHOD (GS SVM) [3] IN TERMS OF A REA U NDER THE C URVES (AUC) FOR ALL EXPERIMENTAL CATEGORIES Fig The comparison between the GA SVM method with the grid search method [3] for the novel enhancers True Positive and False Positive indicates the CRMs with unique regulatory activities where the prediction results are true/false Partial indicates the number of CRMs that the predicted regulatory activity is one of the expression categories detected by in vivo experiments expression category where the prediction information needs to be more precise In addition, we also outperformed the prediction in the novel enhancers when using known enhancers as training set That indicates the importance of optimization in biological prediction The biological data is in emerging time that leads to the needs of optimal computational optimization Future work includes challenging a diversity of prediction problems in biology and then building up an automatic systems of evolutionary computation algorithms to learn the prediction parameters from the biological data itself ACKNOWLEDGMENT This work is partially supported by Vietnams National Foundation for Science and Technology Development (NAFOSTED) R EFERENCES 2) In vivo enhancer test: In [3], they carried out the in vivo experiments for 35 among more than 8000 new enhancers and reported its specific regulatory activities In this paper, we evaluate the performance of the two methods by predicting these datasets It also considered the so-called partially corrected predictions if the enhancers were predicted one of the expression categories observed Both methods well-perform up to approximately 80% of novel CRM regulatory activities (see Fig 4) Interestingly, the GA SVM improves significantly number of CRM activity predictions for partially expression It also helps to decrease number of false positive CRM activity predictions significantly compared to the previous results [3] It indicates that the well-suited prediction parameters are necessary for learning the rules from known CRM datasets to predict the activity of the novel ones where the training information might not be really fit the predicting information V CONCLUSIONS We proposed a new way to improve the prediction of gene regulatory activity based on transcriptional factor binding profiles Our performance was improved roughly more than 10% accuracy compared to the previous method Especially, we gained the significantly better results in case of unique [1] A Stark, “Learning the transcriptional regulatory code,” Mol Syst Biol., vol 5, p 329, 2009 [2] M I Arnone and E H Davidson, “The hardwiring of development: organization and function of genomic regulatory systems,” Development, vol 124, pp 1851–1864, May 1997 [3] R P Zinzen, C Girardot, J Gagneur, M Braun, and E E Furlong, “Combinatorial binding predicts spatio-temporal cis-regulatory activity,” Nature, vol 462, pp 65–70, Nov 2009 [4] P J Park, “ChIP-seq: advantages and challenges of a maturing technology,” Nat Rev Genet., vol 10, pp 669–680, Oct 2009 [5] C Cortes and V Vapnik, “Support-vector networks,” Machine Learning, vol 20, pp 273–297, 1995, 10.1007/BF00994018 [Online] Available: http://dx.doi.org/10.1007/BF00994018 [6] X Zhang, X Chen, and Z He, “An aco-based algorithm for parameter optimization of support vector machines,” Expert Syst Appl., vol 37, pp 6618–6628, September 2010 [Online] Available: http://dx.doi.org/10.1016/j.eswa.2010.03.067 [7] C.-L Huang and C.-J Wang, “A ga-based feature selection and parameters optimizationfor support vector machines,” Expert Systems with Applications, vol 31, no 2, pp 231 – 240, 2006 [Online] Available: http://www.sciencedirect.com/science/article/pii/S0957417405002083 [8] M S Halfon, S M Gallo, and C M Bergman, “REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila,” Nucleic Acids Res., vol 36, pp D594– 598, Jan 2008 [9] C Reeves, Genetic Algorithms and Combinatorial Optimisation: Applications of Modern Heuristic Techniques UK: In V.J Rayward- Smith (Eds), Alfred Waller Ltd, Henley-on-Thames, UK, 1995 [10] T Sing, O Sander, N Beerenwinkel, and T Lengauer, “ROCR: visualizing classifier performance in R,” Bioinformatics, vol 21, pp 3940–3941, Oct 2005 ... “Combinatorial binding predicts spatio-temporal cis -regulatory activity, ” Nature, vol 462, pp 65–70, Nov 2009 [4] P J Park, “ChIP-seq: advantages and challenges of a maturing technology,” Nat Rev Genet.,... evolutionary computation algorithms to learn the prediction parameters from the biological data itself ACKNOWLEDGMENT This work is partially supported by Vietnams National Foundation for Science and... combinations, Meso+SM and VM+SM, were also considered because of the natural observation from the expression data III METHODS A SVM prediction of regulatory activity based on transcriptional factor