Finding motifs in gene sequences is one of the most important problems of bioinformatics and belongs to NPhard type. This paper proposes a new ant colony optimization algorithm based on consensus approach, in which a relax technique is applied to recognize the location of common motif. The efficiency of the algorithm is evaluated by comparing it with the stateoftheart algorithms
An Efficient Ant Colony Algorithm for DNA Motif Finding Xuan Huan Hoang1, The Hung Nguyen1, T. Thu Ha Doan2, and T. Anh Tuyet Duong1 1 University of Engineering and Technology, VNU, Hanoi, Vietnam {huanhx, hungnt_55, tuyetdta_55}@vnu.edu.vn 2 Hanoi University of Agriculture, doanha86@gmail.com Abstract. Finding motifs in gene sequences is one of the most important problems of bioinformatics and belongs to NP-hard type. This paper proposes a new ant colony optimization algorithm based on consensus approach, in which a relax technique is applied to recognize the location of common motif. The efficiency of the algorithm is evaluated by comparing it with the state-of-the-art algorithms. 1. Introduction Gene regulatory elements are called the DNA motifs (later we call it “motifs” for short), which contain a number of important biological information [1,5,12,14,18]. The identification of DNA motif is currently one of the most important problems in bioinformatics and is NP-hard (see [2,10,12,16,17,19]). There are two main approaches to search for a motif: biological experiment and computing method, i.e. bioinformatics. Due to the high cost and time consuming, biological experiments are not really effective, whereas computing methods are widely used to predict motifs. Researchers have made various definitions of motif, many statements for motif finding problem and also developed a number of algorithms for finding motif [1,3,5,15]. One of the widely used approaches is to use an approximate algorithm to optimize consensus score or information content [2,7,10,11,15,16,19]. Recently, the methods that use ant colony optimization (ACO) have been applied effectively by several authors for this problem. For example, Bouamama et al. (2010) proposed MFACO algorithm [2] that uses consensus score to find motifs and information content to locate their appearances (binding sites) in each DNA sequence. Yang et al. (2011) proposed an algorithm [19], referred to from now on as EMACO, that combines ACO algorithm with Expectation Maximization (EM) to find the starting positions of motif in sequences. Liu et al. (2013) proposed ACRI algorithm [11] that uses information content as the objective function for the same purpose as EMACO. In this paper, we propose a new ACO algorithm called ACOMotif using the total Hamming distance score function of motif to DNA sequences for this problem. ACOMotif uses the structural graph as in MFACO but with different heuristics information, pheromone update rule, and local search technique. For each motif found, to locate the starting positions in the DNA sequences, the algorithm subsequently applies a relax technique and gives R-ACOMotif version for this goal. Runtime of ACOMotif is also compared with MotifSuite (2012) on a very large dataset obtained from [21] called SCPD. The efficiency of ACOMotif is indicated by the experiments on the same datasets published in three articles above and on SCPD. The rest of this paper is organized as follows. Section 2 states the DNA motif finding problem, followed by a brief introduction of ACO method and how it was applied in 1 MFACO, ACRI, and EMACO algorithms. Our new algorithm will be introduced in Section 3. Section 4 describes the experiments comparing ACOMotif/R-ACOMotif with MFACO, EMACO, ACRI, and MotifSuite. Some conclusions are presented in the last section. 2. DNA motif finding problem and related works 2.1. DNA motif finding problem DNA motif finding problem, from optimization perspective, can be described as follows [2,19]: consider a set of same length DNA sequences S = {S1,S2,..., SN}, in which , belongs to the letter set Σ = {A, C, G, T} for all i, j. For a given natural number l, there are two approaches to discover a motif: 1) Consensus approach: Find a string Sc of length l and a set of subsequences , in which mi is the substring of length l of Si, such that they minimize the objective function: (1) Then Sc is called a motif of S, each mi is called a motif instance (or instance in short) from Si. If we consider M as a matrix (called a consensus matrix) with the ith row being the string mi and denote C(u,j) as the number of nucleotides u in column j, the objective function CSc(M) is formulated as: (2) Then motif Sc is a string of length l and the letter at position i is the nucleotide that occurs most often in the ith column of M. Note that each M can have many motifs but we consider only one of them. 2) Positional approach: Find a set of substrings } and a set of starting positions }, in which, each instance mi is a length l substring of Si corresponding to starting position ai. In this approach, the objective function is information content: (3) In which, Q(u,j) indicates the frequency of nucleotide u in column j of the matrix M, pu is the background frequency of u in the entire set S. In reality, the location of mi on Si is called the binding site of DNA. Remark: Note that, it is not sure that the optimal solutions of this objective function are real motifs. So, the more solutions and the closer to real motif’s binding sites from locations of the instances, the better an algorithm is. 2.2. Ant Colony Optimization method Ant colony Optimization (ACO) proposed by Dorigo [6,9] is a random metaheuristic method to solve hard combinatorial optimization problems. This algorithm has been diversely improved in the literature and widely applied in many applications. Memetic scheme using population-based search technique was first proposed by Moscato [13] and applied for genetic algorithm. Today, it is incorporated with other algorithms [3,8]. 2 Memetic-ACO algorithms In this article, we apply ACO with reinforcement search following a simple memetic schema as described in Figure 1. In such algorithms, the original problems are converted into the problems of finding solutions on the structural graph G = (V, E, Ω, η, T) where V is the vertex set, E is the edge set, and η and T are the set of heuristics information and pheromone trail, respectively, in reinforcement learning ; η and T can be placed on vertices or edges. An acceptable solution is a path satisfying the condition Ω, starting from a vertex in C0 set of V, then expanded by a random to the next vertex based on heuristics information and pheromone trail. The ACO algorithm uses Nant artificial ants, in each iteration, each ant finds a solution by a randomized procedure on the structural graph. Then all the ant solutions are assessed and choosing the best one to apply enhancing strategy or local search technique. Consequently, the obtained solutions will be evaluated again and the pheromone trail is updated as reinforcement learning information in the next iteration. Although many algorithms use the same graph G(V,E), they use diffirent heuristics information, pheromone update rules and local search techniques. From now on we will call G(V,E) as the structural graph. Procedure of Memetic-ACO algorithms; Begin Initialize; // initialize pheromone trail matrix and u ants; Repeat Construct solutions; // each ant constructs its own solution; Choose a subset Ωil to evolve by enhanced (or local) search; For each individual in Ωil do Run enhance (or local) search End for; Update trail; Until End condition; Choose the solutions End; Algorithm 1. Specification of a simple memetic-ACO algorithm Recently, some ACO-based algorithms following this scheme have been applied effectively for DNA motif finding: MFACO(2010), proposed by Bouamama et al. [2], uses consensus score to find motifs and information content to determine their starting positions. Experiments showed that this algorithm obtains better results than the other best techniques: GS, BP, and MEME. EMACO (2011), proposed by Do Yang et al. [19], combines ACO and Expectation Maximization (EM), in which EM is used to determine binding sites. Experiments revealed that this method is better than GAME and GALF. ACRI (2013), proposed by Liu et al. [11], uses information content as the objective function to determine positions. Different from two methods above, this algorithm uses local search at two adjacent positions instead of using random search method. It is also 3 experimentally proven to have better results than that of these algorithms: MEME, AlignACE, and Gibbs Sampler. In ACO algorithms, there are four important factors affecting their performances: 1) structural graph, 2) heuristics information, 3) pheromone update rule, and 4) local search technique. The proposed algorithm uses the same structural graph of MFACO algorithm. 3. The proposed algorithm This algorithm, named ACOMotif, uses total Hamming distance of motif to DNA sequences as the objective function. ACOMotif uses structural graph of MFACO but with different heuristics information, pheromone update rule, and local search technique. For each motif found, to locate binding sites in DNA sequences, the algorithm subsequently applies relax technique, which is why in this case ACOMotif is called R-ACOMotif. 3.1. ACOMotif ACOMotif follows the scheme described in Algorithm 1.The output is the set Q which includes the motifs of length l and the corresponding instances on the DNA sequences which have smallest hamming distance compared to the motifs. The detailed description of ACOMotif is as follows. Structural graph The structural graph G (V, E) is the same as MFACO’s. To find a motif of length l, the graph has 4l vertices arranged in four rows and l columns. Each vertex at position (u, j) is labeled by the corresponding nucleotide u as shown in Fig.1 . The labels of the vertices in each row are also used to refer to the rows. From left to right, edges connect vertices of two consecutive columns. We denote as the edge connecting vertex (u, j) to (v, j +1). Heuristics Information and pheromone trail are placed at the vertices of the first column and on the edges. Figure 1. Structural graph for finding motif of length l Heuristic information The heuristics information is placed at the first column vertices and on the edges. At t vertices of the first column, heuristics information is the frequency of nucleotide in the entire dataset S. 4 Heuristics information on edge is the frequency of the couple uv nucleotide in S. There are only 16 such quantities with (u,v) ∑x∑. Remark. Note that, in MFACO, heuristics information at edges is computed by high-order background model based on the frequency of the motif pattern from the first column to the current column in S. Since the appearance of these patterns in each DNA sequence Si is rare, this kind of statistical information is limited. Pheromone update rule Our algorithm uses the SMMAS pheromone update rule (Smooth Max-Min Ant System) [9]. Pheromone trails (on each vertex u of the first column) and ) are first initialized to a predetermined value at each vertex u is updated by Equation (4): (on edges . After each loop, pheromone trail , (4) where , and are pre-determined parameters. Pheromone trail on edge is updated by Equation (5) , (5) where: The computational analysis and experiment in [9] show that this rule is better than MMAS update rule used in MFACO. Randomized procedure to find solution In each iteration, each ant randomly selects a starting node u at the first column with probability : (6) Then the ant randomly walks through all the columns sequentially with the probability of choosing the edge from vertex u of column j to vertex v at column j+1 being (7) The path of the ant from starting vertex to the last column vertex identifies an acceptable solution for motif. The objective function and identification of binding sites For each acceptable solution Sc, instead of using 5 as in Equation 1, ACOMotif takes the total Hamming distance Hd(Sc) from Sc to DNA sequences in S as the objective function: (8) where } (9) Minimum string mi (9) is an instance of and its position in Si is the binding site. Note that each can have multiple instances on Si with the same distance. Local search ACOMotif applies hill-climbing technique for local search as described below. After all ants finish their paths through the graph, the solutions are formed with their corresponding total hamming scores; then the local search is applied on the solutions having smallest scores. For each potential motif Sm, use the set Q(Sm ) to contain search results, and the iteration procedure is carried out as follows: Step 1: Initialize Q(Sm) = {Sm}; Step 2.Repeat: For each i=1,…,l do: 2.1. Replace letter at position i of Sm by one of three remaining letters consecutively in set ∑ to get Sp; 2.2. Compute ; 2.3. If ≤ then Sm Sp and Q(Sm) Q(Sm) {Sp}; Until we cannot improve the objective function anymore. After applying local search for potential motifs in each iteration, the sets Q(Sm), consisting of candidates with the smallest or nearly smallest score, are combined into the set Q containing all the best solutions up to that point, which have the same binding site (retaining only a motif). Based on set Q, the pheromone trail on the graph is updated according to Equation (4) and (5). The algorithm stops when it finishes running a predefined number of loops. The binding sites associated with motifs in Q allow us to identify instances of motif. 3.2. R- ACOMotif Because real positions of motifs in DNA sequences are not certainly the solution of optimization problem, ACOMotif additionally uses relax technique to locate binding site of each motif. When ACOMotif employs this technique, it is called R- ACOMotif. With each motif Sc found in the set of solutions Q of ACOMotif and given a number , relax technique finds set of instances } and set of starting positions } as follows: Step 1. // Expand the instances set and binding sites 1.1. On each sequence Si, finds locations so that for each substring 6 of length l, Hamming distance from substring to Sc is or . We get sets Mi and Ai including and respectively; 1.2.Compute the number of elements ni of sets Mi respectively and ; Step 2.// Filter to reduce the size of sets Mi and Ai Repeat 2.1. Rearrange the order of the set M incrementally with repect to ni // later follows this new order; 2.2. Determine the smallest number k so that with every i ≥ k then ni>1; 2.3. For each i = k to N do 2.3.1. For each Mi, compute ; 2.3.2. Compute gi= min{ }; 2.3.3. If then remove out of set Mi; 2.4. // reduce ; Until is smaller than half of that value before the loop; Step 3. // find the best solutions 3.1. Build all consensus matrices from the reduced set Mi and compute consensus score as in Equation (2); 3.2. Sort the matrices in step 3.1 and the corresponding locations of the instances in decreasing order with respect to their consensus scores; Step 4. The solution is the first tuple in the list in step 3.2 // can take more depending on priority computed. 4. Experimental results The program was written in Perl, run on a desktop computer equipped with CPU Intel Core i5 2.5 GHz and 4 GB RAM, using Ubuntu 12.04 Operating System. Our experiments compare the new algorithm’s efficiency with those of MFACO [2], EMACO [19], and ACRI [11] on the same datasets, using the same numbers of loops and ants as in the corresponding evaluations. The number of ants is fixed to 8. Because we do not have the programs of these algorithms, we cannot compare runtime on the same configuration machine, the results of the compared algorithms will be taken directly from the published articles. The runtime of ACOMotif/R-ACOMotif is in average. The parameters had been set as follows: is chosen depending on algorithm’s number of loops. 10-100 100-300 300-600 > 600 Number of loops 0.03-0.05 0.02-0.03 0.01-0.02 0,005 Coefficient To evaluate computation time, ACOMotif was compared with MotifSuite (2012) [20] on SCPD dataset [21]. The efficiency of ACOMotif was assessed by experiments on the same published dataset of three algorithms above and on SCPD. 4.1. Comparison with MFACO using consensus approach Experiments on H.sapiens used in [2] contain three small sets with the number of strings are 6, 9, and 12, respectively. Each of them has 3,001 nucleotide in length. H.sapiens dataset did 7 not have known actual motif biologically. Therefore, our experiments just compare the values of objective functions as computed in Equation (1) and Equation (2). We use notations HSc for total Hamming distance score and CSc for consensus score. Note that: HSc = N*l - CSc (10) Then smaller HSc is equivalent to greater CSc; therefore, we only need to care about HSc. The experiments have been performed in the same way as in [2]: each set is run three times with 50 loops as in [2], computation time is in average, the result of MFACO is taken from [2]. The experimental results for H.Sapiens dataset 1 are shown in Table I and Table II with motif length l = 7 and l = 13 respectively. Table I. A comparison between ACOMotif Table II. A comparison between ACOMotif and and MFACO on H.sapiens 1: = 0.03, l = MFACO on H.sapiens 1: = 0.03, l = 13, N= 6, 7; N=6, runtime 41s runtime 96s ACOMotif CCTCCCC AAAAAAA GCAGCGG GCCGGGG GCCGCCG AAAAAAG GCCTGTG TAAAAAT CGGCGCC GGGCCAG GGCCAGG GCGGGCG CCCGGGC CSc 42 42 42 42 42 42 42 42 42 42 42 42 42 HSc 0 0 0 0 0 0 0 0 0 0 0 0 0 MFACO AAAAAAA AGGAGGA AAAAAAG TAAAAAT CSc 42 42 42 42 HSc 0 0 0 0 ACOMotif AAAAAAAAAAAGA AAAAAAAAAAAAG GCTGAGGCAGGAG GCCGCCGCCGCCG CGCCGCCGCCGCC GAGGCTGAGGCAG CSc 76 75 72 72 72 71 HSc 2 3 6 6 6 7 MFACO AAAAAAAAAAAGA AAAAAAAAAAAAG AAAAAAAAAAAGT AAAAAAAAAAAAG CSc HSc 76 2 75 3 75 3 75 3 Remark: Table I shows that ACOMotifis is considerably better (with 13 motifs found) in comparison to MFACO (with only 4 motifs found) with the same HSc and CSc. We can see from Table 2 that both algorithms discovered the motif having the best score, but MFACO did find three motifs whose HSc score equals 3, while ACOMotif found only one. However, the number of motifs discovered by ACOMotif is higher in general. The experimental results for H.Sapiens dataset 2 are represented in Table III and Table IV, with motif length l = 7 and l = 13, respectively. Table III. Comparison between ACOMotif and MFACO on H.sapiens2: = 0.03, l = 7; N=9, runtime 62s ACOMotif CCCTCCT CTCCCTT GAGCAGG GGGTTGG GGGGCTG TGGGAGG GGCGGCC GGGGCTG CCCCTCC TTCCTGG CCCCTCC GGGCTGG CCTCCCT CSc 63 62 62 62 62 62 62 62 62 62 62 62 62 HSc MFACO 0 CCCTCCT 1 CCCTCAG 1 GGGTTGG 1 GAGCAGG 1 1 1 1 1 1 1 1 1 CSc 63 62 62 62 HSc 0 1 1 1 Table IV. Comparison between ACOMotif and MFACO on H.sapiens2: = 0.03, l = 13; N=9, runtime 126s ACOMotif GCCGGCGGGCGCC GGCCCCCGGGCGG GGGGGAGCAGGAG GGCCGGCGGGCGG GCAGGGGCTGGGG GGCCAGGCTCGGC CCCCGCCCCCGGC 8 CSc 102 101 101 100 100 100 100 HSc 15 16 16 17 17 17 17 MFACO GCCGGCGGGCGCC GCGGGCGGGCGCC GCCGGCGGGCGGC GCCGGAGGGCGCC CSc 102 101 100 100 HSc 15 16 17 17 Remark: Table III shows that ACOMotif is considerably better in terms of number of found motifs with minimum score. ACOMotif found 12 motifs whose HSc equals 1, compared with only 3 motifs when using MFACO. As can be seen from Table IV, ACOMotif still represented its superiority over MFACO when it also found the best score motif, but two motifs whose HSc equals 16, and four motifs with HSc 17. The experimental results for H.Sapiens dataset 3 are illustrated in Table V and Table VI, with motif length l = 7 and l = 13, respectively. Table V. Comparison between ACOMotif and TableVI. Comparison between ACOMotif and MFACO on H.sapiens3: = 0.03, l = 7; N=9, MFACO on H.sapiens3: = 0.03, l = 13; N=9, runtime 114s runtime 231s ACOMotif CSc GGCGGGG CTGAGGC CCAGCTG GAGGCAG GGGGCGG 123 123 123 123 123 HS c 3 3 3 3 3 MFACO CSc HSc GGGGCGG CCCAGCT CCAGCTG CTGAGGC 123 123 123 123 3 3 3 3 ACOMotif GGGAGGCTGAGGC CGGGAGGCGGAGG GCTGAGGCAGGAG GGAGGCTGAGGCA GGCTGAGGCAGGA GGGCGGGGCGGGG CSc 205 204 202 202 119 119 HSc 29 30 32 32 35 35 MFACO GGGAGGCTGAGGC CGGGAGGCGGAGG GGAGGCTGAGGCA CGGGAGGCGGGGG CSc 205 204 202 201 Remark: Table V proves that ACOMotif discovered one more motif with minimum score 3. With table VI, ACOMotif still gave better results in terms of both number of found motifs and score. 4.2. Comparisonwith MFACO and ACRI in terms of position approach The experiment was carried on E.coli dataset: CRP binding sites used by both MFACO and ACRI in [2], [11] to compare discovered binding site. The dataset containseighteen 105nucleotite strings. The length of examined motif is 22 as the same in MFACO and ACRI. RACOMotif ran 20 times, each with 300 loops, 10 ants, and = 0.02. Experimental result is expressed in Table VII. Table VII: Comparison result betweenR-ACOMotif and MFACO, ACRI algorithms. Ordered number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Position ACRI Error MFACO Error R-ACOMotif Error 17; 61 17; 55 76 63 50 7; 60 42 39 9; 80 14 61 41 48 71 17 53 1; 84 78 63 57 78 65 52 9 44 41 11 16 63 43 50 73 19 55 95 78 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 0 61 55 76 63 50 7 42 39 9 14 61 41 48 71 17 53 84 78 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 55 76 63 50 7 42 39 9 14 61 41 48 71 17 53 84 78 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 HSc 29 30 32 33 Remark: The result shows that R-ACOMotif and MFACO both discovered all the correct starting positions, however ACRI had comparably high error. 4.3. Comparison with EMACO in terms of position approach Experiments on two datasets, ERE and E2F, were carried on with EMACO [19]. Each of them includes 25 strings and 200 nucleotides per string; real motifs and its starting positionson each string are known in advance. The algorithms were run 20 times and compared their average values, using 20 ants, 100 loops, and . According to [19], the discovered position is correct if it is at most 3 unit(s) away from real location. To assess the result, the study [4] proposed three measurements including precision, recall, and F-score: Precision = , Recall = , F- score = , (11) where nc is the number of binding sites that were correctly predicted, np is the total number of predicted binding sites, and nt is the total number of actual binding sites. Especially, F-score is said to be suitable for assessing quality of algorithm [11]. Experimental result comparing with EMACO is presented in Table 8. Table VIII.Comparison between R-ACOMotif and EMACO on ERE and E2F datasets Data ERE E2F Precision 0.89 0.005 1 0 RACOMotif Recall 0.83 0.004 1 0 ACO combined with EM Precision Recall F-score 0.81 0.01 0.76 0.01 0.85 0.01 0.91 0.02 0.92 0.01 0.91 0.02 F-score 0.86 0.005 1 0 Remark: From the above table, we can see easily that on both two datasets, R-ACOMotif has significantly better measurements as against EMACO. In particular, the former’s F-score ishigher than the later’s one. Thus, we can conclude that R-ACOMotif runs more efficient than ACO combined with EM algorithm in [19]. 4.4. Comparison with MotifSuite To compare computation time and precision scores, ACOMotif and MotifSuite [20] were run on two datasets GCR1 and GCN4 taken from SCPD data [21]. GCR1 contains six 9,050 in length strings (DNA sequences), and real motif is CTTCC (l = 5). GCN4 contains nine strings, and real motif is TGACTC (l =6). Both two programs were run 20 times with the same 20 loops to compute average runtime and to check if they can find the real motif or not. ACOMotif used eight ants with . The experimental result shows that ACOMotif found the real motif CTTCC on GCR1 with score HSc = 0 and TGACTC on GCN4 with score HSc = 1, whereas MotifSuite did not. Runtime of the two algorithms presented in Table IX shows that ACOMotif is dramatically faster. 10 Table IX: Comparison of runtime between ACOMotif and MotifSuite ACOMotif(s) MotifSuite(s) GCR1 148,8 501,8 GCN4 314,9 891,9 5. Conclusion Motif Finding is one of the challenges of biomolecular. The application of ant colony optimization algorithm to address the problem has shown its power, but each algorithm has its own advantages and disadvantages. Experiments prove that ACOMotif algorithm is superior in comparison with existing algorithms, and R-ACOMotif version allows us to find binding sites of real motif precisely. This algorithm can be developed to apply to other types of motif problems, and can improve search techniques to enhance quality. When parallel processing is employed, the runtime will be lower. Acknowledgements This work was done during the stay of the first author in Vienamese institute for advanced study in mathematics (VIASM) REFERENCES 1. S. Bandyopadhyay, S. Sahni, S. Rajasekaran (2012) Pms6: A faster algorithm for motif discovery. In: Proceedings of the second IEEE Int. Conf. on Computational Advances in Bio and Medical Sciences(ICCABS 2012), 1–6. 2. S. Bouamama, A. Boukerram, and A.F. Al Badarneh: Motif finding using ant colony optimization, ANTS’10 Proc. of the 7th int. conf. on Swarm intelligence(2010), LNCS vol.6234, 464-471. 3. X. S. Chen, Y. S. Ong, M. H. Lim. Research frontier: memetic computation - past, present & future, IEEE Computational Intelligence Magazine 5 (2), 2011, 24-36. 4. M. Claeys; V. Storms; H. Sun; T. Michoel; K. Marchal. MotifSuite: workflow for probabilistic motif detection and assessment.Bioinformatics; 28(14):1931-1932. doi: 10.1093/bioinformatics/bts293 (2012). 5. H. Dinh, S. Rajasekaran, J. Davila, qPMS7: A fast algorithm for finding the (l; d)-motif in DNA and protein sequences, PLoS one, Vol.7, No 7, 2012, e41425. 6. M. Dorigo, T. St¨utzle: Ant Colony Optimization. MIT Press, Cambridge, 2004 7. E. Eskin and P. Pevzner, Finding composite regulatory patterns in DNA sequences, Bioinformatics S1, 2002,. 354-363.. 8. H. Hoang Xuan, D. Do Duc, N. Manh Ha: An efficient two-phase ant colony optimization algorithm for the closest string problem. SEAL2012, 188-197 9. H. Hoang Xuan, T. Nguyen Linh, D. Do Duc, H. Huu Tue, Solving the traveling salesman problem with ant colony optimization: a revisit and new efficient algorithms, REV Journal on Electronics and Communications, Vol. 2, No. 3–4, July – December,2012, 121-129. 10. M.K. Keith et al., “A simulated annealing algorithm for finding consensus sequences,” J.Bioinformatics, 18(2002), 1494-1499 11. W. Liu, H. Chen, L. Chen: An ant colony optimization based algorithm for identifying gene regulatory elements. Comp. in Bio. and Med. 43(7), 2013, 922-932 12. N. W. Lo, S. W. Changchien, Y. F. Chang and T. C. Lu, “Human promoter prediction 11 based on sorted consensus sequence patterns by genetic algorithms,” Proc. of the Int. Congress on Biological and Medical Engineering, D3I-1540: 111-112, (2002) 13. P. Moscato, On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms. Tech. Rep.Caltech Concurrent Computation Program, Report. 826, California Institute of Technology, Pasadena, California, USA (1989). 14. A. Neuwald, J. Liu, and C. Lawrence, Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Science, 4:1618–1632 (1995). 15. N. Pisant, A. Carvalho, L. Marsan, M.F. Sagot (2006) Risotto: Fast extraction of motifs with mismatches. In: Proceedings of the 7th Latin American Theoretical Informatics Symposium. pp 757–768 (2006) 16. G.D. Stormo, G.W. Hartzell, 3rd, Identifying protein-binding sites from unaligned DNA fragments, Proc. Natl. Acad. Sci. USA 86 (4) (1989) 1183–1187. 17. G. Thijs, M. Lescot, K. Marchal, S. Rombauts, B.D. Moor, P. Rouz´e, Y. Moreau: A higher order background model improves the detection of regulatory elements by Gibbs sampling. Bioinformatics 17(12), 1113–1122 (2001). 18. W. Thompson, C.R. Eric and E.L, Lawrence, “Gibbs recursive sampler: finding transcription factor binding sites,” J. Nucleic Acids Research, 31, pp. 3580-3585 (2003). 19. C.H. Yang, Y.T. Liu, L.Y. Chuang. DNA motif discovery based on ant colony optimization and expectation maximization.Proc.of IMECS 2011, 169-174 3. 20. http://bioinformatics.psb.ugent.be/webtools/MotifSuite/Index.htm#pub 21. http://rulai.cshl.edu/SCPD/ 12 [...]... patterns in DNA sequences, Bioinformatics S1, 2002, 354-363 8 H Hoang Xuan, D Do Duc, N Manh Ha: An efficient two-phase ant colony optimization algorithm for the closest string problem SEAL2012, 188-197 9 H Hoang Xuan, T Nguyen Linh, D Do Duc, H Huu Tue, Solving the traveling salesman problem with ant colony optimization: a revisit and new efficient algorithms, REV Journal on Electronics and Communications,... ACOMotif and MotifSuite ACOMotif(s) MotifSuite(s) GCR1 148,8 501,8 GCN4 314,9 891,9 5 Conclusion Motif Finding is one of the challenges of biomolecular The application of ant colony optimization algorithm to address the problem has shown its power, but each algorithm has its own advantages and disadvantages Experiments prove that ACOMotif algorithm is superior in comparison with existing algorithms, and... Marchal MotifSuite: workflow for probabilistic motif detection and assessment.Bioinformatics; 28(14):1931-1932 doi: 10.1093/bioinformatics/bts293 (2012) 5 H Dinh, S Rajasekaran, J Davila, qPMS7: A fast algorithm for finding the (l; d) -motif in DNA and protein sequences, PLoS one, Vol.7, No 7, 2012, e41425 6 M Dorigo, T St¨utzle: Ant Colony Optimization MIT Press, Cambridge, 2004 7 E Eskin and P Pevzner, Finding. .. “A simulated annealing algorithm for finding consensus sequences,” J.Bioinformatics, 18(2002), 1494-1499 11 W Liu, H Chen, L Chen: An ant colony optimization based algorithm for identifying gene regulatory elements Comp in Bio and Med 43(7), 2013, 922-932 12 N W Lo, S W Changchien, Y F Chang and T C Lu, “Human promoter prediction 11 based on sorted consensus sequence patterns by genetic algorithms,”... advanced study in mathematics (VIASM) REFERENCES 1 S Bandyopadhyay, S Sahni, S Rajasekaran (2012) Pms6: A faster algorithm for motif discovery In: Proceedings of the second IEEE Int Conf on Computational Advances in Bio and Medical Sciences(ICCABS 2012), 1–6 2 S Bouamama, A Boukerram, and A.F Al Badarneh: Motif finding using ant colony optimization, ANTS’10 Proc of the 7th int conf on Swarm intelligence(2010),... Bioinformatics 17(12), 1113–1122 (2001) 18 W Thompson, C.R Eric and E.L, Lawrence, “Gibbs recursive sampler: finding transcription factor binding sites,” J Nucleic Acids Research, 31, pp 3580-3585 (2003) 19 C.H Yang, Y.T Liu, L.Y Chuang DNA motif discovery based on ant colony optimization and expectation maximization.Proc.of IMECS 2011, 169-174 3 20 http://bioinformatics.psb.ugent.be/webtools/MotifSuite/Index.htm#pub... algorithms, and R-ACOMotif version allows us to find binding sites of real motif precisely This algorithm can be developed to apply to other types of motif problems, and can improve search techniques to enhance quality When parallel processing is employed, the runtime will be lower Acknowledgements This work was done during the stay of the first author in Vienamese institute for advanced study in mathematics... Biological and Medical Engineering, D3I-1540: 111-112, (2002) 13 P Moscato, On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms Tech Rep.Caltech Concurrent Computation Program, Report 826, California Institute of Technology, Pasadena, California, USA (1989) 14 A Neuwald, J Liu, and C Lawrence, Gibbs motif sampling: detection of bacterial outer membrane protein... membrane protein repeats Protein Science, 4:1618–1632 (1995) 15 N Pisant, A Carvalho, L Marsan, M.F Sagot (2006) Risotto: Fast extraction of motifs with mismatches In: Proceedings of the 7th Latin American Theoretical Informatics Symposium pp 757–768 (2006) 16 G.D Stormo, G.W Hartzell, 3rd, Identifying protein-binding sites from unaligned DNA fragments, Proc Natl Acad Sci USA 86 (4) (1989) 1183–1187 17 ... Pevzner, Finding composite regulatory patterns in DNA sequences, Bioinformatics S1, 2002, 354-363 H Hoang Xuan, D Do Duc, N Manh Ha: An efficient two-phase ant colony optimization algorithm for the... expanded by a random to the next vertex based on heuristics information and pheromone trail The ACO algorithm uses Nant artificial ants, in each iteration, each ant finds a solution by a randomized... its power, but each algorithm has its own advantages and disadvantages Experiments prove that ACOMotif algorithm is superior in comparison with existing algorithms, and R-ACOMotif version allows