DSpace at VNU: ACOHAP: An efficient ant colony optimization for the haplotype inference by pure parsimony problem

Swarm Intell (2013) 7:63–77 DOI 10.1007/s11721-013-0077-8 ACOHAP: an efficient ant colony optimization for the haplotype inference by pure parsimony problem Dong Duc Do · Sy Vinh Le · Xuan Huan Hoang Received: 20 March 2012 / Accepted: 12 February 2013 / Published online: 28 February 2013 © Springer Science+Business Media New York 2013 Abstract Haplotype information plays an important role in many genetic analyses However, the identification of haplotypes based on sequencing methods is both expensive and time consuming Current sequencing methods are only efficient to determine conflated data of haplotypes, that is, genotypes This raises the need to develop computational methods to infer haplotypes from genotypes Haplotype inference by pure parsimony is an NP-hard problem and still remains a challenging task in bioinformatics In this paper, we propose an efficient ant colony optimization (ACO) heuristic method, named ACOHAP, to solve the problem The main idea is based on the construction of a binary tree structure through which ants can travel and resolve conflated data of all haplotypes from site to site Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods ACOHAP is as good as the currently best exact method, RPoly, on small data sets However, it is much better than RPoly on large data sets These results demonstrate the efficiency of the ACOHAP algorithm to solve the haplotype inference by pure parsimony problem for both small and large data sets Keywords ACOHAP · Ant colony optimization · Haplotype inference · Pure parsimony · Genotypes D.D Do Information Technology Institute, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam e-mail: dongdoduc@vnu.edu.vn S.V Le ( ) University of Engineering and Technology & Information Technology Institute, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam e-mail: vinhbio@gmail.com X.H Hoang University of Engineering and Technology, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam e-mail: huanhx@vnu.edu.vn 64 Swarm Intell (2013) 7:63–77 Introduction Single nucleotide polymorphisms (SNPs) are the most frequent form of genomic variations The nucleotide variants at SNP sites are called alleles Most SNPs are biallelic, that is, only two different nucleotides are observed in the population A haplotype is a sequence of alleles on one chromosome Haplotypes provide important information for many genetic analyses (The International Hapmap Consortium 2007; Graỗa et al 2010, and references therein) However, the experimental determination of haplotypes is expensive and time consuming Fortunately, current sequencing methods can efficiently determine their conflated data, that is, genotypes This motivates researchers to develop computational methods to infer haplotypes from genotypes Haplotype inference is a challenging problem in bioinformatics (Istrail 2004; Graỗa et al 2010) Haplotype inference by pure parsimony (HIPP), that is, finding the smallest number of haplotypes to explain a set of genotypes without recombination, is an NP-hard problem (Gusfield 2001, 2003) Many different computational methods have been proposed to solve this problem These methods can be classified as either heuristic or exact approaches Clark was the first to propose an inference rule-based approach to solve the HIPP problem (Clark 1990) Tininini et al generalized Clark’s inference rules to construct the CollHaps algorithm (Tininini et al 2010) This algorithm starts from a list of haplotypes and performs a sequence of collapsed steps in order to minimize the number of distinct haplotypes CollHaps, a heuristic method, is designed to conduct a number of complete attempts (driven by a randomized quasi-greedy strategy) for each problem instance to find the best solution Another heuristic method, called parsimonious tree-grow method (PTG), was proposed to solve this problem (Li et al 2005) The main idea of PTG is to resolve genotypes from site to site by growing a maximum parsimony tree Although PTG is very efficient for what concerns the computational complexity, its accuracy is not as good as the one of other methods Exact methods for solving the HIPP problem include integer linear programming techniques (Gusfield 2001), Boolean satisfiability (SAT) formulation techniques (Lynce and Marques-Silva 2008), and Boolean constraint techniques (Graỗa et al 2007, 2008) Recently, Graỗa et al (2007, 2008) developed a pseudo-Boolean optimization method, called RPoly, for solving the HIPP problem Experimental results show that RPoly outperforms other exact methods Although exact methods are able to find optimal solutions for the HIPP problem, they are only applicable to small data sets due to the computational burden (Gusfield and Orzack 2005; Brown and Harrower 2006) The ant colony optimization (ACO) approach has been widely used to tackle combinatorial optimization problems (Dorigo and Stützle 2004, and references therein) Benedettini and colleagues were the first to apply ACO to solve the HIPP problem (Benedettini et al 2008) Their algorithm includes two levels: in the first level, it uses ACO to determine a good visiting order of genotypes; in the second level, it employs ACO to infer haplotypes from genotypes following orders determined in the first level Although the heuristic information is estimated to determine a good visiting order of genotypes in this ACO system, it is not estimated to instruct ants to infer individual alleles In this paper, we propose an ACO-based method, named ACOHAP, to solve the HIPP problem for large data sets The key idea is the construction of a binary tree structure which allows ants to travel and resolve conflated data of all haplotypes from site to site Heuristic information for each individual allele can be effectively estimated to guide ants towards good Swarm Intell (2013) 7:63–77 65 solutions This technique overcomes the limitation of the 2-level ACO method in estimating heuristic information The rest of the paper is organized as follows: The HIPP problem is presented in Sect 2; the ACOHAP algorithm is described in Sect 3; Sect analyzes the performance of different methods on both small and large data sets; conclusions are given in the last section The HIPP problem We assume SNPs are biallelic and represent alleles by or A haplotype of m sites is represented by a string h = h1 hm of size m where hi ∈ {0, 1} Consider an unordered haplotype pair (ha , hb ) Their corresponding genotype g is represented by a string g = g1 gm of size m where gi ∈ {0, 1, 2} is the conflated data at site i Hereby, gi is defined as follows: gi = hai if hai = hbi (genotype g is homozygous at site i) if hai = hbi (genotype g is heterozygous at site i) (1) The haplotype pair (ha , hb ) is called a haplotype resolution of g, and we say the genotype g is resolved by (ha , hb ) A given genotype g has exactly 2k−1 different haplotype resolutions, where k is the number of sites at which g is heterozygous For example, genotype g = 022 can be resolved by the two different unordered haplotype pairs (000, 011) and (001, 010) Let G = {g , , g n } be a set of n genotypes at the m loci under consideration It is said that haplotype set H = {h1 , , hk } is a solution of G if each genotype of G is resolved by a pair of haplotypes in H Given a set of genotypes G, the HIPP problem is to find a solution H with minimum number of haplotypes (Gusfield 2003) Consider an example of three genotypes G = {g , g , g } = {121, 002, 221}, H = {h1 , h2 , h3 , h4 } = {000, 001, 101, 111} is an optimal solution of G having four haplotypes where (h3 , h4 ), (h1 , h2 ) and (h2 , h4 ) are haplotype resolutions of g , g , and g , respectively Methods 3.1 Graph construction This section describes a binary tree structure to represent all possible haplotypes with m sites (see Fig 1) The tree structure has the following properties: – It is a full binary tree with m + levels The root is at level and leaves are at level m – We denote by v, v ∈ {0, 1}, the label of a branch A branch from an internal node X to its left (right) child is labeled as (1) and called 0-branch (1-branch) – A node is labeled by the concatenated string of branch labels from the root to the node – The label of a node at level i represents a haplotype with i sites The binary tree of 2m leaves represents 2m different possible haplotypes 66 Swarm Intell (2013) 7:63–77 Fig The full binary tree of depth m (the root is at level 0; leaves are at level m) Branches are labeled either or A node is labeled by the concatenated string of branch labels from the root to the node The label of a node at level i represents a haplotype with i sites Ants can travel from the root to the leaves of the tree to determine haplotypes For example, two haplotypes = 001 and hb = 101 (bold paths) are a haplotype resolution of g = 201 In the algorithm proposed below, we think of an ant traveling from the root to the leaves of the tree to resolve each genotype g into two haplotypes and hb At level i − 1, the ant determines allele hai by following either the 0-branch (hai = 0) to the left child or the 1-branch (hai = 1) to the right child Specifically, if genotype g is homozygous at site i, we assign hai = gi and the ant follows the gi -branch Otherwise, the ant takes a decision to follow either the 0-branch or the 1-branch based on the pheromone trail and heuristic information It is worth to notice that the complementary haplotype hb can be determined from and g and vice versa Specifically, hbi = hai − hai if gi is or (homozygous) if gi is (heterozygous) (2) For example, genotype g = 201 can be resolved into two haplotypes = 001 and hb = 101 (bold paths in the tree in Fig 1) 3.2 The HAPIN algorithm We propose an ant-walking algorithm, named HAPIN, to determine a solution H of G The algorithm starts from the root of the tree with n initially undetermined haplotype pairs (h1a , h1b ), , (hna , hnb ) To determine these haplotype pairs, we think of them as ants and allow them to follow branches of the tree from the root to the leaves In such a way, that haplotype pair (hsa , hsb ) forms a haplotype resolution of g s for each s, s ∈ {1, , n} A leaf is active if it contains at least one haplotype Labels of active leaves constitute a solution H of G (see Fig 2) Swarm Intell (2013) 7:63–77 67 Fig The full binary tree structure through which ants can travel and determine a solution H of G Consider an example of three genotypes G = {121, 002, 221}, haplotypes h1a , h1b , h2a , h2b , h3a , h3b follow branches from the root to four leaves: 000, 001, 101, and 111 The set H = {000, 001, 101, 111} including labels of these leaves is a solution of G The HAPIN algorithm performs m iterations In the ith iteration, alleles at site i for all haplotypes are inferred (all ants move from nodes at level i − to nodes at level i) Let Ho (He ) be the list of haplotypes whose genotypes are homozygous (heterozygous) at site i In the ith iteration, the HAPIN algorithm performs the two following steps: the homozygous step and the heterozygous step The homozygous step determines alleles at site i for all haplotypes of Ho It is straightforward to determine these alleles The heterozygous step determines alleles at site i for all haplotypes of He The determination of each allele is guided by both the pheromone trail and heuristic information The details of the HAPIN algorithm are described in Algorithm Figure illustrates an example of three genotypes G = {g , g , g } = {121, 002, 221} Haplotypes h1a , h1b , h2a , h2b , h3a , h3b follow branches to four leaves: 000 (h2a ), 001 (h2b , h3a ), 101 (h1a ), and 111 (h1b , h3b ) Thus, the haplotype set H = (000, 001, 101, 111) is a solution of G 3.2.1 Pheromone trail update Pheromone trail update is a crucial step in ant colony optimization as it guides ants to find good solutions The heterozygous step uses pheromone trail τivs as a guide to choose between sb alleles hsa i and hi We use the smooth max-min ant system method (Do et al 2008), a refined version of the well-known pheromone boundary rule (Stützle and Hoos 2000), to update the pheromone trail Let H = {h1 , , hk } be a solution which is used to update the pheromone trail Consider genotype g s ∈ G resolved by haplotypes hsa , hsb ∈ H , the pheromone trail 68 Swarm Intell (2013) 7:63–77 Algorithm 1: HAPIN Algorithm Input: A set of n genotypes with m sites G = {g , , g n } where g s = g1s gms Output: A haplotype solution H = {h1 , , hk } of G Begin Set a list of n initially undetermined haplotype pairs (h1a , h1b ), , (hna , hnb ) to the root of the tree for i = → m // In the ith iteration, alleles of all haplotypes at site i are determined Homozygous step: Homozygous genotypes at site i will be processed in this sb step Consider a homozygous genotype g s at site i, alleles hsa i and hi are s s sa sb simply assigned equal to gi Specifically, if gi = 0, we assign hi = hi = (haplotypes hsa and hsb follow the 0-branches to the left nodes) Otherwise, we sb sa sb assign hsa i = hi = (haplotypes h and h follow the 1-branches to the right nodes) Heterozygous step: Heterozygous genotypes at site i will be processed in this step Consider a heterozygous g s at site i, allele hsa i will be assigned to v, v ∈ is assigned v (hsa follows the v-branch; {0, 1} The probability Pis (v) that hsa i sb h follows the (1 − v)-branch) is defined as follows: Pis (v) = s β (τivs )α (ηiv ) s α s β s β (τi0 ) (ηi0 ) + (τi1s )α (ηi1 ) (3) in which α, β are the relative influence coefficients of the pheromone trail and of the heuristic information, respectively These parameters are set as in Table Collect labels of all active leaves to form a haplotype solution H End Table The ParamILS program (Hutter et al 2007) was used to determine good parameter values for ACOHAP Value: the parameters values used in the tuning process (bold numbers are default values) Parameter Value Tuned value Description Nants {10, 15, 20, 25, 30, 40, 50} 20 Number of ants f {0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0} 3.0 α {1.0, 1.5, 2.0} τmax τmin = f × m × n 1.0 Relative influence of pheromone trail β {1.0, 1.5, 2.0} 1.5 Relative influence of heuristic information ρ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 0.3 Pheromone evaporation factor τivs (i = 1, , m; v ∈ {0, 1}) is updated as follows: τivs = (1 − ρ)τivs + ρτmin ρτmax hsa i =v hsa i =v (4) in which ρ ∈ (0, 1] is the evaporation parameter; τmin and τmax are the lower bound and upper bound of the pheromone trail, respectively These parameters are set as in Table Swarm Intell (2013) 7:63–77 69 3.2.2 Heuristic information estimation s is used For what concerns a heterozygous genotype g s at site i, the heuristic information ηiv sa sb as a guide to determine alleles hi and hi We denote by Hi the list of haplotypes whose alleles at site i are already determined This haplotype list is used to estimate the heuristic s information ηiv Two partially determined haplotypes h and h are called compatible if they can follow branches to the same leaf For example, the two haplotypes h2b and h3a in Fig are compatible because they follow branches to the same leaf 001 We use the heuristic that haplotypes hsa , hsb should follow branches such that they will be compatible with as many haplotypes of Hi as possible We denote by cvsa the number of haplotypes of Hi which are compatible with when following the v-branch, v ∈ {0, 1} If cvsa = 0, hsa cannot travel with any haplotype of Hi to the same leaf, that is, it will result in a new active leaf If hsa follows the v-branch (hsb follows the (1 − v)-branch), the minimum number of new active leaves tvs resulting from hsa and hsb can be determined as follows: ⎧ sb ⎪ if cvsa = & c1−v =0 ⎪ ⎨ sb (5) tvs = if cvsa = & c1−v =0 ⎪ ⎪ ⎩ otherwise Haplotypes hsa and hsb tend to follow branches which minimize the minimum number of new active leaves and maximize the number of compatible haplotypes To this end, the s s and ηi1 are estimated as follows: heuristic information ηi0 ⎧ ⎪ (nm, 1) if t0s < t1s ⎪ ⎨ s s , ηi1 ) = (1, nm) (ηi0 if t0s > t1s ⎪ ⎪ ⎩ sa ((c0 + c1sb + 1)m, (c1sa + c0sb + 1)m) if t0s = t1s (6) 3.3 The ACOHAP algorithm We propose an ACO algorithm, named ACOHAP, to search for a good solution H of G The key component of ACOHAP is the HAPIN algorithm which determines one solution at each run Local searches are typically coupled with ACO algorithms to improve solutions ACOHAP uses the so-called stochastic first improvement rule method (Di Gaspero and Roli 2008) to improve solutions generated by the HAPIN algorithm This simple local search improves a solution by replacing two current haplotypes by a new one if possible The improvement process is repeated until no further replacement is found The ACOHAP algorithm performs the HAPIN algorithm several times to find a good solution Since the HIPP problem assumes that sites are independent, the orders along which sites are inferred are randomly specified for each run of the HAPIN algorithm to enforce search space exploration The ACOHAP algorithm is described in Algorithm The complexities of both the HAPIN algorithm and the stochastic first improvement rule algorithm are O(n2 m) Thus, the complexity of the inference step in the ACOHAP algorithm is O(n2 m) for each run The overall complexity of the ACOHAP algorithm is O(Nloops Nants n2 m) 70 Swarm Intell (2013) 7:63–77 Algorithm 2: ACOHAP Algorithm Data: A list of n genotypes with m sites G = {g , , g n } where g s = g1s gms Result: The best found solution H = {h1 , , hk } begin //Initialization step – Initialize pheromone trail – Set the currently best solution Hbest = undefined – Set the number of loops Nloops = repeat Hlocal = undefined for p ← to Nants //Inference step – Specify a random inferring order of sites S = (s1 , , sm ) – Perform the HAPIN algorithm to determine solution Hp using S, i.e alleles of all haplotypes at site si are determined in the ith iteration – Perform the stochastic first improvement rule algorithm to improve the obtained solution Hp – If Hp is better than Hlocal then update Hlocal = Hp //Updating step – Use Hlocal to update the pheromone trail – If Hlocal is better than the currently best solution Hbest then update Hbest = Hlocal – Increase the number of loops Nloops by one //Restarting step – Reset the pheromone trail if no improvement is found after 30 consecutive iterations until (the running time exceeds a given time limit) //Return the best solution found Hbest Return Hbest end Experimental results We compared ACOHAP to the currently best methods, RPoly version 1.2.1 (Graỗa et al 2008),1 CollHaps (Tininini et al 2010),2 and PTG (Li et al 2005)3 on both small and large data sets All experiments were conducted on a PC cluster of 24 nodes (AMD 2.2 GHz, 48 GB RAM) RPoly is an exact method which uses pseudo-Boolean optimization techniques to find optimal solutions The running time of RPoly was set to 100000 seconds http://sat.inesc-id.pt/~assg/rpoly/ http://www.iasi.cnr.it/~liuzzi/BIOCOMP/SNP/ http://doc.aporc.org/wiki/PTG Swarm Intell (2013) 7:63–77 71 (∼28 h) for each problem instance RPoly returns an approximate solution if it cannot find an optimal solution for a given problem instance after 100000 seconds CollHaps is a heuristic method which generalizes the well-known Clark’s rule method to solve the HIPP problem PTG is a very fast heuristic algorithm Both CollHaps and ACOHAP require long running times to converge to optimal solutions In our experiments, the running time limit for both ACOHAP and CollHaps was set to 1000 seconds In addition to the time limit of 100000 seconds, we also tested the performance of RPoly with a time limit of 1000 seconds, called RPoly1000 , for each problem instance We used the ParamILS program (Hutter et al 2007) to determine good parameter settings for ACOHAP Table presents the parameters and their values as used in the tuning process The parameter settings for ACOHAP are obtained using ParamILS on artificially generated benchmarks (SU1, SU2, SU3, and SU-100k) from the International HapMap Consortium (Marchini et al 2006) These parameters are given in Table and are used to test the performance of ACOHAP on all data sets Note that ACOHAP was applied only once to each problem instance Unfortunately, the software for the two-level ACO method (ACO-HI+ ) (Benedettini et al 2008) is no longer available for testing (as communicated by the first author of Benedettini et al 2008) However, we can compare ACOHAP with ACO-HI+ on 100 problem instances of the SU2 data set (Marchini et al 2006) for which results obtained with ACO-HI+ are available (Benedettini et al 2008) 4.1 Small data sets We tested the different methods on nine small data sets including four artificially generated benchmarks and five biological data sets 4.1.1 Artificially generated benchmarks We used artificially generated benchmarks (SU1, SU2, SU3, SU-100kb)4 from the International HapMap Consortium (Marchini et al 2006) to assess the performance of the different methods SU1 was generated using a constant recombination rate across the whole region, a constant population size, and random mating It contains 100 problem instances of 90 genotypes SU2 was generalized from SU1 in the sense that the recombination rate varies across the region SU3 is the same as SU2 except that the demography model is consistent with the white Americans model The small data set SU-100kb contains 29 problem instances each consisting of 90 genotypes These four data sets together contain 329 different problem instances They are summarized in Table Results from RPoly, ACOHAP and CollHaps for the SU data sets are presented in Table The PTG method is not applicable to multiple problem instances Therefore, we cannot assess its performance for these 329 artificial problem instances The sum of the objective function values of the solutions generated by ACOHAP for all 329 problem instances is 42305, which is smaller than those from RPoly (42823) and CollHaps (42852) ACOHAP found optimal solutions for 303 (92 %) out of 329 problem instances RPoly found optimal solutions for these and 17 other problem instances (320 in total) However, optimal solutions from RPoly for these 17 problem instances are only slightly better than those from ACOHAP ACOHAP is much better than RPoly on the nine problem instances where RPoly could http://www.stats.ox.ac.uk/~marchini/phaseoff.html 72 Swarm Intell (2013) 7:63–77 Table Artificially generated benchmarks (SU1, SU2, SU3, SU-100kb) from the International HapMap Consortium (Marchini et al 2006) These four data sets contain 329 different problem instances each consisting of 90 genotypes Data set #Problem instances SU-100kb #Genotypes (n) Genotype length (m) 29 90 18 SU1 100 90 179 SU2 100 90 171 SU3 100 90 187 Table Results obtained with RPoly, RPoly1000 , ACOHAP and CollHaps for the SU data sets; #Opts: The number of optimal solutions; #Haps: The number of haplotypes Data set RPoly #Opts SU-100kb RPoly1000 #Haps #Opts ACOHAP #Haps #Opts CollHaps #Haps #Opts #Haps 23 1346 21 1430 23 1035 23 1036 SU1 100 11961 98 12142 98 11963 24 12186 SU2 100 14794 100 14794 97 14797 56 14873 SU3 97 14722 91 15158 85 14510 36 14757 320 42823 310 43524 303 42305 139 42852 Summary Table Results obtained with ACOHAP and CollHaps on artificially generated benchmarks ACOHAP < CollHaps: ACOHAP is better than CollHaps; ACOHAP = CollHaps: ACOHAP is as good as CollHaps; ACOHAP > CollHaps: ACOHAP is worse than CollHaps Data set SU-100kb SU1 #Problem instances ACOHAP < CollHaps #Problem instances ACOHAP = CollHaps #Problem instances ACOHAP > CollHaps (7 %) 25 (86 %) (7 %) 76 (76 %) 24 (24 %) (0 %) SU2 44 (44 %) 54 (54 %) (2 %) SU3 66 (66 %) 34 (34 %) (0 %) not find optimal solutions after 100000 seconds The approximate solutions from RPoly for these nine problem instances are even worse than those from CollHaps RPoly with a time limit of 1000 seconds (RPoly1000 ) found optimal solutions for 310 problem instances Thus, it could not find optimal solutions for 10 problem instances whose optimal solutions were found by RPoly with a time limit of 100000 seconds The sum of the objective function values of the solutions generated by ACOHAP (42305) is about 1.3 % smaller than that of CollHaps (42852) ACOHAP is as good as CollHaps for the small data set SU-100kb However, it is superior to CollHaps for larger data sets, that is, SU1, SU2, and SU3 (see Table for more details) For example, ACOHAP outperforms CollHaps on 76 out of 100 problem instances of the SU1 data set and is equal to CollHaps on the 24 remaining problem instances CollHaps shows no better results than ACOHAP for the SU1 and SU3 data sets It is only better than ACOHAP on two out of 100 problem instances of SU2 Swarm Intell (2013) 7:63–77 Table Five biological data sets from Caucasian European in Utah population (Rosa and Guimarães 2010) are used to assess the performance of different methods These data sets contain 13 problem instances each consisting of 88 genotypes 73 Problem instance #Genotypes (n) Genotype length (m) CEU-100-1 88 100 CEU-100-2 88 100 CEU-100-3 88 100 CEU-200-1 88 200 CEU-200-2 88 200 CEU-200-3 88 200 CEU-400-1 88 400 CEU-400-2 88 400 CEU-400-3 88 400 CEU-800-1 88 800 CEU-800-2 88 800 CEU-800-3 88 800 CEU-1600 88 1600 As we said, software for the two-level ACO method (ACO-HI+ ) (Benedettini et al 2008) is no longer available We can compare ACOHAP with ACO-HI+ only using published results on the SU2 data set including 100 problem instances.5 Overall, ACOHAP is better than ACO-HI+ on these 100 problem instances The sum of the solution values of ACOHAP is 14797 which is about % smaller than that of ACO-HI+ (15102) The sum of solution values of ACO-HI+ (15102) is even worse than that of CollHaps (14873) 4.1.2 Biological data We assessed the performance of the methods on five biological data sets from the Caucasian European in Utah (CEU) populations (Rosa and Guimarães 2010) These data sets contain 13 problem instances each consisting of 88 genotypes (see Table for more details) All five methods are applicable to these biological data sets Results obtained with the five methods on biological data sets are presented in Table ACOHAP is always better or at least as good as CollHaps and PTG ACOHAP outperforms CollHaps on eight out of 13 problem instances It is as good as CollHaps on the five other problem instances ACOHAP is better than PTG on nine out of 13 problem instances For example, the solution generated by ACOHAP for the CEU-100-3 problem instance consists of only 135 haplotypes compared to 169 haplotypes concerning PTG RPoly found optimal solutions for 12 out of 13 problem instances It could not find an optimal solution for CEU-100-2 The value of the approximate solution found for this problem instance after 100000 seconds is 138, which is much worse than that of ACOHAP (80), CollHaps (81) or PTG (88) ACOHAP found optimal solutions for 10 out of 13 problem instances ACOHAP is therefore slightly worse than RPoly on three problem instances (the Even though Table of Benedettini et al (2008) presents results of ACO-HI+ on the four SU data sets, only results concerning SU2 can be used for our comparisons This is because of two reasons First, results of ACO-HI+ for the SU1 data set are incorrectly reported in Benedettini et al (2008); in fact, the sum of optimal solution values for the SU1 data set is 11961, which is much higher than 2453 as reported in Table of Benedettini et al (2008) Second, Table of Benedettini et al (2008) does not report the sum of the solution values of ACO-HI+ for all problem instances of the SU-100kb and SU3 data sets; this makes a comparison with our results impossible 74 Swarm Intell (2013) 7:63–77 Table Results obtained with the five considered methods on biological problem instances Entries represent the number of haplotypes of the best solution found by each method (columns) for each problem instance (rows) Smaller numbers indicate better solutions Bold numbers are the best solution values Real data RPoly RPoly1000 ACOHAP CollHaps PTG CEU-100-1 12 12 12 12 CEU-100-2 138 138 80 81 12 88 CEU-100-3 134 134 135 139 169 CEU-200-1 38 38 38 39 57 CEU-200-2 120 120 120 125 161 CEU-200-3 140 140 141 145 169 CEU-400-1 84 135 85 87 88 CEU-400-2 143 143 143 146 172 CEU-400-3 170 170 170 170 175 CEU-800-1 176 176 176 176 176 CEU-800-2 162 162 162 164 175 CEU-800-3 175 175 175 175 175 CEU-1600 176 176 176 176 176 1668 1719 1612 1635 1793 Total number of haplotypes Table We used the software described in Tininini et al (2010) to generate 100 large artificial problem instances of sizes (number of genotypes) ranging from 100 to 300 Large artificial data sets #Problem instances #Genotypes (n) Genotype length (m) SIM-100 20 100 103 SIM-150 20 150 103 SIM-200 20 200 103 SIM-250 20 250 103 SIM-300 20 300 103 difference between two solutions is only one haplotype) RPoly with 1000 seconds could not find optimal solutions for CEU-100-2 and CEU-400-1 The approximate solutions for these problem instances after 1000 seconds are much worse than those from the other methods 4.2 Large data sets We also assessed the performance of these methods on large artificial data sets Tininini et al proposed a method (Tininini et al 2010) to generate artificial data sets from 178 unique haplotypes with length of 103 SNPs (Daly et al 2001) We used their program and these 178 unique haplotypes to generate 100 large artificial problem instances of sizes (number of genotypes) ranging from 100 to 300 (see Table 7) Table shows that ACOHAP is superior to both RPoly and CollHaps on these large data sets (PTG is not applicable to multiple problem instances) RPoly could not find optimal solutions for any data set after 100000 seconds The approximate solutions of RPoly after 100000 seconds are much worse than those of ACOHAP and CollHaps after 1000 seconds Swarm Intell (2013) 7:63–77 75 Table Results from different methods on large artificial problem instances #Best: The number of cases in which ACOHAP (CollHaps, RPoly, RPoly1000 ) found the best solutions; Avg: The average number of haplotypes per solution Data set ACOHAP #Best CollHaps RPoly RPoly1000 Avg #Best Avg #Best Avg #Best Avg SIM-100 20 124.75 126.95 186.45 186.45 SIM-150 20 160.20 174.05 270.70 270.70 SIM-200 20 172.95 203.05 353.05 353.05 SIM-250 20 176.65 205.00 434.15 434.15 SIM-300 20 177.75 193.80 512.80 512.80 Summary 100 162.46 180.57 351.43 351.43 Specifically, the average number of haplotypes per solution concerning RPoly is 351.43, about two times higher than that of ACOHAP (162.46) The poor solutions of RPoly on large data sets are expected as reported in previous studies (Gusfield and Orzack 2005; Brown and Harrower 2006) RPoly with 1000 seconds performs as well as RPoly with 100000 seconds on large artificial data sets As can be seen in Table 8, ACOHAP outperforms CollHaps on these 100 large artificial problem instances The average number of haplotypes per solution concerning ACOHAP is about 10 % smaller than the one of CollHaps ACOHAP is better than Collhaps on 98 out of 100 problem instances They are equally good on the two other problem instances Since we used 178 unique haplotypes to generate these problem instances, the upper bound on the number of optimal solutions for these problem instances is 178 The average number of haplotypes per solution from ACOHAP is 162.46, which is smaller than the upper bound of optimal solutions ACOHAP requires about 178 haplotypes to explain a problem instance from the two largest data sets SIM-250 (250 genotypes) and SIM-300 (300 genotypes) These results indicate that ACOHAP found solutions which might be close to optimal solutions 4.3 The robustness of ACOHAP algorithm We also made a preliminary assessment of the robustness of ACOHAP To this end, we conducted 25 independent runs of ACOHAP for each problem instance of the five biological data sets Table presents the number of haplotypes in the best and worst solutions, and the mean and standard deviation over 25 independent runs for each problem instance All standard deviations are rather small They are always smaller than 1.0 In 10 out of 13 problem instances, the standard deviations are zero The small standard deviations suggest the robustness of ACOHAP Conclusions We proposed the ACO-based ACOHAP algorithm to efficiently infer haplotypes from genotypes under the pure parsimony criterion The principal advantage of ACOHAP over the 2-level ACO algorithm from the literature (Benedettini et al 2008) is that ACOHAP can efficiently estimate the heuristic information for each allele to guide ants towards good solutions Its low computational complexity enables ACOHAP to infer haplotypes from large genotype data sets 76 Swarm Intell (2013) 7:63–77 Table The robustness of ACOHAP algorithm on 13 problem instances of the five biological data sets The mean and standard deviation for each problem were calculated from 25 independent runs of ACOHAP ‘Best’ and ‘Worst’ represent the best and worst solutions found in 25 independent runs Problem instance Mean Best Worst Standard deviation CEU-100-1 12.0 12 12 0.0 CEU-100-2 80.0 80 80 0.0 0.51 CEU-100-3 134.56 134 135 CEU-200-1 38.00 38 38 0.0 CEU-200-2 120.20 120 121 0.41 CEU-200-3 140.88 140 141 0.33 CEU-400-1 85.00 85 85 0.0 143.0 143 143 0.0 0.0 CEU-400-2 CEU-400-3 170.0 170 170 CEU-800-1 176.00 176 176 0.0 CEU-800-2 162.00 162 162 0.0 CEU-800-3 175.00 175 175 0.0 CEU-1600 176.00 176 176 0.0 Experiments with both small and large data sets show that ACOHAP outperforms the currently best heuristic methods RPoly, the currently best exact method, is only slightly better than ACOHAP on small problem instances for which it could find optimal solutions However, it could not find optimal solutions for large problem instances The approximate solutions from RPoly after 100000 seconds (∼28 h) are much worse than those of ACOHAP and CollHaps after only 1000 seconds The poor results of RPoly on large data sets are expected as reported by previous studies (Gusfield and Orzack 2005; Brown and Harrower 2006) The good performance of ACOHAP in comparison to the other methods tested demonstrates the power of ACO to solve the haplotype inference by pure parsimony problem for both small and large data sets Acknowledgements We thank Gavin Band, Quang Le, and Khoi Le for comments and careful proof reading We appreciate support from Graỗa and Tininini for providing us their programs We also thank anonymous referees and editors for helpful comments and corrections This work is partly supported by Vietnam National Science and Technology Fund (Nafosted: 102.01-2011.21) References Benedettini, S., Roli, A., & Di Gaspero, L (2008) Two-level ACO for haplotype inference under pure parsimony In M Dorigo, M Birattari, C Blum, M Clerc, T Stützle, & A Winfield (Eds.), Lecture notes in computer science: Vol 5217 Ant colony optimization and swarm intelligence (pp 179–190) The 6th international workshop, ANTS 2008 Berlin/Germany: Springer Brown, D G., & Harrower, I M (2006) Integer programming approaches to haplotype inference by pure parsimony IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(2), 141–154 Clark, S (1990) Inference of haplotypes from PCR-amplified samples of diploid populations Molecular Biology and Evolution, 7, 111–122 Daly, M J., Rioux, J D., Schaffner, S F., Hudson, T J., & Lander, E S (2001) High-resolution haplotype structure in the Human genome Nature Genetics, 29(2), 229–232 Di Gaspero, L., & Roli, A (2008) Stochastic local search for large-scale instances of the haplotype inference problem by pure parsimony Journal of Algorithms, 63(1–3), 55–69 Swarm Intell (2013) 7:63–77 77 Do, D D., Dinh, Q H., & Hoang, X H (2008) On the pheromone update rules of ant colony optimization approaches for the job shop scheduling problem In T D Bui, T V Ho, & Q T Ha (Eds.), Lecture notes in computer science: Vol 5357 The 11th pacific rim international conference on multi-agents: intelligent agents and multi-agent systems (pp 153–160) Heidelberg: Springer Dorigo, M., & Stützle, T (2004) Ant colony optimization Cambridge: MIT Press Graỗa, A., Marques-silva, J., Lynce, I., & Oliveira, A L (2007) Efficient haplotype inference with pseudoBoolean optimization In H Anai, K Horimoto, & T Kutsia (Eds.), Lecture notes in computer science: Vol 4545 Algebraic biology 2007 (pp 125139) Heidelberg: Springer Graỗa, A., Marques-silva, J., Lynce, I., & Oliveira, A L (2008) Efficient haplotype inference with combined CP and OR techniques In L Perron & M Trick (Eds.), Lecture notes in computer science: Vol 5015 The 5th international conference on integration of AI and OR techniques in constraint programming for combinatorial optimization problems, CPAIOR 2008 (pp 308312) Heidelberg: Springer Graỗa, A., Lynce, I., Marques-Silva, J., & Oliveira, A L (2010) Haplotype inference by pure parsimony: a survey Journal of Computational Biology, 17(8), 969–992 Gusfield, D (2001) Inference of haplotypes from samples of diploid populations: complexity and algorithms Journal of Computational Biology, 8(3), 305–323 Gusfield, D (2003) Haplotype inference by pure parsimony In R Baeza-Yates, E Chávez, & M Crochemore (Eds.), Lecture notes in computer science: Vol 2676 Combinatorial pattern matching (pp 144–155) 14th Annual symposium on combinatorial pattern matching, CPM 2003 Heidelberg: Springer Gusfield, D., & Orzack, S H (2005) Haplotype inference In S Aluru (Ed.), Handbook of computational molecular biology (pp 1–28) Boca Raton: CRC Press Hutter, F., Hoos, H., & Stützle, T (2007) Automatic algorithm configuration based on local search In H Robert & A Howe (Eds.), The twenty-second conference on artificial intelligence (AAAI ’07) (pp 1152– 1157) Menlo Park: AAAI Press Istrail, S (2004) Computational methods for SNPs and haplotype inference Berlin/Heidelberg: Springer Li, Z., Zhou, W., Zhang, X., & Chen, L (2005) A parsimonious tree-grow method for haplotype inference Bioinformatics, 21(17), 3475–3481 Lynce, I., & Marques-Silva, J (2008) Haplotype inference with Boolean satisfiability International Journal Artificial Intelligence Tools, 17, 355–387 Marchini, J., Cutler, D., Patterson, N., Stephens, M., Eskin, E., Halperin, E., Lin, S., Qin, Z., Munro, H., Abecasis, G., & Donnelly, P (2006) A comparison of phasing algorithms for trios and unrelated individuals The American Journal of Human Genetics, 78(3), 437–450 Rosa, S., & Guimarães, S (2010) Insights on haplotype inference on large genotype datasets In E Ferreira, S Miyano, & P Stadler (Eds.), Lecture notes in computer science: Vol 6268 Advances in bioinformatics and computational biology (pp 47–58) 5th Brazilian conference on bioinformatics Heidelberg: Springer Stützle, T., & Hoos, H (2000) MAX-MIN ant system Future Generation Computer Systems, 16(8), 889– 914 The International Hapmap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs Nature, 449, 851–861 Tininini, L., Bertolazzi, P., Godi, A., & Lancia, G (2010) CollHaps: a heuristic approach to haplotype inference by parsimony IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3), 511–523 ... homozygous at site i, we assign hai = gi and the ant follows the gi -branch Otherwise, the ant takes a decision to follow either the 0-branch or the 1-branch based on the pheromone trail and heuristic... multiple problem instances Therefore, we cannot assess its performance for these 329 artificial problem instances The sum of the objective function values of the solutions generated by ACOHAP for. .. demonstrates the power of ACO to solve the haplotype inference by pure parsimony problem for both small and large data sets Acknowledgements We thank Gavin Band, Quang Le, and Khoi Le for comments and

Định dạng
Số trang	15
Dung lượng	529,49 KB