1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Efficient algorithms for the gene tree species tree reconciliation problem

124 250 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 124
Dung lượng 1,58 MB

Nội dung

EFFICIENT ALGORITHMS FOR THE GENE TREE-SPECIES TREE RECONCILIATION PROBLEM ZHENG YU (B.Sc.(Hons.), Sun Yat-sen University) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2014 Declaration I hereby declare that the thesis is my original work and it has been written by me in its entirety I have duly acknowledged all the sources of information which have been used in the thesis This thesis has also not been submitted for any degree in any university previously Zheng Yu 15 Aug 2014 Acknowledgements First and foremost, I would like to thank my advisor Prof Zhang Louxin, for his guidance, patience, and continuous support in the past five years Without him, none of this would have been possible I would also like to thank Prof Choi Kowk Pui for his encouragement and helpful discussion, and Dr Wu Taoyang to whom I worked closely in the first two years of my PhD study I am indebted to all the members in our group, Dr David Chew, Dr Li Si, Dr Ngoc Hieu Tran, Tian Dechao, and Luo Chang for sharing ideas with me Last but not least, I want to thank my parents for their endless love and support, and my wife for bringing happiness into my life when I am pursuing my PhD degree v Contents Declaration iii Acknowledgements v Summary xi Introduction 1.1 The Contribution of The Thesis 1.2 The Organization of The Thesis Background 2.1 Phylogenetic Trees 2.2 The Gene Tree and Species Tree Reconciliation 2.3 Reconciliation Measures 11 2.3.1 The gene duplication cost 11 2.3.2 The gene loss cost 12 2.3.3 The mutation and affine costs 13 2.3.4 The deep coalescence cost 14 vii 2.4 15 2.4.1 Duplication history 15 2.4.2 The linear relationship among three reconciliation costs 16 2.5 The Robinson-Foulds Distance 16 2.6 The General Reconciliation Problem 17 2.6.1 The species tree inference problem 17 2.6.2 Properties of the LCA Reconciliation The general reconciliation problem 17 The Gene Tree Refinement Problem 19 3.1 A Dynamic Programming Method 21 3.2 Irreducible Duplication History 22 3.3 Compression of Child-Image Subtrees 26 3.3.1 Compressed child-image subtrees 26 3.3.2 The compression algorithm 29 Linear Time Algorithms for Different Reconciliation Costs 32 3.4.1 Minimizing the gene loss and deep coalescence costs 33 3.4.2 Minimizing the gene duplication cost 35 3.4.3 Minimizing the mutation cost 38 The Affine Cost 51 3.5.1 Wagner parsimony problem 51 3.5.2 The extended Cs˝ rös algorithm u 52 3.6 Experiments 54 3.7 Remarks 57 3.4 3.5 The Species Tree Refinement Problem 59 4.1 The Restricted SPR Local Search Problem 61 4.1.1 Node coloring 62 4.1.2 The longest chain and an equivalence relation 66 viii 4.1.3 71 4.1.4 Minimizing the duplication cost 80 4.2 Refine Non-binary Species Tree 84 4.3 Experiments 85 4.3.1 Simulated datasets 85 4.3.2 Contracting weakly supported branches 87 4.3.3 The effect of missing taxa 89 4.3.4 Running time 90 4.3.5 An algorithm for finding the longest chains Biological datasets 92 Conclusion and Future Work 95 ix dmoj dvir dgri dwil dper dpse dana dmel dsim dsec dere dyak Figure 4.17: The species tree computed by TxT-SPR on the fruitfly gene tree dataset [41] These gene trees were unrooted with support values on branches Since some of these gene trees were not uniquely labeled, PhyloNet-MDC could not be applied to this dataset When TxT-SPR was applied to this dataset, the inferred species tree (Figure 4.17) had one missing branch comparing to the fruitfly species tree in [41] However, the inferred species tree had the same tree topology as the fruitfly species tree [41] in terms of unrooted trees This may be caused by the fact that the input gene trees were unrooted 94 Chapter Conclusion and Future Work In summary, this thesis is devoted to the general gene tree and species tree reconciliation problem, especially the non-binary gene tree and species tree refinement problems, under different reconciliation cost models For the non-binary gene tree refinement problem, we designed fast algorithms for different reconciliation cost function, based on two innovations, the irreducible duplication history and subtree compression Our algorithms improve the time complexities of previous best algorithms by an order of magnitude We developed the linear-time algorithms, as well as the structure informations of the optimal solutions, for four popular reconciliation cost functions These results generalize the standard LCA reconciliation for binary trees For the gene tree refinement problem under the affine cost model, we extended the Cs˝ rös algorithm which is designed for the asymmetric Wagner parsimony u u problem [26] The extended Cs˝ rös algorithm solves the gene tree refinement problem under the affine cost model in quadratic time, and its running time is very close to those of our linear-time algorithms in practice The species tree refinement problem is much harder than the gene tree refinement problem In fact, the species tree inference problem, as a special case of the species tree refinement problem, is NP-hard for many reconciliation cost functions [7, 55, 88] 95 Therefore, in Chapter 4, we designed a heuristic algorithm for inferring a species tree from a collection of non-binary gene trees using the SPR local search method under the gene duplication cost model Our method is based on the method of DupTree [8, 81, 82] which cannot explicitly handle non-binary gene trees Our method benefits from our subtree compression algorithm, as well as the structure information of the non-binary gene tree refinement with optimal duplication cost, that have been studied in Chapter One exciting result is that our method has the same time complexity as DupTree Another interesting result is, when used to infer the species tree, our method has higher accuracy by contracting weakly supported branches in gene trees Our method also applicable to the species tree refinement problem, and therefore enable us to solve the general reconciliation problem which handles non-binary gene and species trees simultaneously There are several limitations of this work First, we have not included the horizontal gene transfer (HGT) in all of our studies Even for binary gene and species trees, the reconciliation problem considering HGT is much more complicated [10, 20, 25, 43, 62], especially when the time-consistence is required [79] Second, all the studies in this thesis are parsimony based Parsimony based methods are generally more efficient and suitable for large dataset [5, 17] Probability based methods, such as [4, 37] for the reconciliation problem, and [3, 53] for the species tree inference problem, are likely to provide more accurate results There is a trade-off between the efficiency of parsimony based methods and the accuracy of probability based methods Lastly, our studies focus on the comparison of gene and species trees However, a complete phylogenetic analysis consists of three elements, the gene tree, the species tree, and the sequence data Recently, unified models, that combine the sequence evolution and the gene evolution, have been proposed to study these three elements simultaneously [1, 16, 68, 69] These are definitely interesting problems for future work 96 Bibliography [1] Örjan Åkerborg, Bengt Sennblad, Lars Arvestad, and Jens Lagergren Simultaneous bayesian gene tree reconstruction and reconciliation analysis Proceedings of the National Academy of Sciences of the United States of America, 106(14): 5714–5719, 2009 [2] Benjamin L Allen and Mike Steel Subtree transfer operations and their induced metrics on evolutionary trees Annals of Combinatorics, 5(1):1–15, 2001 [3] Cécile Ané, Bret Larget, David A Baum, Stacey D Smith, and Antonis Rokas Bayesian estimation of concordance among gene trees Molecular Biology and Evolution, 24(2):412–426, 2007 [4] Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt Sennblad Bayesian gene/species tree reconciliation and orthology analysis using mcmc Bioinformatics, 19(suppl 1):i7–i15, 2003 [5] Mukul Bansal, J Gordon Burleigh, and Oliver Eulenstein Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models BMC Bioinformatics, 11(Suppl 1):S42, 2010 97 [6] Mukul S Bansal and Oliver Eulenstein Algorithms for genome-scale phylogenetics using gene tree parsimony IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(4):939–956, 2013 [7] Mukul S Bansal and Ron Shamir A note on the fixed parameter tractability of the gene-duplication problem IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(3):848–850, May 2011 [8] Mukul S Bansal, J Gordon Burleigh, Oliver Eulenstein, and André Wehe Heuristics for the gene-duplication problem: a θ(n) speed-up for the local search In Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology, (RECOMB), pages 238–252 Springer, 2007 [9] Mukul S Bansal, J Gordon Burleigh, Oliver Eulenstein, and David FernandezBaca Robinson-foulds supertrees Algorithms for Molecular Biology, 5:18, 2010 [10] Mukul S Bansal, Eric J Alm, and Manolis Kellis Reconciliation revisited: Handling multiple optima when reconciling with duplication, transfer, and loss Journal of Computational Biology, 20(10):738–754, 2013 [11] Md Shamsuzzoha Bayzid, Siavash Mirarab, and Tandy Warnow Inferring optimal species trees under gene duplication and loss In Proc Pacific Symposium on Biocomputing, volume 18, pages 250–261 World Scientific, 2013 [12] Michael A Bender and Martin Farach-Colton The lca problem revisited In LATIN 2000: Theoretical Informatics, pages 88–94 Springer, 2000 [13] Ann-Charlotte Berglund-Sonnhammer, Pär Steffansson, Matthew J Betts, and David A Liberles Optimal gene trees from sequences and species trees using a soft interpretation of parsimony Journal of Molecular Evolution, 63(2):240–250, 2006 98 [14] Magnus Bordewich and Charles Semple On the computational complexity of the rooted subtree prune and regraft distance Annals of Combinatorics, 8(4):409–423, 2005 [15] Luis Boto Horizontal gene transfer in evolution: facts and challenges Proceedings of the Royal Society B: Biological Sciences, 277(1683):819–827, 2010 [16] Bastien Boussau, Gergely J Szöll˝ si, Laurent Duret, Manolo Gouy, Eric Tano nier, and Vincent Daubin Genome-scale coestimation of species and gene trees Genome Research, 23(2):323–330, 2013 [17] J Gordon Burleigh, Mukul S Bansal, Oliver Eulenstein, Stefanie Hartmann, André Wehe, and Todd J Vision Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees Systematic Biology, 60(2):117–125, 2011 [18] Wen-Chieh Chang and Oliver Eulenstein Reconciling gene trees with apparent polytomies In Proceedings of the 12th Annual International Conference on Computing and Combinatorics, (COCOON), pages 235–244 2006 [19] Wen-Chieh Chang, Pawel Górecki, and Oliver Eulenstein Exact solutions for species tree inference from discordant gene trees Journal of Bioinformatics and Computational Biology, 11(5), 2013 [20] Michael A Charleston Jungles: a new solution to the host/parasite phylogeny reconciliation problem Mathematical Biosciences, 149(2):191–223, 1998 [21] Ruchi Chaudhary, Mukul Bansal, Andre Wehe, David Fernandez-Baca, and Oliver Eulenstein iGTP: A software package for large-scale gene tree parsimony analysis BMC Bioinformatics, 11(1):574, 2010 [22] Ruchi Chaudhary, John Burleigh, and David Fernandez-Baca Inferring species 99 trees from incongruent multi-copy gene trees using the robinson-foulds distance Algorithms for Molecular Biology, 8(1):28, 2013 [23] Cedric Chauve and Nadia El-Mabrouk New perspectives on gene family evolution: Losses in reconciliation and a link with supertrees In Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology, (RECOMB), pages 46–58, 2009 [24] Kevin Chen, Dannie Durand, and Martin Farach-Colton Notung: a program for dating gene duplications and optimizing gene family trees Journal of Computational Biology, 7(3-4):429–447, 2000 [25] Zhi-Zhong Chen, Fei Deng, and Lusheng Wang Simultaneous identification of duplications, losses, and lateral gene transfers IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9:1515–1528, 2012 [26] Miklós Cs˝ rưs Ancestral reconstruction by asymmetric Wagner parsimony over u continuous characters and squared parsimony over distributions In Proceedings of the International Workshop on Comparative Genomics, (RECOMB-CG), pages 72–86, 2008 [27] Sanjoy Dasgupta, Christos H Papadimitriou, and Umesh Vazirani Algorithms McGraw-Hill, Inc., 2006 [28] William HE Day Optimal algorithms for comparing trees with labeled leaves Journal of Classification, 2(1):7–28, 1985 [29] James H Degnan and Noah A Rosenberg Gene tree discordance, phylogenetic inference and the multispecies coalescent Trends in Ecology and Evolution, 24 (6):332–340, 2009 100 [30] Dannie Durand, Bjarni V Halldórsson, and Benjamin Vernot A hybrid micromacroevolutionary approach to gene tree reconstruction Journal of Computational Biology, 13:320–335, 2006 [31] Scott V Edwards, Liang Liu, and Dennis K Pearl High-resolution species trees without concatenation Proceedings of the National Academy of Sciences of the United States of America, 104(14):5936–5941, 2007 [32] Oliver Eulenstein, Boris Mirkin, and Martin Vingron Duplication-based measures of difference between gene and species trees Journal of Computational Biology, 5(1):135–148, 1998 [33] James S Farris Methods for computing Wagner trees Systematic Biology, 19: 83–92, 1970 [34] Joseph Felsenstein Inferring Phylogenies Sinauer Associates, Sunderland, 2004 [35] Walter M Fitch Distinguishing homologous from analogous proteins Systematic Biology, 19:99–113, 1970 [36] Morris Goodman, John Czelusniak, G William Moore, A E Romero-Herrera, and Genji Matsuda Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences Systematic Biology, 28:132–163, 1979 [37] Paweł Górecki and Oliver Eulenstein Drml: Probabilistic modeling of gene duplications Journal of Computational Biology, 21(1):89–98, 2014 [38] Paweł Górecki and Jerzy Tiuryn DLS-trees: a model of evolutionary scenarios Theoretical Computer Science, 359:378–399, 2006 101 [39] Roderic Guigo, Ilya Muchnik, and Temple F Smith Reconstruction of ancient molecular phylogeny Molecular Phylogenetics and Evolution, 6(2):189213, 1996 [40] Stộphane Guindon, Jean-Franỗois Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of phyml 3.0 Systematic Biology, 59(3):307–321, 2010 [41] Matthew Hahn Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution Genome Biology, 8(7):R141, 2007 [42] Matthew W Hahn, Tijl De Bie, Jason E Stajich, Chi Nguyen, and Nello Cristianini Estimating the tempo and mode of gene family evolution from comparative genomic data Genome Research, 15(8):1153–1160, 2005 [43] Michael T Hallett and Jens Lagergren Efficient algorithms for lateral gene transfer problems In Proceedings of the fifth annual international conference on Computational biology, pages 149–156 ACM, 2001 [44] Dov Harel and Robert Endre Tarjan Fast algorithms for finding nearest common ancestors SIAM Journal on Computing, 13(2):338–355, 1984 [45] John P Huelsenbeck and Fredrik Ronquist MrBayes: Bayesian inference of phylogenetic trees Bioinformatics, 17:754–755, 2001 [46] Kazutaka Katoh and Daron M Standley MAFFT multiple sequence alignment software version 7: Improvements in performance and usability Molecular Biology and Evolution, 30(4):772–780, 2013 [47] Laura S Kubatko, Bryan C Carstens, and L Lacey Knowles Stem: species tree 102 estimation using maximum likelihood for gene trees under coalescence Bioinformatics, 25(7):971–973, 2009 [48] Chih-Horng Kuo, John P Wares, and Jessica C Kissinger The apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees Molecular Biology and Evolution, 25(12):2689–2698, 2008 [49] Manuel Lafond, KristerM Swenson, and Nadia El-Mabrouk An optimal reconciliation algorithm for gene trees with polytomies In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, volume 7534 of Lecture Notes in Computer Science, pages 106–122 Springer Berlin Heidelberg, 2012 [50] Bret R Larget, Satish K Kotha, Colin N Dewey, and Cécile Ané BUCKy: Gene tree/species tree reconciliation with bayesian concordance analysis Bioinformatics, 26(22):2910–2911, 2010 [51] Emmanuelle Lerat, Vincent Daubin, Howard Ochman, and Nancy A Moran Evolutionary origins of genomic repertoires in bacteria PLoS Biology, 3(5):e130, 2005 [52] Liang Liu Best: Bayesian estimation of species trees under the coalescent model Bioinformatics, 24(21):2542–2543, 2008 [53] Liang Liu and Dennis K Pearl Species trees from gene trees: reconstructing bayesian posterior distributions of a species phylogeny using estimated gene tree distributions Systematic Biology, 56(3):504–514, 2007 [54] Michael Lynch and John S Conery The evolutionary fate and consequences of duplicate genes Science, 290(5494):1151–1155, 2000 [55] Bin Ma, Ming Li, and Louxin Zhang From gene trees to species trees SIAM Journal on Computing, 30(3):729–752, 2000 103 [56] Wayne P Maddison Gene trees in species trees Systematic Biology, 46:523–536, 1997 [57] Fred R McMorris and Michael A Steel The complexity of the median procedure for binary trees In New Approaches in Classification and Data Analysis, pages 136–140 Springer, 1994 [58] Michael L Metzker Sequencing technologiesthe next generation Nature Reviews Genetics, 11(1):31–46, 2009 [59] Boris Mirkin, Ilya Muchnik, and Temple F Smith A biologically consistent model for comparing molecular phylogenies Journal of Computational Biology, 2(4): 493–507, 1995 [60] Leon Mirsky A dual of dilworth’s decomposition theorem American Mathematical Monthly, pages 876–877, 1971 [61] Luay Nakhleh Computational approaches to species phylogeny inference and gene tree reconciliation Trends in Ecology and Evolution, 28(12):719–728, 2013 [62] Luay Nakhleh, Derek Ruths, and Li-San Wang Riata-hgt: a fast and accurate heuristic for reconstructing horizontal gene transfer In Computing and Combinatorics, pages 84–93 Springer, 2005 [63] Roderic DM Page Maps between trees and cladistic analysis of historical associations among genes,organisms, and areas Systematic Biology, 43(1):58–77, 1994 [64] Roderic DM Page Genetree: comparing gene and species phylogenies using reconciled trees Bioinformatics, 14(9):819–820, 1998 [65] Roderic DM Page and Michael A Charleston Reconciled trees and incongruent gene and species trees Mathematical Hierarchies in Biology, 37:57–70, 1997 104 [66] Daniel A Pollard, Venky N Iyer, Alan M Moses, and Michael B Eisen Widespread discordance of gene trees with species tree in drosophila: evidence for incomplete lineage sorting PLoS Genetics, 2(10):e173, 2006 [67] Morgan N Price, Paramvir S Dehal, and Adam P Arkin FastTree approximately maximum-likelihood trees for large alignments PLoS ONE, 5(3):e9490, 03 2010 [68] Matthew D Rasmussen and Manolis Kellis A bayesian approach for fast and accurate gene tree reconstruction Molecular Biology and Evolution, 28(1):273– 290, 2011 [69] Matthew D Rasmussen and Manolis Kellis Unified modeling of gene duplication, loss, and coalescence using a locus tree Genome Research, 22(4):755–765, 2012 [70] DF Robinson and Leslie R Foulds Comparison of phylogenetic trees Mathematical Biosciences, 53(1):131–147, 1981 [71] Antonis Rokas, Barry L Williams, Nicole King, and Sean B Carroll Genomescale approaches to resolving incongruence in molecular phylogenies Nature, 425 (6960):798–804, 2003 [72] Noah A Rosenberg and Magnus Nordborg Genealogical trees, coalescent theory and the analysis of genetic polymorphisms Nature Reviews Genetics, 3(5):380– 390, 2002 [73] Leonidas Salichos and Antonis Rokas Inferring ancient divergences requires genes with strong phylogenetic signals Nature, 497(7449):327–331, 2013 [74] Baruch Schieber and Uzi Vishkin On finding lowest common ancestors: Simplification and parallelization SIAM Journal on Computing, 17:1253–1262, 1988 105 [75] Maureen Stolzer, Han Lai, Minli Xu, Deepa Sathaye, Benjamin Vernot, and Dannie Durand Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees Bioinformatics, 28(18):i409–i415, 2012 [76] David L Swofford Paup* phylogenetic analysis using parsimony (*and other methods) version Sinauer Associates, Sunderland, Massachusetts, 2003 [77] Cuong Than and Luay Nakhleh Species tree inference by minimizing deep coalescences PLoS Computational Biology, 5(9):e1000501, 2009 [78] Cuong Than, Derek Ruths, and Luay Nakhleh Phylonet: a software package for analyzing and reconstructing reticulate evolutionary relationships BMC Bioinformatics, 9(1):322, 2008 [79] Ali Tofigh, Michael Hallett, and Jens Lagergren Simultaneous identification of duplications and lateral gene transfers IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(2):517–535, 2011 [80] Benjamin Vernot, Maureen Stolzer, Aiton Goldman, and Dannie Durand Reconciliation with non-binary species trees Journal of Computational Biology, 15(8): 981–1006, 2008 [81] André Wehe and John Gordon Burleigh Scaling the gene duplication problem towards the tree of life In Hisham Al-Mubaid, editor, BICoB, pages 133–138 ISCA, 2010 [82] André Wehe, Mukul S Bansal, J Gordon Burleigh, and Oliver Eulenstein Duptree: a program for large-scale phylogenetic analyses using gene tree parsimony Bioinformatics, 24(13):1540–1541, 2008 [83] Taoyang Wu and Louxin Zhang Structural properties of the reconciliation space 106 and their applications in enumerating nearly-optimal reconciliations between a gene tree and a species tree BMC Bioinformatics, 12(Suppl 9):S7, 2011 [84] Yufeng Wu Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood Evolution, 66(3):763– 775, 2012 [85] Jimmy Yang and Tandy Warnow Fast and accurate methods for phylogenomic analyses BMC Bioinformatics, 12(Suppl 9):S4, 2011 [86] Yun Yu, Tandy Warnow, and Luay Nakhleh Algorithms for mdc-based multi-locus phylogeny inference: Beyond rooted binary gene trees on single alleles Journal of Computational Biology, 18:1543–1559, 2011 [87] Louxin Zhang On a mirkin-muchnik-smith conjecture for comparing molecular phylogenies Journal of Computational Biology, 4:177–187, 1997 [88] Louxin Zhang From gene trees to species trees II: Species tree inference by minimizing deep coalescence events IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8:1685–1691, November 2011 [89] Yu Zheng and Louxin Zhang Reconciliation with non-binary gene trees revisited In Roded Sharan, editor, Proceedings of the 18th Annual International Conference on Research in Computational Molecular Biology, (RECOMB), pages 418–432 2014 107 ... embedding of the gene tree into the species tree [36] In such an embedding, the topology of the gene tree is kept; gene tree leaves are placed at the species tree leaves where they come from; and the. .. 2.6.1 The General Reconciliation Problem The species tree inference problem Parsimony-based inference of the species tree from a set of gene trees, also known as the gene tree parsimony problem. .. Chapter The Gene Tree Refinement Problem In this chapter, we shall study the general reconciliation problem for arbitrary gene trees and binary species trees Problem 3.1 The Gene Tree Refinement Problem

Ngày đăng: 09/09/2015, 08:12

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN