Three mathematical issues in reconstructing ancestral genome

THREE MATHEMATICAL ISSUES IN RECONSTRUCTING ANCESTRAL GENOME YANG JIALIANG (B. Sc., DUT; M. Sc., DUT) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2008 Acknowledgements I would like to express my deepest gratitude to my advisor, Professor Zhang Louxin for all his kindness, supervision, and invaluable advices throughout this research work. I am very grateful for all that he has done for me, especially, during the last few months while I was applying for jobs. This dissertation would not have been possible without his guidance and help. I am indebted to my family. Without their support, I could not have the courage to face the problems on research and life. Thanks to my friends NG Yen Kaow, Ning Kang, Liu Yongjin and Francis Ng Hoong Kee. They share much pleasant time with me as well as give me many advices on research and life. ii Contents Acknowledgements Summary ii vii List of Tables ix List of Figures x Introduction to the Reconstruction of Ancestral Genomes 1.1 DNA and Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Genome Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Ancestral Genomes Reconstruction . . . . . . . . . . . . . . . . . . 11 1.3.1 11 1.3 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . iii Contents 1.4 iv 1.3.2 Reconstructing Evolutionary History . . . . . . . . . . . . . 12 1.3.3 Inferring Tandem Duplication Events . . . . . . . . . . . . . 16 Contribution and Organization . . . . . . . . . . . . . . . . . . . . 17 1.4.1 Issue 1: How to Optimize Seeds for Homology Search? . . . 1.4.2 Issue 2: Analysis of the Accuracy of the Fitch Method on 1.4.3 Complete Trees. . . . . . . . . . . . . . . . . . . . . . . . . . 18 Issue 3: How to Count the Tandem Duplication Models? . . 19 Sensitivity Analysis of Spaced Seed in Homology Search 2.1 17 21 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Global Alignment . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Scoring Schemes . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.3 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.4 Local Alignment Programs . . . . . . . . . . . . . . . . . . . 25 Seed Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Consecutive Seed . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2 Basic Spaced Seed . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.3 Transition Seed . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 Seed Sensitivity and Specificity . . . . . . . . . . . . . . . . 29 2.3 High-order Seed Patterns . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Hit Probability qn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 A Recurrence System for Computing qn . . . . . . . . . . . . 33 2.4.2 An Inequality on qn . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.3 Asymptotic Analysis for Hit Probability . . . . . . . . . . . 41 Average Distance between Non-overlapping Hits . . . . . . . . . . . 44 2.2 2.5 Contents 2.6 v 2.5.1 A Formula for µQ . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.2 Bounding µQ . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.3 Using µQ to Bound λQ . . . . . . . . . . . . . . . . . . . . . 60 Transition Seed Selection . . . . . . . . . . . . . . . . . . . . . . . . 61 2.6.1 Selection Methods . . . . . . . . . . . . . . . . . . . . . . . 61 2.6.2 Good Transition Seeds . . . . . . . . . . . . . . . . . . . . . 63 Reconstruction Accuracy for the Fitch Method on Complete Trees 72 3.1 Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 The Jukes-Cantor Model . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 The Fitch Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 Reconstruction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5 Accuracy Analysis of the Fitch Method on Complete Trees . . . . . 78 3.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5.2 A Recurrence System for Reconstruction Accuracy . . . . . 79 3.5.3 Asymptotic Analysis on Reconstruction Accuracy . . . . . . 89 Count Tandem Duplication Models 101 4.1 Introduction to Tandem Duplication . . . . . . . . . . . . . . . . . 102 4.2 Tandem Duplication Model . . . . . . . . . . . . . . . . . . . . . . 102 4.3 4.2.1 Rooted Duplication Trees . . . . . . . . . . . . . . . . . . . 104 4.2.2 Unrooted Duplication Trees . . . . . . . . . . . . . . . . . . 105 Counting Tandem Duplication Trees . . . . . . . . . . . . . . . . . 109 4.3.1 Number of Rooted Trees . . . . . . . . . . . . . . . . . . . . 109 4.3.2 Number of Unrooted Trees . . . . . . . . . . . . . . . . . . . 111 4.3.3 Relation between the Number of Rooted and Unrooted Trees 112 Contents vi Bibliography 124 Index 125 Summary With the advances in comparative genomics methods in the past decade, bioinformatics has become a feasible approach to reconstructing ancestral genomes. It reconstructs an ancestral genome by aligning extant genomes and inferring different types of evolutionary events in the evolutionary history, among which two important events are substitution event and duplication event. In this thesis, we study three mathematical issues arising from this approach. The first issue is the seed optimization for homology search and sequence alignment. It is known that the performance of a seed-based alignment program depends largely on the quality of the seed used in the program. However, seed optimization is a difficult task. No polynomial-time algorithm is known at present. Aiming for fast algorithms for identifying good seeds, we first formulate a high-order seed pattern to model different types of seeds used in seeded programs. Then, we theoretically study the following two probabilistic parameters that are related to the performance of a seed: hit probability and the average distance between successive non-overlapping hits. We establish a recurrence formula for computing the hit probability of high-order seeds and analyze asymptotically the hit probability. vii Summary viii We also present a matrix-based formula and a tight upper bound for the average distance between successive non-overlapping hits. Based on our theoretical results, an algorithm for identifying good transition seeds is designed. This algorithm can also be adopted to identify multiple seeds. Our algorithm outperforms existing deterministic methods in running time and random algorithms in seed quality. The second issue arises from the reconstruction of ancestral sequences, which is usually represented by an evolutionary tree. Given a rooted evolutionary tree and a group of states on its leaves, the Fitch method is used to reconstruct the ancestral states at interior nodes. The reconstruction accuracy of the Fitch method is the probability that it reconstructs correctly the true state at root. We assume that the conservation probability of each state on every branch is equal to a common value p, and let Paccuracy (Tn ) be the reconstruction accuracy of the Fitch method on a rooted complete tree Tn with two states, say and 1. Steel (1989) observed that lim Paccuracy (Tn ) = n→∞   1,   4p−3 2(2p−1) + √ if (8p−7)(4p−3) , 2(1−2p)2 ≤ p ≤ 78 ; if p > 7/8. We give a rigorous proof to this observation and study the convergence for p < 12 . The third issue arises from reconstructing duplication events, among which common events are tandem duplication events. A tandem duplication history resulting in n repeated segments is modeled by a (rooted or unrooted) tandem duplication tree with n ordered leaves. We first present a simple recurrence formula for the number of rooted duplication trees   1 rn =   ⌊(n+1)/3⌋ (−1)k+1 k=1 if n = 2, n+1−2k k rn−k if n ≥ 3. and then give a non-counting proof that the number of rooted duplication trees for n segments is twice the number of unrooted duplication trees for n segments. List of Tables 1.1 Genetic code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 aligned genomic sequences. . . . . . . . . . . . . . . . . . . . . . . 13 2.1 A score matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Good transition seeds in Bernoulli model M({1, 2, 3}, 0.6, 0.1, 0.3). . 64 2.3 Good transition seeds in Bernoulli model M({1, 2, 3}, 0.6, 0.2, 0.2). . 65 2.4 Good transition seeds in Bernoulli model M({1, 2, 3}, 0.6, 0.3, 0.1). . 66 2.5 Good transition seeds in Bernoulli model M({1, 2, 3}, 0.7, 0.15, 0.15). 67 2.6 Good transition seeds in Bernoulli model M({1, 2, 3}, 0.8, 0.1, 0.1). . 68 2.7 Comparing the running time with Hedera. . . . . . . . . . . . . . . 69 2.8 Comparing the hit probability and running time with Mandala. . . 71 4.1 The correspondence between the unrooted duplication trees and rooted duplication trees with ordered leaves. . . . . . . . . . . . . 108 ix List of Figures 1.1 Microbial genome growth – from “http://www.ncbi.nlm.nih.gov/”. . 1.2 Four nucleotides: Adenine, Cytosine, Guanine and Thymine – from “http://www.genome.gov/”. . . . . . . . . . . . . . . . . . . . . . . 1.3 Structure of base pair – from “http://academic.brooklyn.cuny.edu/”. 1.4 Jukes-Cantor one-parameter model. . . . . . . . . . . . . . . . . . . 1.5 Kimura’s two-parameter model. . . . . . . . . . . . . . . . . . . . . 1.6 Unequal crossover. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Different types of mutations. . . . . . . . . . . . . . . . . . . . . . . 10 1.8 A phylogenetic tree that we want to evaluate using parsimony. . . . 13 1.9 Reconstruction of site on the tree in Figure 1.8. . . . . . . . . . . 14 1.10 Alternative reconstructions of site on the tree in Figure 1.8. . . . 14 1.11 (a) Reconstruction of site on the tree in Figure 1.8; (b) Reconstruction of site on the tree in Figure 1.8. . . . . . . . . . . . . . . 15 1.12 A possible reconstruction of ancestral states. . . . . . . . . . . . . . 15 x 4.3 Counting Tandem Duplication Trees 111 Now, by Equation (4.1), n j=2 p(n, n rn = n j=2 − j) ⌊(j+1)/3⌋ (−1)i+1 j−2i rn−i i=1 i−1 ⌊(n+1)/3⌋ n j−2k (−1)k+1 i=1 3k−1 k−1 ⌊(n+1)/3⌋ (−1)k+1 n+1−2k rn−k i=1 k = = = where we use the formula b i=a i a b+1 a+1 = rn−k for two integers ≤ a ≤ b. This finishes the proof. The recurrence formula in Theorem 4.3.1 allows us to find a closed formula for computing rn . Let X = (rn , rn−1 , . . . , r3 , r2 )T . Then, by the recurrence formula, AX = (0, 0, . . . , 0, 1)T , where A = (aij )(n−1)×(n−1) is defined as   (−1)j−i n+2+i−2j j−i aij =  0 if i ≤ j ≤ (n + + 2i)/3, otherwise. Since A is an upper triangular matrix having 1s along the diagonal, its determinant is 1. Hence, the fact that only the last entry ‘1’ is non-zero in the right-hand vector implies that rn is the co-factor of the row n − and column in A. For example, we have − r6 = 4.3.2 − 2 − − Number of Unrooted Trees Let a(n, k) be the number of rooted trees RT satisfying one of the following conditions: (1) the root of RT is the direct ancestor of segment n; (2) the root of RT 4.3 Counting Tandem Duplication Trees 112 is the ancestor of segment n and involved in a multiple duplication. Gascuel et al. (2003) proved the following recursive formula for the number of unrooted trees dn . Lemma 4.3.2. (Gascuel et al. 2003) For any n ≥ 4, n−2 dn = a(n, i), (4.7) i=0 For ≤ i ≤ n − 3, a(n, i) satisfies: a(n, 0) = a(n − 1, 0) + a(n − 1, 1), (4.8) a(n, i) = a(n − 1, i + 1) + a(n, i − 1), (4.9) a(n, n − 2) = a(n, n − 3) = a(n, n − 4) = rn−1 (4.10) with initial values a(2, 0) = 1, a(3, 0) = and a(3, 1) = 1. 4.3.3 Relation between the Number of Rooted and Unrooted Trees One can create rooted tandem duplication trees by rooting unrooted trees at some edge. The number of potential root placements on an unrooted tree is on average (Gascuel et al., 2003; Yang and Zhang, 2004). Thus, we have Theorem 4.3.2 (Gascuel et al., 2003). For n ≥ 3, rn = 2dn , (4.11) i.e., the number of rooted duplication trees for n segments is twice the number of unrooted duplication trees for n segments. Gascuel et al. (2003) proved the theorem by using the recursive formula in Lemma 4.3.1 and Lemma 4.3.2. The theorem is true for n = and n = 4. For n > 4, they proved that p(n, i) = 2a(n, i), ≤ i ≤ n − 2, by induction on n and thus concluded 4.3 Counting Tandem Duplication Trees rn = 2dn since rn = n−2 i=0 p(n, i) and dn = 113 n−2 i=0 a(n, i). In the following we give an alternative non-counting proof for this result. Proof. Let U be an unrooted duplication tree for n segments {1, 2, . . . , n}. By definition, at least one rooted duplication tree can be obtained from U by rooting it at some edge in the path from to n. Recall that U can be rooted at an edge e if a rooted duplication tree can be formed by rooting it at e. Let Si denote the set of unrooted duplication trees that can be rooted at exactly i edges, i ≥ 1. Then, the following facts are true: (1) If the unrooted duplication tree U can be rooted at two edges e and e′ , then it can be rooted at any edge between e and e′ in the path from segment to segment n. (2) Let U ∈ Sk for some k ≥ 3. Assume the edges in the path from to n in U are e1 , e2 , . . . , em , m≤n where ei = (ui−1 , ui ), u0 = 1, and um = n. If U can only be rooted at the edges ej (i ≤ j ≤ i′ ) where i′ − i + = k. Then, ui−1 must be contained in a multi-duplication block and so is ui′ . Assume U ∈ Sk , k ≥ and i and i′ are as in (2). Let Tj denote the subtree of U rooted at a child of uj that is off the path from to n. We also let Uj(j+1) denote the unrooted duplication tree obtained from U by interchanging Tj and Tj+1 as illustrated in Figure 4.6. Then, we attain i′ − i − unrooted trees U(i+1)(i+2) , U(i+2)(i+3) , . . . , U(i′ −2)(i′ −1) from U. It is easy to see that, for each j from i + to i′ − 2, Uj(j+1) can be rooted uniquely at the edge ej+1 . Conversely, if U can be rooted uniquely at some edge ei+1 = (ui , ui+1) in the path from to n, then, ui and ui+1 must be contained in a double duplication block (and e e4 e e e5 e e8 18 16 34 T T2 11 10 T3 T4 12 13 T 14 15 17 T7 T6 (a) 18 T5 T1 T T3 4.3 Counting Tandem Duplication Trees e T T7 (b) T6 Figure 4.6: (a) An unrooted duplication tree U. It can be rooted at edges e3 , e4 , e5 , e6 , e7 . The rooted duplication by interchanging subtrees T4 and T5 . U45 can only be rooted at e5 . 114 tree derived from U by rooting it at e5 is given in Figure 4.2. (b) An unrooted duplication tree U45 obtained from U 4.3 Counting Tandem Duplication Trees 115 hence the right subtree of ui and the left subtree of ui+1 are swapped). Thus, Ui(i+1) obtained by interchanging Ti and Ti+1 can be rooted at three edges ei , ei+1 , ei+2 . This implies that Ui(i+1) is an unrooted duplication tree that can be rooted in at least three ways. Therefore, the mapping from U to C1U = {Uj(j+1) | i + ≤ j ≤ i′ − 2} is one-to-one from unrooted duplication trees U ∈ Sk (k ≥ 3) to the (k − 2)-subsets of S1 . Recall that rn and dn denote the number of rooted and unrooted duplication trees for n segments respectively. We obtain rn = = n−2 j=1 j|Sj | n−2 j=3 j|Sj | = n−2 j=3 = n−2 j=3 = n−2 j=3 =2 n−2 j=3 =2 n−2 j=1 + |S1 | + 2|S2 | U ∈Sj j+ n−2 j=3 U ∈Sj |C1U | + 2|S2 | U ∈Sj (j + |C1U |) + 2|S2 | U ∈Sj (j + j − 2) + 2|S2| U ∈Sj (1 + |C1U |) + 2|S2| |Sj | = 2dn , where we use the fact |C1U | = j − for U ∈ Sj , j ≥ 3. Bibliography S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman, Basic local alignment search tool. Journal of Molecular Biology 215 (3) (1990), pp. 403-410. S.F. Altschul, T.L. Madden, A.A. Sch¨ affer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman, Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research 25 (17) (1997), pp. 3389-3402. N. Balakrishnan and M.V. Koutras, Runs and Scans with Applications. John Wiley & Sons, U.S.A. (2002). D. Baltimore, Our genome unveiled. Nature 409 (2001), pp. 814C816. S. Batzoglou, L. Pachter, J.P. Mesirov, B. Berger and E.S. Lander, Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research 10 (2000), pp. 950-958. G. Benson and L. Dong, Reconstructing the Duplication History of a Tandem Repeat. Proceedings of ISMB’99 (1999), pp. 44-53. 116 Bibliography 117 B. Bertrand and O. Gascuel, Topological Rearrangements and Local Search Method for Tandem Duplication Trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics (1) (2005), pp. 15-28. B. Brejovà, D. Brown, and T. Vina˘r, Optimal spaced seeds for homologous coding regions. Journal of Bioinformatics and Computational Biology (2004), pp. 595610. B. Brejovà, D. Brown, and T. Vina˘r, Vector seeds: an extension to spaced seeds allows substantial improvements in sensitivity and specificity. Journal of Computer and System Sciences 70(3) (2005), pp. 364-380. M. Brudno, M.A. Chapman, B. Gottgens, S. Batzoglou and B. Morgenstern, Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics (2003), pp. 4-66. J. Buhler, U. Keich and Y. Sun, Designing seeds for similarity search in genomic DNA. Proceedings of RECOMB’03 (2003), pp. 67-75. A. Califano and I. Rigoutsos, FLASH: fast look-up algorithm for string homology. Proceedings of ISMB’93 (1993), pp. 56-64. K.P. Choi, F. Zeng, and L. Zhang, Good spaced seeds for homology search. Bioinformatics 20 (7) (2004), pp. 1053-1059. K.P. Choi, and L. Zhang, Sensitivity analysis and efficient method for identifying optimal spaced seeds. Journal of Computer and System Sciences 68 (1) (2004), pp. 22-40. M. Cs˝ urös and B. Ma, Rapid homology search with neighbor seeds. Algorithmica 48(2) (2007) pp. 187-202. Bibliography 118 A. Darling, T. Treangen, L. Zhang, C. Kuiken, X. Messeguer and N. Perna, Procrastination leads to efficient filtration for local multiple alignment. Proceedings of WABI’06 (2006), pp. 126-137. R. V. Eck and M. O. Dayhoff, Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring, MD. E. E. Eichler, Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends in Genetics 17 (1) (2001), pp. 661-669. O. Elemento and O. Gascuel, An exact and polynomial distance-based algorithm to reconstruct single copy tandem duplication trees. Proceedings of CPM’03 (2003), pp. 96-108. O. Elemento, O. Gascuel and M. P. Lefranc, Reconstructing the duplication history of tandemly repeated genes. Molecular Biology and Evolution 19 (3) (2002), pp. 278-88. I. Elias and T. Tuller, Reconstruction of ancestral genomic sequences using likelihood. Journal of Computational Biology 14 (2) (2007), pp. 216-237. J. Farris, Methods for computing Wagner trees. Systematic zoology 19 (1)(1970), pp. 83-92. W. Feller, An Introduction to Probability Theory and its Applications. vol. 1. 3rd edition. John Wiley and Sons, New York (1968). J. Felsenstein, Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17 (6) (1981), pp. 368-376. J. Felsenstein, Phylogenies from molecular sequences: Inference and reliability. Annual Review of Genetics 22 (1988), pp. 521-565. Bibliography 119 J. Felsenstein, Inferring Phylogenies. Sinauer Association (2003). W. Fitch, Towards defining the course of evolution: Minimum change for a specific tree topology. Systematic Zoology 20 (1971), pp. 406-416. W. Fitch, Phylogenies constrained by cross-over process as illustrated by human hemoglobins in a thirteen cycle, eleven amino-acid repeat in human apolipoprotein A-I. Genetics 86 (3) (1977), pp. 623-644. W. Fitch, On the problem of discovering the most parsimonious tree. The American Naturalist 111 (978) (1977), pp. 223-257. W. Fitch, A non-sequential method for constructing trees and hierarchial classifications. Journal of Molecular Evolution 18 (1) (1981), pp. 30-37. O. Gascuel, M. D. Hendy, A. Jean-Marie and R. McLachlan, The combinatorics of tandem duplication trees. Systematic Biology 52 (1) (2003), pp. 110-118. W. Gish, WU-BLAST 2.0, Website: http://blast.wustl.edu, 2001. L. J. Guibas and A. M. Odlyzko, String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory (Series A) 30 (2) (1981), pp. 183-208. D. M. Hillis, J. P. Huelsenbeck and C. W. Cunningham, Application and accuracy of molecular phylogenies. Science 264 (1994), pp. 671-677. X. Huang and M. Miller, A time-efficient, linear-space local similarity algorithm. Advances in Applied Mathematics 12 (3) (1991), pp. 337-357. P. Jacquet and W. Szpankowski, Analytic approach to pattern matching. Applied Combinatorics on Words (Editor: M.Lothaire), Cambridge Press (2005), pp. 329395. Bibliography 120 A. Jeffreys and S. Harris, Processes of gene duplication. Nature 296 (1982), pp. 9-10. T. H. Jukes and C. R. Cantor, Evolution of protein molecules. Mammalian Protein Metabolism Academic Press, New York (1969), pp. 21-132. U. Keich, M. Li, B. Ma, and J. Tromp, On spaced seeds for similarity search. Discrete Applied Mathematics (2004), pp. 253-263. M. Kimura, A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16 (2) (1980), pp. 111-120. W.J. Kent, BLAT-the BLAST-like alignment tool. Genome Research 12 (4) (2002), pp. 656-664. Y. Kong, Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. Journal of Computational Biology 14 (2) (2007), pp. 238-254. G. Kucherov, L. Noe and M. Roytberg, Multiseed lossless filtration. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2005), pp. 51-61. G. Kucherov, L. Noe and M. Roytberg, A unifying frame work for seed sensitivity and its application to subset seeds. INRIA Technical Report: N o 5374 (2004). S. Leem, J. London and J. Kim, The human telomerase gene: complete genomic sequence and analysis of tandem repeat polynomisms in intronic regions. Oncogene 21 (5) (2002), pp. 769-777. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10 (8) (1966), pp. 707-710. Bibliography 121 M. Li and B. Ma, On the complexity of computing the sensitivity of spaced seeds, Journal of Computer and System Sciences 73 (7) (2007), pp. 1024-1034. M. Li, B. Ma, D. Kisman and J. Tromp, PatternHunter II: highly sensitivity and fast homogy search. Journal of Bioinformatics and Computational Biology (3) (2004), pp. 417-439. M. Li, B. Ma and L. Zhang, Superiority and complexity of the spaced seeds. Proceedings of SODA’06 (2006), pp. 444-453. G. Li, M. Steel and L. Zhang, More taxa are not necessarily better for the reconstruction of ancestral character states. Systematic Biology 57 (2008), pp. 647-653. P. Lio and N. Goldman, Models of Molecular Evolution and Phylogeny. Genome Research (12) (1998), pp. 1233-1244. D.J. Lipman and W.R. Pearson, Rapid and sensitive protein similarity searches. Science 227(1985), pp. 1435-1441. B. Ma, J. Tromp and M. Li, PatternHunter: faster and more sensitive homology search. Bioinformatics 18 (3) (2002), pp. 440-445. B. Ma and H. Yao, Seed optimization is no easier than optimal Golomb ruler design. Proceedings of APBC’08 (2008), pp. 133-143. P. M. Maddison, Calculating the probability distributions of ancestral states reconstructed by parsimony on phylogenetic trees. Systematic Biology 44 (4) (1995), pp. 474-481. D. Mak, Y. Gelfand and G. Benson, Indel seeds for homology search. Bioinformatics 22 (14) (2006), pp. e341-349. J. Masek and M. Paterson, A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20 (1) (1980), pp. 18-31. Bibliography 122 F. Nicolas and E. Rivals, Hardness of optimal spaced seed design. Journal of Computer and System Sciences 74 (5) (2008), pp. 831-849. L. Noe and G. Kucherov, YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 33 (2005), pp. 540-543. G. Nuttall, Blood Immunity and Blood Relationship. Cambridge University Press, Cambridge. S. Ohno, Evolution by Gene Duplication. Springer-Verlag, Berlin (1970). A.Philippou and A. Muwafi, Waiting for the kth consecutive success and the Fibonacci sequence of order k. Fibonacci 20 (1982), pp. 28-32. F.P. Preparata, L. Zhang and K.P. Choi, Quick, practical selection of effective seeds for homology search. Journal of Computational Biology 12 (9) (2005), pp. 1137-1152. N. Saitou and M. Nei, The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution (4) (1987), pp. 406-425. D. Sankoff and P. Rousseau, Locating the vertices of a Steiner tree in an arbitrary metric space. Mathematical Programming (1975), pp. 240-246. S.J. Schwager, Run probabilities in sequences of markov-dependent trials. Journal of the American Statistical Association 78 (381) (1983), pp. 168-175. S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison, D. Haussler and W. Miller, Human-mouse alignments with BLASTZ. Genome Research 13 (1) (2003), pp. 103-107. G.P. Smith, Evolution of repeated DNA sequences by unequal crossover, Science 191 (1976), pp. 528-535. Bibliography 123 T.F. Smith, M.S. Waterman, Identification of Common Molecular Subsequences. Journal of Molecular Biology 147 (1) (1981), pp. 195-197. R.R. Sokal and C. D. Michener, A statistical method for evaluating systematic relationships. The University of Kansas Science Bulletin 28 (1958), pp. 14091438. A.D. Solov’ev, A combinatorial identity and its application to the problem concerning the first occurences of a rare event. Theory of Probability and Its Applications 11 (2) (1966), pp. 276-282. M. Steel, Distribution in bicolored evolutionary trees. Ph. D thesis, Massey University, New Zealand (1989). M. Steel, Distribution on bicoloured binary trees arising from the principle of parsimony. Discrete Applied Mathematics 41 (1993), pp. 245-261. M. Steel and M. Charleston, Five surpring properties of parsimoniouly colored trees. Bulletin of Mathematical Biology 57(2) (1995), pp. 367-375. Y. Sun and J. Buhler, Designing multiple simultaneous seeds for DNA similarity search. Journal of Computational Biology 12 (6) (2005), pp. 847-861. Y. Sun and J. Buhler, Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinformatics (133) (2006). Proceedings of RECOMB’04 (2004), pp. 76-85. M. Tang, M. Waterman and S. Yooseph, Zinc finger gene clusters and tandem gene duplication. Journal of Computational Biology (2) (2002), pp. 429-46. J.B. Xu, D.G. Brown, M. Li and B. Ma, Optimizing multiple spaced seeds for homology search. Journal of Computational Biology 13 (7) (2006), pp. 1355-1368. Bibliography 124 I.H. Yang, S.H. Wang, Y.H. Chen, P.H. Huang, L. Ye, X.Q. Huang and K.M. Chao, Efficient methods for generating optimal single and multiple spaced seeds. Proceedings of BIBE’04 (2004), pp. 411-418. J. Yang, L. Zhang, On counting tandem duplication trees. Molecular Biology and Evolution 21 (2004), pp. 1160-1163. J. Yang, L. Zhang, Run probabilities of seed-like patterns and identifying good transition seeds. Journal of Computational Biology 15 (10) (2008), pp. 1295-1313. The extended abstract also appears in the proceedings of APBC2008. Z. Yang, S. Kumar and M. Nei, A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141 (4)(1995), pp. 1641-1650. L. Zhang, Superiority of spaced seeds for homology search. IEEE/ACM Transactions on Computational Biology and Bioinformatics (3) (2007), pp. 496-505 L. Zhang, B. Ma, L. Wang and Y. Xu. Greedy method for inferring tandem duplication history. Bioinformatics 19 (12) (2003), pp. 1497-1504. L. Zhang, J. Shen, J. Yang and G. Li. Recurrence formulas for analyzing the accuracy of the Fitch method for reconstructing ancestral states. submitted. L. Zhou and L. Florea, Designing sensitive and specific spaced seeds for crossspecies mRNA-to-genome alignment. Journal of Computational Biology 14 (2) (2007), pp. 113-130. Index affine gap penalty, 24 double helix, alignment score, 23 duplication, ambiguous reconstruction, 77 duplication model, 104 ambiguous reconstruction accuracy, 79 amino acid, ancestral genome, base pairing, bifurcating tree, 74 BLASTN, 27 block, 104 branch, 12 chromosome, codon, exon, gap, 24 gene, genetic code, genome evolution, global alignment, 11, 22, 25 homology, 10 homology search, 11 HTU, 12 comparative genomics, insertion, 8, 22 complement, intron, consecutive seed, 27 Jukes-Cantor model, deletion, 8, 22 DNA, Kimura model, 125 Index 126 large-scale mutation, RNA, linear gap penalty, 24 rooted duplication tree, 104 local alignment, 11, 25 seed, 17, 26 match, 22 seed hit, 17, 26 maximum likelihood, 13 selection, maximum parsimony, 12 sensitivity, 29 MEGABLAST, 27 sequence alignment, 11 mismatch, 22 similarity function, 23 molecular clock, 73 specificity, 30 mRNA, substitution, multifurcating tree, 74 multiple alignment, 12 mutation, tandem duplication, 8, 102 taxa, 12 the central dogma, neighbor-joining, 12 the Fitch’s algorithm, 76 nucleotide, transcription, optimal seed, 61 otholog, 10 OTU, 12 outgroup, 73 pairwise alignment, 12 paralog, 10 phylogenetic tree, 12 point mutation, protein, purine, pyrimidine, transition, 7, 28 transition seed, 28 translation, transversion, 7, 28 unambiguous reconstruction, 77 unambiguous reconstruction accuracy, 79 unequal crossover, 8, 102 uniformly spaced seed, 60 unrooted duplication tree, 105 UPGMA, 12 weight, 27 [...]... calculate the similarity of sequences or structures to infer homologous DNAs or proteins 1.3 Ancestral Genomes Reconstruction 1.3 Ancestral Genomes Reconstruction In a bioinformatic approach, ancestral genomes are reconstructed by first aligning extant genomes and finding homologies between them, and then inferring different types of evolutionary events in the evolutionary history Among these events, substitution... Microbial genome growth – from “http://www.ncbi.nlm.nih.gov/” then inferring different types of evolutionary events like substitution events and duplication events Reconstruction of ancestral genomes contributes to inferring the functions of human genes and thus suggests drug targets for hereditary diseases In this thesis, we study three mathematical issues arising from the bioinformatics approach to ancestral. .. units of DNA are nucleotides Each nucleotide in a DNA has 3 parts: a pentose sugar (deoxyribose), a phosphate and a base Nucleotides can be classified into 4 types corresponding to their distinct bases: Adenine(A), Cytosine(C), Guanine(G) and Thymine(T) A and G are called purines, having a two-ring structure C and T are called pyrimidines, having a one-ring structure (See Figure 1.2) For simplicity,... species using computational approaches Many comparative methods are developed to compare different genomes and infer different types of evolutionary events The development of genomic databases and advances in comparative genomics methods have made bioinformatics a feasible approach to reconstructing ancestral genomes It reconstructs an ancestral genome by aligning extant genomes and 1 1.1 DNA and Genome. .. from U by rooting it at e5 is given in Figure 4.2 (b) An unrooted duplication tree U45 obtained from U by interchanging subtrees T4 and T5 U45 can only be rooted at e5 114 xii Chapter 1 Introduction to the Reconstruction of Ancestral Genomes In the past decades, advances in molecular biology have led to a rapid increase in genomic sequence data More and more genomes have... by copying errors in DNA or RNA during cell division Though mutations happen rarely, they play a very important role in shaping genomes According to the length of the DNA sequence involved, mutations can be classified to point mutations, which only affect a single nucleotide, and large-scale mutations, which affect large regions in a genome Mutations can also be classified by the types of change into substitution,... Entrez Genome Project” database, since the first complete genome Haemophilus in uenzae Rd KW20 was sequenced in 1995, 626 bacteria and 52 archaea genomes have been sequenced and these numbers are still increasing (see Figure 1.1) Facing this deluge of information, scientists begin to take note of the importance and contributions of comparative genomics, a field to investigate the relationships between genomes... stored in all chromosomes is referred to as a genome Genome size can be extremely huge As an example, the human genome has around 3 billion base pairs, and is organized into 23 pairs of chromosomes 1.2 Genome Evolution Genomes evolve over time There are tens of millions of genomes today This tremendous diversity is attributed to genome evolution Genome evolution is the process of evolving various genomes... function Cell is the ‘building block’ of life It mainly 2 1.1 DNA and Genome performs functions for maintaining daily life and passing the genetic instructions to the next generation The former function is mainly facilitated by proteins whereas the latter function is mainly achieved through Deoxyribonucleic acids (DNA) DNA is a polymer that contains the genetic instructions needed by the cell to perform daily... building proteins from RNAs is called translation Protein is a large organic compound made from 20 different amino acids The translation from the four-letter alphabet of RNAs to the twenty-letter alphabet of proteins starts with reading out the messenger RNA in groups of three nucleotides at a time Each of this three consecutive triplet of nucleotides, called codon, specifies a single amino acid in the . and advances in comparative genomics methods have made bioinformatics a feasible approach to reconstructing ancestral genomes. It reconstructs a n ancestral genome by aligning extant genomes and 1 1.1. Reconstruction of ancestral genomes contributes to inferring the functions of human genes and thus suggests drug targets for hereditary diseases. In this thesis, we study three mathematical issues arising. ‘building block’ of life. It mainly 1.1 DNA and Genome 3 performs functions for maintaining daily life and passing the genetic instructions to the next generation. The former function is mainly

Định dạng
Số trang	138
Dung lượng	848,02 KB