algorithms in computational molecular biology techniques, approaches and applications elloumi zomaya 2011 02 02 Cấu trúc dữ liệu và giải thuật

P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi CuuDuongThanCong.com November 18, 2010 8:32 Printer Name: Sheridan P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi November 18, 2010 8:32 Printer Name: Sheridan ALGORITHMS IN COMPUTATIONAL MOLECULAR BIOLOGY CuuDuongThanCong.com P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi November 18, 2010 8:32 Printer Name: Sheridan Wiley Series on Bioinformatics: Computational Techniques and Engineering A complete list of the titles in this series appears at the end of this volume CuuDuongThanCong.com P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi November 18, 2010 8:32 Printer Name: Sheridan ALGORITHMS IN COMPUTATIONAL MOLECULAR BIOLOGY Techniques, Approaches and Applications Edited by Mourad Elloumi Unit of Technologies of Information and Communication and University of Tunis-El Manar, Tunisia Albert Y Zomaya The University of Sydney, Australia A JOHN WILEY & SONS, INC., PUBLICATION CuuDuongThanCong.com P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi Copyright C November 18, 2010 8:32 Printer Name: Sheridan 2011 by John Wiley & Sons, Inc All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission Limit of Liability/Disclaimer of Warranty: While the publisher and the author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor the author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information about our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com Library of Congress Cataloging-in-Publication Data is available ISBN: 978-0-470-50519-9 Printed in the United States of America 10 CuuDuongThanCong.com P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi November 18, 2010 8:32 Printer Name: Sheridan To our families, for their patience and support CuuDuongThanCong.com P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi CuuDuongThanCong.com November 18, 2010 8:32 Printer Name: Sheridan P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi November 18, 2010 8:32 Printer Name: Sheridan CONTENTS PREFACE CONTRIBUTORS xxxi xxxiii I STRINGS PROCESSING AND APPLICATION TO BIOLOGICAL SEQUENCES 1 STRING DATA STRUCTURES FOR COMPUTATIONAL MOLECULAR BIOLOGY Christos Makris and Evangelos Theodoridis 1.1 Introduction / 1.2 Main String Indexing Data Structures / 1.2.1 Suffix Trees / 1.2.2 Suffix Arrays / 1.3 Index Structures for Weighted Strings / 12 1.4 Index Structures for Indeterminate Strings / 14 1.5 String Data Structures in Memory Hierarchies / 17 1.6 Conclusions / 20 References / 20 EFFICIENT RESTRICTED-CASE ALGORITHMS FOR PROBLEMS IN COMPUTATIONAL BIOLOGY 27 Patricia A Evans and H Todd Wareham 2.1 The Need for Special Cases / 27 2.2 Assessing Efficient Solvability Options for General Problems and Special Cases / 28 2.3 String and Sequence Problems / 30 2.4 Shortest Common Superstring / 31 2.4.1 Solving the General Problem / 32 2.4.2 Special Case: SCSt for Short Strings Over Small Alphabets / 34 2.4.3 Discussion / 35 vii CuuDuongThanCong.com P1: OTA/XYZ P2: ABC fm JWBS046-Elloumi viii November 18, 2010 8:32 Printer Name: Sheridan CONTENTS 2.5 Longest Common Subsequence / 36 2.5.1 Solving the General Problem / 37 2.5.2 Special Case: LCS of Similar Sequences / 39 2.5.3 Special Case: LCS Under Symbol-Occurrence Restrictions / 39 2.5.4 Discussion / 40 2.6 Common Approximate Substring / 41 2.6.1 Solving the General Problem / 42 2.6.2 Special Case: Common Approximate String / 44 2.6.3 Discussion / 45 2.7 Conclusion / 46 References / 47 FINITE AUTOMATA IN PATTERN MATCHING 51 Jan Holub 3.1 Introduction / 51 3.1.1 Preliminaries / 52 3.2 Direct Use of DFA in Stringology / 53 3.2.1 Forward Automata / 53 3.2.2 Degenerate Strings / 56 3.2.3 Indexing Automata / 57 3.2.4 Filtering Automata / 59 3.2.5 Backward Automata / 59 3.2.6 Automata with Fail Function / 60 3.3 NFA Simulation / 60 3.3.1 Basic Simulation Method / 61 3.3.2 Bit Parallelism / 61 3.3.3 Dynamic Programming / 63 3.3.4 Basic Simulation Method with Deterministic State Cache / 66 3.4 Finite Automaton as Model of Computation / 66 3.5 Finite Automata Composition / 67 3.6 Summary / 67 References / 69 NEW DEVELOPMENTS IN PROCESSING OF DEGENERATE SEQUENCES Pavlos Antoniou and Costas S Iliopoulos 4.1 Introduction / 73 4.1.1 Degenerate Primer Design Problem / 74 4.2 Background / 74 4.3 Basic Definitions / 76 CuuDuongThanCong.com 73 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX as unsupervised learning problem, 959 validation, 969–973 Network topology, reverse engineering benchmarking metrics, 944–945 components of, 942–943 Neural networks (NNs), 372–373, 450, 460, 471, 473–474, 502, 506–507 Neurodegenerative disorders, 207 New Chemical Entities database, 371 NEWICK format, 566 Newton-Raphson method, 554 Next generation sequencing/sequencers, 36, 92, 299–300, 426–428, 444 NGILA, 243, 246 NIMBUS filters, 304–307, 309 Nodes child, in rooted tree, 198 data, 224–227, 232, 235–236 index, 224, 226, 229, 232, 235–236 initial, 203 parent, in rooted tree, 198 routing, 224 storage, 225 utilization, 232–233, 235 Noise distribution, 646–647 See also Gaussian noise Noncoding region, 802, 815 Non-cognate aa-tRNA, 929, 932 Nonconserved gene pair, 455–456 Nondeterministic automaton, 59 See also Nondeterministic automaton Nondeterministic finite automata (NFA) active state of, 52 approximate string matching, 56 depth of state, 52 exact gene matching, 55 exact string matching, 54–55 level of state, 52 stimulation, 60–66 Nondeterministic polynomial (NP) -complete problems, 172–173, 255, 537, 731, 733–734, 741 defined, 130 -hard problem, 440–441, 450, 672–673, 681, 751, 753, 757–758, 764, 847, 850, 854, 959 Nonmaximal words, composition vectors and, 340 Nonorthologous genes, 726 CuuDuongThanCong.com 1031 Normal distribution, 341–342, 704, 871 Normalization, 630–631, 967 Northern hybridization, 452 Nuclear magnetic resonance (NMR), 537 Nucleic sequences, 313 Nucleosomes, 398 Nucleotide(s), see Single nucleotide polymorphisms (SNPs) barcoding problems, 139 functions of, 130, 171, 302, 311–312, 401, 403, 439, 443, 543, 551–553, 580–581, 593, 610, 616, 626, 802, 804–805, 808, 813, 834, 838 polymorphisms, 429 sequences, 114 Oblivious model of computation, 19 Occurrence heuristics, 94 Odd cycles, 760 Offset indexing, 311 Ohno’s law, 750 Oligonucleotides, 92, 133, 403–404, 416, 439, 441–443, 461, 626, 644–645 Online Mendelian Inheritance in Man (OMIM) database, 682 Open reading frames (ORFs), 454, 461–462, 473 Operon characterized, 449–450 databases, 450 prediction, see Operon prediction Operon Database (ODB), 450–452 Operon Finding Software (OFS), 469 Operon prediction computational data, 454–459 datasets, 451–454, 474 machine learning methods, 460–474 preprocess methods, 459–460 Oppossum2 database, 399 Optimization entropy, 601–613 genomic distances, 785 phylogenetic reconstruction, 593–594 problems, see Optimization problems Optimization problems maximum likelihood, 551 protein function prediction, 488–489 solution of, 605–609 P1: OSO ind JWBS046-Elloumi 1032 December 2, 2010 9:56 Printer Name: Sheridan INDEX Order array, 80 Order of magnitude, 92, 315, 873 Order preserving submatrix (OPSM) algorithm, 658 Oreganno database, 399, 406 ORIS seed-based indexing, 314 Ortholog assignment characterized, 727–729 phylogeny-based method, 731–732 rearrangement-based method, 732–734 sequence similarity-based method, 729–731 Orthologous sequences, 322 Orthologs, defined, 727–728, 765 See also Ortholog assignment Oscillation, 639–640 Osprey software, 876 Out-degree distribution, 994 in graphs, 195–196 Overexpression, 974 Overfitting, 966 Overlap graphs, 33–35 Overlapping, 534–535 Oversampling, 483 OXBENCH, 253–254 PAGE, 669 Pairwise alignment, 36–37, 143, 242–246, 255, 284–285, 413–414, 482 Pajek software, 876 PAML, 551 Pancake flipping problem, 761–762 Papillomaviruses, 552 Paradigms, alignment-free distances, 322 Parallel coordinate (PC), 659 Parallelism, 314, 554 Parallelization, 106, 549, 558, 573 Parallel sequencing, 92 Parallelograms, in filtering, 304–305, 308 Paralogs, 728, 730, 733–734, 765 Parameterized matching, 13 Pareto analysis, 371 Parsimonious tree-grow (PTG) method, 848 Parsimony co-expressed networks, 960 large problem, 581, 594 CuuDuongThanCong.com maximum, see Maximum parsimony small problem, 580–581 types of, 580, 729 Parsing, 333–334 Partition function, 530–533 Partition into exact search (PEX) filter, 104 Partition-ligation-expectation-maximization algorithm, 852–853, 855 Partitioning, 17, 83, 211–212, 237, 374, 488, 508, 556, 733, 787–788, 818, 904, 971 PASS, 432–433 Pastry algorithm, 225 Paternal genome, 845 PathBlast software, 879 Path label, Pathogen detection, 129–130 Paths, in graphs, 194–195 Patricia trie, 18 Pattern-based algorithms defined, 386 for planted-motif problem (PMP), 387–388 PatternBranching, 390, 395 Pattern-discovery algorithm, 411–412 Pattern-driven algorithms, 406–408 PATTERNHUNTER, 245 Pattern matching approximate, see Approximate pattern matching automaton, 53 basic classification of algorithms, 53 features of, 7–10, 13–15, 349 multiple, 97–103 single, 93–96 weighted DNA sequences, 152–160 PAUP*, 551 Pazar database, 399 Pearson’s correlation coefficient, 348, 469, 655, 960, 995 Pedigree, significance of, 844 Pedigree tree, 855 Peer-to-peer (P2P) environment, 237 Peptides, operon prediction, 464–465 Peptidyl transfer, 927–928 Percent Accepted Mutations (PAM), 242 Perfect phylogeny haplotype (PPH) problem, 849–851 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX Performance guarantee, 752–753 Period, in strings, 77, 146 Periodograms, 646 Permutations common intervals in, 778–782 conserved intervals in, 782–783 defined, 774 gene regulatory networks, 970 implications of, 752–754 identity, 760, 777, 779 interval-based computations, 774–782 set, 175–178 signed, 757–759, 763, 774, 778, 782 test methods, 670 unsigned, 759, 774 Permutation tree, 761 Pfizer rule, 370 Pharmacogenomics, 376 Pharmacophoric keys, in 3-D descriptors, 366–367 PHASE software, 852–853, 858 Phosphorylation, 503, 883 Photolithographic synthesis, 626 PHYLIP, 613 PhyloBayes, 551 Phylogenetic analyses, 514, 556, 882 Phylogenetic footprinting, 408–409 Phylogenetic likelihood function (PLF) acceleration by algorithmic means, 555–558 characterized, 549–550 defined, 549 Phylogenetic profiles, 450, 469 Phylogenetic reconstruction, see Phylogeny reconstruction evolutionary metaheuristics, 589–590 features of, 331 memetic methods, 590, 592 neighborhoods, 584–586, 588 problem-specific improvements, 592–594 Phylogenetic search algorithms, 549–573 See also Maximum-likelihood (ML) Phylogenetic trees characteristics of, 208, 242, 327, 329, 332, 338, 345–346, 352–354, 551, 615 construction of, 613, 731 inferences, 582 Phylogenomic alignments, 551 CuuDuongThanCong.com 1033 Phylogeny characterized, 551–552 compositional methods, 346–347 information theoretic and combinatorial methods, 344–346 haplotype inference, 849–851 mitchondrial genome, 334 reconstruction, see Phylogeny reconstruction Phylogeny reconstruction heuristic methods with maximum parsimony, 579–595 maximum entropy method for composition vector method, 599–619 phylogenetic search algorithms for maximum likelihood, 549–573 Phyloinformatics, 550 PHYML, 551, 554, 559, 562, 564 PhyNav program, 573 Picard-Queyranne algorithm, 212 PicTar, 994–995, 997–1000 PID (primary immunodeficiency), 886 Piecewise constant function, 819–820 Pigeon-hole principle, 431 Pipmaker, 399 PLAGE, 669, 671 Plain (PLAIN) arc structure, 115, 119–121, 123, 125 Planted (l, d)-motif problem (PMP), 386–390, 395 PLASMA, 251 Plasmodium falciparum, 643, 645–646, 647–648 PLAST seed-based indexing, 314 POCSIMPUTE, 632–633, 648 Point accepted mutation (PAM), 208–209, 283, 322 Pointer meshes, 557–558 Point mutations, 725, 728 Point query, 226, 230–232, 236 Poisson distribution, 341 Poisson score, 348 Poly(A) tail, 537 Polymerase chain reaction (PCR), 31, 73–74, 133, 136, 171, 416, 426, 998 Polynomial-time approximation, 120, 122, 125, 130 efficiency, 37, 42 problems, 28–30, 32, 40 P1: OSO ind JWBS046-Elloumi 1034 December 2, 2010 9:56 Printer Name: Sheridan INDEX Polypeptides, 514 Population diversity, 589 haplotyping problem, 844 Porphyra, 618 Position frequency matrix (PFMs), 401–402 Position-specific scoring matrices (PSSMs), 401, 508–510 Position weight matrices (PWMs), 12, 401–404, 406, 412 Positive irreducible intervals, 782–783 Post-transcriptional networks, 993 regulations, 901 Power law distribution, 870–871 function, 801, 815–816, 994 Power spectrum characterized, 813–817, 834, 836 density (PSD), 646 PQ-tree, 210 PRALINE, 247, 251 Precedence relation, 114 Prediction suffix tree, 13 PREFAB, 253 Prefix creation phase, 17 defined, 3, 146, 386 reversals, 761–762 transpositions, 762, 766 Preprocessing step, ExMP, 393 Prim’s algorithm, 201–202, 211 Primates, phylogenetic trees, 617 Primers, 74, 133 Principal component analysis (PCA), 633 Principal eigen vector, 709 Prism model checker, 877, 929–933, 935–936 Probabilistic approaches, applications, generally, 893–894 See also Probabilisitic models Probabilistic Boolean networks (PBNs) Boolean functions, 896–897 dependency graph, 895–896 examples of, 898–900 inferring from experiments, 901–902 natural extensions, 900–901, 911 representations, 896–898 CuuDuongThanCong.com simulations, 906–911 update strategies, impact on analysis of, 905–906 Probabilistic graphical models Bayesian networks, 961–963 characterized, 961–962 MINREG algorithm, 963 module networks, 963 Probabilistic model(s) Boolean networks (PBNs), 895–901, 905 checking, see Probabilistic model checking graphical, see Probabilistic graphical models inference from experiments, 901–902 interpretation and quantitative analysis of, 902–911 Probabilistic model checking background, 926–927 defined, 925 insertion errors, 933–934 Prism, 929–933, 935–936 motivation, 928–929 Probabilistic suffix tree, 13 Probability defined, 12 distributions, 324–330, 337–338, 343, 462, 530–531, 838 dot plot, 534 matrices, 553 theory, 858 PROBCONS, 248, 251, 253 Probe(s) microarray technology, 187, 626–627, 958 sequences, 187 ProbModel approach, evolutionary trees, 209 Pro(CKSI), 354 PRODO, 502 Profile(s) alignments, 284, 288 in multiple alignment, 247 Profile-based algorithms ChIP-Seq data analysis, 443 defined, 386 for planted-motif problem (PMP), 389–390 ProfileBranching, 389–390, 395 Profiling techniques, 503–510 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX Progressive alignment, 281, 285–288 Progressive neighborhood (PN), 586 Projection onto convex sets (POCS), 632–633 Prokaryotic genomes, 459, 474, 617 PROMALS, 247, 251 Promela model checker, 917–918, 921–922, 938–940 Promoters, 398–399, 413, 437, 443, 957 Property matching, 13 PROSITE database, 101, 144 Prostate cancer, 995 Protein(s), generally alignment, 322 β-sheets, 513 CATH classification of, 350 domains, see Protein domains function prediction with data-mining techniques, 479–493 gene expression, 485–486 G-proteins, 350–351 GH2 family, 351 misclassification of, 483 phylogenetic profile, 455 phylogeny experiments, 346 predicting network interactions by completing defective cliques, 215–216 RAMPs family, 351 scaffold, 1001 sequence classification, 481–483 sequences, generally, 337 structural classification of (SCOP), 273 structure, atomic coordinates in, 269, 272 subcellular localization prediction, 483–484 superfamilies, 273, 275, 502–503 thermally stable, 212–214 thermophilic, 213 Protein-coding gene networks, 994 regions, 403, 602–603 Protein Data Bank (PDB) database, 92, 212–213, 261, 266–267, 270, 272–274, 279, 350, 484 Protein-DNA binding, 956 complexes, 416, 426 interaction, 671–672 CuuDuongThanCong.com 1035 Protein domain boundary prediction importance of, 501–503 machine learning models, 510–515 profiling technique, 503–510 Protein domains boundary prediction, see Protein domain boundary prediction decomposition, 212, 216 identification of, 210–211 Protein interactome map, protein function prediction global topology, 488–489 local topology structure, 486–488 Protein-protein identification, 671–672 Protein-protein interaction (PPI), 398, 480, 486, 488–489, 491–492, 665–666, 675–676, 679–682, 878–880, 884, 973–974 Proteobacteria, 347, 794–795 Proteomes, phylogenetic studies, 337, 346 Proteomics, 332 Proteomic sequences, 322, 344–345, 355 PRRP, 248 PRRP/PRRN, 251 Pruning, 786, 961 PSALIGN, 247, 251 Pseudo-cognate aa-tRNA, 929–931 Pseudocodes, 593 Pseudocounts, 402 Pseudoknots, in RNA biological relevance of, 530, 536–537, 544 characterized, 524, 534–536 detection of, 542 dynamic programming, 538–542 heuristic approaches, 541–543 prediction of, 537–538, 542, 544 Pseudovertices, 180–181 PSI-BLAST, 479–480, 503, 508, 510 Psilotum, 618 PSIST algorithm, 263, 273–274 PubChem database, 367–368 PubMed database, 367, 453 Pure parsimony model, 5, 848–849 Pyrococcus furiosus, 472 Quadratic discriminant (QDA), 372 Quadratic time, 36 P1: OSO ind JWBS046-Elloumi 1036 December 2, 2010 9:56 Printer Name: Sheridan INDEX Quantitative structure activity relationship (QSAR), 370 Quantitative trait locus (QTL), 856 Quantum match, 15 QUASAR filters, 303–305, 307 Quasiperiodic string, 77 Queries data management system, 226–227 query arrival rate, 229, 233–235 query/data-fetch reports, 226–227 query resolving process, 226–227 query sequence, 210 range, 222, 229–230, 233–236 response time to, 231–232, 234–235 QuEST, 436 Rabin-Karp algorithm, 103 Random access memory (RAM), 106, 139, 432–433 Randomization, 669–670 Random walk, 584, 683, 814–815, 838, 907 Range query, 222, 226, 229–230, 233–236 Rapid amplification, 133 Rapid computation, 130, 133 RASCAL algorithm, 250 Ratcheting technique, 594–595 Rate heterogeneity, 554, 556 RaxML, 550–551, 554, 556–559, 561–565, 567, 569, 573 Read(s) mapping, exact set pattern matching, 103–107 next-generation sequencing, 428 operation, Adleman-Lipton model, 175 whole-genome sequencing, 429–434 Rearrangement analysis, 726–727, 732–734 implications of, 766 multibreak, 765–766 ReBlosum matrix, 313 Recall-Precision plot, 945 Receiver operating characteristic (ROC) analysis, 329 curve, 683, 971 plot, 945 reverse engineering, 945, 949–950 Receptor activity-modifying proteins (RAMPs) family, 351 Reciprocal best hits (RBHs), 729–730 CuuDuongThanCong.com Recombination, 657, 845, 851–852 Reconstruction sequences, 30 Recurrent neural networks, 373 Recursion, RNA structures, 527–529, 538–540 Redundancy, 19, 875 Redundant distinguishability, 133 REFINER algorithm, 250 RefSeq, 399, 480, 993 Regression linear, 738 models, 676, 683 studies, 738 Regular expressions, 144–145 Regular graph, 195 Regulatory information, sources of, 405–406 Regulatory Program (RP), 968 Regulatory regions annotating sequences using predictive models, 406–408 combining motifs and alignments, 412–414 comparative genomics characterized, 408–410 dependencies in sequences, detection of, 403–405 genome regulatory landscape, 397–399 qualitative models of regulatory signals, 400–401 quantitative models of regulatory signals, 401–403 repositories of regulatory information, 405–406 sequence comparisons, 410–412 validation, experimental, 414–417, 428 Regulatory signals qualitative models, 400–401 quantitative models, 410–403 RegulonDB database, 450, 452–453, 464, 472 Relative entropy, 322–329, 390, 442, 466 Relevance networks, 959–960 Repeats defined, 299, 301 filtering, 302, 304–309 multiple, 301, 305–307 Repetitions fixed-length simple, 161–162 fixed-length strict, 163 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX simple, 160–163 strict, 160, 163 in strings, 146 weighted sequences, 160–163, 166–167 Replace, pattern matching, 62–64 Replica management, 223, 228–232, 236 Reptiles, phylogenetic trees, 615 Resampling of the estimated log likelihood (RELL) method, 565 Response time, queries, 231–232, 234–235 Restricted-case algorithms assessing efficient solvability options for general problems and, 28–30 Common Approximate Substring (CAS), 28, 41–46 Longest Common Subsequence, 36–41 need for special cases, 27–28 Shortest Common Superstring (SCS), 28, 31–36 string and sequence problems, 30–31 types of restrictions, 29, 46 Restricted neighborhood search clustering (RNSC), 488 Restricted recursive pseudoknots, 541–542 Reversal distance, 732 Reversals, genome rearrangements, 753–759, 766 Reverse engineering applications, generally, 941–942 of biological networks, 942–945 combinatorial algorithms, 946–951 gold standard, 944 miRNA inference modules using top-down approaches, 988–982 miRNA-mediated post-transcriptional networks, 993 miRNA modules, 1001 performance evaluation, 944–945 Reverse transcriptase PCR (RT-PCR), 416 Revrev, 763 RF algorithm, 250 Rhodophyte, 618 Ribosomal RNA (rRNA), 132, 345–346, 614–615 Rice genome, 737 Rightmost shift array, 15 RMAP software, 104–106, 434 CuuDuongThanCong.com 1037 RNA alignment algorithms, 254 alignment programs, 293 characterized, 521–522 pathogen-specific, 130 pseudoknots, 534–543 secondary structure, 115, 125, 522–534 sequences, see RNA sequences structure comparison, 113 RNAHydbrid, 994 RNA polymerase II, 397–398 RNA sequences Common Approximate Substring (CAS) problem, 42–43 features of, 601 Shortest Common Superstring (SCS) problem, 31 MPSCAN applications, 103–104 ROBIN software, 767 Robinson Foulds (RF) metric, 345, 550, 556–571 Robust computation, 130, 133 Roche-454 next-generation sequencing system, 426 Rodents, phylogenetic trees, 617 Rooted trees, 198, 582 Rose hydrophobicity scale, 504 RSA, 407 RSAtools, 399 R-tree/R*-tree, 222–223, 226, 234 Rule of five (Ro5), 370–371 Run, in strings, 146 rVISTA database, 399 SABMARK, 253 Saccharomyces cerevisaie, 212, 216, 479, 640, 660, 874, 877–878, 884, 886, 956, 963 Genome Database, 972 SAGA, 251 SAGE expression analysis, 967 SAM-GS, 669 SAMO database, 485 Sampling ChIP-Seq data analysis, 441 intensity, 210 SANDY algorithm, 882 SARAH1 scale, 503–505, 509–510, 515 P1: OSO ind JWBS046-Elloumi 1038 December 2, 2010 9:56 Printer Name: Sheridan INDEX SARSCoV (severe acute respiratory syndrome coronavirus), 346 SATCHMO, 251 SBNDM4, 95–96 Scaled matching, 13 Scale-free networks, 871, 958, 973 Scaling factor, 291 Scaling law, 801 Scan phase, 391 Scoring conservation, 290–291 entropy, 390 local dissimilarity, 336 log-likelihood, 459–460, 473, 569 normalization, 473 Poisson, 348 profile-to-profile, 482 similarity, 279–280 sum-of-pairs, 283 Search tree-based algorithms, 35, 37 Sea Urchin, 884 Secondary structure alignment, 513–514 Sectorial search (SS), 592–593, 595 Seed/seeding, generally defined, 245 in degenerate strings, 77, 84–85, 87–89 filters, 309, 315 pairwise local alignment, 245 Segment polarity genes network, 948–949 Selectivity, in filtering, 308–309 Self-organizing maps (SOMs), 372 Semantic analysis, 482 Semimetrics, 334 Sensitivity co-expressed networks, 961 gene clustering, 741–742 in filtering, 311, 315 operon prediction, 453, 465, 467 SEQMAP software, 105–106, 432 Sequence(s), generally alignment, 242–254, 322, 354 analysis, 323 comparison, 321, 599 homologies, 315 logos, 145 motifs, 92 set, 176 similarity, 91, 280, 347, 410, 729 splitting, 814 CuuDuongThanCong.com Sequence-driven algorithms, 409–410 Sequence-sequence alignments, 284 Sequencing by hybridization (SBH), 34, 214–215 Serine/arginine-rich (SR) proteins, 143–144 Set backward oracle matching (SBOM) algorithm, 98–100 Set problems cover, 134, 139, 172, 947 packing, 172 Seven bridges of Kăonigsberg, 193, 196197 Severe acute respiratory syndrome (SARS), 537 Sexually transmitted disease (STD), 886 Shannon entropy, 325 Shannon information theory, 323 Sheet structures, 263 Shift-or technique, 15 Shift-Or algorithm, 95, 101 Shift-or with q grams (SOG), 101 Shortest Common Superstring (SCS), 28, 31–36 Shortest path, generally algorithms, 38 network problem, 213 problem, 203–205, 213 Short oligonucleotide alignment program (SOAP), 104, 106, 432–433 Short swap, 763 Short tandem repeats (STRs), 349 Shotgun sequencing method, 843 Side-chain conformations, 267 Side-walk descent, 584 Sigmoid function, 470 Signaling transduction networks, 679 Signal-to-noise ratio, 603, 678 Signed common string partition (SMCSP), 790–791, 795 Signed reversal distance with duplicates (SRDD), 732–733 Significance analysis function and expression (SAFE), 669–670 Significance analysis of microarrays (SAM), 667, 670, 694, 698, 710 Simian immunodeficiency virus (SIV), 336, 346 Similarity, generally coefficients, 365–366 functional, 347 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX index, information-based, 327 of sequences, 323–325, 334 scores, 279–280, 730 searches, 91–92, 210, 300, 315, 329, 362 Simple count methods characterized, 370 enhanced, using structural features, 371 example studies using, 370–371 Simple graph, 194 Simple motif problem (SMP), 386, 393–395 Simple motif search (SMS) algorithm, 394–395 Simple recursive pseudoknots, 538–542 Simple repetitions, 160–163 Simplified Molecular Input Line Entry Specification (SMILES), 363–364, 369–370 Simulated annealing (SA), 134, 584, 587, 678, 856 Simulations nondeterministic finite automata (NFA), 51–52 probailistic Boolean networks (PBNs), 906–911 Sine function, 807–808 Single factor differential expression characterized, 695 multilevel, 696–697 two-level, 695–696 Single genotype resolution (SGR), 848 Single-input motifs (SIMs), 874–875 Single instruction multiple data (SIMD), 593 Single linkage clustering, 462–463 Single nucleotide polymorphisms (SNPs), 5, 429, 433, 680, 734, 843, 848–849, 857 Single-scale networks, 871 Single streaming extension (SSE), 593 Singular spectrum analysis (SSA) autoregressive (AR)-based spectral estimation, (SSA-AR), 641, 643–644 characterized, 641–642 filtering, 644 Singular value decomposition (SVD), 643, 709–710 Sinusoids, 635 SISSRS, 436 Skewed data, 232, 237 CuuDuongThanCong.com 1039 Skip loop, 94–95 Small parsimony problem, 580–581 Smith-Waterman algorithm, 92, 143, 245, 314, 348 SnapDRAGON, 502 SNAP25, 1001 SOAR algorithm, 732 Solexa sequencing technology, 432 Solid symbol, 77 Sorting comparison-based, 19 genome rearrangements, 753–765 Sorting Permutation by Reversals and block-INterchanGES (SPRING) software, 767 Source-sink networks, 212–213 Space algorithms, 36 Spanning trees, 198–199 Sparseness, 957 Spatial query processing, 222 Spearman correlation, 342–343 Spearman’s rank correlation, 656 Speciation trees, 728 Specificity gene clustering, 741–742 in operon prediction, 453, 465, 467 Spectral component correlation, 634–638 Spectral estimation, microarray analysis signal reconstruction, 644–646 SSA-AR, 643–644 Speller algorithm, 393, 395 Spellman dataset, 971–972 SPEM, 247, 251 SPINE, 681 SPIN model checker, 917–918 Splicing graphs, 207–208 SPLITSTREE, 613–614 Spotted microarray, 626 Square, 77 SSABS, 95 Sspro, 513 Static graphs, 942 Statistical-Algorithmic Model for Bicluster Analysis (SAMBA), 490 Statistical analysis classical, 965 miRNA, 1000 population-based, 844 Statistical clustering techniques, 966 P1: OSO ind JWBS046-Elloumi 1040 December 2, 2010 9:56 Printer Name: Sheridan INDEX Statistical dependency mutual information and, 329–331 significance of, 323–325, 349 Statistical process theory, 858 Statistical validation, network inference cross-validation, 971 model selection, 970 prediction performance, 971–972 Stem, RNA structure, 523, 525, 527, 541 Stem (STEM) arc structure, 115, 118–119, 124 Stem cells, 995 Stem-loops, 125 STEP, 585 Stepwise addition, 583 Stickers model, 172, 174–175, 188 Stochastic biclustering algorithms, 657 Stochastics, protein function prediction, 490 Stochastic search algorithms, 657 STOP search algorithm, 570–571 Stopping, phylogenetic search algorithms, 569–572 Strict repetitions, 160, 163 String, generally alignment, 143 barcoding problem, 131–132 B-trees, 5–6, 17–19 characterized, 301 database, 880 defined, 3, 76, 146, 386 graphs, 33–34 indexing, see String indexing k-string composition, 337–338 length of, 76 String data structures, for computational molecular biology classic algorithmic problems, 4–5 EM-based algorithms, indeterminate/degenerate string, 5, 56–57 index structures, 12–16, 21 main string indexing data structures, 6–12 in memory hierarchies, 17–20 overview of, 3–4, 20 terminology, 3–4 String indexing compression, 16 data structures, 6–12 indeterminate strings, 14–16 weighted strings, 12–14 CuuDuongThanCong.com Stringology, 51 Strings processing, and applications to biological sequences arc-anotated sequences, algorithmic aspects of, 113–125 degenerate sequences, new developments in processing of, 73–89 DNA barcoding problems, algorithmic issues in, 129–141 DNA computing for subgraph isomorphism problem and related problems, 171–188 efficient restricted-case algorithms for problems in computational biology, 27–46 exact search algorithms, 91–108 finite automata in pattern matching, 51–69 string data structures for computational molecular biology, 3–20 weighted DNA sequences, recent advances in, 143–167 Strips, approximation algorithms, 754–755, 758 Structural alignment, see Global structural alignment; Local structural alignment based on center of gravity (SACG), 266–271, 273, 275 problem, 261–262 Structural classification of proteins (SCOP), 253, 273, 481, 484, 514 Structural motifs functional, 273 searching, 270, 272–274 Structural restrictions, 29, 46 Styczynski et al.’s algorithm, 391–392, 395 Subgraph isomorphism problem defined, 172–174 DNA computing, 179–183 example of, 173 Subgraphs, 197–198, 367 Subsampling with network induction (SSML), 468 Subset sum problem, 172 Substitution operations, 116–117 Substitutions, 304 Substrings defined, 146, 386 functions of, generally, 13, 788 length of, 28 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX Subtree pruning and regrafting (SPR), 561–564, 569–570, 585–586, 588–589, 594 Subtrees, 14, 198 Suffix, generally arrays, 8–12 automata, 57–59 defined, 52, 146, 386 link recovery phase, 17 trees, see Suffix tree trie, 58 Suffix tree functions of, generally, 5–8, 14, 17, 43, 58, 243–244, 335–336, 339–340, 790 pattern matching using, 152–153 property, 151–152 weighted, 148–153 Sum-of-pairs (SP) alignment, 36 score, 283 Supercomputers, 550, 572 Superprimitive string, 77 Superstring, 77 Supervised classification (SC) methods, 372 Supervised learning, 959, 971 Supply link, 99 Support vector machines (SVMs) characterized, 372–373, 375, 460, 468–470 protein domain boundary prediction, 502 protein function prediction, 481, 483, 485, 487 Susceptible-infective-removed (SIR) model, 886 Susceptible-infective-susceptible model (SIS model), 886 SVDIMPUTE, 631 Swapped matching, 13 Swaps, pattern matching with, 156–157 SWIFT filters, 303–305 Swissprot, 279 Switch error, 856 Symbol-occurrence restrictions, 39–40 Synchronization loss, in microarray analysis, 632–633 Synchronous dynamical graph, 902–903, 905 Synonymous codon usage biases (SCUB), 458 CuuDuongThanCong.com 1041 Synteny block, 752, 757 defined, 735–737 See also Synteny detection detection, 734–739 SynTReN, 972 Systematic biclustering algorithms, 656–657 Systematic evolution of ligands by exponential enrichment (SELEX), 416 Systems biology, 216, 666, 942–943, 955 Tabu search (TS), 587 Tags ChIP-SEQ data analysis, 429–437 data models for, 237 next-generation sequencing, 428 Tandem affinity purification (TAP) tagging, 215 Tandem repeat defined, 146 fixed-length, 163 TargetScan, 994 TargetScanS, 997, 999–1000 Taxonomy tree, 346 TCA, 879, 881 T-COFFEE, 247, 251, 253, 731 Teiresias algorithm, 394–395 Temporal gene expression profile analysis, 634–640, 648 Temporal properties, probabilistic models asynchronous update, 903–905, 908 mixed update with priorities, 904–905, 909 synchronous update, 902–903 Ten-fold cross-validation, 971 Term Finder, GO website, 660 Testing, degenerate strings, 80 Test statistics, 994 TF-MAP alignment, 399 Thermodynamics, 188 Thermotogae, 347 Third-generation sequencing, 444 Three-dimensional matching problem, 172 Thymine (T), 3, 31, 36, 171, 599, 799–800, 802, 816, 818 TIGRFAMS database, 481 Time and space analysis, 529 Time complexity, 790 P1: OSO ind JWBS046-Elloumi 1042 December 2, 2010 9:56 Printer Name: Sheridan INDEX Time-series data, gene expression, 962 profiles, 644 whole-genome expression data, 634 TM-Align algorithm, 263 Top-down clustering, 374 TOPNET, 681 Topological distance, 586–587, 589–592 Topological indices, 364 Top scoring pairs (TSP), 711, 713–714 TOPS model, 350 Torovirus, 346 Toxicity, 377, 680 T-PROFILER, 669 Tractability of problem, 27–30 Trails, in graphs, 195 Training population, 971 Transciptome data, 966–967 Transcriptional networks, 881–883, 972 Transcription factor binding sites, 347 Transcription factors (TFs), 398, 400, 403, 405, 407, 412–414, 425–426, 428–429, 436–438, 869, 882–883, 887, 893, 899, 955, 957, 993–1001 Transcription regulatory networks (TRNs), 665, 884, 981, 993 Transcription start sites (TSS), 398–399 Transcriptomes, 430 Transcriptomics, 105 TRANSFAC database, 399, 405–407, 880, 995–996 Transfer RNA (tRNA), 929, 935 Translocation, 751 Transpositions, genome rearrangements, 751, 759–761, 766 Transreversal, 763, 766 Traveling salesman problem (TSP), 172, 200–201, 208 Tree(s), see specific types of trees alignment, 36 bisection reconnection (TBR), 561, 585–586, 594 in graphs, 198–199 kernel, 455 length, 580 TREE-FINDER, 551 TRELLS, 17 Treponema pallidum, 877 CuuDuongThanCong.com Trie, defined, 965 Trie-based algorithms, 97–100 Truncated generalized factor automaton (TGFA), 16 Truncated scale-free network, 871 Truth tables, 896, 898 t-statistic, 676 t-test, 667, 669, 694–696 TSUKUBA-BB, 249 TUIUIU filters, 304, 308–309 Tumors characterized, 660, 676 differential expression, see Tumors, differential expression in compendium estrogen receptor (ER) progression, 691–692 heterogeneity of, 701 Tumors, differential expression in compendium Gaussian mixture model (GMM) for finite levels of expression, 701–702 kurtosis excess, 704–705 locally adaptive statistical procedure (LAP), 710–711 local singular value decomposition, 709–710 outlier detection strategy, 703–704 TVSBS, 95 Two-channel microarray experiment, 626 Two-regulating-one, 634–635 TYNA software, 876 Type diabetes, 679 UCSC Genome Browser, 399, 410 Unbalanced genomes, 796 UNBAL-FMB, 787, 791–793, 795 Undersampling, 483 Undirected graphs, 194–195, 197, 869 UniProt protein database, 369, 481 Universal Similarity Metric (USM), 354 Unlimited (UNLIM) arc structure, 114, 117–118, 120–124 Unmatched substring, 80, 83 Unrooted binary tree function of, 568, 582 topology, 552–553 Unsigned minimum common string partition (UMCSP), 787–791 P1: OSO ind JWBS046-Elloumi December 2, 2010 9:56 Printer Name: Sheridan INDEX Unsupervised classification (UC) methods, 374 Unsupervised learning problem, 959 Until operator, 933–934 Untranslated regions (3 UTR), 995 (5 UTR), 349, 461–462, 536 Unweighted pair group, 613 Unweighted pair grouping method with arithmetic means (UPGMA), 280, 285, 293, 589, 592 Upstream problems, 725 Up-weighting, 288, 291 Uracil (U), 3, 31, 171, 522 Validation biological, 972–973 experimental 414–417, 428, 472 gene regulatory network inference, 969–973 network inference, 970–973 statistical, 970–972 Value sequence, 187 van Emde Boas trees, 165 Vanishing ingredient, 502, 506 Vanted software, 876 Variable length alignment fragment pairs (VLAFPs) algorithm, 263–266 based on center of gravity (CG), 274–275 contiguous sequence of, 264 defined, 262 protein classification, 274 Variable Neighborhood Search (VNS), 586, 588 VCOST function, 264–266, 270 Vector(s) bipartition, 566–568 in computing genomic distances, 779 dissimilarity rank, 342 Euclidean, 611 frequency, 600–601 in microarray analysis, 633 probability, 554–557 Vectorizing, 593 Verification, 100 Vertebrates, 346 CuuDuongThanCong.com 1043 Vertex cover problem, 172 isolated, 180–181 level of, 198 -weighted graph, 195 Vienna RNA Package, 534 Virtual Institute for Microbial Stress and Survival (VIMSS), Group Term Life Insurance (GTL) project, 451 Virtual suffix tree, 11 Viruses, 331, 345–346, 536–537, 552, 619 See also specific viruses Visant software, 876 VISTA, 399, 410–411 VITERBI algorithm, 461, 901 Walks in graphs, 194–195 random, 584, 683, 814–815, 838, 907 Shortest Common Superstring (SCS), 34–35 Watson-Crick complements, 132 Wavelet analysis discrete Haar wavelet transform, 821–823 Haar wavelet basis, 819 Haar series, 819–821 Wavelet coefficient, 820–821, 828–839 See also Wavelet coefficient clusters Wavelet coefficient clusters characterized, 828–830 of complex DNA representation, 830–834, 839 of DNA walks, 834–838 Wavelet transform, 460 Wavelet variance scanning (WAVES), 708–709 Weblogo, 399 Weighted directed graph, 195 Weighted DNA sequences characterized, 147–148 defined, 145 indexing, 148–152, 166 strings, defined, 146 Weighted graph, 195–196 Weighted matching, 13 Weighted Sequence Entropy (WSE), 326–327 P1: OSO ind JWBS046-Elloumi 1044 December 2, 2010 9:56 Printer Name: Sheridan INDEX Weighted strings, index structures for, 12–14 Weighted suffix tree (WST), 14, 161, 165–167 Weighted TSP (WTSP), 714 Weighted undirected graph, 195 White noise, 816–817 Whole genome sequences, 136, 600, 619 Whole human genome, 407 Whole molecule descriptors, 363 Widrow-Hoff learning rule, 470 Wiener index, 364 Wilcoxon rank-sum test, 669–670 Wildcard, 393, 431 Wiring diagrams, 942 Word(s) defined, exact matches, 340–344 size, 342–343 WormBase WS130 and 940, 994 Wu-Manber algorithm, 95, 103 XMOTIF algorithm, 659 X-ray crystallography, 263, 269, 267 CuuDuongThanCong.com YASS, 245–246 Yeast, see Saccharomyces cerevisiae cell cycle, 660, 675 characteristics of, 397, 491, 956 interaction networks, 684 microarray analysis, 632, 635, 638, 640 network inference, 962 TF-mediated networks, 994 transcriptional network, 882 two-hybrid, 215, 878 YEASTRACT database, 972 Yed software, 876 Zero (full) Matching Breakpoint Distance (ZMBD), 795 Zero recombinant haplotype configuration (ZRHC) problem, 854–855 Zhu-Takaoka algorithm, 95 Zipfian/zipf distribution, 231 ZM, 432 zmSRp32 gene, 349 ZOOM, 103–106, 108, 314, 433 Zoom-in/zoom-out techniques, 573 Z-score, 343, 348, 669, 998 P1: OTA/XYZ P2: ABC series JWBS046-Elloumi November 3, 2010 12:36 Printer Name: Sheridan Wiley Series on Bioinformatics: Computational Techniques and Engineering Bioinformatics and computational biology involve the comprehensive application of mathematics, statistics, and computer science to the understanding of living systems Research and development in these areas require cooperation among specialists from the fields of biology, computer science, mathematics, statistics, physics, and related sciences The objective of this book series is to provide timely treatments of the different aspects of bioinformatics spanning theory, new and established techniques, technologies and tools, and application domains This series emphasizes algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology Series Editors: Professor Yi Pan and Professor Albert Y Zomaya pan@cs.gsu.edu albert.zomaya@sydney.edu.au Knowledge Discovery in Bioinformatics: Techniques, Methods, and Applications Xiaohua Hu and Yi Pan Grid Computing for Bioinformatics and Computational Biology Edited by El-Ghazali Talbi and Albert Y Zomaya Bioinformatics Algorithms: Techniques and Applications Ion Mandiou and Alexander Zelikovsky Analysis of Biological Networks Edited by Bjăorn H Junker and Falk Schreiber Computational Intelligence and Pattern Analysis in Biological Informatics Edited by Ujjwal Maulik, Sanghamitra Bandyopadhyay, and Jason T L Wang Mathematics of Bioinformatics: Theory, Practice, and Applications Matthew He and Sergey Petoukhov Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications Edited by Mourad Elloumi and Albert Y Zomaya CuuDuongThanCong.com ... the main indexing structures for weighted and indeterminate strings, and (iii) we will present the main external memory string indexing structures 1.2 MAIN STRING INDEXING DATA STRUCTURES In this... replacing thymine), Algorithms in Computational Molecular Biology, Edited by Mourad Elloumi and Albert Y Zomaya Copyright C 2011 John Wiley & Sons, Inc CuuDuongThanCong.com P1: OSO c01 JWBS046 -Elloumi. .. China Ling-Yun Wu, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China Xiao Yang, Department of Electrical and Computer Engineering, Bioinformatics and Computational

Định dạng
Số trang	1.085
Dung lượng	8,57 MB