bioinformatics algorithms techniques and applications m ndoiu zelikovsky 2008 02 25 Cấu trúc dữ liệu và giải thuật

CuuDuongThanCong.com BIOINFORMATICS ALGORITHMS CuuDuongThanCong.com BIOINFORMATICS ALGORITHMS Techniques and Applications Edited by Ion I M˘andoiu and Alexander Zelikovsky A JOHN WILEY & SONS, INC., PUBLICATION CuuDuongThanCong.com Copyright © 2008 by John Wiley & Sons, Inc All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-646-8600, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201)-748-6011, fax (201)-748-6008 Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commerical damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services please contact our Customer Care Department within the U S at 877-762-2974, outside the U S at 317-572-3993 or fax 317-572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print, however, may not available in electronic format Library of Congress Cataloging-in-Publication Data: Bioinformatics algorithms : techniques and applications / edited by Ion I Mandoiu and Alexander Zelikovsky p cm ISBN 978-0-470-09773-1 (cloth) Bioinformatics Algorithms I Mandoiu, Ion II Zelikovsky, Alexander QH324.2B5472 2008 572.80285–dc22 2007034307 Printed in the United States of America 10 CuuDuongThanCong.com CONTENTS Preface ix Contributors xi Educating Biologists in the 21st Century: Bioinformatics Scientists versus Bioinformatics Technicians Pavel Pevzner PART I TECHNIQUES Dynamic Programming Algorithms for Biological Sequence and Structure Comparison Yuzhen Ye and Haixu Tang Graph Theoretical Approaches to Delineate Dynamics of Biological Processes 29 Teresa M Przytycka and Elena Zotenko Advances in Hidden Markov Models for Sequence Annotation 55 Broˇna Brejová, Daniel G Brown, and Tomásˇ Vinaˇr Sorting- and FFT-Based Techniques in the Discovery of Biopatterns 93 Sudha Balla, Sanguthevar Rajasekaran, and Jaime Davila v CuuDuongThanCong.com vi CONTENTS A Survey of Seeding for Sequence Alignment 117 Daniel G Brown The Comparison of Phylogenetic Networks: Algorithms and Complexity 143 Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, and Giancarlo Mauri PART II GENOME AND SEQUENCE ANALYSIS Formal Models of Gene Clusters 175 177 Anne Bergeron, Cedric Chauve, and Yannick Gingras Integer Linear Programming Techniques for Discovering Approximate Gene Clusters 203 Sven Rahmann and Gunnar W Klau 10 Efficient Combinatorial Algorithms for DNA Sequence Processing 223 Bhaskar DasGupta and Ming-Yang Kao 11 Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints 241 K.M Konwar, I.I M˘andoiu, A.C Russell, and A.A Shvartsman 12 Recent Developments in Alignment and Motif Finding for Sequences and Networks 259 Sing-Hoi Sze PART III MICROARRAY DESIGN AND DATA ANALYSIS 277 13 279 Algorithms for Oligonucleotide Microarray Layout Sérgio A De Carvalho Jr and Sven Rahmann 14 Classification Accuracy Based Microarray Missing Value Imputation 303 Yi Shi, Zhipeng Cai, and Guohui Lin 15 Meta-Analysis of Microarray Data Saumyadipta Pyne, Steve Skiena, and Bruce Futcher CuuDuongThanCong.com 329 CONTENTS vii PART IV GENETIC VARIATION ANALYSIS 353 16 355 Phasing Genotypes Using a Hidden Markov Model P Rastas, M Koivisto, H Mannila, and E Ukkonen 17 Analytical and Algorithmic Methods for Haplotype Frequency Inference: What Do They Tell Us? 373 Steven Hecht Orzack, Daniel Gusfield, Lakshman Subrahmanyan, Laurent Essioux, and Sebastien Lissarrague 18 Optimization Methods for Genotype Data Analysis in Epidemiological Studies 395 Dumitru Brinza, Jingwu He, and Alexander Zelikovsky PART V STRUCTURAL AND SYSTEMS BIOLOGY 417 19 419 Topological Indices in Combinatorial Chemistry Sergey Bereg 20 Efficient Algorithms for Structural Recall in Databases 439 Hao Wang, Patra Volarath, and Robert W Harrison 21 Computational Approaches to Predict Protein–Protein and Domain–Domain Interactions 465 Raja Jothi and Teresa M Przytycka Index CuuDuongThanCong.com 493 PREFACE Bioinformatics, broadly defined as the interface between biological and computational sciences, is a rapidly evolving field, driven by advances in high throughput technologies that result in an ever increasing variety and volume of experimental data to be managed, integrated, and analyzed At the core of many of the recent developments in the field are novel algorithmic techniques that promise to provide the answers to key challenges in postgenomic biomedical sciences, from understanding mechanisms of genome evolution and uncovering the structure of regulatory and protein-interaction networks to determining the genetic basis of disease susceptibility and elucidation of historical patterns of population migration This book aims to provide an in-depth survey of the most important developments in bioinformatics algorithms in the postgenomic era It is neither intended as an introductory text in bioinformatics algorithms nor as a comprehensive review of the many active areas of bioinformatics research—to readers interested in these we recommend the excellent textbook An Introduction to Bioinformatics Algorithms by Jones and Pevzner and the Handbook of Computational Molecular Biology edited by Srinivas Aluru Rather, our intention is to make a carefully selected set of advanced algorithmic techniques accessible to a broad readership, including graduate students in bioinformatics and related areas and biomedical professionals who want to expand their repertoire of algorithmic techniques We hope that our emphasis on both in-depth presentation of theoretical underpinnings and applications to current biomedical problems will best prepare the readers for developing their own extensions to these techniques and for successfully applying them in new contexts The book features 21 chapters authored by renowned bioinformatics experts who are active contributors to the respective subjects The chapters are intended to be largely independent, so that readers not have to read every chapter nor have to read them in a particular order The opening chapter is a thought provoking discussion of ix CuuDuongThanCong.com x PREFACE the role that algorithms should play in 21st century bioinformatics education The remaining 20 chapters are grouped into the following five parts: Part I focuses on algorithmic techniques that find applications to a wide range of bioinformatics problems, including chapters on dynamic programming, graphtheoretical methods, hidden Markov models, sorting the fast Fourier transform, seeding, and phylogenetic networks comparison approximation algorithms Part II is devoted to algorithms and tools for genome and sequence analysis It includes chapters on formal and approximate models for gene clusters, and on advanced algorithms for multiple and non-overlapping local alignments and genome things, multiplex PCR primer set selection, and sequence and network motif finding Part III concentrates on algorithms for microarray design and data analysis The first chapter is devoted to algorithms for microarray layout, with next two chapters describing methods for missing value imputation and meta-analysis of gene expression data Part IV explores algorithmic issues arising in analysis of genetic variation across human population Two chapters are devoted to computational inference of haplotypes from commonly available genotype data, with a third chapter describing optimization techniques for disease association search in epidemiologic case/control genotype data studies Part V gives an overview of algorithmic approaches in structural and systems biology First two chapters give a formal introduction to topological and structural classification in biochemistry, while the third chapter surveys protein–protein and domain–domain interaction prediction We are grateful to all the authors for their excellent contributions, without which this book would not have been possible We hope that their deep insights and fresh enthusiasm will help attracting new generations of researchers to this dynamic field We would also like to thank series editors Yi Pan and Albert Y Zomaya for nurturing this project since its inception, and the editorial staff at Wiley Interscience for their patience and assistance throughout the project Finally, we wish to thank our friends and families for their continuous support ˘ Ion I Mandoiu and Alexander Zelikovsky CuuDuongThanCong.com CONTRIBUTORS Sudha Balla, Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, USA Sergey Bereg, Department of Computer Science, University of Texas at Dallas, Dallas, TX, USA Anne Bergeron, Comparative Genomics Laboratory, Université du Québec a` Montréal, Canada Paola Bonizzoni, Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milano, Italy ˇ Brejová, Department of Biological Statistics and Computational Biology, Brona Cornell University, Ithaca, NY, USA Dumitru Brinza, Department of Computer Science, Georgia State University, Atlanta, GA, USA Daniel G Brown, Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada Zhipeng Cai, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada Cedric Chauve, Department of Mathematics, Simon Fraser University, Vancouver, Canada Bhaskar DasGupta, Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA Sérgio A de Carvalho Jr., Technische Fakultăat, Bielefeld University, D-33594 Bielefeld, Germany xi CuuDuongThanCong.com 486 COMPUTATIONAL APPROACHES TO PREDICT PROTEIN–PROTEIN Comparison of Sensitivity in Mediating Domain Pair Prediction Experiment 100 Association EM DPEA PE Avarage Estimated Sensitivity (TP/(TP+FN)) 90 80 70 60 50 40 30 20 10 242 321 148 50 232 34 84 67 84 20 60 37 59 34 33 11 243 1780 10 11 12 13 14 15 16 17 18 19 20 21+ ANY Number of Potential Domain Interactions in Protein Pairs (Number of Protein Pairs in the Corresponding Class) FIGURE 21.11 Domain–domain contact prediction results The results are broken down according to the potential number of domain–domain contacts between protein pairs in the given interacting pair of proteins Pairs of interacting proteins are selected so that each pair contains an iPFAM domain pair which is assumed to be in contant Figure adapted from [26] selection is small, although the improvement increases with the number of possible domain pair contacts 21.3.2.5 Most Parsimonious Explanation (PE) Recently, Guimaraes et al [26] introduced a new domain interaction prediction method called the most parsimonious explanation [26] Their method relies on the hypothesis that interactions between proteins evolved in a parsimonious way and that the set of correct domain–domain interactions is well approximated by the minimal set of domain interactions necessary to justify a given protein–protein interaction network The EM problem is formulated as a linear programming optimization problem, where each potential domain–domain contact is a variable that can receive a value ranging between and (called the LP-score), and each edge of the protein–protein interaction network corresponds to one linear constraint That is, for each (unordered) domain pair Dij that belongs to some interacting protein pair, there is a variable xij The values of xij are computed using the linear program (LP): minimize xij Dij CuuDuongThanCong.com (21.12) ACKNOWLEDGMENTS 487 xij ≥ 1, where Pmn ∈ I subject to Dij ∈Pmn To account for the noise in the experimental data, a set of linear programs is constructed in a probabilistic fashion, where the probability of including an LP constraint in Equation 21.12 equals the probability with which the corresponding protein– protein interaction is assumed to be correct The LP-score for a domain pair Dij is then averaged over all LP programs An additional randomization experiment is used to compute p-values and prevent overprediction of interactions between frequently occurring domain pairs Guimaraes at al [26] demonstrated that the PE method outperforms the EM and DPEA methods (Fig 21.11) GLOSSARY Coevolution Coordinated evolution It is generally agreed that proteins that interact with each other or have similar function undergo coordinated evolution Gene fusion A pair of genes in one genome is fused together into a single gene in another genome HMMer HMMer is a freely distributable implementation of profile HMM (hidden Markov model) software for protein sequence analysis It uses profile HMMs to sensitive database searching using statistical descriptions of a sequence family’s consensus iPfam iPfam is a resource that describes domain–domain interactions that are observed in PDB crystal structures Ortholog Two genes from two different species are said to be orthologs if they evolved directly from a single gene in the last common ancestor PDB The protein data bank (PDB) is a central repository for 3D structural data of proteins and nucleic acids The data, typically obtained by X-ray crystallography or NMR spectroscopy, are submitted by biologists and biochemists from around the world, released into the public domain, and can be accessed for free Pfam Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families Phylogenetic profile A phylogenetic profile for a protein is a vector of 1s and 0s representing the presence or absence of that protein in a reference set organisms Distance matrix A matrix containing the evolutionary distances of organisms or proteins in a family ACKNOWLEDGMENTS This work was funded by the intramural research program of the National Library of Medicine, National Institutes of Health CuuDuongThanCong.com 488 COMPUTATIONAL APPROACHES TO PREDICT PROTEIN–PROTEIN REFERENCES HMMer http://hmmer.wustl.edu RPS-BLAST http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi Altschuh D, Lesk AM, Bloomer AC, Klug A Correlation of coordinated amino acid substitutions with function in viruses related to tobacco mosaic virus J Mol Biol 1987;193(4):683–707 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ Basic local alignment search tool J Mol Biol 1990;215(3):403–410 Apic G, Gough J, Teichmann SA Domain combinations in archaeal, eubacterial and eukaryotic proteomes J Mol Biol 2001;310(2):311–325 Atwell S, Ultsch M, De Vos AM, Wells JA Structural plasticity in a remodeled protein– protein interface Science 1997;278(5340):1125–1128 Berger JM, Gamblin SJ, Harrison SC, Wang JC Structure and mechanism of DNA topoisomerase II Nature 1996;379(6562):225–232 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE The Protein Data Bank Nucl Acid Res 2000;28(1):235–242 Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A Interaction network containing conserved and essential protein complexes in escherichia coli Nature 2005;433(7025):531–537 10 Chothia C, Gough J, Vogel C, Teichmann SA Evolution of the protein repertoire Science 2003;300(5626):1701–1703 11 Dandekar T, Snel B, Huynen M, Bork P Conservation of gene order: a fingerprint of proteins that physically interact Trends Biochem Sci 1998;23(9):324–328 12 Date SV, Marcotte EM Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages Nat Biotechnol 2003;21(9):1055–1062 13 Deng M, Mehta S, Sun F, Chen T Inferring domain–domain interactions from protein– protein interactions Genome Res 2002;12(10):1540–1548 14 Edgar RC MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucl Acid Res 2004;32(5):1792–1797 15 Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA Protein interaction maps for complete genomes based on gene fusion events Nature 1999;402(6757):86–90 16 Finn RD, Marshall M, Bateman A iPfam: visualization of protein–protein interactions in PDB at domain and amino acid resolutions Bioinformatics 2005;21(3): 410–412 17 Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hallich V, Lassmann T, Moxon S, Marshal M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A Pfam: clans, web tools and services Nucleic Acids Res 2006,34(Database issue):D247– D251 18 Gaasterland T, Ragan MA Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes Microb Comp Genomics 1998;3(4):199–217 19 Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, CuuDuongThanCong.com REFERENCES 20 21 22 23 24 25 26 27 28 29 30 31 32 489 Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, SupertiFurga G Functional organization of the yeast proteome by systematic analysis of protein complexes Nature 2002;415(6868):141–147 Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S, Rothschild B Inferring protein interactions from phylogenetic distance matrices Bioinformatics 2003;19(16):2039–2045 Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitals E, Vijayadamodar G, Pochart P, Machineni H, Welsch M, Kong Y, Zerhusen B, Malcalm R, Varrone Z, Callis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs E, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Knight Jr J, Shimkets RA, McKenna MP, Chant J, Rothberg JM A protein interaction map of drosophila melanogaster Science 2003;302(5651):1727–1736 Glazko GV, Mushegian AR Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns Genome Biol 2004;5(5):R32 Gobel U, Sander C, Schneider R, Valencia A Correlated mutations and residue contacts in proteins Proteins 1994;18(4):309–317 Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE Co-evolution of proteins with their interaction partners J Mol Biol 2000;299(2):283–293 Goh CS, Cohen FE Co-evolutionary analysis reveals insights into protein–protein interactions J Mol Biol 2002;324(1):177–192 Guimaraes K, Jothi R, Zotenko E, Przytycka TM Predicting domain–domain interactions using a parsimony approach Genome Biol 2006;7(11):R104 Henrick K, Thornton JM PQS: a protein quarternary structure file server Trends Biochem Sci 1998;23(9):358–361 Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Walting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova, Willims AR, Sassi H, Nielson PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsep E, Crawford J, Poulsen V, Sorensen BD, Mathhiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry Nature 2002;415(6868):180–183 Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y A comprehensive twohybrid analysis to explore the yeast protein interactome Proc Natl Acad Sci USA 2001;98(8):4569–4574 Jespers L, Lijnen HR, Vanwetswinkel S, Van Hoef B, Brepoels K, Collen D, De Maeyer M Guiding a docking mode by phage display: selection of correlated mutations at the staphylokinase-plasmin interface J Mol Biol 1999;290(2):471–479 Jothi R, Cherukuri PF, Tasneem A, Przytycka TM Co-evolutionary analysis of domains in interacting proteins reveals insights into domain–domain interactions mediating protein– protein interactions J Mol Biol 2006;362(4):861–875 Jothi R, Kann MG, Przytycka TM Predicting protein–protein interaction by searching evolutionary tree automorphism space Bioinformatics 2005;21(Suppl 1):i241–i250 CuuDuongThanCong.com 490 COMPUTATIONAL APPROACHES TO PREDICT PROTEIN–PROTEIN 33 Jothi R, Przytycka TM, Aravind L Discovering functional linkages and cellular pathways using phylogenetic profile comparisons: a comprehensive assessment BMC Bioinformatics 2007;8:173 34 Kann MG, Jothi R, Cherukuri PF, Przytycka TM Predicting protein domain interactions from co-evolution of conserved regions Proteins 2007;67(4)811–820 35 Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis P, Punna T, Peregrin-Alvaraz JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St-Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O’Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF Global landscape of protein complexes in the yeast saccharomyces cerevisiae Nature 2006;440(7084):637– 643 36 Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M A map of the interactome network of the metazoan c elegans Science 2004;303(5657):540–543 37 Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D Detecting protein function and protein–protein interactions from genome sequences Science 1999;285(5428):751–753 38 Metropolis N, Rosenbluth AW, Teller A, Teller EJ Simulated annealing J Chem Phys 1955;21:1087–1092 39 Mirkin BG, Fenner TI, Galperin MY, Koonin EV Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes BMC Evol Biol 2003;3:2 40 Moyle WR, Campbell RK, Myers RV, Bernard MP, Han Y, Wang X Co-evolution of ligand-receptor pairs Nature 1994;368(6468):251–255 41 Neher E How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci USA 1994;91(1):98–102 42 Ng SK, Zhang Z, Tan SH Integrative approach for computationally inferring protein domain interactions Bioinformatics 2003;19(8):923–929 43 Notredame C, Higgins DG, Heringa J T-Coffee: A novel method for fast and accurate multiple sequence alignment J Mol Biol 2000;302(1):205–217 44 Nye TM, Berzuini C, Gilks WR, Babu MM, Teichmann SA Statistical analysis of domains in interacting protein pairs Bioinformatics 2005;21(7):993–1001 45 Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N Use of contiguity on the chromosome to predict functional coupling In Silico Biol 1999;1(2):93–108 46 Pazos F, Helmer-Citterich M, Ausiello G, Valencia A Correlated mutations contain information about protein–protein interaction J Mol Biol 1997;271(4):511– 523 CuuDuongThanCong.com REFERENCES 491 47 Pazos F, Ranea JA, Juan D, Sternberg MJ Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome J Mol Biol 2005;352(4): 1002–1015 48 Pazos F, Valencia A Similarity of phylogenetic trees as indicator of protein–protein interaction Protein Eng 2001;14(9):609–614 49 Pazos F, Valencia A In silico two-hybrid system for the selection of physically interacting protein pairs Proteins 2002;47(2):219–227 50 Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO Assigning protein functions by comparative genome analysis: protein phylogenetic profiles Proc Natl Acad Sci USA 1999;96(8):4285–4288 51 Ramani AK, Marcotte EM Exploiting the co-evolution of interacting proteins to discover interaction specificity J Mol Biol 2003;327(1):273–284 52 Riley R, Lee C, Sabatti C, Eisenberg D Inferring protein domain interactions from databases of interacting proteins Genome Biol 2005;6(10):R89 53 Saitou N, Nei M The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol Biol Evol 1987;4(4):406–425 54 Sato T, Yamanishi Y, Kanehisa M, Toh H The inference of protein–protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships Bioinformatics 2005;21(17):3482–3489 55 Shindyalov IN, Kolchanov NA, Sander C Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 1994;7(3): 349–358 56 Sprinzak E, Margalit H Correlated sequence-signatures as markers of protein–protein interaction J Mol Biol 2001;311(4):681–692 57 Thompson JD, Higgins DG, Gibson TJ CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucl Acid Res 1994;22(22):4673–4680 58 Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae Nature 2000;403(6770):623– 627 59 Valencia A, Pazos F Computational methods for the prediction of protein interactions Curr Opin Struct Biol 2002;12(3):368–373 CuuDuongThanCong.com INDEX 2SNP computer program 383, 386, 390 5-fold cross validation 311 active direction 153 adenocarcinoma 345 adjacency 443 Affymetrix 331, 335 agreement homeomorphic subtree 151 agreement homeomorphic supertree 158 agreement isomorphic subtree 151 agreement isomorphic supertree 158 aligned fragment pair (AFP) 24 alignment graph 12 alignment sensitivity 123–125 allele 355 ambiguous genotype 374 AMMP 441 analytical solution of maximum likelihood estimate of haplotype frequency 375–377, 389 ancestor 143 ancestral sequences 356 anchor 13 approximate gene cluster discovery problem (AGCDP) 204, 206 arc contraction 148 ArrayExpress repository 344 ARRAY-RPACK problem 233 association method 480–481 atomic descriptor 446 augmented connectivity molecular formula (ACMF) 452 average conflict index (ACI) 286 Batched Greedy re-embedding 292 Baum-Welch algorithm 82–83 Bayesian 336–339 Bayesian inference 382, 389 Bayesian meta-analysis, see Meta-analysis Bioinformatics Algorithms: Techniques and Applications, Edited by Ion I Mˇandoiu and Alexander Zelikovsky Copyright © 2008 John Wiley & Sons, Inc 493 CuuDuongThanCong.com 494 INDEX Bayesian network 74 best-hit 469 beta distribution 341 biological network 260, 267–270 biopatterns 93, 94, 95 BLAST computer program 3–5, 13, 15, 109, 121–122, 466, 468–469 BLASTZ computer program 136 bond-electron matrix 443 bond partition 444 border conflict 282 border length 282–300 BPCAimpute 303, 307, 308 breakpoint 185 Bron-Kerbosch clique detection 454 BUGS 338 BUILD algorithm 160–161 candidate ligand identification program (CLIP) 454 CANGEN 448 canonical order 448 case/control study 401 CaTScan1 105, 106 CaTScan2 105, 106 CDKN3 347–348 cell cycle 345–346 cell signaling 46–50 centroid-based quadrisection 293, 296–297 CGS-Ftest 307, 310 CGS-Ttest 307, 310 chaining problem 225 character compatibility 39–43 character stability 42–44 chemical abstracts service registry system 452 chemical databases 439, 450 chemoinformatics 439 chessboard re-embedding 292 chIP-chip 332, 337 chordal graph 31–36, 40–44, 48–49 chromosomes 179 clade 147 classification accuracy 303 CuuDuongThanCong.com clustalW 468, 470 cluster 147, 149, 163, 169 cluster system 147, 169 CMVE 304 coalescent 382 CODEHOP computer program 102, 104 coevolution 464, 468–473, 478–479 cograph 32–37,44, 48 Cohen’s d 334 COLimpute 304, 308 column-swapping algorithm 474–477 combinatorial search 403 combining effect sizes 333 combining probabilities 339–343 common ancestor 146 common interval 187, 203, 205, 214 compatibility graph 153 compatible subtree 151 compatible supertree 157 compatible tree 145, 150 complimentary greedy search 404 composition 170 conditional maximum likelihood 84 conflict index 282, 284–300 confusion table 405 connectivity table 445 CONSENSUS computer program 96 consensus method 391 consensus subtree 150, 156 consensus supertree 158 consensus tree 144 consensus tree methods 145, 156 conserved gene clusters 178 conserved pathway 267–268 conserved segment 185 consistency-based pairwise alignment 261 convolution coefficient 109, 110, 111 convolution problem 109 coverage efficiency 103 cross-validation 364, 405 database search 262–263, 269–270 DEFOG computer program 102 INDEX degeneracy 101–104, 112 degenerate PCR primers 101–104, 112, 114–115, 243, 256 deposition sequence 282 descendancy graph 161 difference of means 332 diploid genotype 374 directory of natural products 454 disease-association 39 disease susceptibility prediction problem 405 distance matrix 472, 474–477, 488 divide and conquer 13 DMS computer program 100 DNA polymorphism 224 DNA sequence variation 374 domain interaction 465, 477–488 domain pair exclusion analysis 483 double dynamic programming 23 double-stranded RNA (dsRNA) 104, 105 DPDP computer program 101–103 DPPH computer program 383, 386 DPS 103 duplication 146, 162, 164, 168 duplication cost 164, 168, 171 duplications 182 dynamic programming 9–28, 290, 298 edit distance 95, 100, 108 edit distance-based motifs 100 EM (expectation-maximization) computer program 381, 384, 385, 387 empirical Bayes 338 ENT (entropy minimization) computer program 383, 387 Epitaxial placement 288 estimator, biased 334 estimator, method of moments 334 estimator, unbiased approximate 334 e-value 466, 468–469, 483–484, evidence combination 77–80 exhaustive search 402 exon chaining 16 CuuDuongThanCong.com 495 expectation-maximization algorithm 359, 374, 381, 389–390 extended connectivity 448 extension 184 facet-defining inequality 209 false discovery rate (FDR) 332, 343 fan 149 FASTA file format 13 fast Fourier transform (FFT) 93, 94, 108, 109, 110, 111 fingerprint 194 fingerprint methods 451 Fisher inverse chi-square 340 Fisher product 340 fissions 182 fixed effects modeling 333, 339 flexible structure alignment 24 Forbes coefficient 454 force field 441 founder haplotypes 356 FOXM1 347 F-test 307, 310 functional interaction 466–467 functional module 268–269 fusions 182 GenBank 332 gene cluster 203 gene content 205, 211 gene duplication 144, 147, 163 Gene expression microarray 303 gene fusion 466–467, 487 gene loss 146, 164 gene pool 205 generalized HMM 67–72 generator 190 gene selection 310 gene silencing 93, 107 gene team 192, 205, 214 genetic polymorphism 374 gene transfer 144 gene tree 146, 147, 162–171 gene universe 205 genome tiling 230 496 INDEX genotype 356, 396 GEO repository 344 geometric tail 69–71 Gibbs sampling 338, 374, 381 global sequence alignment 11–14 global similarity 109, 110 GMCimpute 305 GO ontology 336 graph alignment 268 Gray code 287 Greedy+ algorithm 298–299 greedy re-embedding 292 GTile problem 232 Hamming distance 95, 96, 98, 99, 100, 104 HAP computer program 383, 386, 390 HaploFreq computer program 383, 386, 390 haplotype 355, 374–375, 396 haplotype frequency direct estimate 384 haplotype frequency indirect estimate 387 haplotype inference 62, 364, 373, 390 Haplotyper computer program 381, 387, 389 haplotyping problem 356 HapMap 365 Hardy-Weinberg equilibrium 377, 380–381, 389 hash function 452 heatmap 346–347 heuristic local alignment 122 hidden Markov model (HMM) 357 hierarchical model 336–337 HIT computer program 357 hitting set 154 HIV 440 HMM decoding 58–59, 63–67 HMMer computer program 478, 488 HMM topology 57 HMM training 82–85 homeomorphism 148 homogeneity, test of 333, 335 horizontal gene transfer 146, 162, 164 CuuDuongThanCong.com Hosoya polynomial 425 hybridization node 147 HYDEN computer program 102 ILLSimpute 303, 307, 308 imputation 303, 308 incidence 443 indexing 396 infinite-sites model 383 informative SNP 398 integer linear program 203–221, 243–247, 250–252, 256 integration-driven discovery rate (IDR) 343–344 intersection graph 31–41, interval graph 30–32, 34–36, 45–46, inversions 181 iPfam 479, 483, 488 IR problem 226 irreducible common interval 190 isomorphism 148 iterative refinement 261 JAGS 338 KNN classifier 307, 310 KNNimpute 303, 305, 308 lateral gene transfer 144, 146, 164–166 (l,d)-motif 264–265 least common ancestor 143, 147, 163 Levenshtein distance 100 likelihood 376, 378–379, 383, likelihood function 337 LinCmb 304 linear interval 205 line notation 447 linkage disequilibrium 374–375 local alignment 14–15 local ratio and multi-phase techniques 227 local structure segments 18 longest increasing subsequence problem 14 LOWESS normalization 344 INDEX lowest p-value method 484–486 max-gap cluster 195, 203, 205, 214 maximal exact matches (MEMs) 13 maximal segment pair (MSP) 109 maximal unique matches (MUMs) 14 maximum agreement homeomorphic network 155, 156 maximum agreement homeomorphic subtree (MAST) 150–153 maximum agreement homeomorphic supertree (MASP) 158, 161 maximum agreement isomorphic subtree (MIT) 151 maximum agreement isomorphic supertree (MISP) 158 maximum clique 453 maximum common subgraph 453 maximum compatible subtree (MCT) 150–151 maximum compatible supertree (MCSP) 158, 161 maximum consensus subtree 150–156 maximum consensus supertree 158 maximum control-free cluster problem 403 Maximum Coverage Degenerate Primer Design Problem (MC-DPDP) 102 maximum likelihood 335, 377, 379, 380 maximum likelihood estimation 479–482 maximum likelihood principle 359 MC-DPDP 102 MCMC 338 MD-DPDP 103, 104 MDL 444 melting temperature 241, 249–250 MEME computer program 96 meta-analysis 329–349 microarray 329–332, 344–349 microarray layout 279–301 microarray layout problem (MLP) 282–300 microarray production 280–281 MINCUT algorithm 160–162 CuuDuongThanCong.com 497 MinDPS computer program 104 minimum compatible regular supernetwork 159 Minimum Degeneracy Degenerate Primer Design with Errors Problem (MD-DPDEP) 103, 104 minimum multicolored subgraph 243, 245, 257 MIPS computer program 102, 103 MIPS ontology 336 mirror-tree 468–472, 478–479 missing rate 307 MITRA computer program 97, 99 mixed graph 165 model-Fitting 406 modular decomposition 36–37 molecular graphs 441–443 molecular tree 442 Monte Carlo search 473, 475–476 Morgan Algorithm 448 MORPH algorithm 475, 477 most compatible NestedSupertree problem 159, 162 most compatible supertree problem 158, 162 most parsimonious explanation method 486–487 motif discovery 94, 95, 111 motif finding 260, 264–267, 269 MTA1 349 multiple sequence alignment 15, 259–264, 269, 468–472, 478, 488 multiple spaced seeds 131–133 multiplex PCR (MP-PCR) 101, 242 multiplex PCR primer set selection 241–257 multiplicative Wiener index 421 MULTIPROFILER computer program 96 MUSCLE computer program 468 mutation cost 164, 169 mutual information content 21 neighbor-joining algorithm 470, 491 498 INDEX NESTEDSUPERTREE algorithm 159, 161–162 network matching 18 network motif 268 nonoverlapping local alignments 224 normalized border length (NBL) 286 NRMSE 306 off-target gene silencing 93, 107 Oncomine 330, 344–349 one-dimensional partitioning 293–294 online interval maximum problem 235 optimal primer cover problem 242–243 optimum single probe embedding (OSPE) 290–292, 298–299 ordered maximum homeomorphic subtree 156 orthologs 468–472, 478, 488 overlap 187 pair HMM 80–82 partial order alignment 19,24 partial order graph 19 partitioning algorithm 286, 293–300 PatternHunter 129 Pearson’s correlation coefficient 470 peptoid 433 perfect phylogeny 39–40, 383 performance coefficient 96, 97 permutation 180, 203 permutation testing 339 Phase computer program 382, 385–387, 389 phasing problem 356 pheromone signaling pathway 46–50 photolithographic mask 280–281 phylogenetic footprinting 266–267, 269 phylogenetic HMM 75–77 phylogenetic network 146, 147 phylogenetic profile 466–467, 488 phylogenetic tree 38–40, 468–470, 475, 479 physical interaction 44, 468, 482, 485 pivot partitioning 293, 297–298 placement algorithm 286–290, 298–300 CuuDuongThanCong.com planted (l,d)-motif Problem 95 planted motif 96, 97, 99, 100, 112, 113, 114 PL-EM (partial-ligation EM) computer program 382, 384, 385, 387, 389 PMS1 computer program 98, 99 PMS2 computer program 99 PMS computer program 97, 98, 100 PMSi computer program 100 PMSP computer program 100 PMSprune computer program 100 polymerase chain reaction (PCR) 231, 241–242, 374 pooled standard deviation 334 positional homologs 195 posterior distribution 337–339 posterior decoding 63–64 potential function 244, 247 potential-greedy primer selection algorithm 243–244, 247–257 PQ-trees 188 Primer3 computer program 242, 249, 258 primer selection 95, 101, 113, 114, 241–258 prior distribution 336–338 probe design problem 107 probe embedding 282–300 probe embedding, asynchronous 283 probe embedding, left-most 283 probe embedding, synchronous 283 profile 263–265 profile HMM 62 progressive alignment 15 progressive alignment 261 PROJECTION computer program 96 prostate cancer 345–348 protein alignments 136–137 protein complex 43–50 protein data bank (PDB) 479, 486, 488 protein interaction 29–30, 44–50 protein interaction 465–488 protein interaction network 477–488 protein interaction specificity (PRINS) 471, 473, 475, 477 INDEX pseudoknots 20 p-value 339–343, 347 quadratic assignment problem (QAP) 286 quantitative-structure-activityrelationship (QSAR) 419 quorum 212 radix sort 94, 101, 105–106, 108 random effects modeling 333–336 RB (rule-based) algorithm 374, 382 RB (rule-based) computer program 382, 384, 385 rearrangements 177 recombination 62 reconciled tree 167, 170, 171 re-embedding algorithm 286, 290–292, 298–300 refinement 145,148, 149, 157, 158 RepeatMasker computer program 231 ribosomal complex assembly 44–45 ring structure 442 RMA normalization 344 RNA interference (RNAi) 95, 104, 112, 113 RNA secondary structure 11,19 RNA secondary structure prediction 20 rooted network 146 rooted triple 149 root mean squared distance (RMSD) 22 row-epitaxial placement 289 RPS-BLAST computer program 478 Russel-Rao coefficient 454 r-window 205, 214 scenario 166, 167, 171 Schultz index 421 segment alignment (SEA) 18 segment chaining 13, 24 separation problem 209 sequence alignment 3, 117–140, 466 sequence-structure motifs 18 sequential re-embedding 292 CuuDuongThanCong.com 499 shared substructure 440 shortest path 10 short-interference RNA (siRNA) 93, 95, 105, 107, 112, 113, 115 shrinkage 339 Sidak’s correction 341 similarity coding 453 similarity measurement 94, 108, 111 similar subset 440 Simpson coefficient 454 single nucleotide polymorphism (SNP) 241–257, 355–372, 395–415 siRNA patterns 107 sliding-window matching algorithm 288 SMARTS 447 SMD repository 344 SMILES 447 SMIRKS 448 soft conflict 150 SOS algorithm 108 spaced seeds 126–133 speciation 163, 166 species tree 145, 146, 147, 162–171 specific selection problem 107, 108 spliced alignment 16 SP-STAR algorithm 96 standardized mean difference 334 Stereochemically Extended Morgan algorithm (SEMA) 449 Stouffer’s sum of Z’s 340 strong common interval 188 structural flexibility 24 structure-based sequence alignment 23 structure search 451 subdivision operation 165 suboptimal structures 21 suffix array 94 suffix tree 94, 100, 101, 105, 107, 115 sum of logits 341 sum of logs 340 supernetwork 144, 156, 157, 159, 160, 161 switch distance 365 Szeged index 421 500 INDEX tagging 396 Tanimoto coefficient 453 t-Coffee computer program 468 threading 287, 288 three-dimensional molecular representation 441 thresholded Fisher product 341–342 Tippett’s Minimum p 341 TOP2A, 348 topological index 419 topological restriction 148 toposisomerase II 348 total agreement supertree problem 157 TRANSFAC database 336 transfer arc 165 translocations 181 transmembrane proteins 61 traveling salesman problem (TSP) 287 tree isomorphism 475–477 treelike cluster system 147, 159, 169 tree topology 475, 477 triple repeat identification problem (TRIP) 106 t-statistic 347 CuuDuongThanCong.com two-dimensional molecular representation 441 two-dimensional partitioning 293, 295–296 Ullman algorithm 452 uniformly ordered maximum homeomorphic subtree 156 union graph 165 unphased genotype 356 variance 339 vector seeds 134–135 vertex contraction 148 Viterbi algorithm 58–59 Watson-Crick complement 241, 244 weighted sum of Z’s 341 Wiener index 420 WinBUGS computer program 338 Winer’s sum of t’s 341 WINNOWER algorithm 96 Wiswesser line notation (WLN) 447 wrapping interval 211 bioinformatics-cp.qxd 11/29/2007 8:44 AM Page Wiley Series on Bioinformatics: Computational Techniques and Engineering Bioinformatics and computational biology involve the comprehensive application of mathematics, statistics, science, and computer science to the understanding of living systems Research and development in these areas require cooperation among specialists from the fields of biology, computer science, mathematics, statistics, physics, and related sciences The objective of this book series is to provide timely treatments of the different aspects of bioinformatics spanning theory, new and established techniques, technologies and tools, and application domains This series emphasizes algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology Series Editors: Professor Yi Pan and Professor Albert Y Zomaya pan@cs.gsu.edu zomaya@it.usyd.edu.au Knowledge Discovery in Bioinformatics: Techniques, Methods, and Applications Xiaohua Hu and Yi Pan Grid Computing for Bioinformatics and Computational Biology Edited by El-Ghazali Talbi and Albert Y Zomaya Bioinformatics Algorithms: Techniques and Applications Ion Mandiou and Alexander Zelikovsky Analysis of Biological Networks Edited by Björn H Junker and Falk Schreiber CuuDuongThanCong.com .. .BIOINFORMATICS ALGORITHMS CuuDuongThanCong.com BIOINFORMATICS ALGORITHMS Techniques and Applications Edited by Ion I M? ?andoiu and Alexander Zelikovsky A JOHN WILEY &... bioinformatics and it has been broadly applied since then [61], dynamic programming has become an Bioinformatics Algorithms: Techniques and Applications, Edited by Ion I M? ?andoiu and Alexander Zelikovsky. .. Reprinted from Bioinformatics 20:2159–2161 (2004) with the permission of Oxford University Press Bioinformatics Algorithms: Techniques and Applications, Edited by Ion I M? ?andoiu and Alexander Zelikovsky

Định dạng
Số trang	498
Dung lượng	7,38 MB