Bioinformatics and Functional Genomics BIOINFORMATICS AND FUNCTIONAL GENOMICS Second Edition Jonathan Pevsner Department of Neurology, Kennedy Krieger Institute and Department of Neuroscience and Division of Health Sciences Informatics, The Johns Hopkins School of Medicine, Baltimore, Maryland Copyright # 2009 by John Wiley & Sons, Inc All rights reserved Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical, and Medical business with Blackwell Publishing Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750–8400, fax (978) 750–4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748–6011, fax (201) 748–6008, or online at http://www.wiley.com/go/permission Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762–2974, outside the United States at (317) 572–3993 or fax (317) 572– 4002 Wiley also publishes its books in variety of electronic formats Some content that appears in print may not be available in electronic format For more information about Wiley products, visit our web site at www wiley.com Cover illustration includes detail from Leonardo da Vinci (1452– 1519), dated c.1506–1507, courtesy of the Schlossmuseum (Weimar) ISBN: 978-0-470-08585-1 Library of Congress Cataloging-in-Publication Data is available Printed in the United States of America 10 For Barbara, Ava and Lil ian with all my love Contents in Brief PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES IN DATABASES Introduction, Access to Sequence Data and Literature Information, Pairwise Sequence Alignment, 47 Basic Local Alignment Search Tool (BLAST), 101 Advanced Database Searching, 141 Multiple Sequence Alignment, 179 Molecular Phylogeny and Evolution, 215 PART II GENOMEWIDE ANALYSIS OF RNA AND PROTEIN Bioinformatic Approaches to Ribonucleic Acid (RNA), Gene Expression: Microarray Data Analysis, 331 10 Protein Analysis and Proteomics, 379 11 Protein Structure, 421 12 Functional Genomics, 461 13 279 PART III GENOME ANALYSIS 13 Completed Genomes, 517 14 Completed Genomes: Viruses, 567 15 Completed Genomes: Bacteria and Archaea, 597 16 The Eukaryotic Chromosome, 639 17 Eukaryotic Genomes: Fungi, 697 18 Eukaryotic Genomes: From Parasites to Primates, 729 19 Human Genome, 791 20 Human Disease, 839 Glossary, 891 Answers to Self-Test Quizzes, Author Index, Subject Index, 909 911 913 vii Contents Preface to the Second Edition, Preface to the First Edition, Foreword, xxi xxiii xxvii PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES IN DATABASES Introduction, Organization of The Book, Bioinformatics: The Big Picture, A Consistent Example: Hemoglobin, Organization of The Chapters, A Textbook for Courses on Bioinformatics and Genomics, Key Bioinformatics Websites, 10 Suggested Reading, 11 References, 11 Access to Sequence Data and Literature Information, 13 Introduction to Biological Databases, 13 GenBank: Database of Most Known Nucleotide and Protein Sequences, 14 Amount of Sequence Data, 15 Organisms in GenBank, 16 Types of Data in GenBank, 18 Genomic DNA Databases, 19 cDNA Databases Corresponding to Expressed Genes, 19 Expressed Sequence Tags (ESTs), 19 ESTs and UniGene, 20 Sequence-Tagged Sites (STSs), 22 Genome Survey Sequences (GSSs), 22 High Throughput Genomic Sequence (HTGS), 23 Protein Databases, 23 National Center for Biotechnology Information, 23 Introduction to NCBI: Home Page, 23 PubMed, 23 Entrez, 24 BLAST, 25 OMIM, 25 Books, 25 Taxonomy, 25 Structure, 25 The European Bioinformatics Institute (EBI), 25 Access to Information: Accession Numbers to Label and Identify Sequences, 26 The Reference Sequence (RefSeq) Project, 27 The Consensus Coding Sequence (CCDS) Project, 29 Access to Information via Entrez Gene at NCBI, 29 Relationship of Entrez Gene, Entrez Nucleotide, and Entrez Protein, 32 Comparison of Entrez Gene and UniGene, 32 Entrez Gene and HomoloGene, 33 Access to Information: Protein Databases, 33 UniProt, 33 The Sequence Retrieval System at ExPASy, 34 Access to Information: The Three Main Genome Browsers, 35 The Map Viewer at NCBI, 35 ix x CONTENTS Step 1: Setting Up a Matrix, 76 Step 2: Scoring the Matrix, 77 Step 3: Identifying the Optimal Alignment, 79 Local Sequence Alignment: Smith and Waterman Algorithm, 82 Rapid, Heuristic Versions of Smith–Waterman: FASTA and BLAST, 84 Pairwise Alignment with Dot Plots, 85 The Statistical Significance of Pairwise Alignments, 86 Statistical Significance of Global Alignments, 87 Statistical Significance of Local Alignments, 89 Percent Identity and Relative Entropy, 90 Perspective, 91 Pitfalls, 94 Web Resources, 94 Discussion Questions, 94 Problems/Computer Lab, 95 Self-Test Quiz, 95 Suggested Reading, 96 References, 97 The University of California, Santa Cruz (UCSC) Genome Browser, 35 The Ensembl Genome Browser, 35 Examples of How to Access Sequence Data, 36 HIV pol, 36 Histones, 38 Access to Biomedical Literature, 38 PubMed Central and Movement toward Free Journal Access, 39 Example of PubMed Search: RBP, 40 Perspective, 42 Pitfalls, 42 Web Resources, 42 Discussion Questions, 42 Problems, 42 Self-Test Quiz, 43 Suggested Reading, 44 References, 44 Pairwise Sequence Alignment, 47 Introduction, 47 Protein Alignment: Often More Informative Than DNA Alignment, 47 Definitions: Homology, Similarity, Identity, 48 Gaps, 55 Pairwise Alignment, Homology, and Evolution of Life, 55 Scoring Matrices, 57 Dayhoff Model: Accepted Point Mutations, 58 PAM1 Matrix, 63 PAM250 and Other PAM Matrices, 65 From a Mutation Probability Matrix to a Log-Odds Scoring Matrix, 69 Practical Usefulness of PAM Matrices in Pairwise Alignment, 70 Important Alternative to PAM: BLOSUM Scoring Matrices, 70 Pairwise Alignment and Limits of Detection: The “Twilight Zone”, 74 Alignment Algorithms: Global and Local, 75 Global Sequence Alignment: Algorithm of Needleman and Wunsch, 76 Basic Local Alignment Search Tool (BLAST), 101 Introduction, 101 BLAST Search Steps, 103 Step 1: Specifying Sequence of Interest, 103 Step 2: Selecting BLAST Program, 104 Step 3: Selecting a Database, 106 Step 4a: Selecting Optional Search Parameters, 106 Query, 107 Limit by Entrez Query, 107 Short Queries, 107 Expect Threshold, 107 Word Size, 108 Matrix, 110 Gap Penalties, 110 Composition-Based Statistics, 110 Filtering and Masking, 111 Step 4b: Selecting Formatting Parameters, 112 BLAST Algorithm Uses Local Alignment Search Strategy, 115 CONTENTS BLAST Algorithm Parts: List, Scan, Extend, 115 BLAST Algorithm: Local Alignment Search Statistics and E Value, 118 Making Sense of Raw Scores with Bit Scores, 121 BLAST Algorithm: Relation between E and p Values, 121 Parameters of a BLAST Search, 123 BLAST Search Strategies, 123 General Concepts, 123 Principles of BLAST Searching, 123 How to Evaluate Significance of Your Results, 123 How to Handle Too Many Results, 128 How to Handle Too Few Results, 128 BLAST Searching With Multidomain Protein: HIV-1 pol, 129 Perspective, 134 Pitfalls, 134 Web Resources, 135 Discussion Questions, 135 Computer Lab/Problems, 135 Self-Test Quiz, 136 Suggested Reading, 137 References, 137 Advanced Database Searching, 141 Introduction, 141 Specialized BLAST Sites, 142 Organism-Specific BLAST Sites, 142 Ensembl BLAST, 142 Wellcome Trust Sanger Institute, 143 Specialized BLAST-Related Algorithms, 143 WU BLAST 2.0, 144 European Bioinformatics Institute (EBI), 144 Specialized NCBI BLAST Sites, 144 Finding Distantly Related Proteins: Position-Specific Iterated BLAST (PSI-BLAST), 145 Assessing Performance of PSI-BLAST, 150 xi PSI-BLAST Errors: The Problem of Corruption, 152 Reverse Position-Specific BLAST, 152 Pattern-Hit Initiated BLAST (PHI-BLAST), 153 Profile Searches: Hidden Markov Models, 155 BLAST-Like Alignment Tools to Search Genomic DNA Rapidly, 161 Benchmarking to Assess Genomic Alignment Performance, 162 PatternHunter, 162 BLASTZ, 163 MegaBLAST and Discontiguous MegaBLAST, 164 BLAT, 166 LAGAN, 168 SSAHA, 168 SIM4, 169 Using BLAST for Gene Discovery, 169 Perspective, 173 Pitfalls, 173 Web Resources, 174 Discussion Questions, 174 Problems/Computer Lab, 174 Self-Test Quiz, 175 Suggested Reading, 176 References, 176 Multiple Sequence Alignment, 179 Introduction, 179 Definition of Multiple Sequence Alignment, 180 Typical Uses and Practical Strategies of Multiple Sequence Alignment, 181 Benchmarking: Assessment of Multiple Sequence Alignment Algorithms, 182 Five Main Approaches to Multiple Sequence Alignment, 184 Exact Approaches to Multiple Sequence Alignment, 184 Progressive Sequence Alignment, 185 Iterative Approaches, 190 Consistency-Based Approaches, 192 Structure-Based Methods, 194 Conclusions from Benchmarking Studies, 196 xii CONTENTS Databases of Multiple Sequence Alignments, 197 Pfam: Protein Family Database of Profile HMMs, 197 Smart, 199 Conserved Domain Database, 199 Prints, 201 Integrated Multiple Sequence Alignment Resources: InterPro and iProClass, 201 PopSet, 202 Multiple Sequence Alignment Database Curation: Manual versus Automated, 202 Multiple Sequence Alignments of Genomic Regions, 203 Perspective, 206 Pitfalls, 207 Web Resources, 207 Discussion Questions, 207 Problems/Computer Lab, 208 Self-Test Quiz, 208 Suggested Reading, 209 References, 210 Molecular Phylogeny and Evolution, 215 Introduction to Molecular Evolution, 215 Goals of Molecular Phylogeny, 216 Historical Background, 217 Molecular Clock Hypothesis, 221 Positive and Negative Selection, 227 Neutral Theory of Molecular Evolution, 230 Molecular Phylogeny: Properties of Trees, 231 Tree Roots, 233 Enumerating Trees and Selecting Search Strategies, 234 Type of Trees, 238 Species Trees versus Gene/Protein Trees, 238 DNA, RNA, or Protein-Based Trees, 240 Five Stages of Phylogenetic Analysis, 243 Stage 1: Sequence Acquisition, 243 Stage 2: Multiple Sequence Alignment, 244 Stage 3: Models of DNA and Amino Acid Substitution, 246 Stage 4: Tree-Building Methods, 254 Phylogenetic Methods, 255 Distance, 255 The UPGMA Distance-Based Method, 256 Making Trees by DistanceBased Methods: Neighbor Joining, 259 Phylogenetic Inference: Maximum Parsimony, 260 Model-Based Phylogenetic Inference: Maximum Likelihood, 262 Tree Inference: Bayesian Methods, 264 Stage 5: Evaluating Trees, 266 Perspective, 268 Pitfalls, 268 Web Resources, 269 Discussion Questions, 269 Problems/Computer Lab, 269 Self-Test Quiz, 271 Suggested Reading, 272 References, 272 PART II GENOMEWIDE ANALYSIS OF RNA AND PROTEIN Bioinformatic Approaches to Ribonucleic Acid (RNA), 279 Introduction to RNA, 279 Noncoding RNA, 282 Noncoding RNAs in the Rfam Database, 283 Transfer RNA, 283 Ribosomal RNA, 288 Small Nuclear RNA, 291 Small Nucleolar RNA, 292 MicroRNA, 293 Short Interfering RNA, 294 Noncoding RNAs in the UCSC Genome and Table Browser, 294 Introduction to Messenger RNA, 296 mRNA: Subject of Gene Expression Studies, 300 Analysis of Gene Expression in cDNA Libraries, 302 Pitfalls in Interpreting Expression Datafrom cDNA Libraries, 308 SUBJECT INDEX BLAST Sequences, 51–52, 85–86, 94 books, 25 chicken genome resources, 771 Clusters of Orthologous Groups of Proteins (COG) database, 622–625, 704 Comprehensive Microbial Resource (CMR), 525, 599, 615–616, 620 Contig Assembly, 804 dbEST, 386 discontiguous MegaBLAST, 166–167 Ensembl, see Ensembl Entrez, see Entrez Fisher’s exact test, 307 Gene Expression Omnibus (GEO), 320, 334 Genes and Disease, 883 genome browser, 806– 808 Genomes page, 525, 543, 571–572, 581, 603, 649, 767 GLIMMER, see GLIMMER help section, 120 HMMer searches, 160 home page, 23, 37, 289, 586 HomoloGene, 184, 243 human genome, 791, 794, 825 Links page, 739 malaria genetics, 740 Map Viewer, 35, 142, 646 –647, 704, 794–795, 860 metagenomics projects, 543 Netblast, 117 OMIM, 25, 27, 30, 32, 34, 38, 403, 454, 464, 507, 658 Open Mass Spectrometry Search Algorithm (OMSSA) software, 387 PDBeast, 25 PubMed, 23–24, 30, 39 RefSeq, see RefSeq RefSNPs, 793 SAGE database, 309 –312 Short Read Archive, 552 single nucleotide polymorphisms (SNPs), 685 SKY/M–FISH & CGH database, 868 structure, 25, 438 Taxonomy Browser, 25, 37, 65, 107, 129, 289, 543, 585, 761 Tax Plot, 626–628, 704, 718 Trace Archive, 16, 145, 552 –553, 651 UniGene, see UniGene VCOG page, 580 website, 24, 65 WGS sequences, 15, 172 –173, 651 yeast genome browser, 707, 715 National Center for Health Statistics, 844 National Human Genome Research Institute (NHGRI): budget, 541 eukaryotic genomes, 647, 742, 773, 775, 779 finishing process, 552 Functional Analysis Program, 465 Genome Technology Program, 540 human diseases, 883 Human Genome Project, 800, 825 –826 plant genomes, 757 projects, overview of, 280, 465, 539 National Institute for Health Intramural Sequencing Center (NISC), 673 National Institute of Genetics in Mishima, 14 National Institutes of Health (NIH): Cancer Gene Anatomy Project (CGAP), 404 Center for Information Technology, 246 functional genomics, 465 Image software, 319 Knockout Mouse Project (KOMP), 472, 478 –479 MeSH terms, 845– 846 National Cancer Institute (NCI), 869 National Human Genome Research Institute, see National Human Genome Research Institute (NHGRI) National Institute of Allergy and Infectious Disease (NIAID), 569 Office of Rare Diseases, 845 projects, 13, 41, 330 proposals/white papers, 539 National Library of Medicine (NLM): functions of, 15, 23, 39–41, 849 Genotype database, 868 Phenotype database, 868 937 National Microbial Pathogen Resource (NMPRDR), 611 National Organization for Rare Disorders (NORD), 863, 883 National Research Council, 800 Native proteins, 499 Natural killer cell receptors, 680 Natural selection, 58, 63, 230– 231, 842 Nature, 15–16 NCKU Bioinformatics Center, 654 NDB, 447 Neanderthal genome, 537, 543, 547 Needleman –Wunsch: algorithm, 76–79, 81–82, 84, 93, 101, 159, 169, 195 dynamic programming, 194 method, 237 Needleman –Wunsch –Sellers algorithm, 76 Negative selection, 224, 227–230, 478 Neighbor-joining (NJ) algorithm: eukaryotic genomes, 764 phylogenetic analysis, 252– 253 phylogenetic trees, 255 –256, 258 –260 sequence analysis, 51– 52, 172 Neighbor-joining trees, 255 –256, 258–260, 749, 779 Neisseria meningitidis, 534– 535, 600–601, 608, 611, 625 Nematoda, 21 Nematodes, 518, 661, 678, 749, 758–761, 767, 792 NEN: Life Sciences, 315 Perkin–Elmer, 319 neo genes, 477 Neomorphs, 492 Neoplasms, 844– 845 Nephritis, 844 Nephrosis, 844 Nephrotic syndrome, 844 Nerve growth factor, 59, 223, 397 Nervous system diseases, 845 Netblast, 115, 117 NetGlycate, 413 N-ethyl-N-nitrosurea (ENU), 491, 876 NetOGlyc, 413 NetPhos 2.0 Prediction Server, 399, 401, 413 Neural cell adhesion molecule L1 (L1CAM), 878 938 SUBJECT INDEX Neural cells, 397 Neural networks, 368, 401, 413, 429 Neuraminidase (NA), 575 Neurodegenerative disorders, 506 –507 Neurofibromatosis, 847, 854 Neuroglobins, multiple sequence alignment, 188, 196 Neurological diseases, 872, 875 Neuromuscular disease, 872 Neurophysin 2, 223 Neuropsychiatric diseases, 130, 872 Neurospora spp.: characteristics of, 279 crassa, 715, 719 –720 Neurotransmitters, 400, 504, 768 Neurotrophic metallothionein, 311 Newick format, 187 Next-generation sequencing technology, 15–16 N50 sequence, 755, 850 Nicotiana tabacum, 17, 23, 528, 530 Night blindness, 872 NNPREDICT, 429 Nodes, phylogenetic trees, 231–233, 238 –239, 269 Noise, hypothesis testing, 348 Non-Hodgkin’s lymphoma, 877 Nonallelic homologous recombination (NAHR), 681 Noncoding DNA, 528, 650– 652 Noncoding genes, in human genome, 813 Noncoding regions, 535 Noncoding RNA, 251, 282 –283, 287, 292, 294–295, 320, 322 –323, 663, 688, 763, 781 Noncoding sequences, plant genomes, 757 Noncommunicable disease, 844 Nonconserved coding, 673 –674 Nondisjunction, 676 Nonhierarchical clustering See k-means clustering Nonparametric bootstrapping, 266 Nonparametric tests, 344, 349 Nonredundant (nr) database, 103, 106 –107, 127, 171–172 Nonsense mutations, 301 Nonvertebrates, disease in, 870, 874 –875 See also Invertebrates Nop56, 293 Normal distribution, 88, 100, 118 –119, 344, 349 Normalization, microarray data analysis, 343 –344, 355 North American Conditional Mouse Mutagenesis Project (NorCOMM), 472, 479 Northern blot, 293, 306, 310 Nostac sp., 600 NPRES, 441 N-sec1 See Syntaxin binding protein (stxb1) Nuclear cap-binding complexes, 500 Nuclear dimorphism, 742 Nuclear genome, 528 Nuclear magnetic resonance spectroscopy (NMR), 430 –431, 433, 435, 448 Nuclear proteins, 413 Nuclear spliceosomal RNAs, 292 Nucleases, 542 Nucleic acid(s): metagenomics, 573 microarray analysis, 313, 331 molecular evolution, 221 –222 molecular structure, 280 phylogenetic analysis, 219, 244 protein structure, 435 sequence analysis, 102, 544 viral genome, 568, 570, 577, 591 Nucleins, 544 Nucleomorph genomes, 535– 536, 745–746 Nucleotidase, 397 Nucleotide(s): eukaryotic DNA, 674 eukaryotic genomes, 737 human genome, 805, 811 molecular evolution, 225 –226, 229 multiple sequence alignment of, 205 phylogenetic analysis, 252, 263, 265 polymorphisms, 309 in prokaryotic genomes, 615– 620 pyrosequencing, 545 ribonucleic acid composition, 281 sequences: accession numbers, 26–27 advanced database searches, 144, 162 –163 analysis, 48, 109, 114, 600 BLASTZ applications, 166 database, 105 GenBank database, 14–15 phylogenetic analysis, 250–251 phylogenetic trees, 242 transcription, 297 Nucleotide Sequence Database Library, 33 NUCmer, 629 Nuisance variables, 350 Null hypothesis, 88, 122, 227, 347 –348, 350 –351, 683 Null mutants, 711 Obesity, 851, 876 OCA, 447 Odds ratio, 59, 62, 69 Odontella sinensis, 745 Odorant-binding proteins (OBP), 49, 90, 125, 150 –152, 196, 409, 501, 777 Oikopleura dioica, 767– 768 Olfactory receptors, 680 Oligo(dT), 310 Oligonucleotides: DNA sequencing technologies, 544, 552 eukaryotic chromosomes, 687 messenger RNA (mRNA), 297, 299 microarray analysis, 312 –314, 317, 333, 336 –337 resequencing, 542 robust multiarray analysis, 345 Olkopleura dioica, 766 Oncogenes, 128 “Once a gap, always a gap” rule, 189 OncoLink, 884 One-sample t-test, 353 1000 Genome Project, 15–16 One-way ANOVA, 353 Online Mendelian Inheritance in Man (OMIM): contents of, 25, 27, 30, 32, 34 eukaryotic chromosomes, 658 eukaryotic genome, 796 functional genomics, 464, 507 human disease, 870, 877, 882– 883 human genome, 841 –842, 859– 863 protein analysis, 403, 454 Online Mendelian Inheritance in Animals (OMIA), 863 O-notation, 84 On the Origin of the Species by Means of Natural Selection (Darwin), 215 Ontogenesis, 216 Open Mass Spectrometry Search Algorithm (OMSSA) software, 387 SUBJECT INDEX Open reading frames (ORFs), 292, 474, 480 –482, 487 –489, 494, 528, 556 –558, 569, 581– 582, 617, 619 –620, 630, 662, 701, 703– 704, 706 –707, 709, 714, 719 Open REGulatory ANNOtation database (OregAnno), 669 –670 Operational taxonomic unit (OTU): hierarchical clustering, 358 –359 phylogenetic trees, 231 –235, 238, 258–260 OperonDB, 620 Operons, 620, 737 Ophthalmological disease, 872 Opposum genome, 772 –773 Optimal alignment, 80–81 Order and orientation (ONO), 549 ORF Finder, 558 Organellar genomes, 5, 527–528 Origin of Species, The (Darwin), 50 Ornithorhynchus anatinus, 203, 554 Orphan sequences, 196 Orthologs: bacterial and archaeal genomes, 622 eukaryotic genomes, 707, 747, 763, 765, 775 –776, 778, 780 functional genomics, 508 human disease, 870, 874– 875 lateral gene transfer, 620 molecular evolution, 217, 251, 267 multiple sequence alignments, 189, 204 protein analysis, 410 sequence analysis, 7, 49–51, 102, 108, 130 Orthology, 678 Orthomyxoviridae, influenza viruses, 575 Oryctolagus cuniculus, 57 Oryza spp.: rufipogon, 754 sativa, 17, 20, 57, 403, 530, 556, 652, 675, 751 –755 Oryzias latipes, 471, 768, 770 Osteoporosis, 851, 876 Ostreococcus spp.: characteristics of, 750 tauri, 750, 756 Oswald Cruz Institute, 736 Ovalbumin, 378 Overlapping genes, 486, 549 Ovis aries, 51, 57 OxBench, 184 Oxidoreductases, 409 Oxytocin, 221 Oxytricha spp.: nova, 661 trifallax, 745 Pair hidden Markov model (Pair-HMM), 158 Paircoil, 412 Paired box (PAX6), 878 Paired t-test, 353 Pairwise alignment(s), see Pairwise sequence alignment bacterial and archaeal genomes, 626 BLAST searches, 109 genomic DNA sequences, 168 human disease, 864 mRNA sequences, 301 multiple, see Multiple pairwise alignment profile HMMs, 159 progressive, 188 protein structure, 442, 445 RNA expression, 309 sequence, see Pairwise sequence alignment Pairwise sequence alignment: advanced database searches, 141, 143, 145, 149, 162 algorithms, 55, 75– 85, 92– 94 BLAST search: advanced, 149–150 characteristics of, 93, 113 –114 BLASTZ applications, 164 characteristics of, 46, 92 detection limits, 74–75 with dot plots, 86–87 errors in, 92 evolution and, 55–57 gaps, 50, 55, 58, 91–92 historical perspectives, 46 homology, 47, 50, 53, 55–56 molecular evolution, 217 Needleman–Wunsch method, 237 protein alignment, 47– 54 protein structure, 422 purpose of, 53 scoring matrices, 57–75, 92 significance of, 92 similarity, 51 software packages, 55, 92, 95 statistical significance of, 87–92 types of, 51 websites, as information resources, 95 939 Paleontology, 221 Palmitoylation, 397 PAM matrices: advanced searches, 147– 148, 151, 156 alternatives to, see BLOSUM matrices BLOSUM matrices, relationship with, 73 conversion to log-odds/ relatedness-odds matrix, 69 derivation of, 67 function of, 69, 92, 129–130, 147 –148, 156, 194, 879 globin phylogeny, 218, 222 historical perspectives, 524 intervals, 65 pairwise alignment applications, 70 PAM40, 73, 110 PAM1, 58– 59, 63, 65–69, 72, 91 PAM1 to PAM20, 72 PAM100, 67 PAM70, 50, 69, 110 PAM10, 70–71, 92, 146 PAM30, 50, 69, 73, 110 PAM250, 50, 67– 70, 73, 75, 81, 91– 92, 110, 146, 422, 879 PAM2000, 68 protein sequence analysis, 105 relative entropy, 91 Select PAM250, 50 Pan spp.: paniscus, 227, 779 troglodytes, 17, 22, 51, 57, 70, 130, 229, 289, 530, 554, 778 Pancreatitis, 877 Pancrustacea, 761 Pandemics, 575, 577 Pangaea, 523–524 PANTHER, 201, 390 Parabasala, 733 Parallel substitutions, 241 Paralogs, 8, 49–50, 52, 102, 238, 658, 712, 737, 819 ParameciumDB genome browser, 743 Paramecium spp.: characteristics of, 661, 752 genome, 462 tetraurelia, 675, 742 Parametric bootstrapping, 266 Parametric tests, 344, 349 Parascaris univalens, 678 Parasites/parasitic disease, 610, 845 Parathryin, 223 940 SUBJECT INDEX Parkinson disease, 844, 861, 877 Parsimonious trees, 237 Parsimony analysis, 217, 261 See also Maximum parsimony Parsing, 167, 701 Partekw: applications, 334, 363, 367 Genomics Suite, 334 Partek Pro, 349 Partitioning, 354 Parvalbumin, 223 Parvovirus, 570 PASC, 572 “Passenger” mutations, 869 Pastuerella multocida, 535 –536, 608 Patau syndrome, 852 –853 PathGuide, 504 Pathogens, unicellular, 735– 736 Pathology, defined, 846 Pattern-hit initiated BLAST See PHI-BLAST (pattern-hit initiated BLAST) PatternHunter, 162– 164 Patterns, protein, 390 PAUP (Phylogenetic Analysis Using Parsimony) software: applications, generally, 60, 255 phylogenetic tree analysis, 172, 235, 237, 239, 244, 248, 252, 254– 256, 262, 264 PB1/PB2 gene, 576 –577 PBIL (Poˆle Bio InformatiqueLyonnais), 412, 429 –430 Pcoord (Principal Coordinate Analysis), 587 PDB90, 446 PDBD40-J database, 142 PDBSum database, 447 Pdha2 gene, 653 p-distance, 247 –249 PDZBase, 502 PDZ domain, 397 Pearson correlation coefficient r, 335, 357, 361 Penetrance, complex disorders, 852 PEO (progressive external ophthalmoplegia), 858 Pepsin, 427 PeptideMass, 412 Peptides: chloroplast transit (cTP), 414 mass fingerprinting, 387 phylogenetic analysis, 219 –221 protein analysis, 407 protein complexes, 500 in protein identification, 382, 385 –386 protein structure, 426 Peptoglycans, 227 Percent identity: implications of, 74, 81, 615 pairwise alignment, 90–92 threshold, 165 Percent similarity, 53 Perlegen, 777 Permutation test, 350, 353 Peromyscus maniculatatus, 202 Peroxidases, 680 Peroxisomes, 406, 409, 734 Pertussis, 611 Pfam: characteristics of, 14, 34, 39, 56, 129, 174, 179, 197 –199, 201, 206 JalView tool, 217 measles virus, 590 microarray data analysis, 368, 390 multidomain proteins, 394 phylogenetic analysis, 244 phylogenetic tree construction, 234 protein analysis, 390 protein networks, 507 protein patterns, 396 Pfam-A sequence family, 451 Pfam5000, 432 Phage lambda, 701 Phanerochaete chrysosporium, 715, 717, 720–721 Pharmacological interventions, 731, 737, 761 PharmGKB, 863 PHD program, 429 PhenCode project, 865 –866 Phenotypes, in functional genomics, 463–465, 486 Phenylalanine: characteristics of, 54, 61, 188, 428, 453, 866 hydroxylase (PAH), 878 Phenylketonuria (PKU), 847– 848, 866 Phenylthiohydantoin (PTH), 382 phi (N), 426 –427 PHI-BLAST (pattern-hit initiated BLAST), 111–112, 145, 153–156, 174 Phobius, 414 Phosphates, 281 Phospholipase A2, 223 Phosphopeptides, 401 Phosphorimaging, 318, 337 Phosphorylation, 391, 397 –399, 413, 825 Phosphotransferase system, 398 Photolithography, 317 Photosynthesis, 739, 746–749 PHRAP software, 551, 805 PHYLIP (PHYLogeny Inference Package), 172, 248, 254–255 Phylogenesis, 216 Phylogenetic analysis: historical perspectives, 214, 217– 218 lateral gene transfer, 620– 621 plants, 749 protein alignment, 60 stages of: multiple sequence alignment, 244–246 sequence acquisition, 243– 244 substitution models, DNA and amino acids, 246 –254 systematics and, 520 tree-building methods, 254 –255 web resources, 269 Phylogenetic fingerprinting, 553 –554 Phylogenetic inference: implications of, 184 maximum likelihood, 262–264 maximum parsimony, 260 –262 Phylogenetic reconstruction, 678 Phylogenetic shadowing, 553– 554 Phylogenetic trees: building/construction methods, 229, 254– 255 distance-based, 254– 263, 268 eukaryotic, 729– 730 evaluation methods, 266 –268 gene families, 678 HIV, 573 inference, 264– 266 mammalian genomes, 772 model-based phylogenetic inference, 262– 264 pitfalls of, 268– 269 properties of, 51, 55, 59–60, 172, 198, 231 –238, 268 true tree, 216 types of, 216, 238–243 viral genome, 585 visual impact of, 268 Phylogeny: algorithms, 182 SUBJECT INDEX defined, 7, 216 longitudinal studies, 230 Phylogram, 233 Phylum, 216 PHYRE, 450 Physcomitrella patens, 751, 756 Physeter catodon, 51 Phytophthora spp.: characteristics of, 746 ramorum, 747 sojae, 747 Phytophthora Functional Genomics Database, 747 PI/Mw server, 399 Pichia angusta, 716 PicTar, 293, 295 PIK3CA, 869 PileUp, 202 PipMaker, 674 PIRSF, 201, 390 Pixels, microarray analysis, 318 PKA, 397 Plant genomes/genomics, 530, 534– 535, 748 –756 Plant globins, 7, 188 Plant kingdom, 697 PlantProm, 670 Plants, invading viruses, 572 Plasmids, 802 PlasmoDB, 739, 754 Plasmodium spp.: berhei, 739 –740 chabaudi, 739–740 characteristics of, 732 falciparum, 110, 289, 530, 536, 552, 556, 661, 666, 714, 738 –742, 754, 757, 764, 784, 848 malariae, 738, 764 ovale, 738, 764 vivax, 738, 740, 764 yoelii yoelii, 536, 661, 739 –740 Plastid genomes, 739, 745–746 Plastocyanin, 223 Platyhelminthes, 21 Plexin, 817 Plink, 866 Ploidy, significance of, 675 Pneumocystis carinii, 716 Pneumonia, 626, 844 Poaceae, 753 Point: accepted, 58–63, 217 mutations, 840 substitutions, 162 Poisoning, 845 Poisson: correction, 52, 247 –249, 252 –253, 258 distribution, 247 POLE, 424 Poliomyelitis, 571 Poliovirus, 572 pol protein, 36, 102, 123, 129–134, 153, 170 POL II promoters, 670 poly(A), 300, 316, 321, 484 Polyacrylamide gel electrophoresis (PAGE), 382 –383 Polyadenylated RNA, 300 –301 Polycystic kidney disease, 854 Polygalacturonases, 680 Polymerase chain reaction (PCR), 22, 310, 321, 477, 480, 486– 487, 543, 672 Polymerases, viral, 571 Polymorphisms, see Single nucleotide polymorphisms (SNPs) advanced database searches, 142 eukaryotic chromosomes, 677 human disease, 855, 862 human genome, 821 inversion, 678 Polypeptides, 391, 394, 425 Polyploid organisms, 539, 602, 675, 753 Polyvinylidene fluoride (PVDF) membrane, 381 P1-derived artificial clones (PACs), 804 Pongo pygmaeus, 51, 166, 778–779 Poplar, 755 PopSet (Population Data Study Sets), 201 –202 Population biology studies, 755 Population shadowing, 553–554 Populus spp.: characteristics of, 753 trichocarpa, 755 Porphyra purpurea, 530 Porphyria variegata, 877 Porphyromonoas gingivalia, 600 Position-based scoring matrices (PSSM): advanced database searches, 146 –154, 157, 174 database searches, 111, 129, 174 multiple sequence alignment, 195, 199 941 Positive selection, 224, 227–230, 478, 737, 777, 780 Position-specific iterated BLAST (PSI-BLAST) See PSI-BLAST Postgenomic era, 559 Posttranslational modification, 390, 397–399, 401, 410, 439 Potato virus X, 566 PowerAtlas, 348 Power law distribution, 504 Pox viruses, 567, 592 PP1 g, 224 PP2 b, 224 ppsearch, 412 Prader-Willi syndrome, 853, 856 PRALINE, 184, 190 –191 PRATT, 396, 412 Precision, in microarray data analysis, 344–345 Prediction of Apicoplast Targeted Sequences (PATS), 740 Prediction software, 667, 688 Predictive accuracy, 367 –368 PredictProtein server, 414, 429, 450 PREFAB, 184, 195 Pregnancy, research studies, 845, 855 Prenyl group, 398 Preservation of Favoured Races in the Struggle for Life (Darwin), 215 Primate(s): genomes, 778 –781 research studies, 107, 227 –228, 231, 583, 878 Primer extension strategy, 544 Primers, functions of, 310, 322, 547, 552 Principal components analysis (PCA), 198–199, 332– 333, 355, 361, 364–367, 740 PRINTS database, 34, 200– 201, 390 Prion proteins, 397, 454, 568 Pristionchus pacificus, 761 Probabilistic consistency transformation, 194– 195 Probability applications: advanced database searches, 163 genomic sequencing, 551 hidden Markov models, 156 –158 matrices, 65 mutations, 63, 65 odds ratio, 59, 62, 69 phylogenetic analysis, 246– 247, 264 –266 942 SUBJECT INDEX Probability distribution, 158– 159, 265 ProbCons, 184, 192, 194 –195 Probes: eukaryotic chromosome analysis, 645 microarray analysis, 316 –318, 337, 345– 346 PROCHECK, 450 Prochlorococcus marinus, 154 Procrustes, 558 ProDom, 201, 203, 390, 394 –395 Profile, protein families, 391 Profile HMMs, 157–160, 197, 199 ProfileScan Server, 412 Profile searches, using advanced databases, 144, 146 Progenitors, 748 Progenote, 518, 521 Progeny, 572 Prokaryotes: Prokaryotes: classification of, 599 epicellular, 609 gene-finding programs, 556 –557 genome analysis, see Prokaryotic genomes genomic annotation, 556 –557 large-scale, culture-independent, 610 Prokaryotic genes, 664 Prokaryotic genomes: analysis: functional annotations, 622 –625 lateral gene transfer, 620–622 nucleotide composition, 615–620 characteristics of, 56, 398, 402, 518, 520, 524 –525, 539, 567 comparison of: MUMmer, 628–629 significance of, 625 –626 TaxPlot, 626 –628 pitfalls of, 630 Prolactin, 223 Proline, 54, 61, 63, 152, 229, 427 –428 Promoter 2.0 Prediction Server, 670 Promoters, functions of, 294, 478, 670, 672 Propagation, 297 PROSCAN (PROSITE SCAN), 412 PROSITE, 34, 181, 201–202, 394, 396, 507, 618 Proteases, molecular evolution, 223 Proteasomes, 409 Protein(s): annotations, 33 discovery of, 12 eukaryotic, 170 families, 388– 390 folding, 424 functional classifications, 410 functional genomics, 493 histones see Histone proteins homologous, 151, 180 kinases, 398, 703 lipocalins, see Lipocalins measle virus, 589 –590 microarrays, 493 molecular evolution, 224 networks, 501– 508 replication, 680 in RNA analysis, 320 –321 sequences: accession numbers, 26–27 advanced database searches, 113 –114, 125, 145 BLAST search, 102– 103, 106 direct, 381– 382 ExPASy, 34–35, 39 GenBank database, 14–15 molecular evolution, 225 –226 percent similarity, 53 phylogenic analysis, 217 –218 phylogenetic trees, 235 –236 repetitive, 111 structure: disease and, 453–454 domain structures, 443– 446 hierarchy of, 423 –424 historical perspectives, 420 intrinsically disordered proteins, 453 overview of, 8, 421 –423, 454 pitfalls of, 455 prediction, 184, 429 –430, 447 –453 primary, 423 –425 principles of, 423 –434 resources, 446 –447 secondary, 423 –430, 437 structural genomics, 432–433 taxonomic system, 441 –443 tertiary, 423 –424, 430 –431 three-dimensional, 125-126, 432 –433 superfamilies, 59 synthesis, 813 Protein analysis: characteristics of, 379 –380, 411 Gene Ontology (GO) Consortium, 388– 389, 402 –403 historical perspectives, 380 modular nature of proteins, 389 –394 multidomain proteins, 394 –395 perspectives on proteins, 388 –389 physical properties of proteins, 397– 402 pitfalls of, 411– 412 protein alignment, 47–54 protein function, 407 –411, 493 protein identification techniques, 381– 388 protein localization, 406–407, 493 protein patterns, 394 –397 web resources, 412 –414 Protein-based trees, 240 –243 Protein-coding genes, 557, 559, 605, 620, 662 –664, 668–669, 673, 688, 701, 704, 706, 720–721, 737, 743, 755, 763, 775, 781 –782, 793, 796, 814 Protein complexes, 499 –500 Protein Data Bank (PDB): accessing entries at NCBI website, 437– 441 annotations, 454–455 contents of, 18, 23, 25, 27, 33, 106– 107, 195, 422, 434 protein folds, 441– 446 protein structure, 432 viral structures, 592 Protein databases, 23, 380 –381 Protein Domain Parser, 446–447 Protein disulfide isomerase (PDI), 388, 401 Protein Family (Pfam) See Pfam Protein Information Resource (PIR), 18, 27, 33, 35, 93, 106, 201 PROtein MUMmer (PROmer), 629 ProteinPilot, 288–387 Protein– protein interactions, 103, 399, 493 –496, 498 –508 Protein Research Foundation (PRF), 27, 33, 106 Protein Structure Initiative (PSI), 433 –434 Proteobacterium, 150, 528, 600, 610, 612, 622, 732 Proteomes, 5, 224, 687, 763, 814 –816 Proteomics: applications, 493– 508, 701, 741 overview of, 379–380 research standards, 381 SUBJECT INDEX Proteomics Standards Initiative (PSI), 381, 499 Proteotyping, 577 Proto-oncogenes, 251 Protobacterium, 749 Protoctists, 697 Protozoans, 403, 541, 731 –735, 855 Protozoology, 596 PRRP, 196 PRSS, 89, 93 Pseudocoelomata, 758 Pseudocounts, 148 Pseudogenes, 162, 226, 242, 292, 653– 657, 659, 662, 708, 710, 736, 796 Pseudomonas aeruginosa, 534 –535, 608 p17, 39 p7, 39 psi (c), 426– 427 c-BLAST See PSI-BLAST PSI-BLAST: advanced searches, 110 algorithms, 152 characteristics of, 126, 135, 145–149, 174, 434 errors, types of, 152 –153, 174 eukaryotic genomes, 736, 740 human genome, 819 measle virus, 590 –591 multiple sequence alignment, 200 performance assessment of, 151–152 profile searches, 149 progressive alignment, 191 protein structure, 449, 451 PSSM, 148 –151 target frequencies, 148 PSIPRED, 429 PSORT, 414 PubChem database, 281 PubMed: accessing sequencing data, example of, 38–39 access to, 40 Central, 40 contents of, 14, 23– 24, 39– 41, 368 HIV heading, 587 links to, 30 Medical Subject Headings (MeSH), 41–42 sample search, 41–42 tutorials, 39 Puccinia graminis, 717 Pufferfish, 765, 768 –770, 792 Pulmonary disease, 872 Pulsed-field gel electrophoresis, 738 Purines, 281, 617, 761 p value, 121 –122, 349 –351 Pycnogonida, 761 Pygmy introns, 745 Pylaiella littoralis, 530 Pyrimidines: functions of, 250, 281 nucleotides, 242 Pyrococcus spp.: abyssi, 536, 601 horikoshii OT3, 533 Pyrophosphates (PPi), 545– 546 Pyrosequencing, 541, 545 –547, 573 Q3, 430 Quantile normalization, 344, 346 Quantitative trait locus (QTL), 851 Query sequences: advanced database searches, 155, 162 genomic DNA searches, 162 Query sequences, BLAST searches: advanced, 143, 145 –148 characteristics of, 100, 111– 120, 135 q value, 352 Rabies virus, 570 RAD, 754 Radial trees, phylogenetic analysis, 267 Radioactivity, 247, 316 –318 Ramachandran plots, 427– 428 RAN, 224 Random effects, microarray data analysis, 354 Random insertional mutagenesis, 483 –485 Ras gene, 702, 817, 874 RasMol, 437 –438 Rat genome, 774 –778, 877 See also Rattus spp.; Rodent studies Rat Genome Database (RGD), 403, 754 Rat Genome Sequencing Project Consortium, 778 Ratios, in microarray data analysis, 340 Rattus spp.: genome analysis, 403 norvegicus, 17, 22, 30, 51, 57, 105, 554, 775 rbcL gene, 752 RCN1 gene, 707 Reactome database, 403, 502, 504 943 ReadSeq servers, 246, 265 Rearrangements: genome-wide, 745 implications of, human disease, 854 Reassortment viruses, 577 Receiver operating characteristic (ROC), 111, 683 Reclinomonas americana, 530 Recombinant Identification Program (RIP), 587 Recombination: human genome, 808, 827, 830 human disease, 854 Red algae, 535, 747 Reference pools, microarray analysis, 315–316 Reference Sequence (RefSeq) Project, 27–29, 37–39, 52 See also RefSeq Reformatting, BLAST search, 114 RefSeq: accession numbers, 282, 294, 526, 707, 735, 776 advanced database searches, 146, 154 annotations, 368, 666 bacterial and archaeal genomes, 622 coding sequence, 814 eukaryotic chromosomes, 646 eukaryotic genome, 776 Genes, 302 genome sequencing, 165, 531 human disease, 870 human genome, 794, 796, 818, 830 multidomain proteins, 394 noncoding RNAs, 296 nucleotide sequences, 107 protein searches, 106, 112, 129 –130, 160, 380, 386, 501 repetitive DNA, 656 retransposition, 653 RNA sequences, 289, 299, 301 –302 transcription, 322 viral genome, 583, 586 RefSNPs, 685, 793 Regulatory Sequence Analysis Tools (RSAT), 670 RegulonDB database, 466 Relatedness-odds matrix, 69 Relative entropy, 91–92 Renal disease, 873, 875 Renal tubular acidosis, 875 REP, 412 944 SUBJECT INDEX Repeat-induced point (RIP) mutations, 720 Repeated-measures ANOVA, 353 RepeatFinder, 654 RepeatMasker, 653 –657, 768, 807 Repeats: interspersed, 164, 652 –653, 810 protein families, 300, 303, 392, 394 simple sequences, 811 Replication, human genome, 830 Resequencing, 538, 542, 869 Resourcerer, 368 Respiratory diseases, 844– 845 Restriction enzymes, 548 RET gene, 650, 673 Retinol, Retinol-binding protein (RBP/RBP4), –8, 32–33, 49, 89, 94, 105, 125 –128, 146, 149–150, 154 –155, 189, 304, 407 –408, 422, 442–443, 501, 552, 796 Retransposons, 652, 809 –810 Retropseudogene, 652 Retrotransposons, 704 Retroviral gene expression, 130 Retroviral sequences, BLAST searches, 134 Retroviruses, 297, 572, 585 Rett syndrome (RTT), 62, 319, 464, 807, 846, 848 –850, 861, 882 Reverse genetics, 473 –491, 508 Reverse position-specific BLAST See RPS-BLAST (reverse positionspecific BLAST), 153 Reverse proteomics, 494 –495 Reverse transcriptase, 36, 38, 130, 152, 297, 394, 656 Reverse transcriptase-polymerase chain reaction (RT-PCR), 297, 323, 666 Rfam: characteristics of, 282, 292 database, noncoding RNAs, 283– 285 eukaryotic chromosomes, 663 microRNA and, 293 phylogenetic analysis, 244 R groups, protein structure, 425– 426 Rhesus Macaque Genome Sequencing and Analysis Consortium, 780, 878 Rhesus monkey studies See Macaca mulatta Rhesus rhadinovirus (RRV), 579 Rheumatoid arthritis, 867– 868 Rhinovirus, 572 Rhizophydium spp., 530 Rhizopus spp.: nigricans, 698 oryzae, 717 Rhodopsins, 127, 397, 759 RibAlign software, 612 Riboflavin synthesis, 761 Ribonuclease, pancreatic, 223 Ribonucleic acid (RNA): amplification, 315 analysis interpretation, 320 –322 ancient, 543 ancient viruses, 571 -based trees, 240–243 characteristics of, 279, 323 circular molecules, 568 complementary DNA (cDNA), relationship with, 302 –309, 322 –323 composition of, 281 -dependent DNA polymerase, 297 double-stranded, 570, 572 –573, 579 functional genomics, 493, 662 gene expression studies, 323 historical perspectives, 521 –522 hybridization, 337 interference (RNAi), 294, 489 –491, 760 messenger, see Messenger RNA (mRNA) micro (miRNA), 293–295, 313, 322, 622 microarray analysis, 316–317, 322 –323, 335–337 -multiprotein complex, 406 nuclear ribosomal, 680 overview of, 279 –282 polymerases, 224, 613 RefSeq identifiers, 28 ribosomal (rRNA), see Ribosomal RNA (rRNA) self-replicating, 574 single-stranded, 568, 570, 575, 579 small interfering (siRNA), 294 small nuclear (snRNA), 291– 292, 300, 812 small nucleolar (snoRNA), 292 –283, 812 splicing, 816 structure of, 280, 294 surveillance system, 300 synthesis, 590, 607 transcription, 3– 5, 130, 301, 321– 322, 336, 462, 538, 558 transfer (tRNA), see Transfer RNA (tRNA) web resources, 323 Ribonucleoproteins (RNPs), 290, 590, 813 Ribosomal Database, 234, 244 Ribosomal DNA (rDNA), 289 Ribosomal RNA (rRNA), 282, 288 –291, 611 –612, 622, 662, 698, 704, 731, 812 –813, 825 Ribosome Data Project, 291 Rice, 729, 753 –755 Rickettsia spp.: conorii, 609 prowazekii, 533, 600, 604, 609 rif genes, 740 –741 RIKEN: Genomic Sciences Center, 803 Mouse Gene Encyclopedia Project, 876 R language, 251 –252, 340–342, 367 RMSD-APDB, 195 –196 RNAdb, 282 RNA-inducing silencing complex (RISC), 294, 489 RNAmmer, 291 RNA World website, 323 Robertsian translocation, 675 –676 Robust Multiarray Analysis (RMA), 340, 342, 344– 346 Rodentia, 107, 732 Rodent studies: disease in, 876 –878 functional genomics, 472 –473 genomic sequencing, 536, 552, 774– 778 malaria parasites and, 739 molecular evolution, 231 phylogenetic trees, 239 protein complexes, 499 protein structure, 422 viruses, 577 Root mean square database (RMSD), 441, 451 Rooted phylogenetic trees, 233 –235, 257 ROSE software, 162, 183 –184 Rosetta Stone, 451 –452, 500–501 Rotavirus, 570 –571 Rotroelements, 652 Roundworm genomes, 758 –761 RPB1/RPB2, 699 SUBJECT INDEX RPS-BLAST (reverse position-specific BLAST), 153, 199 –200 R software, 334 R-statistical package, 358 r-test, 331 Rubella virus, 569 –571 Rubisco protein, 749, 752 S-PLUS, 334, 339, 349, 357, 361, 363, 367 SABmark, 184, 195 Saccharomyces spp.: bayanus, 714 castellii, 712 –714 cerevisiae, characteristics of, 529 –530, 532, 556, 622, 675 chromosome exploration, 704 –708 common domains, 701, 703 eukaryotic genomes, 719, 721 –722, 763 features of, 697, 700 –707, 722 gene duplications/genome duplication, 708–711 gene nomenclature, 707 hemiascomycetes, comparative analysis, 712 –715 human disease studies, 842, 870 human proteins compared with, 814 molecular evoluation, 224, 242 proteome comparisons, 818 protein analysis, 394, 401, 403, 462, 466, 469, 481, 493, 496 –498, 504 protein families, most common, 702 RNA analysis, 289, 321 sequence analysis, 57, 81, 86, 105, 171, 203, 394, 701 Saccharomyces Genome Database (SGD), 144, 402 –403, 466–468, 473, 493, 498, 504 –506, 645, 705– 707, 709, 722, 754 Saccharum spp., 753 SAGA, 196 Salmonella spp: enterica, 608, 611 typhi, 611 Salmon salar, 305 SAM-T02, 429 SAM-T98, 152 Sampling, microarray data analysis, 337 Sanger Centre, 803 Sanger Institute, 471, 646 Sanger sequencing, 544– 545, 701, 804 SAPS, 412 Sarcopterygii, 769 SARS, 571– 572 SAS, 349 Satellite DNA, 652, 661 Scalable Vector Graphics (SVG), 744 Scaled phylogenetic tree, 232, 253 Scan, BLAST algorithms, 115 Scanner, microarray data analysis, 337 ScanProsite, 396, 412 ScanPS (Scan Protein Sequence), 144 Scatter plots, 314, 331, 335, 337 –342, 370 Scavenger decapping, 500 Schistosoma japonica, 388 Schizaphis gramium, 609 Schizophrenia, 130, 847, 852 Schizosaccharomyces pombe, 57, 530, 535 –536, 715 –716, 719–721, 738, 754, 870, 874 –875 Scoaffold, 549 Scoring matrices: advanced database searches, 145 BLAST searches, 109– 110, 121, 129 BLOSUM, see BLOSUM matrices Dayhoff model, accepted point mutations, 58–63, 217 detection limits, 74–75 development of, 57 log-odds, 69–70, 72 PAM, see PAM matrices significance of, 57, 92 SDS-PAGE, 383–384, 499 Sea spiders, 761 Sea urchin, 766– 767 Sea Urchin Genome Sequencing Consortium, 758, 767 Sea urchins See also Strongylocentrotus purpuratus Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), 502, 504 Seattle Biomedical Research Institute (SBRI), 737 SEC1 gene, 475 Sedimentation coefficient, 280, 397 SeedGenes, 753 Seed models, PatternHunter, 163 945 SEG, 152 Segmental duplications, 676, 681 Segmented genomes, 567 Selenomonas sputigena, 596 Self-organizing maps, 332, 361, 363–364, 740 Selfish DNA, 651 Semaphorin, 817 SeneSpring, 349 Sensitivity: advanced database searches, 148, 151, 154, 162 –163 alignment algorithm, 87 bacterial and archaeal genomes, 629 BLAST algorithms, 117 –118 eukaryotic chromosomes, 667, 671 –672 multiple sequence alignments, 204 phosphorylation, 401 significance of, 38, 70, 92 transcription, 321–322 types of, 181 Sensory system, birth-and-death evolution, 680 Septicemia, 844 Sequence Alignment and Modeling Software System (SAM), 160–161, 174 Sequence Retrieval System (SRS), 14, 34, 38, 394 –395 Sequence reversal, 628 Sequence similarity, BLAST searches, 124 Sequence-tagged sites (STSs): BLAST searches, 106 –107 characteristics of, 20, 22 organisms obtained, 22 Sequencher, 551 Sequencing technology, in human genome, 801 Serial Analysis of Gene Expression (SAGE), 298 –299, 309 –312, 323, 493, 538 Serial homology, 50 Serine, 51, 54–55, 59, 61– 65, 401, 428 Serum albumin, 59, 223, 378 Seven-transmembrane-domain (7TM), 759 Severe combined immunodeficiency disease, 877 Sex Chromosomes and Sex-Linked Genes (Ohno), 729 Sex pheromones, 680 946 SUBJECT INDEX Sexually transmitted diseases, 601 SGP, 668 Shine–Dalgarno sequence, 617 –618 Short-chain dehydrogenase/reductase (SDR), 817 Short interspersed nuclear elements (SINEs), 203, 652 –655, 755, 771 –772, 780, 792, 809 –810 Shotgun sequencing, 548, 550, 802 –803 See also Whole genome shotgun (WGS) Shotgun single-pass, 302 “Shuffled” genes, 216 Sickle cell anemia, 423, 454, 477, 847 –850, 852, 859 Sickle cell disease, 861 Signal detection, 156 SignalP, 402, 414 Signaling molecules, 758 Signal-to-noise ratio, 348, 354 Signal transduction, 758 Signature: advanced database searches, 154, 174 microarray data analysis, 332 protein families, 389 –390 sequences, 613 –614 Significance analysis of microarrays (SAM), 351–353 Silenced genes, 294, 486, 489 –491, 675, 678 Silkworm, 765 Silurian period, 748 SILVA database, 291 SIM, 93 Sim4, 167, 169 Simian(s), retrotransposition, 653 See also Primates Simian immunodeficiency virus (SIV), 574, 583, 586 Simian T-cell lymphotropic virus type (STLV), 586 Simian virus 40 (SV40), 527, 572 Similarity: homologous sequences, 48– 49 matrix, 357, 361 scores, multiple sequence alignments, 185 search: advanced databases, 144, 163 multiple sequence alignment, 189 Simple Molecular Architecture Research Tool (SMART) database, 129, 156, 199 Simulations, 120, 683 Single-gene diseases/disorders, 843, 847, 849, 851, 859, 866, 881 Single linkage clustering, 358 –361 Single nucleotide polymorphisms (SNPs): advanced database searches, 145, 162 bacteria and archaea, 628 characteristics of, 26 eukaryotic genomes, 646, 755, 777, 780 human disease databases, 863, 867, 869 human genome, 793, 796, 826– 831 microarray analyses, 683 –687, 867, 869 molecular evolution, 230 –231 multiple sequence alignment, 181 nonsynonymous, 827 synonymous, 827 Singletons, 303 Sinorhizobium meliloti, 536 Sister chromatids, 681 Size-fractionation RNA, 293 Size of genome, significance of, 539–540, 602– 604 Skeletal disease, 873 Skew/skewness, implications of: BLAST statistics, 118–119 microarray data analysis, 344, 370 phylogenetic analysis, 251 Skin disease, 845 SKY/M-FISH & CGH Database, 646, 868 Sleeping sickness, 735 Slime genomes, 756–757 Small-insert libraries, 548 Small interfering RNA (siRNA), 294 Small nuclear ribonucleoproteins (SNRNPs), 292 Small nuclear RNA (snRNA), 291–292, 300, 812 Small nucleolar RNA (snoRNA), 292–283 Smallpox, 569, 571 Small subunit rRNA (SSU rRNA), 520–521, 524 SMART database, 129, 390–391, 396, 412 Smith– Magenis syndrome, 853, 856 Smith– Waterman algorithm: advanced database searches, 144 –145, 152, 159 applications, generally, 101–102, 653, 709, 743 BLAST searches, 122 components of, 81–84 rapid, 84–85 Smith–Waterman alignments, 122 SNAP-25 protein, 398, 400, 469, 496 SNC1/SNC2 genes, 711 Snc1p, 711 snoRNABASE, 295 Sodalis glossinidius, 608, 735 Sodium dodecyl sulfate (SDS), 383 See also SDS-PAGE Solexa, 299, 547, 672, 826 Solibacter usitatus, 604 Somatic cells, 464, 675 Somatic mutations, 538, 869 Somatotropin, 223 Sorangium cellulosum, 604 SOSUI, 414, 429 Sotos syndrome, 856 Soudan Mine Red Sample project, 544 Southern blotting, 477 SP-TrEMBL, 199 –200 Speciation, 49–50, 219, 782 Species: distribution, 198 trees, 238–240 Specificity: advanced database searches, 154, 162 alignment algorithm, 87 defined, 38 eukaryotic chromosomes, 671 functional genomics, 469 pairwise alignment, 92 phosphorylation, 401 significance of, 38 Spectroscopy See Mass spectrometry Speech detection, 156 Spiders, 761 Spinal muscular atrophy, 854 Spinocerebellar ataxaia, 875 Spirochaetales, 600 Spliceosomal RNAs, 292 Spliceosomes 300, 813 Splicing, 102, 291, 301, 312, 398, 765, 840 Sporozoites, 738, 740 Spotfire, 334, 349 Spreadsheet applications, 251, 331, 334, 339, 346 SPSS, 349 Spurious matches, 111 SUBJECT INDEX Spurious sequences, 174 Sputnik, 752 SRP46, 653 SSAHA (Sequence Search and Alignment by Hatching Algorithm), 167, 169 SSAP algorithm, 444–445 SSEARCH, 93 SSO1 gene, 711, 713 Standard deviation, 348 Standardization and Normalization of Microarray Data (SNOMAD), 344 Stanford Genome Technology Center, 803 Stanford HIV Drug Resistance Database, 592 Stanford HIV RT and Protease Sequence Database, 591 Stanford Human Genome Center, 803 Stanford Online Universal Resource for Clones and ESTs (SOURCE), 368 Stanford University, genome projects, 738 Staphylococcus aureus, 601, 625 STATA, 334 Statistical analysis, 298 Statistical significance, 87– 92, 108, 111, 121, 146, 355, 683 Statistics applications, 332 See also Descriptive statistics; Inferential statistics Statistics of Extremes (Gumbel), 100 Statistics software packages, 334 Step matrices, 243 Sterkiella histriomuscorum, 742, 745 Stochastic context-free grammars (SCFG), 287, 291, 294 Stoichiometry, 500 Stokes radius, 389, 397 Stramenopila, 729, 746 –747 Strasbourg Bioinformatics Platform, 81 Streptavidin beads, 310 –311 Streptococcus spp.: agalactiae, 625 pyogenes, 601 pneumonaie, 536, 600, 611, 625 pyogenes, 536, 625 Streptomyces spp.: avermitilis, 81–82 coelicolor, 539, 556, 604 Streptophyta, 21 Stress-induced protein (SRP1/TIP1), 702 Stretcher, 92 Stroke, 844 Stronglyocentrotus purpuratus, 17, 758, 766 –767 Structural classification of proteins (SCOP) database, 152, 422, 435, 441 –444, 446 Structural genomics, 422, 432– 434, 448, 591 Structural Genomics Consortium, 432 Structure Prediction Meta Server, 450 Substitution(s): amino acids in protein sequences, 60, 62–63 eukaryotic genomes, 737, 776 evolutionary, 879 human disease, 864, 870 human genome, 816 in molecular evolution, 222 –225, 231 matrices, 103, 118, 121, 148, 158, 197 pairwise alignment, 94 phylogenetic analysis, 247 –251, 732 phylogenetic trees, 240, 252 significance of, 51, 55, 94 Subtrees, 237 –238 Suicide, 844 Sulfation, 399 Sulfinator, 413 Sulfolobus spp.: solfataricus, 536 tokodaii, 536 Sum-of-pairs score (SPS), multiple sequence alignment algorithms, 182 –183, 190 Superfamilies, 128, 141, 189, 201, 392, 702 –703, 817 SUPERFAMILY, 201, 390 Superoxide dismustase (SOD), 388, 507 –508 Support vector machines, 368 Supt4h2, 653 Surface plasmon resonance, 495 Surface proteins, 736 Surfactants, gene expression, 312 Survival rates, 227 Sus scrofa, 17, 20, 51, 57 S values, BLAST algorithms, 120, 122 Swiss Institute of Bioinformatics (SIB), 33–34, 39, 86 Swiss-Model, 423, 450 947 SwissPDB viewer, 437 Swiss-Prot: database, 18, 23, 25, 27, 33–34, 38– 39, 106, 199– 200, 368, 380, 386, 394, 453, 662, 796 website, 62 Symmetric matrix, 67 Synapsin, 397 Synaptobrevin proteins, 506, 711 Syndrome, defined, 840 Synechocystis spp., 532, 600 Synonymous Non-synonymous Analysis Program (SNAP), 230, 587 Syntaxin binding protein (stxb1), 474 Syntaxin proteins, 496, 506, 711 Synteny, 673, 737, 740, 776 –777, 796 Synthetic genetic array (SGA) analysis, 482 Systematics, historical perspectives, 518, 520 TAA, 32 Tachyzoites, 742 TAG, 32 Tagged proteins, 499 Tags, SAGE database, 311– 312 Tajima’s relative rate test, 226–228 Takifugu spp.: characteristics of, 768 –769 rubripes, 471, 664, 768 Tandem affinity purification coupled to mass spectrometry (TAP-MS), 499–500 Tandem mass spectrometry (MS/MS), 386–387 Tandemly repeated sequences, 661–662 Tandem repeats, 628, 809 TargetDB, 431 Target frequencies, 59, 110 TargetP, 414 TargetScan, 293 TargetScanS miRNA Regulatory Sites, 295 target2k program, 161 Taste receptors, 680 TATA box, 663, 668 Taxonomy, 524 –525 Taxonomy identifier (txid), 107, 128 TaxPlot, 626 –628, 718 Tay-Sachs disease, 857 tblastn, 103 –106, 122, 133, 170 –172 tblastx, 103–106 948 SUBJECT INDEX T cells: functions of, 230 receptors, 680, 796 TCF7l2, 867 T-Coffee, 159, 183 –184, 195 –196, 202 TEIRESIAS, 412 Telomeric repeats, 660 –661 Template-free modeling, 450 –453 Templates, homology modeling, 448 –449 Termite Gut Metagenome project, 544 Terpenes, 756 TESS (Transcription Element Search System), 670 Tetanus, 611 Tetrahymena spp.: characteristics of, 661, 743 thermophila, 282, 742– 743 Tetraodon nigroviridis, 471, 554, 768, 770 Tetraploidy, 709 Tetratricopeptide (TPR-1), 817 TFAM, 288 TGA, 32, 105 Thalassemias, 423, 477, 854, 848 –849, 859, 864 Thalassiosira pseudonana, 746 –747 The Arabdopsis Information Resource (TAIR), 470, 707, 753– 754 The Institute for Genomic Research (TIGR): BLAST, 144 completed genomes, 532 eukaryotic genomes, 754 Gene Indices, 144 eukaryotic chromosomes, 666 information resources, 525 Rice Genome Project, 755 The Memrane Protein Data Bank, 402 Theileria spp.: annulata, 739, 741 parva, 739, 741 Thermophiles, 520 Thermoplasma spp: acidophilum, 534–535 volvanium GSS1, 533, 601 Thermotogales, 600 Thermotoga maritima, 533, 600, 608 Thogotovirus, 575 Thomsen disease, 875 Threaded Blockset Aligner (TBA) program, 204, 206 Threading, 450 3dee, database, 447 3D-JIGSAW, 450 3D-PSSM, 450 3-Finger venom toxins, 680 Threonine, 51, 54–55, 59, 61, 63–64, 92, 401, 428 Threshold: BLAST search parameters: algorithms, 115, 117 –118 E values, 122 implications of, 123 scores, 115 Thrombospondin, 817 thy-1 gene, 397 Thymidine: characterized, 281 kinase, 477 Thymine, 110, 242 Thyrotropin beta chain, 223 Ticks, 741, 761 TIGR Database, Trypanosome genomics, 736 TIGRFAMs, 201, 390 Tiling microarrays, 321, 672 TIM barrel, 444 Time of divergence, 225 Tissue microarrays, 494 T-lymphocytes, 679, 741 TM4 suite, 334 TMHMM, 402 Tmpred, 414 Tobacco plant, 566 Toll-like receptors, 767 Tomato bushy stunt virus, 572 “Top down” clustering, 355 Topoisomerase, 501 Topology: CATH database, 443 –444 guide tree, 187 phylogenetic trees, 231, 236, 240, 252, 263 TopPred2, 414 Total RNA, 538 ToxoDB, 742 Toxoplasma spp.: characterized, 732 gondii, 529 –530, 739, 741–742 Toxoplasmosis, 741 TPTE gene, 824 Trace-back procedure, 83 Trachoma, 626 Trans-NIH Mouse Initiatives, 472 Transcription: components of, 20, 27, 164, eukaryotic chromosomes, 650 human diseases and, 881 mRNA expression, 300 nature of, 321–322 phylogenetic trees, 241–242 protein analysis, 389 RNA analysis, 297, 321–322, 330 Transcriptional profiling, 321, 367 Transcriptional regulation, 671 Transcriptional Regulatory Element Database (TRED), 670 Transcription factor(s): databases, 659–600, 669–672 DNA-binding, 671 Transcriptome, 538 Transcripts: characteristics of, 551, 740 full-length, 538 human genome, 796 microarray data analysis, 338 –339 TRANSFAC, 670 Transferases, 409 Transfer RNA (tRNA), 14, 282 –289, 544, 625, 662, 664, 812–813, 825, 858 Transition probability, hidden Markov models, 158– 159 Transitions, phylogenetic trees, 242, 250 Transitive catastrophe, 503 Translocations, 161, 205, 609, 675 –676 Transmembrane domain, in proteins, 397, 401 –402 Transport proteins, 170, 408, 737 Transposition, 628 Transposon: -derived repeats, 652 –653, 809 –810 evolution, 810 functions of, 704, 706 -tagged proteins, 407 Transversions, phylogenetic trees, 242, 250 Traumatic injury, 846–847 Tree bisection reconnection (TBR), 237 Tree of life: fungi, 697 history of life on earth, 521–523 illustration of, 516 molecular sequences as basis of, 523– 524 nature of, 5, 7, 56, 130, 216, 599 reconstruction of, 520 SUBJECT INDEX unrooted, 613 viruses, 591 Web Project, 525 Tree rooting, 172 TREE-PUZZLE, 254– 255, 263– 264 TreeView software, 267, 361 –363 TrEMBL, 25, 33 Treponema pallidum, 533, 608 Tribolium castaneum, 762 Trichomonas spp.: characteristics of, 732 –733 vaginalis, 732 –733 Triose phosphate isomerase, 444 TRIPLES database, 487 –488 Triple X syndrome, 853 Tripsacum dactyloides, 530 Trisomies: impact of, 338– 339, 346, 355, 460, 462–463, 465, 473, 644–645, 676, 686, 710, 824, 847, 852–853 Triticum aestivum, 305, 752 tRNAscan, 618 tRNAscan-SE, 285 –287, 662 Trophozoites, 740 Troponin C, 223 True tree, defined, 216 Truncation, 40 Trypanosoma brucei Genome Project, 736 Trypanosoma cruzi Genome Initiative Information Service, 736 Trypanosoma spp.: brucei, 735, 737, 754 cruzi, 735, 737 Trypanosomes, 731, 735– 736 Trypsin, 59, 223, 388, 401 Tryptophan, 54, 58, 61, 63, 65, 68–70, 148, 428 Tsetse flies, 735 See also Malaria t statistic (test statistic), 348– 351, 354 t-test, 333, 348 –350, 353 –354 Tuberculosis, 609, 611 Tuberous sclerosis: characteristics of, 847, 874, 877 gene (TSC2), 879 Tubulin, 224, 305 Tumor suppressor genes, 869 Tumorigenesis, 869 Tumors, 362 Tupaia glis, 51 Turkey rhinotracheitis virus, 591 Turku, 752 Turner syndrome, 853 Twilight zone, pairwise alignment, 74–75 TWINSCAN, 668 Twin studies, 852 Two-dimensional bacterial genomic display (2DBGD), 616 Two-dimensional gel electrophoresis, 383 –386, 411, 493 Two-dimensional SDS-PAGE, 384 Two Sequence Alignment Tool, 93 Two-way clustering, 361 Type II errors, 348 Typhoid, 611 Tyrosine: functions of, 54, 61, 401, 428 kinase, 398 UBC4, 224 Ubiquitins, 223 –224, 305, 680 Ultraconserved coding, 672 –673 UNAIDS, 583 Uncultured Human Fecal Virus Metagenome project, 544 Unfinished sequence, in genome sequencing, 550 –551 UniGene: accessing sequence data, example of, 38 blastx search, 105 clusters, 22, 304– 305 components of, 20, 22, 303 –304, 368 Entrez Gene compared with, 32–33 express data in cDNA libraries, 308 human genome, 794 human diseases, 862 links to, 30 mRNA sequences, 299 organisms represented in, 21 pairwise alignment, 113 UniParc, 34 Uniparental disomy, 676 UniProt: access to, 34 Archive (UniParc), 34 components of, 23, 380, 399 development of, 33 Knowledgebase (UniProt KB), 33–34, 39, 380 organization of, 33– 34 949 Reference Clusters (UniRef ), 34 repetitive DNA, 656 UniProtKB/Swiss-Prot, 201, 395 UniProtKB/TrEMBL, 201 UniRef, 34 Unite´ de Recherche Ge´nomique Info (URGI), 752 U.S Department of Energy: functions of, 720 Joint Genome Institute (DOE JGI), 548, 747, 767 –768, 803 U.S Food and Drug Administration, 733 U.S Naval Medical Research Center (NMRC), 738 Universal Mutation Database, 863–864 Universal Protein Resource (UniProt) See UniProt University of California, Santa Cruz (UCSC): characteristics of, 289, 862 ENCODE project, 793 GENES, 656 Genome Bioinformatics website, 168, 646, 798 Genome Browser, 11, 14, 35–36, 162, 164– 165, 204 –206, 294 –296, 301–302, 476– 478, 485, 525, 645 –649, 653 –654, 656 –659, 669–672, 706, 715, 798, 805– 808, 822, 828, 830, 865 –866 Saccharomyces Genome Database (SGD), 705–707, 709, 722, 754 Table Browser, 35, 294 –296, 525, 646, 653 –654, 706, 807, 822 University of Oklahoma, Advanced Center for Genome Technology, 803 University of Texas Southwestern Medical Center, 803 University of Washington Genome Center, 654, 803 UNIX platform 159, 291, 619 Unpaired t-test, 353 Unrooted phylogenetic trees, 233–235, 259 Unscaled phylogenetic tree, 232 Untranslated regions (UTRs), mRNA sequences, 302– 303, 309 950 SUBJECT INDEX Unweighted pair group method of arithmetic averages (UPGMA): distance-based phylogenetic trees, 255– 258 implications of, 187, 191 microarray data analysis, 358 UPTAG, 480 Uracil, 64, 281, 286 Ureaplasma urealyticum, 534, 556, 604, 608 Uromodulin, 397 Urydylation, 398 Usher syndrome, 877 Ustilago maydis, 717 Vaccines/vaccinations, 569, 570 –571, 574, 587–588, 731, 737 Vaccinia, 570 Valine, 54, 61, 63, 428 Value u, BLAST search, 119 var genes, 740–741 Variant surface glygoprotein (VSG ) genes, 736–737 Varicella, 571 Varicellovirus, 578 Variegate porphyria, 875 Vasopressin, 221 VAST, 438 Vector Alignment Search Tool (VAST), 25, 438 –441 Vector NTI Suite 7, 93 Venn diagrams, 41 v-erb-B, 128 VERIFY3D, 450 Vertebrata, 107 Vertebrate Genome Annotation (VEGA) project, 144, 380, 471 Vertebrate genomes, 35 Vesicle-associated membrane protein (VAMP), 400, 469, 711 ves1 gene, 741 V genes, 145 Vibrio cholerae, 534, 601, 608 Vienna RNA, 287 Viral Genome Database (VGDB), 591 Viral Genome Organizer, 592 Viral genomes, 5, 527, 539, 568 Viral (MHV3) hepatitis, 876 Virginia Bioinformatics Institute, 81 Viridiplantae, 16, 107, 748 –750 Viroids, 568 Virology, problems in, 574 Viruses, see Viral genomes BLAST searches, 134 characteristics of, 518, 520, 525, 548, 567– 568, 591 classification of, 568– 571 diversity, 571–573 evolution of, 571 –573, 591 metagenomics, 573 microarray analysis, 381 molecular evolution, 216 mosaic, 566 RNA, 294 sequence analysis, 7, 16, 25– 26 types of, 130, 574 –591 web resources, 591 –592 VIrus Particle ExplorER (VIPER), 592 VISTA (Visual Tools for Alignments), 674 Visual Molecular Dynamics (VMD), 437 Vitamin A See Retinol Vitis vinifera, 17, 755 –756 VizX Labs, 334 VRML, 437 WAK-like kinase, 680 Wards’s method, 358 Washington University Genome Sequencing Center, 548, 803 WebGene, 666 WebMol software, PDB, 422, 427, 434–437 Weizmann Institute, 81 Wellcome Trust Case Control Consortium, 867–868 Wellcome Trust Sanger Institute (WTSI), 15–16, 36, 41, 142–143, 201, 295, 403, 464, 540, 548, 654, 721, 736 –379, 768, 800, 869, 883 Wernicke– Korsakoff syndrome, 875 West Nile virus disease, 764 WGKV, 50 WHATIF, 450 White papers, 539 Whitehead Institute for Biomedical Research, MIT, 803 Whole-genome duplication (WGD), 675–676, 709, 712 –714, 743–744, 752 Whole-genome sequencing, 538, 551 Whole-genome shotgun (WGS) sequence, 15, 172–173, 380, 539, 548–549, 553, 651, 658, 701, 733, 735 –736, 738, 761, 805 Whole Mouse Catalog, 876 Whooping cough, 616 Wigglesworthia glossinidia, 608, 735 Wilcoxon test, 349 Wild-type alleles, 707 Williams– Beuren syndrome, 823, 856 Williams syndrome, 463 Wilson disease, 875 Within-subject design, 350 Wolbachia spp., 608, 761 WoLF PSORT, 406 –407, 414 Word pairs, BLAST algorithms, 115 –116 Word size: advanced database searches, 144 BLAST search, 107–110 MegaBLAST applications, 166 –167 pairwise alignment, 50 World Health Organization, 583, 732 –733, 736, 844–845 WormBase, 403, 470– 471, 707, 754, 760 Worms, 731, 759 –761 See also Silkworms WPL228W gene, 713 WU BLAST 2.0, 144 X chromosome inactivation center (XCI), 678 X chromosomes, 685, 729, 759, 761, 773, 776 –777, 780, 797, 799, 807, 813, 823, 825, 830, 849 –851 Xenopus spp., 17, 20, 305, 679 Xeroderma pigmentosa, 875 X inactivation center (XIC), 772–773 XIST gene, 772–773 X-linked retinoschisis gene (RS1), 878 Xpound, 666 X-ray: crystallography, 6, 8, 48, 145, 152, 182, 185, 217, 430, 433, 566 diffraction, 420, 435 irradiation, 471 x 2: analysis, 879 distribution, 253 test statistic, 227 Xylella fastidiosa, 534 XYY syndrome, 853 Yaks, 202 Yarrowia lipolytica, 712, 716 Y chromosome, 708, 729, 776, 797, 799, 814, 823– 825, 850 SUBJECT INDEX Yeast, see Saccharomyces spp functional genomics, 466 –470, 473, 487 genome, 86 nature of, 292, 363, 368, 381, 407, 535, 544, 666 reverse genetics, 480 –483 two-hybrid system, 411, 462, 496–500, 503 unicellular, 719 Yeast artificial chromosomes (YAC), 20, 22, 794 Yeast Gene Order Browser, 713 Yellow fever virus, 570, 764 Yersinia pestis, 608 YinOYang 1.2, 413 YKL159c gene, 707 YLR106c gene, 707 YPL230W gene, 713 Zalophus californianus, 173 Zea mays, 17, 20, 23, 530, 554, 652, 673, 752 Zebrafish, nature of, 768 See also Danio rerio Zebrafish Information Network (ZFIN), 403, 471 Zellweger syndrome, 875 Zinc, 221 Zoophytes, 790 Z scores, 88–89, 119, 446 Zygotes, 473 951 ... Disrupting RNA, 489 CONTENTS Forward Genetics: Chemical Mutagenesis, 491 Functional Genomics and the Central Dogma, 492 Functional Genomics and DNA: The ENCODE Project, 492 Functional Genomics. .. present the material in an interesting way, highlighting the fascinating features that make each genome unique Far from being a dry account of the facts of genomics and bioinformatics, the book... 103 Step 1: Specifying Sequence of Interest, 103 Step 2: Selecting BLAST Program, 104 Step 3: Selecting a Database, 106 Step 4a: Selecting Optional Search Parameters, 106 Query, 107 Limit by Entrez