a user's guide to the human genome

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	82
Dung lượng	16,61 MB

Nội dung

contents supplement to nature genetics • september 2002 Cover art by Darryl Leja supplement september 2002 editorial 1Spreading the word Alan Packer foreword 2Power to the people Andreas D Baxevanis & Francis S Collins perspective 3Genomic empowerment: the importance of public databases Harold Varmus user’s guide 4A user’s guide to the human genome Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins & Andreas D Baxevanis 5Introduction: putting it together 9Question 1 How does one find a gene of interest and determine that gene’s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? 18Question 2 How can sequence-tagged sites within a DNA sequence be identified? 21Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers. How can all the known and predicted candidate genes in this interval be identified? What BAC clones cover that particular region? 29Question 4 A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found? 33Question 5 Given a fragment of mRNA sequence, how would one find where that piece of DNA mapped in the human genome? Once its position has been determined, how would one find alternatively spliced transcripts? 40 © 2002 Nature Publishing Group http://www.nature.com/naturegenetics contents supplement to nature genetics • september 2002 Question 6 How would one retrieve the sequence of a gene, along with all annotated exons and introns, as well as a certain number of flanking bases for use in primer design? 44Question 7 How would an investigator easily find compiled information describing the structure of a gene of interest? Is it possible to obtain the sequence of any putative promoter regions? 49Question 8 How can one find all the members of a human gene family? 53Question 9 Are there ways to customize displays and designate preferences? Can tracks or features be added to displays by users on the basis of their own research? 57Question 10 For a given protein, how can one determine whether it contains any functional domains of interest? What other proteins contain the same functional domains as this protein? How can one determine whether there is a similarity to other proteins, not only at the sequence level, but also at the structural level? 63Question 11 An investigator has identified and cloned a human gene, but no corresponding mouse ortholog has yet been identified. How can a mouse genomic sequence with similarity to the human gene sequence be retrieved? 66Question 12 How does a user find characterized mouse mutants corresponding to human genes? 70Question 13 A user has identified an interesting phenotype in a mouse model and has been able to narrow down the critical region for the responsible gene to approximately 0.5 cM. How does one find the mouse genes in this region? 74Commentary: keeping biology in mind 75Acknowledgments 76References 77Web resources: Internet resources featured in this guide © 2002 Nature Publishing Group http://www.nature.com/naturegenetics editorial supplement to nature genetics • september 2002 1 There was a time, not too long ago, when the wisdom of genome-sequencing projects was up for discussion. Would they be too expensive, draining funds from other areas of the life sciences? Would they be worth the trouble? Not much more than 15 years have passed since those early debates, and the importance of sequenced genomes to biology and medicine has now gained wide acceptance. This is in part owing to the relatively rapid fall in the cost of sequencing, followed by the undeniably important insights gained from the annotation of several bacterial genomes, and those of a few of our favorite eukaryotes. The news has been so relentlessly upbeat that one might even have expected some ‘genome fatigue’ to set in, especially given the saturation coverage of the publication of the drafts of the human genome sequence 18 months ago. Not so, however; witness the recent jockeying by different groups for inclusion of ‘their’ model organism in the next round of sequencing projects. The honeymoon goes on. And yet there are important issues to be addressed. One is the concern surrounding any bestseller—that it will have far fewer actual readers than one might expect. At first glance, this would seem not to apply to the human genome. After all, one is hard pressed these days to pick up a copy of Nature Genetics, or any genetics journal, and not find evidence that sequenced genomes inform many of the most important advances. A survey published last year by the Wellcome Trust, however, found that only half of the researchers who were using sequence data were fully conversant with the services provided by the freely accessible databases. There is also the concern that genome sequencers might be victims of their own success. As computational biologist David Roos recently put it, “We are swimming in a rapidly rising sea of data…how do we keep from drowning?” And if geneticists and bioinfor- maticians are struggling to stay afloat, what of the non- geneticists who are eager to exploit the sequences but are relative newcomers to the tools needed to navigate all of this information? It is with these questions in mind that we present A User’s Guide to the Human Genome. Written by Tyra Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis Collins and Andreas Baxevanis of the National Human Genome Research Institute (NHGRI), this peer- reviewed how-to manual guides the reader through some of the basic tasks facing anyone whose work might be facilitated by an improved understanding of the online resources that make sense of annotated genomes. The directors of these online resources—Ewan Birney of Ensembl, David Haussler of the University of California, Santa Cruz and David Lipman of the National Center for Biotechnology Information—have served as advisors during the development of this guide, ensuring a bal- anced and accurate treatment of their respective web portals. The online version of the guide will also evolve, with an initial update scheduled for April, 2003. As noted by Harold Varmus in his eloquent perspective on A User’s Guide and the public databases it exam- ines, one of the important legacies of the Human Genome Project is its ethos of open access to the data. In this spirit, and with the generous sponsorship of the NHGRI and the Wellcome Trust, the online version of this supplement will be freely available on the Nature Genetics website. Alan Packer Nature Genetics Spreading the word doi:10.1038/ng961 supplement september 2002 © 2002 Nature Publishing Group http://www.nature.com/naturegenetics foreword 2 supplement to nature genetics • september 2002 Power to the people doi:10.1038/ng962 The National Human Genome Research Institute of the National Institutes of Health is delighted to sponsor this special supplement of Nature Genetics. The primary aim of this supplement is to provide the reader with an elementary, hands-on guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium, as well as data found in other publicly available genome databases. The majority of this supplement is devoted to a series of worked examples, providing an overview of the types of data available and highlighting the most common types of questions that can be asked by searching and analyzing genomic databases. These examples, which have been set in a variety of biological contexts, provide step-by-step instructions and strategies for using many of the most commonly- used tools for sequence-based discovery. It is hoped that readers will grow in confidence and capability by working through the examples, understanding the underlying concepts, and applying the strategies used in the examples to advance their own research interests. One of the motivating factors behind the development of this User’s Guide comes from the general sense that the most commonly-used tools for genomic analysis still are terra incognita for the majority of biologists. Despite the large amount of publicity surrounding the Human Genome Project, a recent survey conducted on behalf of the Wellcome Trust indicated that only half of biomed- ical researchers using genome databases are familiar with the tools that can be used to actually access the data. The inherent potential underlying all of this sequence- based data is tremendous, so the importance of all biologists having the ability to navigate through and cull important information from these databases cannot be understated. The study of biology and medicine has truly undergone a major transition over the last year, with the public availability of advanced draft sequences of the genomes of Homo sapiens and Mus musculus, rapidly growing sequence data on other organisms, and ready access to a host of other databases on nucleic acids, proteins and their properties. Yet for the full benefits of this dramatic revolution to be felt, all scientists on the planet must be empowered to use these powerful databases to unravel longstanding scientific mysteries. As pointed out by Harold Varmus in the Perspective, free accessibility of all of this basic information, without restrictions, subscrip- tion fees or other obstacles, is the most critical component of realizing this potential. It is our modest hope that this User’s Guide will provide another useful contribution. Andreas D. Baxevanis and Francis S. Collins National Human Genome Research Institute © 2002 Nature Publishing Group http://www.nature.com/naturegenetics perspective supplement to nature genetics • september 2002 3 Genomic empowerment: the importance of public databases doi:10.1038/ng963 Over the past twenty five years, a mere sliver of recorded time, the world of biology — and indeed the world in general — has been transformed by the technical tools of a field now known as genomics. These new methods have had at least two kinds of effects. First, they have allowed scientists to generate extraordi- narily useful information, including the nucleotide-by- nucleotide description of the genetic blueprint of many of the organisms we care about most—many infectious pathogens; useful experimental organisms such as mice, the round worm, the fruitfly, and two kinds of yeast; and human beings. Second, they have changed the way science is done: the amount of factual knowledge has expanded so precipitously that all modern biologists using genomic methods have become dependent on computer science to store, organize, search, manipulate and retrieve the new information. Thus biology has been revolutionized by genomic information and by the methods that permit useful access to it. Equally importantly, these revolutionary changes have been dissemi- nated throughout the scientific community, and spread to other interested parties, because many of those who practice genomics have made a concerted effort to ensure that access is simplified for all, including those who have not been deeply schooled in the information sciences. The goal of providing genomic information widely has also inevitably attracted the interests of those in the commercial sector, and privately developed versions of various genomes are also now available, albeit for a licensing fee. The operative principle most prominently involved in trans- mitting the fruits of genomics—the one that has captured the imagination of the public and served as a standard for the sharing of results and methods more generally in modern biology— has been open access. Funding by public and philanthropic organizations, such as the U.S. National Institutes of Health, the U.S. Department of Energy, the Wellcome Trust in Britain, and many other organizations, has made this altruistic behavior possible and has fostered the idea that genomic information about biological species should be available to all. (Such information about individual human beings is, of course, an entirely different matter and should be protected by privacy rules.) The attitude of open access to new biological knowledge has also been embodied in the databases of the International Nucleotide Sequence Data- base Collaboration, comprising the DNA DataBank of Japan, the European Molecular Biology Laboratory, and GenBank at the US National Library of Medicine. The same focus on open access is exemplified by PubMed (operated by the NLM), other gateways to the scientific literature, and the assemblies of genomic sequence now found at the several Web portals described in this guide. The Human Genome Project (HGP), which has supported the public genome sequencing effort, has been the mainstay of the effort to make genomes accessible to the entire community of scientists and all citizens. This effort has, in fact, been quite natu- rally extended to instruct the public about many themes in modern biological science. This has occurred in part because the human genome itself has been such an exciting concept for the public; in part because genomes are natural entry points for teaching many of the principles of biological design, including evolution, gene organization and expression, organismal development, and disease; and in part because those who work on genomes have been tireless in attempts to explain the meaning of genes to an eager public. Endless metaphors, artistic creations, lively journalism, monographs about social and ethical implications, televised lectures from the White House, and many other cultural happenings have been among the manifestations of this fascination. In this way, the HGP has had a strong hand in raising the public’s awareness of new ideas in biology and of the powerful implications of genomics in medicine, law and other societal institutions. Some of these cultural effects come as much from the behav- ioral aspects of the HGP as from the genomic sequences them- selves. The sharing of new information, even before its assembly into publishable form, has spurred efforts to share other kinds of research tools and has encouraged the notion of making the scientific literature freely accessible through the Internet. The contribution of scientists in many countries to the sequencing of many genomes, including the human genome, has inspired efforts to develop gene-based sciences—from basic genomics to biotechnology—throughout the world, including the poorest developing nations. Indeed, the World Health Organization, the United Nations, and the World Bank have all contributed recently to the growth of the ideas that science is both possible and valuable in all economies and that science can be a means to help unify the world’s population under a banner of enlighten- ment, demonstrating a virtue of globalization. From this perspective, the availability of the sequences of many genomes through the Internet is a liberating notion, making extraordinary amounts of essential information freely accessible to anyone with a desktop computer and a link to the World Wide Web. But the information itself is not enough to allow efficient use. Interested people who reside outside the centers for studying genomes need to be told where best to view the information in a form suitable for their purposes and how to take advantage of the software that has been provided for retrieval and analysis. The manual before us now offers such help to those who might otherwise have had trouble in attempting to use the products of genomics. Furthermore, the advice is offered in that spirit of altruism that has come to characterize the public world of genomics. The information is provided in a highly inviting and understandable format by casting it in the form of answers to the questions most commonly posed when approaching big genomes. The information, made freely available on the World Wide Web, has been assembled by some of the best minds in the HGP, who have generously given their time and intellect to encourage widespread use of the great bounty that has been cre- ated over the past two decades. In other words, the guide to use of genomes provided here is simply another indication that the HGP should take great pride in much more than the sequencing of genomes. Harold Varmus Memorial Sloan-Kettering Cancer Center © 2002 Nature Publishing Group http://www.nature.com/naturegenetics user’s guide 4 supplement to nature genetics • september 2002 A user’s guide to the human genome doi:10.1038/ng964 The primary aim of A User’s Guide to the Human Genome is to provide the reader with an elementary hands-on guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium and other systematic sequencing efforts. The majority of this supplement is devoted to a series of worked examples, providing an overview of the types of data available, details on how these data can be browsed, and step- by-step instructions for using many of the most commonly-used tools for sequence-based discovery. The major web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system, along with many others that are discussed in the individual examples. It is hoped that readers will become more familiar with these resources, allowing them to apply the strategies used in the examples to advance their own research programs. Authors Tyra G. Wolfsberg Kris A. Wetterstrand Mark S. Guyer Francis S. Collins Andreas D. Baxevanis National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. e-mail: andy@nhgri.nih.gov © 2002 Nature Publishing Group http://www.nature.com/naturegenetics user’s guide supplement to nature genetics • september 2002 5 Introduction: putting it together doi:10.1038/ng965 In its short history, the Human Genome Project (HGP) has provided significant advances in the understanding of gene structure and organization, genetic variation, comparative genomics and appreciation of the ethical, legal and social issues surrounding the availability of human sequence data. One of the most significant milestones in the history of this project was met in February 2001 with the announcement and publication of the draft version of the human genome sequence 1 . The significance of this milestone cannot be understated, as it firmly marks the entrance of modern biology into the genome era (and not the post- genome era, as many have stated). The potential usefulness of this rich databank of information should not be lost on any biologist: it provides the basis for ‘sequence-based biology’, whereby sequence data can be used more effectively to design and inter- pret experiments at the bench. The intelligent use of sequence data from humans and model organisms, along with recent technological innovation fostered by the HGP, will lead to important advances in the understanding of diseases and disorders having a genetic basis and, more importantly, in how health care is deliv- ered from this point forward 2 . Although this flood of data has enormous potential, many investigators whose research programs stand to benefit in a tan- gible way from the availability of this information have not been able to capitalize on its potential. Some have found the data difficult to use, particularly with respect to incomplete human genome draft sequence information. Others are simply not sufficiently conversant with the seeming myriad of databases and analytical tools that have arisen over the last several years. To assist investigators and students in navigating this rapidly expanding information space, numerous World Wide Web sites, courses and textbooks have become available; many individuals, of course, also turn to their friends and colleagues for guidance. We have prepared this Guide in that same spirit, as an additional resource for our fellow scientists who wish to make use (or better use) of both sequence data and the major tools that can be used to view these data. The Guide has been written in a practical, question-and-answer format, with step- by-step instructions on how to approach a representative set of problems using publicly available resources. The reader is encouraged to work through the examples, as this is the best way to truly learn how to navigate the resources covered and become comfortable using them on a regular basis. We suggest that readers keep copies of the Guide next to their computers as an easy-to-use reference. Before embarking on this new adventure, it is important to review a number of basic concepts regarding the generation of human genome sequence data. This review does not discuss the chronological development of the HGP or provide an in-depth treatment of its implications; the reader is referred to Nature’s Genome Gateway (http://www.nature.com/genomics/human/) for more information on these topics. Current status of human genome sequencing Sequencing of the human genome is nearing completion. The target date for making the complete, high-accuracy sequence available is April 2003, the 50th anniversary of the discovery of the double helix 3 . As we go to press, however, the work is still a mosaic of finished and draft sequence. A sequence becomes finished when it has been determined at an accuracy of at least 99.99% and has no gaps. Sequence data that fall short of that benchmark but can be positioned along the physical map of the chromosomes are termed ‘draft’. Currently, 87% of the euchromatic fraction of the genome is finished and less than 13% is at the draft stage. Even in this incomplete state, the available data are extremely useful. This usefulness was apparent early on, leading the Inter- national Human Genome Sequencing Consortium (IHGSC) to pursue a staged approach in sequencing the human genome. The first stage generated draft sequence across the entire genome 1 . The project is now well advanced into its second stage, with draft sequence being improved to ‘finished quality’ across the entire genome, a necessarily localized process. As a result, and as it has been presented to date, the human genome sequence is an evolving mix of both finished and unfinished regions, with the unfinished regions varying in data quality. As the data are initially made available in raw form, with subsequent refinement and improvement, and because data of different quality are found in different places in the genome, users must understand the kinds of data presented by the various tools available. Determining the human sequence: a brief overview As with all systematic sequencing projects, the basic experimental problem in sequencing lies in the fact that the output of a single reaction (a ‘read’) yields about 500–800 bp 1,4 . To determine the sequence of a DNA molecule that is millions of bases long, it must first be fragmented into pieces that are within an order of magnitude of the read size. The sequence at one or both ends of many such fragments is determined, and the pieces are then ‘assembled’ back into the long linear string from which they were originally derived. A number of approaches for doing this have been suggested and tested; the most commonly used is shotgun sequencing 4 . The application of shotgun sequencing to the mul- timegabase- or gigabase-sized genomes of metazoans is still evolving. A small number of strategies are currently being evalu- ated, for example, hierarchical or map-based shotgun sequencing, whole-genome shotgun sequencing and hybrid approaches. These approaches are described in detail elsewhere 4 . The IHGSC’s human sequencing effort began as a purely map- based strategy and evolved into a hybrid strategy 1 . The ‘pipeline’ that the IHGSC used to generate the human sequence data involved the following steps. 1. Bacterial artificial chromosome (BAC) clones were selected, and a random subclone library was constructed for each one in either an M13- or a plasmid-based vector. 2. A small number of members of the subclone library (usually 96 or 192) were sequenced to produce very-low-coverage, single- pass or ‘phase 0’ data. These data were used for quality control and can be found in the Genome Survey Sequence division of The DNA Database of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank (of the National Cen- ter for Biotechnology and Information; NCBI). 3. If a BAC clone met the requisite standard, subclones were derived and sufficient sequence data generated from these to provide four- to fivefold coverage (that is, enough data to represent an average base in the BAC clone between four and five times). This is known as ‘draft-level’ coverage, and permits the assembly © 2002 Nature Publishing Group http://www.nature.com/naturegenetics user’s guide 6 supplement to nature genetics • september 2002 of sequence using computer programs that can detect overlaps between the random reads from the subclones, yielding longer ‘sequence contigs’. At this stage, the sequence of a BAC clone could typically exist on between four and ten different contigs, only some of which were ordered and oriented with respect to one another. The BAC ‘projects’ were submitted, within 24 hours of having been assembled, to the High-Throughput Genomic Sequences (HTGS) division of DDBJ/EMBL/GenBank 5 , where each was given a unique accession number and identified with the keyword ‘htgs_draft’. (The DDBJ, EMBL and GenBank are members of the International Nucleotide Sequence Database Collaboration, whose members exchange data nightly and assure that the sequence data generated by all public sequencing efforts are made available to all interested parties freely and in a timely fashion.) Less-complete high-throughput genomic (HTG) records are also known as ‘phase 1’ records. As the sequence is refined, it is designated ‘phase 2’. In the context of a BLAST search at the NCBI, these sequences would be available in the HTGS database. 4. In late 2000, the draft sequence of the entire human genome was assembled from the sequence of 30,445 clones (BAC clones and a relatively small number of other large-insert clones). This assembled draft human genome sequence was published in Feb- ruary 2001 and made publicly available through three primary portals: the University of California, Santa Cruz (UCSC), Ensembl (of the European Bioinformatics Institute; EBI) and the NCBI. The use of all three of these sites to obtain annotated information on the human genome sequence is the primary subject of this guide. 5. Subsequent to the generation and publication of the draft human genome sequence, work has continued towards finishing the sequencing. The final stage initially targeted draft-quality BAC clones. For each of these clones, enough additional shotgun sequence data are obtained to bring the coverage to eight- to tenfold, a stage referred to as ‘fully topped-up’. The data from each fully topped-up BAC are reassembled, typically resulting in a smaller number of contigs (often in just a single contig) than at the draft level. The new assembly is again submitted to the HTGS division as an update of the existing BAC clone, now identified with the keyword ‘htgs_fulltop’. The accession number of the clone stays the same, and the version number increases by one (AC108475.2, for example, becoming AC108475.3). 6. At this stage, there are, even for clones comprising a single contig, typically some regions that are of insufficient quality for the clone to be considered finished. If this is the case, the fully topped-up sequence is analyzed by a sequence finisher (an actual person) who collects, in a directed manner, the additional data that are needed to close the few remaining gaps and to bring any regions of low quality up to the finished sequence standard. While the clone is worked on by the finisher, the HTGS entry in GenBank is identified by the keyword ‘htgs_activefin’. Once work on the clone has been completed, the keyword of the HTG record is changed to ‘htgs_phase3’, the version number is once again increased, and the record is moved from the HTGS division to the primate division of DDBJ/EMBL/GenBank. In the context of a BLAST search at NCBI, these finished BAC sequences would now be available in the nr (“non-redundant”) database. 7. The finished clone sequences are then put together into a finished chromosome sequence. As with the initial draft assemblies, there are a number of steps involved in this process that use map-based and sequence-based information in calculating the maps. The final assembly process involves identifying overlaps between the clones and then anchoring the finished sequence contigs to the map of the genome; details of the process can be found on the NCBI web site (http://www.ncbi.nlm.nih.gov/ genome/guide/build.html). Initially, both the UCSC and NCBI groups generated complete assemblies of the human genome, albeit using different approaches. As noted on the UCSC web site, the NCBI assembly tended to have slightly better local order and orientation, whereas the UCSC assembly tended to track the chromosome-level maps somewhat better. Rather than having different assemblies based on the same data, IHGSC, UCSC, Ensembl and NCBI decided that it would be more productive (and obviously less confusing) NCBI reference sequences The data release and distribution practices adopted by the HGP participants have led not only to very early, pre-publication access to this treasure trove of information, but also to a potentially confusing variety of formats and sources for the sequence data. To address this and other issues, the NCBI initiated the RefSeq project (http://www.ncbi.nlm.nih.gov/ locuslink/refseq.html). The goal of the RefSeq effort is to provide a single reference sequence for each molecule of the central dogma: DNA, the mRNA transcript, and the protein. The RefSeq project helps to sim- plify the redundant information in GenBank by providing, for example, a single reference for human glyceraldehyde-3-phosphate dehydrogenase mRNA and protein, out of the 14 or so full- length sequences in GenBank. Each alternatively spliced transcript is represented by its own reference mRNA and protein. The RefSeq project also includes sequences of complete genomes and whole chromosomes, and genomic sequence contigs. The human genomic contigs that NCBI assembles, which form the basis of the presentations in the different genome browsers, are part of the RefSeq project. Most RefSeq entries are considered provisional and are derived by an automated process from existing GenBank records. Reviewed RefSeq entries are manually curated and list additional publications, gene function summaries and sometimes sequence corrections or extensions. Reference sequences are available through NCBI resources, including Entrez, BLAST and LocusLink. They can be easily recognized by the distinctive style of their accession numbers. NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to designate genomic contigs. The NCBI and UCSC use alignments of the mRNA RefSeqs with the genome to annotate the positions of known genes. Ensembl aligns mRNA RefSeqs to the genome. The NCBI also provides model mRNA RefSeqs produced from genome annotation. These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled genome and then extracting the genomic sequence corresponding to the transcripts. The resulting model mRNA and model protein sequences have accession numbers of the form XM_###### and XP_######. As the XM_ and XP_ records are derived from genomic sequence, they may differ from the original NM_ or GenBank mRNAs because of real-sequence polymorphisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic sequence alignment. A complete list of types of RefSeqs, along with details on how they are produced, is available from http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html. © 2002 Nature Publishing Group http://www.nature.com/naturegenetics user’s guide supplement to nature genetics • september 2002 7 to focus their efforts on a single, definitive assembly. To this end, and by agreement, the NCBI assembly will be taken as the reference human genome sequence. It is this NCBI assembly that is displayed at the three major portals covered in this guide. Annotating the assemblies Once the assemblies have been constructed, the DNA sequence undergoes a process known as annotation, in which useful sequence features and other relevant experimental data are cou- pled to the assembly. The most obvious annotation is that of known genes. In the case of NCBI, known genes are identified by simply aligning Reference Sequence (RefSeq) mRNAs (see box), GenBank mRNAs, or both to the assembly. If the RefSeq or Gen- Bank mRNA aligns to more than one location, the best alignment is selected. If, however, the alignments are of the same quality, both are marked on to the contig, subject to certain rules (specifically, the transcript alignment must be at least 95% iden- tical, with the aligned region covering 50% or more of the length, or at least 1,000 bases). Transcript models are used to refine the alignments. Ensembl identifies ‘best in genome’ positions for known genes by performing alignments between all known human proteins in the SPTREMBL database 6 and the assembly using a fast protein-to-DNA sequence matcher 7 . UCSC predicts the location of known genes and human mRNAs by aligning Ref- Seq and other GenBank mRNAs to the genome using the BLAST- like alignment tool (BLAT) program 8 . In addition to identifying and placing known genes onto the assemblies, all of the major genome browser sites provide ab initio gene predictions, using a variety of prediction programs and approaches. Genome annotation goes well beyond noting where known and predicted genes are. Features found in the Ensembl, NCBI and UCSC assemblies include, for example, the location and placement of single-nucleotide polymorphisms, sequence- tagged sites, expressed sequence tags, repetitive elements and clones. Full details on the types of annotation available and the methods underlying sequence annotation for each of these different types of sequence feature can be found by accessing the URLs listed under Genome Annotation in the Web Resources section of this guide. At UCSC, many of the annotations are provided by outside groups, and there may be a significant delay between the release of the genome assembly and the annotation of certain features. Furthermore, some tracks are generated for only a limited number of assemblies. For an in-depth discussion of genome annotation, the reader is referred to an excellent review by Stein 9 and the references cited therein. This review, along with the Commentary in this guide, also provides cautions on the possible overinterpretation of genome annotation data. The data—and sometimes the tools—change every day The steps outlined in the previous section should emphasize that the state of the human genome sequence will continue to be in flux, as it will be updated daily until it has actually been declared ‘finished’. (Finished sequence is properly defined as the “complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps” 2 . A more practical definition is that of “essentially finished sequence,” meaning the complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps, except those that cannot be closed by any current method.) The reader should be mindful of this, not just when reading this guide, but also, when referring back to it over time. Similarly, the tools used to search, visualize and analyze these sequence data also undergo constant evolution, capitalizing on new knowledge and new technology in increasing the usefulness of these data to the user. Over the next year, sequence producers will continue to add finished sequence to the nucleotide sequence databases, and the NCBI will continue to update the human sequence assembly until its ultimate completion. The human genome sequence will, however, continue to improve even after April 2003, as new cloning, mapping and sequencing technologies lead to the clo- sure of the few gaps that will remain in the euchromatic regions. It is hoped that such technological advances will also allow for the sequencing of heterochromatic regions, regions that cannot be cloned or sequenced using currently available methods. The sequence-based and functional annotations presented at the three major genome portals will certainly continue to evolve long after April 2003. Computational annotation is a highly active area of research, yielding better methods for identifying coding regions, noncoding transcribed regions and noncoding, non-transcribed functional elements contained within the human sequence. Accessing human genome sequence data Although each of the three portals through which users access genome data has its own distinctive features, coordination among the three ensures that the most recent version and annotations of the human genome sequence are available. Ensembl (http://www.ensembl.org) is the product of a collab- orative effort between the Wellcome Trust Sanger Institute and EMBL’s European Bioinformatics Institute and provides a bioinformatics framework to organize biology around the sequences of large genomes 7 . It contains comprehensive human genome annotation through ab initio gene prediction, as well as information on putative gene function and expression. The web site provides numerous different views of the data, which can be either map-, gene- or protein-centric. Ensembl is actively build- ing comparative genome sequence views, and presents data from human, mouse, mosquito and zebrafish. In addition, numerous sequence-based search tools are available, and the Ensembl system itself can be downloaded for use with individual sequencing projects. The UCSC Genome Browser (http://genome.ucsc.edu) was originally developed by a relatively small academic research group that was responsible for the first human genome assemblies. The genome can be viewed at any scale and is based on the intuitive idea of overlaying ‘tracks’ onto the human genome sequence; these annotation tracks include, for example, known genes, predicted genes and possible patterns of alternative splicing. There is also an emphasis on comparative genomics, with mouse genomic alignments being available. The browser also provides access to an interactive version of the BLAT algorithm 8 , which UCSC uses for RNA and comparative genomic alignments. Given its Congressional mandate to store and analyze biological data and to facilitate the use of databases by the research community, the NCBI (http://www.ncbi.nlm.nih.gov) serves as a central hub for genome-related resources. NCBI maintains Gen- Bank, which stores sequence data, including that generated by the HGP and other systematic sequencing projects. NCBI’s Map Viewer provides a tool through which information such as exper- imentally verified genes, predicted genes, genomic markers, physical maps, genetic maps and sequence variation data can be visualized. The Map Viewer is linked to other NCBI tools—for example, Entrez, the integrated information retrieval system that provides access to numerous component databases. Although we have chosen to illustrate each example using resources available at a single site, almost all the questions in this guide can be answered using any of the three browsers. The © 2002 Nature Publishing Group http://www.nature.com/naturegenetics user’s guide 8 supplement to nature genetics • september 2002 informational sidebars that follow some of the questions provide pointers on how to format the search at other sites. Furthermore, the three sites link to each other wherever possible. Examples presented in this Guide rely on the data and genome browser interfaces that were available in June 2002. As new versions of the genome assembly and viewing tools will come online every few months, the specifics of some of the examples may change over time. Regardless, the basic strategies behind answering the questions in the examples will remain the same. This underscores the importance of readers working through the examples at their own computers so that they may understand and be able to navigate these public databases. The readers are encouraged to explore the alternative methods for answering the questions. Browser problems? In following the question-and-answer portion of this guide, some readers may find that their web browsers are not be able to render the web pages properly. If this occurs, do one or more of the following: 1. Install the most recent version of either Netscape Navi- gator or Internet Explorer. 2. Increase the amount of memory available to the web browser. 3. Try a different web browser. In general, Macintosh users who seek to gain access to these three genome portals will see better performance with Internet Explorer. © 2002 Nature Publishing Group http://www.nature.com/naturegenetics [...]... search box, and hit Find At the top of the resulting page (Fig 3.7), two red tick marks on the chromosome cartoon indicate that the markers map close to each other on chromosome 10 The search results at the bottom of the page show the alternative names for the two markers (AFMA232YH9 and AFMA230VA9) as well as the maps on which they have been placed To view both markers at the same time, click on the. .. highlighting all features that have been mapped to this region of the human genome The navigator buttons between the Overview and the Detailed View move the display to the left and right and zoom in and out The features to be displayed can be changed by selecting the Features pull-down menu and then checking which features to view The Features shown in Fig 1.14 are the defaults The DNA (contigs) map separates... the Genome Browser, select the appropriate organism from the pull-down menu at the top of the blue sidebar (Human, in this case) and then click the link labeled Browser On the resulting page, select the version of the human assembly to view The genome browser from August 2001 is based on an assembly of the human genome done by UCSC using sequence data available on that date The Dec 2001 9 © 2002 Nature... database nucleotide sequences are shown in the Human mRNAs from GenBank, spliced EST, UniGene and Nonhuman mRNAs from GenBank tracks Translated alignments of mouse and Tetraodon genomic sequence are in the mouse and fish BLAT tracks Tracks displaying single-nucleotide polymorphisms (SNPs), repetitive elements and microarray data are shown at the bottom Additional details about each track are available... interest are called by their alternate names (AFMA232YH9 and AFMA230VA9 in this view) and are at the top and bottom of the interval, respectively (Fig 3.2, arrows) The full list of known genes in this display is shown in the Known Genes track (Fig 3.1) These protein-coding genes are taken from the RefSeq mRNA sequences compiled at the NCBI10 and aligned to the genome assembly using the BLAT program8 To export... so that the reader can gain an appreciation of the subtle differences in information presented at each of these sites National Center for Biotechnology Information Map Viewer The NCBI Human Map Viewer can be accessed from the NCBI’s home page, at http://www.ncbi.nlm.nih.gov Follow the hyperlink in the right-hand column labeled Human map viewer to go to the Map Viewer home page The notation at the top... and Contig maps by selecting them from the Available Maps box and selecting ADD>> Make the STS map the master by highlighting it, then selecting Make Master/Move to Bottom To limit the view such that only the STSs between D10S1676 and D10S1675 are shown, type the marker names in the Region Shown boxes Hit Apply to see the aligned maps In some cases, it may be useful to select a page size larger than... gene, the Genes_Seq map shows all the exons that have been mapped to the genome Exons for individual known mRNAs are shown on the RNA (Transcript) map Unless a gene is alternatively spliced, the Genes_Seq and RNA maps will be the same The GScan (GenomeScan) map 22 shows the NCBI’s gene predictions Any of these genes, known or predicted, are candidates for the disease gene The NCBI’s assembled contigs, also... If the C (for coding) appears in orange, part or all of the marker position overlaps with a coding region The next column, labeled Het, indicates the average heterozygosity observed for this marker, on a scale of 0–100% A reading of zero means that no information is available for that particular marker, whereas the pink bars show a 95% confidence interval for the marker The Validation column indicates... NCBI Map Viewer, Ensembl map and UCSC genome assembly in the section labeled Integrated Maps The sections labeled Variation Summary and Validation Summary (not shown) give the raw data on this particular SNP To answer the final part of this question requires jumping from dbSNP to LocusLink10 To do so, click on the ADAM2 link in the line marked LocusLink at the top of the page (Fig 4.3) This brings the . elements and microarray data are shown at the bottom. Additional details about each track are available by selecting the track name in the Track Controls at the bottom. To view the genomic. conversant with the seeming myriad of databases and analytical tools that have arisen over the last several years. To assist investigators and students in navigating this rapidly expanding information. of the BLAT algorithm 8 , which UCSC uses for RNA and comparative genomic alignments. Given its Congressional mandate to store and analyze biological data and to facilitate the use of databases

Ngày đăng: 10/04/2014, 10:58

Xem thêm