Preface Volume 183 of Methods in Enzymology dealing with the computer analysis of protein and nucleic acid sequences has proved very popular with molecular biologists and biochemists. Computers and computer programs evolve rapidly, however, and can become outmoded very quickly. As a result, there was pressure to issue an updated volume that covers much the same general subject areas. Like the earlier volume, this one is divided into several sections, the first of which deals with databases and some aspects related to their hold- ings. Also, there have been some relocations of major databases. GenBank is now centered at the National Center for Biotechnology Information (NCBI) at the National Library of Medicine in Bethesda, Maryland, and the EMBL Database has relocated to the European Bioinformatics Institute (EBI) at a site just outside Cambridge, England. More than ever, of course, geographic location is becoming moot, thanks to the World Wide Web (WWW) and extended hyperlink access. There is some new vocabulary in this volume that did not appear in Volume 183. The use of neural nets, for example, is discussed in several places, including chapters dealing with the classification of sequences, on the one hand, and with predicting secondary structure, on the other. The kinds of databases are also changing. For instance, it has been found that the fragmentary data known as Expressed Sequence Tags (EST) are ex- tremely useful. Searching newly determined sequences remains the first order of busi- ness. More often than not, a simple search of a new sequence provides both functional and structural information. New pattern searching programs have greatly extended the power of this approach so that very distant relatives of well-characterized families can be identified. The multiple alignment of protein sequences continues to have a promi- nent role in protein characterization. Whether the sequences are of the "same" protein from different organisms or are paralogs that have resulted from gene duplications, the alignment problems are the same. Interestingly, the most popular algorithms have not changed much, but the amino acid substitution tables that support them have. This is chiefly the result of there being so much comparative data in the current databases that empirical measures of relationships can be obtained by simply tallying the occurrences of the amino acids in blocks of obviously aligned sequences. As discussed in Chapter [6] by Henikoff and Henikoff, these BLOSUM tables have been remarkably effective. xiii xiv PREFACE Among their many uses, multiple alignments are used to construct profiles for more sensitive searching than is possible by single-searching. They are also used in the consensus mode for better predictions of secondary structure and for three-dimensional searches. And, of course, they are used in the construction of phylogenetic trees. Recent advances have led to some changes in emphasis in some of the sections. Most of the chapters focus on protein sequences, even though the vast majority of those are determined by DNA sequencing. Accordingly, a section on RNA folding that appeared in the earlier volume has been dropped, and instead a number of chapters that relate to the secondary structure and three-dimensional aspects of proteins have been added. Indeed, three-dimensional searching is following the course of sequence searching a decade ago. As a new protein structure is characterized, the first matter of general interest is to determine whether the fold resembles that of any that were reported previously. The remarkable thing is that not only are most new structures falling into well-defined families, but often there is no hint in advance on the basis of either structure or function. The problems associated with structure searching are similar to those experi- enced by sequence searchers in the past: a burgeoning data bank (PDB is the Protein Data Bank), choices of search programs, and, finally, the problem of judgment on how significant a resemblance may be. Many of these problems are addressed in Section V of this volume. As with Volume 183, authors were encouraged to make their programs or databases available to readers. Many chapters make reference to a WWW home page or an Internet email address from which additional information can be extracted. Finally, I thank all the authors who wrote such interesting and informa- tive chapters under a very strict and compressed timetable. Academic Press, and especially our editor, Shirley Light, outdid themselves in getting the manuscripts through the publication process in record time. As in the case of the previous volume dealing with this topic, I must also acknowledge that the task could not have been accomplished without the help of my assistant, Karen Anderson. Her relentless but always gentle prodding of authors to produce manuscripts and her remarkable organizational skills that kept the courier traffic flowing in the right direction were indispensable. RUSSELL F. DOOLITTLE Contributors to Volume 266 Article numbers are in parentheses following the names of contributors. Affiliations listed are current. STEPHEN F. ALTSCHUL (27), National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 PATRICK ARGOS (8), European Molecular Biology Laboratory, 69117 Heidelberg, Germany MARCELLA ATFIMONELLI (17), Dipartimento de Biochimica e Biologia Molecolare, Uni- versitd di Bari, 70125 Bari, Italy WINONA C. BARKER (3, 4), National Biomedi- cal Research Foundation, Washington, Dis- trict of Columbia 20007 GEOFFREY J. BARTON (29), Laboratory of Molecular Biophysics, University of Ox- ford, Oxford OX1 3QU, United Kingdom PEER BORK (11), European Molecular Biol- ogy Laboratory, D-69012 Heidelberg, Ger- many," and Max-Delbriick-Center for Molecular Medicine, Department of Bioin- formatics, D-13122 Berlin-Buch, Germany JAMES U. BOWIE (35), Department of Chemis- try and Biochemistry and DOE Laboratory of Structural Biology and Molecular Medi- cine, University of California, Los Angeles, Los Angeles, California 90095 STEVEN E. BRENNER (37), Medical Research Council Centre Laboratories of Molecular Biology, Cambridge CB2 2QH, United Kingdom GRAHAM N. CAMERON (1), European Molec- ular Biology Laboratory Outstation the European Bioinformatics Institute, Hinx- ton, Cambridge CBIO 1RQ, United Kingdom CYRUS CHOTHIA (37), Medical Research Council Centre Laboratories of Molecular Biology and Cambridge Centre for Protein Engineering, Cambridge CB2 2QH, United Kingdom ix JEAN-MICHEL CLAVERIE (14), Laboratory of Structural and Genetic Information, E.P. 91 Centre National de la Recherche Sci- entifique, 13402 Marseille, France MARC DELARUE (40), Immunologie Structur- ale Institut Pasteur, 75015 Paris, France RUSSELL F. DOOLITrLE (21), Center for Mo- lecular Genetics, University of California, San Diego, La Jolla, California 92093 DAVID EISENBERG (35), Department of Chemistry and Biochemistry and DOE Laboratory of Structural Biology and Mo- lecular Medicine, University of California, Los Angeles, Los Angeles, California 90024 JONATHAN A. EPSTEIN (10), National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 THURE ETZOLD (8), European Molecular Biology Laboratory, 69117 Heidelberg, Germany SCOTt FEDERHEN (33), National Center for Biotechnology Information, National Li- brary of Science, National Institutes of Health, Bethesda, Maryland 20894 JOSEPH FELSENSTEIN (24), Department of Ge- netics, University of Washington, Seattle, Washington 98195 DA-FEI FENG (20, Center for Molecular Ge- netics, University of California, San Diego, La Jolla, California 92093 JEAN GARNIER (32), Unit~ de Bioinformat- ique Biotechnologies, INRA, 78352 Jouy- en-Josas, Paris, France DAVID G. GEORGE (3, 4), National Biomedi- cal Research Foundation, Washington, Dis- trict of Columbia 20007 JEAN-FRANfO~S GIBRAT (32), Unit~ de Bioin- formatique Biotechnologies, INRA, 78352 Jouy-en-Josas, Paris, France X CONTRIBUTORS TO VOLUME 266 TOBY J. GIBSON (11, 22), European Molecular Biology Laboratory, 69012 Heidelberg, Germany WARREN GISH (27), Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108 MICHAEL GRIBSKOV (13), San Diego Super- computer Center, La Jolla, California 92093 XUN Gu (26), Human Genetics Center, Sph, University of Texas, Houston, Texas 77225 DANIEL GUSFIELD (28), Computer Science Department, University of California, Davis, Davis, California 95616 ROBERT A. L. HARPER (1), European Molec- ular Biology Laboratory Outstation the European Bioinformatics Institute, Hinx- ton, Cambridge CBIO 1RQ, United Kingdom JOTUN HEIN (23), Department of Ecology and Genetics, Institute of Biological Sciences, Aarhus University, DK-8000 Aarhus, Denmark JORJA G. HEN1KOFF (6), Fred Hutchinson Cancer Research Center, Seattle, Washing- ton 98104 STEVEN HENIKOVV (6), Howard Hughes Medi- cal Institute, Fred Hutchinson Cancer Re- search Center, Seattle, Washington 98104 DESMOND G. HIGGINS (22), European Molec- ular Biology Laboratory Outstation the European Bioinformatics Institute, Hinx- ton, Cambridge CBIO 1RQ, United Kingdom LIISA HOLM (39), European Molecular Biol- ogy Laboratory Outstation the European Bioinformatics Institute, Hinxton, Cam- bridge CBIO 1RQ, United Kingdom TIMOTHY J. P. HUBBARD (37), Medical Re- search Council Centre Laboratories of Mo- lecular Biology and Cambridge Centre for Protein Engineering, Cambridge CB2 2Q H, United Kingdom Lois T. HUNT (3), National Biomedical Re- search Foundation, Washington, District of Columbia 20007 MARK S. JOHNSON (34), Molecular Modelling and Biocomputing Group, Turku Center for Biotechnology, University of Turku, FIN-20521 Turku, Finland JONATHAN A. KANS (10), National Center for Biotechnology Information, National Li- brary of Medicine, National Institutes of Health, Bethesda, Maryland 20894 ANTHONY R. KERLAVAGE (2), The Institute for Genomic Research, Gaithersburg, Maryland 20850 EUGENE V. KOONIN (18), National Center for Biotechnology Information, National Li- brary of Medicine, National Institutes of Health, Bethesda, Maryland 20894 ERIC S. LANDER (19), Whitehead Institute for Biomedical Research and Department of Biology, Massachusetts Institute of Tech- nology, Cambridge, Massachusetts 02142 WEN-HSIUNG L1 (26), Human Genetics Cen- ter, Sph, Health Science Center, University of Texas, Houston, Texas 77225 CRAIG D. LIVINGSTONE (29), Genomics Sup- port Group, SmithKline Beecham Pharma- ceuticals, New Frontiers Science Park, Har- low, Essex CM19 5AW, United Kingdom ANDREI LUPAS (30), Abteilung Molukulare Strukturbiologie, Max-Planck-Institut fiir Biochemie, D-82152 Martinsried, Germany THOMAS L. MADDEN (9), National Center for Biotechnology Information, National Li- brary of Medicine, National Institutes of Health, Bethesda, Maryland 20894 ALEX C. W. MAY (34), Department of Crystal- lography, Birkbeck College, University of London, London WC1E 7HX, United Kingdom RICHARD J. MURAL (16), Biology Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831 ALEXEY G. MURZ1N (37), Medical Research Council Centre Laboratories of Molecular Biology and Cambridge Centre for Protein Engineering, Cambridge CB2 2QH, United Kingdom HITOMI OHKAWA (10), National Center for Biotechnology Information, National Li- brary of Medicine, National Institutes of Health, Bethesda, Maryland 20894 CONTRIBUTORS TO VOLUME 266 xi CHRISTINE A. ORENGO (36), Department of Biochemistry and Molecular Biology, Uni- versity College, London WC1E 6BT, En- gland JOHN P. OVER1NGTON (34), Computational Chemistry, Pfizer Central Research, Sand- wich, Kent CT13 9NJ, United Kingdom LASZLO PATTHY (12), Institute of Enzymol- ogy, Biological Research Center, Hungarian Academy of Sciences, Budapest H-1113, Hungary WILLIAM R. PEARSON (15), Department of Biochemistry, University of Virginia, Char- lottesville, Virginia 22908 GRAZIANO PESOLE (17), Dipartimento di Bio- chimica e Biologia Molecolare, UniversittJ di Bari, 70125 Bari, Italy FRIEDHELM PFEIFFER (4), Martinsried Insti- tute for Protein Sequences, Max Planck Institute for Biochemistry, Martinsried 82152, Germany OLIV1ER POCH (40), UPR 9002 du Centre Na- tional de la Recherche Scientifique, I.B.M.C. du Centre National de la Recherche Scien- tifique, 67084 Strasbourg, France BARRY ROBSON (32), Dirac Foundation, Bio- informatics Laboratory, Royal Veterinary College, University of London, London NW10TU, United Kingdom MICHAEL A. RODIONOV (34), Molecular Modelling and Biocomputing Group, Turku Centre for Biotechnology, University of Turku, FIN-20521 Turku, Finland; and Institute of Bioorganic Chemistry, Belarus Academy of Sciences, Minsk-141, Republic of Belarus 220141 BURKHARD ROST (31), Protein Design Group, European Molecular Biology Laboratory, 69012 Heidelberg, Germany KENNETH E. RUDD (18), National Center for Biotechnology Information, National Li- brary of Medicine, National Institutes" of Health, Bethesda, Maryland 20894 CECILIA SACCONE (17), Dipartmento di Bio- chimica e Biologia Moleculare, Universit~t di Bari and Centro di Studio sui Mitocondri e Metabolismo Energetico, CNR, 70125 Bari, Italy NARUYA SAITOU (25), Laboratorv of Evolu- tionary Genetics, National Institute of Ge- netics, Mishima-shi, Shizuoka-ken, 411, Japan CHRIS SANDER (39), European Molecular Bi- ology Laboratory Outstation the Euro- pean Bioinformatics Institute, Hinxton. Cambridge CBIO 1RQ, United Kingdom GREGORY D. SCHULER (10), National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 BENNY SHOMER (1), European Molecular Bi- oh)gy Laboratory Outstation the Euro- pean Bioinformatics Institute, Hinxton, Cambridge CBIO IRQ, United Kingdom RODGER STADEN (7), Medical Research Council Centre Laboratories of Molectdar Biology, Cambridge CB2 2QH, United Kingdom P. STELLING (28), Computer Science Depart- ment, University of California, Davis, Davis, California 95616 JENS STOVLBA~K (23), Department of Ecology and Genetics, Institute of Biological Sci- ences, Aarhus University, DK-8000 Aar- hus, Denmark MARK BASIL SWINDELLS (38), Department of Molecular Design, Institute for Drug Dis- coverT Research, Yamanouchi Pharmaceu- tical Company, Ltd., Tsukuba 305, Japan ROMAN L. TATUSOV (9, 18), National Center of Biotechnology Information, National Li- brary of Medicine, National Institutes of Health, Bethesda, Maryland 20894 WILLIAM R. TAYLOR (20, 36), Division of Mathematical Biology, National Institute for Medical Research, London NW7 lAA, United Kingdom JUL1E D. THOMPSON (22), European Molecu- lar Biology Laboratory, 69012 Heidel- berg, Germany EDWARD C. UBERBACHER (16), Computer Sciences and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831 ANATOLY ULYANOV (8), European Molecular Biology Laboratory, 69117 Heidelberg, Germany xii CONTRIBUTORS TO VOLUME 266 STELLA VERETNIK (13), San Diego Supercom- puter Center, La Jolla, California 92093 OWEN WHITE (2), The Institute for Genomic Research, Gaithersburg, Maryland 20850 MATrmAS WILMANNS (35), European Molec- ular Biology Laboratory, 69001 Heidel- berg, Germany JOHN C. WooTroN (33), National Center for Biotechnology Information, National Li- brary of Medicine, National Institutes of Health, Bethesda, Maryland 20894 CATHY H. Wu (5), Departments of Epidemiol- ogy and Biomathematics, University of Texas Health Center at Tyler, Tyler, Texas 75710 YING Xu (16), Computer Sciences and Mathe- matics Division, Oak Ridge National Labo- ratory, Oak Ridge, Tennessee 37831 TAu-Mu YI (19), Whitehead Institute for Bio- medical Research and Department of Biol- ogy, Massachusetts Institute of Technology, Cambridge, Massachusetts 02142 JINGHUI ZHANG (9), National Center for Bio- technology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892 KAM ZHANG (35), Division of Basic Sciences, Fred Hutchinson Cancer Center, Seattle, Washington 98104 [ 1 ] EUROPEAN BIOINFORMATICS INSTITUTE 3 [1] Information Services of the European Bioinformaties Institute By BENNY SHOMER, ROBERT A. L. HARPER, and GRAHAM N. CAMERON Introduction The European Bioinformatics Institute (EBI) was established in Sep- tember 1994 as a new outstation of the European Molecular Biology Labo- ratories (EMBL). The new outstation is located at Hinxton Hall, Cam- bridgeshire, United Kingdom. Its main tasks are management of databases for molecular biology, bioinformatics services, and research and develop- ment in these fields) The move of the bioinformatics services from the EMBL headquarters in Heidelberg, Germany, to the EBI had various implications, including considerable expansion in the computer power and the number of staff. The computers are used for management of the principal databases, and for providing network servers. The outstation provides excellent communi- cations channels to the scientific and research community throughout Eu- rope, and a specialized user support group ensures that all the services are properly maintained and functional. Various new services (which will be reviewed in this chapter) have been established, and this has been due to the fact that there has been an increase in both computational power and manpower at the EBI. The inspiration for these new services has come from the various research and development (R&D) teams now operating at the EBI, who do research on managing sequence databases and studying the interrelationships between various kinds of data. The main thrust of this work is to provide novel ways to access the data and to provide interfaces that are intuitive and easy to use for the EBI user community. This chapter is divided into two sections. The first section is devoted to describing the various current and future databases and resources that are being developed in-house, and the second section describes the various interfaces and network connections that EBI provides for the scientific community globally. A glossary is provided at the end of this chapter that gives a brief description of common terms. t D. B. Emmert, P. J. Stoehr, G. Stoesser, and G. N. Cameron, Nucleic Acids Res. 22, 3445 (1994). Copyright © 1996 by Academic Press, Inc. METHODS IN ENZYMOLOGY, VOL. 266 All rights of reproduction in any form reserved. 4 DATABASES AND RESOURCES [ 11 EBI Databases and Resources EMBL Nucleotide Sequence Database The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences either collected from the scientific literature and patent applications or submitted directly from researchers and sequenc- ing groups. 2 The database is produced in a collaboration between the EMBL, GenBank (Washington DC, USA), and the DNA Data Bank of Japan (DDBJ, Mishima, Japan). Each entry that is created at any of these databases is automatically exchanged between the other two databases. This allows almost complete synchronization between the databases. Currently, there is a 75% annual growth rate of the nucleotide sequence database. The total number of entries and bases for different taxonomic divisions can be seen in Table I. With further technological advancements, the rate of growth of the databases will increase even more. The nucleotide database is maintained in the relational database man- agement system (RDBMS) ORACLE, running on a DEC Alpha VMS cluster. Each entry in the database is assigned an accession number, which is a permanent unique identifier. The entry is represented externally as an ASCII "flat file." The flat file (see Fig. 1) is composed of lines beginning with a two-character tag and followed by an associated text. The header information ("annotation") is followed by the sequence itself. The sequence entry ends with the unique identifier "//." Table II summarizes the meaning of the two-character line tags. The EBI maintains a very high level of quality assurance of the sequence data in the EMBL database. Each new entry is carefully reviewed by a team of annotators, and, when necessary, direct communication with the submitting author is initiated to clarify ambiguities. Rapid data turnaround is essential; we guarantee to process well-formed submissions within 1 week, although in practice entries are created within 2-3 days after receipt. Development of the next generation of the sequence database is one of the R&D group activities. This group concentrates on various means of ensuring database integrity and developing state-of-the-art implementa- tions of the data. The latest release (Release 45, December 1995) contains 622,566 entries, comprising 427,620,278 nucleotides. SWISS-PROT Protein Sequence Database The SWISS-PROT Protein Sequence Database is a database of protein sequences? This database is produced and maintained in a collaboration 2 C. M. Rice, R. Fuchs, D. G. Higgins, P. J. Stoehr, and G. N. Cameron, Nucleic Acids Res. 21, 2967 (1993). 3 A. Bairoch and B. Boeckmann, Nucleic Acid Res. 22, 3578 (1994). [ 1 ] EUROPEAN BIOINFORMATICS INSTITUTE 5 TABLE I NUMBERS OF ENTRIES AND BASES IN EMBL NUCLEOTIDE SEQUENCE DATABASE a ACCORDING TO TAXONOMIC DIVISION Division b Entries Nucleotides Bacteriophage 1066 1,493,417 EST 123,526 39,332,522 Fungi 8420 19,940,449 Invertebrates 13,831 27,610,495 Organelles 8195 9,364,254 Other mammals 6272 6,976,315 Other vertebrates 7041 8,144,622 Plants 11,105 14,145,431 Primates 35,290 36,665,648 Prokaryotes 21,427 37,074,154 Rodents 23,626 26,850,022 STS 7232 2,288,477 Synthetic 8597 4,295,284 Unclassified 6082 3,577,630 Viruses 21,496 24,801,066 Subtotal 303,206 262,559,786 Other patents 6686 2,507,063 Total 309,892 265,066,849 " Data are total numbers of entries and bases in the EMBL nucleotide database at the time of freezing the database for building Release 42. b EST, Expressed sequence tags; STS, sequence tagged sites. between Dr. Amos Bairoch from the University of Geneva and the EBI. The data in SWISS-PROT arise from several sources; they are derived from translations of sequences from the EMBL Nucleotide Sequence Database, adapted from the Protein Identification Resource (PIR) collection, ex- tracted from the literature, and directly submitted by researchers. The database contains high-quality annotation, is nonredundant, and is cross- referenced to several other databases, notably the EMBL nucleotide Se- quence Database, PROSITE pattern database, and Protein Data Bank (PDB). The latest release (Release 32, November 1995) contained 49,340 sequence entries comprising 17,385,503 amino acids abstracted from 43,056 references. As in the nucleotide sequence database, SWISS-PROT entries are rep- resented externally as an ASCII flat file. The main difference between both flat files is in the feature table, which in SWISS-PROT describes the ID XX AC XX Dr Dr XX DE XX KW OS OC OC XX ~N RP RX RA RT RT RT RL XX RN RP RA RT RL RL RL XX m ZX FH FH FT FF FT FT 5T FT FT F? F? FT FT FT SQ standard; IliA; PRO; 1636 BP. Zi1747; S35943; 28-FEB-1992 (Rel. 31, Created) 30-JUN-1993 (Rel. 36, Last tlcdated, Version 6) C.symbiost~ gdh gene encodir~ glutamate dehydrogermse. 9dz gene; glut6m~te dehydrogenase. Clostridi~ symbiostm Prokaryota; Bacteria; Firmicutes; I~zdospore-forming rods and cocci; Bacillaceae; Clostridiu~. [1] 1-1636 M~3LINE; 92267007. Teller J.K., Smith R.J., McPhersc~ M.J., Ehgel P.C., Guest J.R. ; "qhe glutan~te dehydrogerkmse gene of Clostridit~n symbios~n. Cloning by polymarase chain reactic~l, sequence analysis and over-expressic~ in Escherichia coll."; Eur. J. Bioc/le~. 206:151-159(1992). [2] 1-1636 Teller J.K. ; Suhnitted (26-FEB-1992) to the 194BL/GenBank/EfB/ databases. Teller J.K., University of Sheffield, Molecular Biology and Biotechnology, Western Bank, Sheffield, ihited ~, SI0 2L~ SWISS-PROT; P24295; EHE2_CLOSY. Key Locati(xl/Quali fiers so%trce RBS CDS i 1636 / organm~= "C lostridiu~ symbiosum" /clcne="pC~516" 189 194 / citation= [ 1 ] 204 1556 /gene= "gd~" /EC_nunber:-" i. 4. i. 2" /product: "Glutamate Dehydrogenase" /evidence=~AL /citaticn= [i] /note: "pid: g49280" Sequence 1636 BP; 474 A; 329 C; 416 G; 417 T; 0 other; aacgtcgatc gtgcacgttt gcgctgtaac aattataatg ctaattcaat ttc3cttatat aaQtgaaatg cgttataata aaaccag~c agaaaatttc acaas~cat agat~ < > aagaccggca gctattattt aataacaatt gcataagcgg ttgtctg~t gattggggct gctgcattaa gtatat // 60 120 1620 1636 [...]... 12,890 11 Library group Sequence Sequence Sequence Sequence Sequence Sequence Sequence Protein structure Protein structure Protein structure Protein structure Protein structure Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Sequence related Literature... the EBI Expressed sequence tags and sequence tagged sites The two specialist sequence libraries dbEST (database of expressed sequence tags) and dbSTS (sequence tagged sites, Ref 12), developed by the National Center for Biotechnology Information (NCBI), are mirrored by EBI dbEST is a database of sequence and mapping data on expressed sequence tags, which are partial, "single pass" cDNA sequences, whereas... using forms enables EBI to use it as an optimal mechanism for providing and collecting information The EBI W W W home page can be logically divided into several major topics as follows MAIN DATABASES E M B L Nucleotide Sequence Database Area The home page introduces the user to the E M B L Nucleotide Sequence Database It provides the user with the updated database release information, information for. .. merging related protein structures and sequences Alu sequence database RNA databank of 5 S rRNA and 5 S rRNA gene sequences Catalog of molecular biology programs Sequence blocks database Tables of codon frequencies, calculated for different organisms Human CpG-island database Tables of codon frequencies in a tabulated format EST (expressed sequence tags) database STS (sequence tagged sites) database Escherichia... UK) The objectives for the IMGT database are to contain information about immunoglobulins and T-cell receptors from all species, specifically, to contain all sequences and alignments, allele information, sequence tagged sites (STS) and polymorphism, genomic maps, molecular modeling information, and information about the relations with diseases and hybridomas Software will be developed for facilitating... facilitating the annotation process, for classification of sequences, and for molecular modeling The aims include developing a user-friendly graphical interface, stabilizing keywords used in immunogenetics, and incorporating results of sequence alignments and translation of sequences to amino acid sequences The database will provide a detailed morphological and functional analysis of immunoglobulins and... Submission Systems Submission of Sequence Data There are three main ways to submit sequence data to the EBI sequence databases The first two refer to the nucleotide sequence and SWISS-PROT databases, while the third one (WWW submissions) refers only to nucleotide sequences 12 DATABASESAND RESOURCES [ 1] MANUAL EDITING OF ELECTRONIC SUBMISSION FORM A text (ASCII) submission form can be filled using any... of entering information already available Finally, the user may freeze a submission session for a very long time The system breaks the complicated task of sequence submission into a set of interactive forms which check the user's input and present the following forms according to the input The system is compatible with the various WWW browsers currently available, on all platforms An effort was made... use the service The user can ask for a sequence, either by accession n u m b e r or entry name In response, the sequence will be sent to the user's E-mail address in E M B L flat file format C o m p u t e r programs and other binary files will be sent in a U U e n c o d e d format (see Glossary), which calls for extra steps on the part of the user TABLE V ADDRESSES FOR COMMUNICATINGWITH EBI Mail address:... by E-mail [ 1] EUROPEAN BIOINFORMATICS INSTITUTE 19 Protein sequence homology searches: B L I T Z database searches The W W W server enables submission of sequences for a B L I T Z search B L I T Z uses the MPsearch program of Shane Sturrock and John Collins 48MPsearch allows sensitive and extremely fast comparisons of protein sequences against the SWISS-PROT protein sequence database using the Smith . 183 of Methods in Enzymology dealing with the computer analysis of protein and nucleic acid sequences has proved very popular with molecular biologists and biochemists. Computers and computer. modeling information, and information about the relations with diseases and hybridomas. Software will be developed for facilitating the annotation process, for classification of sequences, and for. objectives for the IMGT database are to contain information about immunoglobulins and T-cell receptors from all species, specifically, to con- tain all sequences and alignments, allele information, sequence