Protein Family Databases pot

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	7
Dung lượng	212,53 KB

Nội dung

Protein Family Databases Steven Henikoff, Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA Jorja G Henikoff, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA The rapid expansion of biological sequence databanks and the utilization of protein sequence homologies to draw functional inferences has led to a proliferation of databases aimed at organizing protein homology information. Databases differ in how families are defined and in how family information is depicted. Introduction Improvements in the efficiency of large-scale DNA sequencing are resulting in rapid increases in the number of protein sequences that lack genetic or biochemical annotation. One traditional way to deduce the function of a protein ofinterest is tocompare it with other sequences of known function to find a possible homologue. Methods for homology detection formerly relied on pairwise comparisons of protein sequences. However, the accumulation of sequence data has motivated and facilitated the creation of families of relatedproteins. Whereas the number of protein sequences increases at an exponential rate, the number of new protein families has begun to level off. As these families become populated with more and more sequences, the utility of the classification increases, allowing for better detection of family members, for identification of conserved residues, for distinguishing orthologues (which are related by decent) from paralogues (which derive from gene duplication) and for structure modelling. The increasing utility of protein family databases has led to their proliferation: the first efforts to create a database of protein families began in 1988 (Bairoch, 1992), and the Nucleic Acids Research database issue for 2000 lists more than a dozen. This article surveys these databases and describes their use in inferring protein function. What Is a Protein Family? Each database uses a somewhat different operational definition of a family. These differences reflect the difficulty in defining just what constitutes a protein family. Some of this ambiguity in demarcating relationships among proteins that share sequence similarity is reflected in the use of the imprecise but useful terms ‘superfamily’ and ‘subfamily’. Whereas orthologous enzymes in different organisms are clearly members of the same family or subfamily, distinctions between groupings of paralogues, especially between those that are not detectably similar in pairwise comparisons, suggest a broader superfamily relationship. Furthermore, in modular and multidomain proteins, relationships are typically limited to only parts of the protein’s sequence. As detection of relationships improves with more samples and better methodology, families and superfamilies can become more populated. At present, structural relationships provide the highest level of classification, and structure-based databases classify proteins with similar ‘folds’. These classifications reveal that whenever structures are known for two proteins that are considered members of the same family or superfamily, the structures are similar, whereas the converse is often not true. Therefore, significant sequence similarity can be used to infer common structure (and common ancestry); however, similar structures that lack detectable sequence similarity may have resulted from either divergence beyond detection or convergence to a similar fold.Because divergence from a common ancestor can occur with retention of function, family, subfamily and superfamily relationships are valu- able for drawing functional inferences, whereas similarities in fold but not sequence are less likely to reveal common function. Two excellent protein structure databases, SCOP (Lo Conte et al., 2000) and CATH (Pearl et al., 2000), provide hierarchical structural classifications of proteins above the superfamily level. Problems in defining what a protein family is make it difficult to estimate how many families exist. It has been estimated that there are about 1000 protein folds (Chothia, 1992), and so there must be more than 1000 families. Currently, the InterPro database lists almost 3000 families classified by manual curation; however, databases that use automated methods to cluster proteins into families may list an order of magnitude more (e.g. Corpet et al., 2000), with ‘singleton’ sequences potentially representing tens of thousands of protein families yet to be catalogued. It may be that the large number of potential families reflects the greater divergence of proteins in the very diverse bacterial and archaeal genomes, where sequence divergence over eons has obliterated sequence similarity. Alternatively, these unclassified proteins may constitute distinct families Article Contents Secondary article . Introduction . What Is a Protein Family? . Classes of Protein Family Databases . Curated Protein Family Databases . Clustered Protein Family Databases . Clustered Databases from Genomes . Derivative Protein Family Databases . An Example: Kinesins . Conclusions 1 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net that have not yet entered curated databases. Despite this complication, it is evident that at least half of all proteins in most eukaryotic organisms have been classified into families, and so for most organisms of interest to experimentalists working on model organisms and to human biologists, protein family databases constitute fairly comprehensive resources. Classes of Protein Family Databases Protein family databases obtain sequences from one of the large protein sequence databases, most commonly SWISS- PROT with TrEMBL (Bairoch and Apweiler, 2000) but also PIR (Barker et al., 2000). They then apply an algorithm, either manual or automatic, to group the sequences into families. Each family is represented in one or more ways to facilitate both inspection by humans and comparison by computer programs. The most common representation is a multiple alignment of the family’s sequences, either with insertion and deletion (gap) characters or without. Sometimes the multiple alignment is summarized as a pattern or consensus sequence. For comparison of a user’s query sequence with the protein family database, the multiple alignment is commonly converted to a position-specific scoring matrix (PSSM), also called a profile or hidden Markov model (HMM). Patterns can be compared directly with a query sequence (Bairoch and Apweiler, 2000), and consensus sequences with the use of a general-purpose amino acid substitution scoring matrix (Henikoff and Henikoff, 2000). In addition to primary protein family databases, there are databases derived from them. Some of these derivative databases use the primary databases’ family definitions, but represent them differently for display and comparison purposes, for example Blocks 1 (JG Henikoff et al., 2000) and ProClass (Huang et al., 2000). Others combine and cross-reference the primary databases without providing different representations, for example InterPro (Apweiler et al., 2000) and MetaFam (Silverstein et al., 2000). The InterPro project is cross-referencing most of the European protein family databases and provides a single entry point into them. Ideally, in a protein family database each family’s function would be fully documented with appropriate references. In practice, only two of the curated databases (PROSITE and PRINTSS) provide this level of information because it requires laborious effort. Fortunately, there is a high level of cooperation and cross-referencing between the protein family databases and links are usually provided to one of the curated collections if possible. All of the protein family databases described here provide access via the World Wide Web (WWW), and all allow entry into the database by some sort of keyword search. Except as noted below, family databases also provide a searching tool to compare a user’s sequence with the database for classification. When a sequence is classified in this way, users have immediate access to what is known about the family and can apply it to their own sequence. A few sites offer additional services such as graphical displays, phylogenetic trees and structural information. Curated Protein Family Databases Protein families in curated databases are delineated by a human overseer, usually on the basis of personal knowl- edge or from the published literature. Usually a proto- family is aligned manually or semiautomatically and then sequences are added to the family from the protein sequence databases on the basis of sequence similarity followed by careful validation. Curated databases have the best documentation, but are the most difficult to maintain. PROSITE [http://www.expasy.ch/prosite/] The PROSITE database (Hofmann et al., 1999) is the original and best-documented protein family database; unfortunately, it has not been appreciably updated with addition of new families for several years. Protein sequences are obtained from SWISS-PROT and grouped based on documented common function. Each family is represented by a simple pattern and sequences can belong to more than one family. A few families are also represented by profiles. The WWW site provides keyword searching and classification of protein sequences by pattern searching. PROSITE is part of the InterPro project. PRINTSS [http://bioinf.man.ac.uk/ dbbrowser/PRINTS/] PRINTS (Attwood et al., 2000) obtains protein sequences from SWISS-PROT and TrEMBL. Related sequences are aligned manually, conserved motifs are excised and searched iteratively through the databases to add sequences. The results are manually validated after each iteration. Each family is represented by a fingerprint, which is a series of ungapped multiple alignments corresponding to the conserved motifs. PRINTS makes a particular effort to provide subfamily-specific entries. The documentation is extensive and the collection is updated regularly. The WWW site is well maintained and provides keyword searching and classification of protein sequences by PSSM searching. PRINTS is part of the InterPro project. SMART [http://smart.embl-heidelberg.de] SMART (Schultz et al., 2000) is a carefully curated database of signalling, extracellular and chromatin-asso- Protein Family Databases 2 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net ciated protein domains, which are represented as gapped multiple alignments. A manual alignment based on known tertiary structures is converted to an HMM for searching against the protein databases to find more sequences, which are validated before being added to the family. The WWW site provides keyword searching and classification of protein sequences by PSSM (HMM) searching. Pfam-A [http://pfam.wustl.edu] Pfam (Bateman et al., 2000) uses structural criteria, in addition to sequence similarity and shared function, in defining a protein family. For example, eukaryotic proteins containing the histone structural fold, including all four of the distinct proteins found in the nucleosome cores, are aligned, even though no significant sequence similarity is detected between histone H3 and the other three histone families. Pfam starts with manual or semiautomatic multiple alignments of sequences with similar sequence, function and/or structure obtained from the literature or from other protein family databases, such as SMART. Pfam-A constructs an HMM from a manually validated seed alignment and searches the SWISS-PROT and TrEMBL databases to collect more sequences. The resulting full multiple alignment is apparently not manually validated and can be extensively gapped. Annotation is minimal but does include references for families taken from the literature. If families are defined on the basis of sequence similarity alone, they are often just documented as ‘domain of unknown function’. Pfam-A is closely coupled with the automatically clustered databases Pfam-B and ProDom, and is part of the InterPro project. PIR Superfamilies [http:// pir.georgetown.edu/pirwww/dbinfo/] The Protein Information Resource (Barker et al., 2000) is a collection of tools, among which is a set of protein superfamilies. PIR is unique in providing an explicit definition of a superfamily as sequences with the same function in various organisms. These sequences are identified as being at least 50% identical and globally alignable. Unfortunately, this strict definition results in many related entries with only a few sequences each. The sequences are taken from the PIR protein sequence database and represented by a single typical sequence. Annotation does not extend beyond that for the individual sequences. The WWW site allows query by keyword and classification of protein sequences by a gapped BLAST (basic local alignment search tool (Altschul et al., 1997)) search versus the representative sequences, but it is apparently not possible to view the full multiple alignment. Clustered Protein Family Databases Several efforts have been made since 1990 to overcome the effort required to maintain curated protein family databases by automatically clustering the protein sequence databases using sequence similarity. The general approach is to compute all possible pairwise comparisons, and then cluster them in some fashion, shifting the effort from humans to computers. This otherwise computationally very demanding process has benefited from the introduction of the rapid PSI-BLAST system (Altschul et al., 1997). PSI-BLAST starts searching with a single sequence but then makes a multiple alignment PSSM from the hits after one iteration, then searches with it, and so forth. Problems with any clustering method include deciding how to delineate clusters (usually on the basis of some sort of cutoff score from the searches) and how to handle multidomain sequences. Users of these compendiums must be aware that they are largely unvalidated by humans and may not always correctly group sequences with the same function. ProDom [http://www.toulouse.inra.fr/ prodom.html] ProDom (Corpet et al., 2000) is one of the earliest clustered protein family databases and continually updates its methods and services. Currently, it coordinates some of its larger entries with Pfam-A and uses PSI-BLAST to cluster the remaining sequences in SWISS-PROT and TrEMBL. While only large entries have been scrutinized manually, the consistency of all families is assessed by computing a series of numerical measurements. The resulting families are represented as consensus sequences and gapped multiple alignments. Phylogenetic trees are computed from these alignments and used to display a family in overlapped subfamilies based on distances in the tree. Documentation consists of links to the protein sequence databases and to other protein family databases (PROSITE, Pfam). The WWW site has graphic displays that link related families through their shared sequences. Keyword searches and classification of proteinsequences is provided by pairwise comparison with every sequence in each family. ProDom has recently been used by Pfam in place of Pfam-B, which is based on the older Domainer algorithm (Sonnhammer and Kahn, 1994). DOMO [http://www.infobiogen.fr/  gracy/ domo/home.htm] DOMO is similar in concept to ProDom and Pfam-B, and uses SWISS-PROT as its source database. DOMO uses a different algorithm, however, which is intended to avoid inclusion of overlapping subsets derived from the same Protein Family Databases 3 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net family (Gracy and Argos, 1998). DOMO has not been updated since its initial release in 1998. ProtoMap [http:// www.protomap.cs.huji.ac.il] ProtoMap (Yona et al., 2000) automatically clusters SWISS-PROT using three different pairwise alignment algorithms. It scores alignments with multiple substitution matrices, resulting in a hierarchical organization, stored as a graph where the nodes are sequences and edges are a measure of their similarity. Uniquely among proteinfamily databases, the representation is a similarity-based dendogram. New proteins are classified by adding them to the existing graph. Documentation consists of links to the sequence databases. The WWW site also supports keyword queries. SYSTERS [http://www.dkfz-heidelberg.de/ tbi/services/cluster/systersform] SYSTERS (Krause et al., 2000) automatically clusters the SWISS-PROT and PIR sequence databases. It uses pairwise alignment algorithms with conservative cutoffs. Each family is represented by a gapped multiple alignment with links to PROSITE and Pfam for documentation. Keyword searching and protein classification by searching against the multiple alignments or consensus sequences is supported at the WWW site. Clustered Databases from Genomes As more complete genomes are sequenced, special databases are being created to facilitate their comparison. The two described here cluster whole genomes instead of protein sequence databases. COG [http://www.ncbi.nlm.nih.gov/COG/] COG (Tatusov et al ., 2000) defines a family as an ancient conserved region (Green et al., 1993). It clustered 21 complete genomes representing 17 phylogenetic lineages, and each cluster of orthologous groups (COG) of proteins consists of individual proteins or groups of paralogues from three or more lineages. Each COG is represented as a gapped multiple alignment with minimal documentation. Proteins can be classified at the WWW site by searching against the individual proteins which are then linked to their COGs. ProDom-CG [http://www.toulouse.inra.fr/ prodomCG.html] ProDom-CG (Corpet et al., 2000) applies the ProDom method to 20 complete genomes instead of to the protein sequence databases. Derivative Protein Family Databases Protein family databases derived from primary collections provide additional or alternative perspectives. Where they are derived from more than one database, they can facilitate comparison and validation of classifications. Blocks 1 [http://blocks.fhcrc.org] Blocks 1 (JG Henikoff et al., 2000) provides a nonredun- dant collection of protein families drawn from PROSITE, PRINTS, Pfam, ProDom and DOMO. Starting with the sequences documented in each PROSITE entry, Blocks 1 runs the BlockMaker motif-finding algorithm to find conserved regions, which are represented as a series of ungapped multiple alignments called blocks. PRINTS entries are then converted to PSSMs and compared with the result blocks from PROSITE using the LAMA (Local Alignment of Multiple Alignments) algorithm (Pietro- kovski, 1996). New PRINTS entries are then added to Blocks 1 . Next, blocks are made from Pfam-A entries and searched with LAMA against the PROSITE and PRINTS- derived blocks and new entries added. Then ProDom and DOMO entries are processed successively. Note that Blocks 1 uses only the sequences documented in each family of the primary protein family databases and not their representations, and thus provides an alternative representation and classification tool. The WWW site provides access by keyword and comparison of protein or DNA sequence with the blocks represented as PSSMs. Phylogenetic tree, sequence logo and 3D structural displays are also provided. Documentation consists of links to source protein family databases. ProClass [http://pir.georgetown.edu/ gfserver/] ProClass (Huang et al., 2000) entries cross-reference PROSITE and PIR superfamilies. ProClass computes a neural network for each entry and uses it to add more sequences from SWISS-PROT and PIR. Documentation consists of links to source protein family databases. InterPro [http://www.ebi.ac.uk/interpro/] Unlike Blocks 1 and ProClass, which provide alternative representations, InterPro is a curated cross-reference of Protein Family Databases 4 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net several protein family databases. Currently, PROSITE, PRINTS and Pfam-A are included. Each InterPro family entry includes documentation drawn from the participat- ing databases. Classification of new protein sequences is not yet available. MetaFam [http://metafam.ahc.umn.edu] Whereas InterPro cross-references protein family databases using a manual procedure based on documentation, MetaFam cross-references them automatically based on shared sequence segments. Families that correspond between pairs of databases are identified by maximizing the sequence membership overlaps. Then pairwise correspondences are grouped into supersets by transitive closure. Currently PROSITE, PRINTS, Pfam-A, PIR Superfamilies, ProDom, SYSTERS, ProtoMap, DOMO, SBASE-A (Murvai et al., 2000) and Blocks 1 are cross- referenced. Documentation consists of links to these databases, as well as to the protein sequence databases. The WWW site shows interrelationships between the various family classifications graphically. Classification of new protein sequences is available. An Example: Kinesins We choose the kinesins to illustrate similarities and differences between protein family databases. Kinesin and its relatives are motor proteins that utilize ATP hydrolysis to move along microtubules in eukaryotic cells. The motor portion of a kinesin is structurally very similar to that of the myosin motor, which moves along actin filaments, although no sequence similarity is evident between them. This is an example of likely divergence from an ancestral fold that is beyond current sequence- based comparison methods to detect. Kinesin subfamilies based on sequence similarities between motor domains are strongly predictive of cellular function, indicating divergence from an ancestral kinesin-like motor. Kinesins are multidomain proteins, with a coiled-coil stalk attached to the motor domain and typically a domain that interacts with other protein subunits or directly with cargo, such as vesicles and chromosomes. A curated web site describes kinesins in detail and provides subfamily and functional information [http://www.proweb.org/kinesin]. This web site lists 238 sequences divided into 10 different subfamilies. The InterPro entry (IPR001752) points to corresponding entries in the curated family databases PROSITE (PS00411, PS50067), PRINTSS (PR00380) and Pfam-A (PF00225), which include a total of208 different sequences. These correspondences are made on the basis of documentation in the source databases. All sequences in the source databases can be displayed graphically, with the kinesin region defined by each database highlighted. It can be seen from this display that agreement is good concern- ing its location. The PROSITE entries provide a pattern and a profile representation. PRINTS and Pfam-A provide two different multiple alignments. The four ungapped PRINTS blocks correspond to four conserved regions within the semimanually constructed gapped Pfam seed alignment, whichwas made from 12 sequences. Automated addition of 208 sequences to the Pfam full alignment introduced numerous additional gaps, and split the domain in some of the sequences. The MetaFam tabular entry for kinesins ( Table 1 ) provides links to kinesin entries in most of the other databases described in this article, noncurated as well as curated, for a totalof 587sequences. Thecurated databases PROSITE, PRINTSS and Pfam-A are represented by the same entries as for InterPro. For PIR, MetaFam lists seven different entries, of which two are myosins. These entries were brought in by a very poor SYSTERS family connection, which can be examined at the MetaFam site. This could be due to the fact that myosins and kinesins have long coiled-coil stalks attached to their dissimilar motor domains, and these heptad amphipathic repeats detect one another in standard database searches. A visit to the PIR site reveals 14 different superfamilies for kinesins, ranging from three to 183 sequences each. The presence of multiple kinesin entries reflects the conservative similarity criteria used todistinguish superfamilies from one another: for example, three sequences including the KLPA protein, a member of the C-terminal subfamily listed in the kinesin web site, are considered to form a superfamily of their own by PIR. Among the clustered databases, ProDom lists four separate entries for kinesins. These entries represent nonoverlapping conserved regions found in 135 to 139 proteins. Such fragmentation of a family into multiple entries occurs frequently in ProDom. The four SYSTERS entries include two with one sequence each, one with 16 sequences, including KLPA, and one with 660 sequences subdivided into 30 subfamilies, including myosin heavy chain. DOMO lists only a single kinesin motor domain entry, which indicates that the DOMO clustering algorithm has succeeded in avoided fragmenting the kinesins. ProtoMap also has a single entry for the kinesin motor domains. Although ProtoMap does not provide a multiple alignment representation, the ProtoMap kinesin classification dendogram separates subsets of sequences below distinct nodes: these approximately correspond to manually curated subfamilies listed on the kinesin web site. Blocks 1 , a derivative database, has a single kinesin entry (derived from PROSITE), with eight blocks corresponding to the eight conserved regions identified as such in the kinesin web site. Blocks 1 also provides a phylogenetic tree that separates sequences approximately corresponding to kinesin web site subfamilies. Protein Family Databases 5 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net Conclusions Evolutionary processes, including mutation, transposition and chromosome rearrangement followed by selection and drift, have resulted in diversification of proteinfamilies and modules from their ancient common ancestors (Henikoff et al., 1997). The resulting protein machines have been responsible for most of the molecular processes in life on earth. Evolutionary ‘tinkering’ has resulted in complex relationships between protein family members and in multidomain proteins that complicate any simple classification scheme. However, the relationships themselves are of extraordinary value because much of modern biology relies upon inferences drawn from comparing and aligning related protein sequences. Therefore, the effort to classify proteins into families continues despite the complexity, and different classification models have resulted in an abundance of protein family databases. All users should be able to find a classification that meets their needs. Protein family databases change constantly. In addition to the large protein family databases such as those described here, there are numerous small databases and WWW sites devoted to single protein families, usually maintained by individual researchers. Up-to-date information may be obtained from the annual database issue of Nucleic Acids Research and the ProWeb WWW site [http:// www.proweb.org] listed in the references. References Altschul SF, Madden TL, Schaffer AA et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402. Apweiler R, AttwoodTK, Bairoch A etal. (2000) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29: 37–40. [http:// www.ebi.ac.uk/interpro] Table 1 MetaFam SuperSet 312: kinesin motor domain Adapted from <http://metafam.ahc.umn.edu> Database Protein Family Description Blocks+ BL00411 KINESIN_MOTOR_DOMAIN, Kinesin motor domain proteins DOMO DM00198 KINESIN MOTOR DOMAIN Pfam PF00225 Kinesin motor domain PIR-D DA1175 kinesin motor domain homology PIR-F FA1228 1134.0: kinesin heavy chain 1.0 PIR-F FA2471 1143.5: myosin heavy chain 1.0 PIR-S 1134.0 kinesin heavy chain PIR-S 1141.5 kinesin-related protein KLPA PIR-S 1143.5 myosin heavy chain PIR-S 2580.0 unassigned kinesin-related proteins PRINTS PR00380 KINESINHEAVY: Kinesin heavy chain signature ProDom PD000454 PROTEIN MOTOR ATP-BINDING COILED COIL MICROTUBULES KINESIN-LIKE KINESIN HEAVY CHAIN ProDom PD000458 PROTEIN MOTOR ATP-BINDING COILED COIL MICROTUBULES KINESIN-LIKE KINESIN HEAVY CHAIN ProDom PD000470 PROTEIN MOTOR ATP-BINDING COILED COIL MICROTUBULES KINESIN-LIKE KINESIN HEAVY CHAIN PROSITE PS00411 KINESIN_MOTOR_DOMAIN1: Kinesin motor domain signature PROSITE PS50067 KINESIN_MOTOR_DOMAIN2: Kinesin motor domain profile ProtoMap 183 protomap 183 SBASE SB00795 KINESIN MOTOR DOMAIN SYSTERS N1722 systers N1722 SYSTERS O1099 systers O1099 SYSTERS S42943 systers S42943 SYSTERS S43289 systers S43289 Protein Family Databases 6 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net Attwood, TK, Croning MDR, Flower DR et al. (2000) PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Research 28: 225–227. [http://bioinf.man.ac.uk/dbbrowser/PRINTS/] Bairoch A (1992) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20: 2013–2018. [http://www.expasy.ch/sprot/] Bairoch A and Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research 28: 45–48. [http://www.expasy.ch/sprot/] Barker WC, GaravelliJS, Huang H etal. (2000)The Protein Information Resource (PIR). Nucleic Acids Research 28: 263–266. [http:// pir.georgetown.edu/pirwww/dbinfo/] Bateman A, Birney E, Durbin R et al (2000) The Pfam protein families database. Nucleic Acids Research 28: 263–266. [http://pfam.wustl.edu] Chothia C (1992) One thousand families for the molecular biologist. Nature 357: 543–544. Corpet F, Servant F, Gouzy J and Kahn D (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Research 28: 267–269. [http://www.toulouse.inra.fr/prodom.html] [http://www.toulouse.inra.fr/pro- domCG.html] Gracy J and Argos P (1998) Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment. Bioinformatics 14: 164–173. [http://www.infobiogen.fr/  gracy/domo/home/htm] Green P, Lipman D, Hillier L et al. (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259: 1711– 1716. Henikoff JG, Greene EA, Pietrokovski S and Henikoff S (2000) Increased coverage of protein families with the Blocks Database servers. Nucleic Acids Research 28: 228–230. [http://blocks.fhcrc.org] Henikoff S and Henikoff JG (2000) Amino acid substitution matrices. Advances in Protein Chemistry 54: 73–97. Henikoff S, Greene EA, Pietrokovski S et al. (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science 278: 609–614. [http://proweb.org] Hofmann K, Bucher P, Falquet L and Bairoch A (1999) The PROSITE database, its status in 1999. Nucleic Acids Research 27: 215–219. [http://www.expasy.ch/prosite/] Huang H, Xiao C and Wu CH (2000) ProClass protein family database. Nucleic Acids Research 28: 270–272. [http://pir.georgetown.edu/ gfserver/] Krause A, Stoye J and Vingron M (2000) The SYSTERS protein sequence cluster set. Nucleic Acids Research 28: 270–272. [http:// www.dkfz-heidelberg.de/tbi/services/cluster/systersform] Lo Conte L, Ailey B, Hubbard TJP et al. (2000) SCOP: a structural classification of proteins database. Nucleic Acids Research 28: 257– 259. [http://scop.mrc-lmb.cam.ac.uk/scop/] Murvai J, Vlahovicek K, Barta E, Cataletto B and Pongor S (2000) The SBASE protein domain library, release 7.0: a collection of annotated protein sequence segments. [http://www3.icgeb.trieste.it/  sbasesrv/] Pearl FM, Lee D, Bray JE et al. (2000) Assigning genomic sequences to CATH. Nucleic Acids Research 28: 277–282. [http://www.biochem ucl.ac.uk/bsm/cath/] Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Research 24: 3836–3845. ProWeb [http://blocks.fhcrc.org] Schultz J, Copley RR, Doerks T, Ponting CP and Bork P (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Research 28: 231–234. [http://smart.embl- heidelberg.de] Silverstein KA, Shoop E, Johnson JE et al. (2000) The MetaFan server: a comprehensive protein family resource. Nucleic Acid Research 29:49– 51 Sonnhammer ELL and Kahn D (1994) Modular arrangement of proteins as inferred from analysis of homology. Protein Science 3: 482–492. Tatusov RL, GalperinMY, Natale DA and Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28: 33–36. [http://www.ncbi.nlm nih.gov/COG/] Yona G, Linial N and Linial M (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Research 28: 49–55. [http://www.protomap.c- s.huji.ac.il] Further Reading Baxevanis AD (2000) The Molecular Biology Database Collection: an online compilation of relevant database resources. Nucleic Acids Research 28: 1–7. Protein Family Databases 7 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net . Databases . Curated Protein Family Databases . Clustered Protein Family Databases . Clustered Databases from Genomes . Derivative Protein Family Databases . An. to human biologists, protein family databases constitute fairly comprehensive resources. Classes of Protein Family Databases Protein family databases obtain

Ngày đăng: 23/03/2014, 12:20

Xem thêm