Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 12 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
12
Dung lượng
189,89 KB
Nội dung
Classification of the short-chain dehydrogenase ⁄reductase superfamily using hidden Markov models Yvonne Kallberg1,2, Udo Oppermann3 and Bengt Persson1,2,4 IFM Bioinformatics, Linkoping University, Sweden ă Department of Cell and Molecular Biology (CMB), Karolinska Institutet, Stockholm, Sweden Structural Genomics Consortium, The Botnar Research Centre, NIHR Biomedical Research Unit, University of Oxford, UK National Supercomputer Centre (NSC) and Swedish E-science Research Centre (SERC), Linkoping University, Sweden ă Keywords bioinformatics; classification; genomes; hidden Markov model; short-chain dehydrogenases ⁄ reductase Correspondence B Persson, IFM Bioinformatics, Linkoping ă University, S-581 83 Linkoping, Sweden ă Fax: +46 13 137 568 Tel +46 13 282 983 E-mail: bpn@ifm.liu.se (Received 23 August 2009, revised 12 February 2010, accepted 16 March 2010) doi:10.1111/j.1742-4658.2010.07656.x The short-chain dehydrogenase ⁄ reductase (SDR) superfamily now has over 47 000 members, most of which are distantly related, with typically 20– 30% residue identity in pairwise comparisons, making it difficult to obtain an overview of this superfamily We have therefore developed a family classification system, based upon hidden Markov models (HMMs) To this end, we have identified 314 SDR families, encompassing about 31 900 members In addition, about 9700 SDR forms belong to families with too few members at present to establish valid HMMs In the human genome, we find 47 SDR families, corresponding to 82 genes Thirteen families are present in all three domains (Eukaryota, Bacteria, and Archaea), and are hence expected to catalyze fundamental metabolic processes The majority of these enzymes are of the ‘extended’ type, in agreement with earlier findings About half of the SDR families are only found among bacteria, where the ‘classical’ SDR type is most prominent The HMM-based classification is used as a basis for a sustainable and expandable nomenclature system Introduction The short-chain dehydrogenase ⁄ reductase (SDR) superfamily, recently reviewed in [1], consists of NAD(P)(H)-dependent oxidoreductases that are distinct from the medium-chain dehydrogenase and aldo–keto reductase (AKR) superfamilies The term SDR was coined in 1991 [2], and the enzyme family has been shown to be present in all domains of life, from primitive bacteria to higher eukaryotes Interestingly, about 25% of all identified dehydrogenases belong to the SDR superfamily [3] Furthermore, in the ocean sequence sampling by Venter et al [4], this superfamily was found to be the largest, with over 60 000 nonredundant sequences (over 30 000 of the ‘classical’ type and close to 30 000 of the ‘extended’ type) The SDR superfamily currently has more than 47 000 primary structures available in sequence databases and over 300 crystal structures deposited in the Protein Data Bank They show early divergence, the majority of family members having only low pairwise sequence identity (typically 20–30%), but have several properties in common, described in [1,2] The threedimensional structures are clearly homologous with a single-domain globular Rossmann-related fold consisting of a b-sheet sandwiched between three a-helices on each side The active site is formed by a triad ⁄ tetrad with highly conserved Tyr, Lys, Ser (and Asn) residues [1,5] Substrate binding occurs in a cleft close to the coenzyme-binding site This cleft shows considerable Abbreviations AKR, aldo–keto reductase; HMM, hidden Markov model; SDR, short-chain dehydrogenase ⁄ reductase; 17b-HSD, 17b-hydroxysteroid dehydrogenase FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS 2375 SDR classification using HMM Y Kallberg et al Results and Discussion The large SDR superfamily is of ancient origin, with most members being equidistantly related at the 20– 30% residue identity level Consequently, there are no natural hierarchical relationships to rely on for the functional assignments HMMs have successfully been used in protein family characterization [6], and their use is now standard when new sequences are being annotated They are used in our functional categorization of all SDR members, where each SDR family corresponds to an HMM, and this set of resulting HMMs forms the basis for a sustainable nomenclature scheme for the whole SDR superfamily So far, 314 families have been defined, covering about 31 900 of 47 000 retrieved SDRs Approximately 9700 SDRs form clusters that are too small (fewer than 20 members with maximum 80% sequence identity) for them to be reliably identified with an HMM, but these will hopefully be extended as new genomes become sequenced The remaining SDRs ( 6800) will be investigated henceforth 2376 120 100 Families 80 60 40 20 29 30 0– 49 50 0- 20 19 0– 10 0– 9 –9 60 –5 40 –3 20 variation between the individual SDR forms, explaining the wide substrate spectrum of this enzyme superfamily In humans, there are 202 SDR forms, corresponding to at least 82 SDR genes; they have important functions in steroid hormone, prostaglandin and retinoid metabolism, and hence signaling They also play crucial roles in the metabolism of xenobiotics, including drugs and carcinogens A growing number of singlenucleotide polymorphisms have been assigned to SDR genes Of the 77 human SDRs that are listed in the well-annotated database Swiss-Prot, 24 enzymes are associated with diseases in the OMIM database (Online Mendelian Inheritance in Man) Thus, many (or even all) of these enzymes are medically important However, the functions of about half of the human SDR enzymes are unknown The SDR superfamily has grown by several orders of magnitude, from 20-odd members in 1991 [2] to over 47 000 today, and thanks to the fast progress in genome and environmental sequencing projects, the number of known SDR forms can be expected to increase even more in the future This substantiates the need for a subdivision into families to achieve a systematic overview and allow for annotations and for functional conclusions In this article, we apply hidden Markov models (HMMs) to obtain a sequence-based subdivision of the SDR superfamily that allows for automatic classification of novel sequence data and provides the basis for a nomenclature system Members Fig SDR family sizes The bars represent the number of SDR families of defined family sizes The most common family size is between 20 and 39 members Family size The numbers of members in the different families identified vary considerably, but the majority of the families are quite small Over half of the SDR families have fewer than 60 members, and 77% of the families have fewer than 100 members (Fig 1) Large families are rare; there are only 16 families with 400 or more members (Table 1) They are primarily of the ‘extended’ SDR type (nine families), and several of these members metabolize carbohydrate derivatives, a basic function common to most life forms Two such families are the GDP-mannose-4,6-dehydratases (SDR3E in the nomenclature system) and the GDPl-fucose synthetases (SDR4E), which are involved in the two-step conversion of GDP-mannose to GDP-l-fucose The latter is a substrate for several fucosyltransferases, which in turn are involved in the expression of many glyconjugates [7,8] Another example is provided by the UDP-glucose-4-epimerases (SDR1E), which catalyze the third and final step in the Leloir pathway of galactose metabolism, interconverting UDP-galactose and UDP-glucose Impairment of this enzyme reaction leads to epimerase-deficiency galactosemia [9], which can lead to, for example, mental retardation or cataract in humans In 16th place among the largest families are the insect alcohol dehydrogenases (SDR109I) These form a very specialized group of enzymes, also called the ‘intermediate’ SDR type [10] This seems to be unique to insects, and the size of this family is due to the very well-studied genomes of fruit flies The 78 different FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS Y Kallberg et al SDR classification using HMM Table The 16 largest SDR families The largest families, with currently more than 400 members, are listed In the domain columns, the letters E, B and A denote eukaryotic, bacterial and archaeal genomes, respectively Domain Family name Family designation Family size E B A Acetoacetyl-CoA reductase UDP-glucose-4-epimerase dTDP-D-glucose-4,6-dehydratase Enoyl-(acyl-carrier-protein) reductase DTDP-4-rhamnose reductase GDP-mannose-4,6 dehydratase Capsular polysaccharide biosynthesis protein Modular polyketide synthase, KR domain Gluconate-5-dehydrogenase Glucose and ribitol dehydrogenase GDP-L-fucose synthetase Dihydroflavonol-4-reductase UDP-glucuronic acid decarboxylase UDP-glucuronate-4-epimerase 3-Hydroxyacyl-CoA dehydrogenase type II Drosophila alcohol dehydrogenase SDR152C SDR1E SDR2E SDR54D SDR134E SDR3E SDR55E SDR56X SDR49C SDR57C SDR4E SDR108E SDR6E SDR50E SDR5C SDR109I 1444 1273 1109 691 656 642 558 557 509 492 469 456 424 423 411 404 x x x x x x x x x x x x x x x x x x Drosophila genomes alone account for more than half of the members (228 of 404) Human SDR members Of the 314 families identified, 37 have human members Ten additional human SDR families have been identified, but these have too few members (< 20 with < 80% pairwise residue identity) to be suitable for HMM analysis at present In total, the 47 families represent 82 SDR genes (Table 2) In Table 3, the numbers of genes for each SDR family in human, mouse and rat are compared For most families, the numbers are identical for the three species Of the 13 families with two or more genes in humans, nine have at least two genes also in mouse and ⁄ or rat Regarding family size in relation to number of human members, one would imagine a linear correlation: the larger the size, the more human members However, typically, the SDR families with more than 400 members have only a single human representative and, in total, as many as 34 of the 47 families have only one human member There are four families, with retinol and steroid dehydrogenases, that stand out: SDR7C, SDR11E and SDR16C, with their six human members each, and SDR9C, with as many as eight human members This observation emphasizes the critical importance of enzymatic control of retinoid and steroid metabolism in development as well as metabolic and homeostatic signaling [11] In this x x x x x x x x x x x x x x x x x x x x context, control of ligand access, by oxidoreductases such as the above-mentioned SDRs, to nuclear hormone receptors such as retinoid or steroid receptors appears to constitute an important determinant of steroid, retinoid and lipid signaling, and seems to necessitate the existence of such diversified enzyme forms to maintain and execute proper functions in multicellular eukaryotes and mammals Species distribution A closer look at the distribution among the classified SDRs in the domains Eukaryota, Bacteria and Archaea (Fig 2) reveals that more than half of the families are unique to bacteria (178 of 314) The two largest of these, SDR56X with 557 members and SDR61X with 245 members, have multidomain enzymes in the form of polyketide synthases They typically have two NADP(H)-binding domains, and the two SDR families cover one domain each One-fifth (63) of the families have members among both bacteria and eukaryotes but not among archaeaons Archaeal SDRs are quite rare, only 32 families having any such member (Fig 2) There are three families with much higher proportions of archaeal members than are generally found Two of them comprise extended SDRs (SDR136E UDP-glucose-4-epimerase and SDR144E UDP-glucose homolog), and the third family contains a classical SDR (SDR146C 3-oxoacyl reductase) There is no single SDR family unique to archaeaons FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS 2377 2378 360 158 1273 1109 642 469 411 424 262 SDR1E SDR2E SDR3E SDR4E SDR5C SDR6E SDR7C 257 129 164 157 SDR8C SDR9C Family size Family designation SDR10E SDR11E SDR12C SDR13C X X X X X A X X X X X X X X X B Domain X X X X X X X X X X X X X M E X X X X X X X X X X I X X X X X X X X P X X X X X X X X X X X X X O GALE_HUMAN TGDS_HUMAN GMDS_HUMAN FCL_HUMAN HCD2_HUMAN UXS1_HUMAN RDH11_HUMAN RDH12_HUMAN RDH13_HUMAN RDH14_HUMAN DHR13_HUMAN DHRSX_HUMAN DHB4_HUMAN BDH_HUMAN DHB2_HUMAN DHI2_HUMAN DHRS9_HUMAN RDH1_HUMAN H17B6_HUMAN DR9C7_HUMAN RDH16_HUMAN FACR1_HUMAN FACR2_HUMAN 3BHS1_HUMAN 3BHS2_HUMAN 3BHS7_HUMAN Q9UD07_HUMAN Q6I955_HUMAN Q9UDK8_HUMAN DHB12_HUMAN DHB3_HUMAN HSDL1_HUMAN HSDL2_HUMAN P37059 P80365 Q9BPW9 Q92781 O14756 Q8NEX9 O75452 Q8WVX9 Q96K12 P14060 P26439 Q9H2F3 Q9UD07 Q6I955 Q9UDK8 Q53GQ0 P37058 Q3SXM5 Q6YN16 Identifier Q14376 O95455 O60547 Q13630 Q99714 Q8NBZ7 Q8TC12 Q96NR8 Q8NBN7 Q9HBH5 Q6UX07 Q8N5I4 P51659 Q02338 Accession number SDR family 9C member Retinol dehydrogenase 16 Fatty acyl-CoA reductase Fatty acyl-CoA reductase 3-b-Hydroxysteroid dehydrogenase type 3-b-Hydroxysteroid dehydrogenase type 3-b-Hydroxysteroid dehydrogenase type 3-b-Hydroxysteroid dehydrogenase 3-b-Hydroxysteroid dehydrogenase w1 protein 3-b-Hydroxysteroid dehydrogenase Estradiol-17-b-dehydrogenase 12 Testosterone-17-b-dehydrogenase Hydroxysteroid dehydrogenase-like protein Hydroxysteroid dehydrogenase-like protein UDP-Glucose-4-epimerase dTDP-D-glucose-4,6-dehydratase GDP-mannose-4,6-dehydratase GDP-L-fucose synthetase 3-Hydroxyacyl-CoA DH type UDP-glucuronic acid decarboxylase Retinol dehydrogenase 11 Retinol dehydrogenase 12 Retinol dehydrogenase 13 Retinol dehydrogenase 14 SDR family member 13 SDR family member on chromosome X Peroxisomal multifunctional enzyme type D-b-Hydroxybutyrate dehydrogenase, mitochondrial Estradiol-17-b-dehydrogenase Corticosteroid-11-b-DH isozyme SDR family member 11-cis-Retinol dehydrogenase Hydroxysteroid-17-b-dehydrogenase Description 1.-.-.- 1.1.1.62 1.1.1.62 1.1.1.62 1.1.1.1.1.-.1.1.1.105 1.1.1.62 1.1.1.63 1.1.1.105 1.1.1.1.1.-.1.2.1.1.2.1.1.1.1.145 5.3.3.1 1.1.1.145 5.3.3.1 1.1.1.- 5.1.3.2 4.2.1.46 4.2.1.47 1.1.1.271 1.1.1.35 4.1.1.35 1.1.1.1.1.1.1.1.1.1.1.1.1.1.-.1.1.-.4.2.1.107 1.1.1.35 1.1.1.30 EC number Table Human SDR members The 47 SDR families with human representatives are listed, together with data about family size, domain occurrence, human entries in Uniprot, and EC number The domain designations are: A, archaeal; B, bacterial; E, eukaryotic The eukaryotic subdivisions are: M, mammal; I, insect; P, plant; O, other Accession number, identifier and description are extracted from the Uniprot-KB SDR classification using HMM Y Kallberg et al FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS Family size 135 143 114 114 95 63 82 52 84 45 149 187 41 74 47 66 47 32 33 43 42 45 23 Family designation SDR14E SDR15C SDR16C SDR17C SDR18C SDR19C SDR20C SDR21C SDR22E SDR23E SDR24C SDR25C SDR26C SDR27X SDR28C SDR29C SDR30C SDR31E SDR32C SDR33C SDR34C SDR35C SDR36Ca Table (Continued) X X A Domain FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS X X X X X X X X X X X X B X X X X X X X X X X X X X X X X X X X X X X X M E X X X X X X X X X X X X I X X X X X X P X X X X X X X X X X X X X X X X X X X X X X X O P49327 P14061 Q9NYR8 Q13034 Q9BY49 Q92506 Q15738 Q6IAN0 A6NNS2 P09417 Q9Y394 Q06136 P15428 P28845 Q7Z5J1 Q13268 Q9BTZ2 Q6PKH6 Q9NUI1 Q16698 Q96LJ7 Q7Z4W1 P16152 O75828 Q16795 Q9NZL9 Q6UWP2 Q8IZJ6 Q9BUT1 O75911 Q8NBQ5 Q7Z5P4 Q8IZV5 Q8N3Y7 Accession number FAS_HUMAN DHB1_HUMAN RDH8_HUMAN Q13034_HUMAN PECR_HUMAN DHB8_HUMAN NSDHL_HUMAN DRS7B_HUMAN DRS7C_HUMAN DHPR_HUMAN DHRS7_HUMAN KDSR_HUMAN PGDH_HUMAN DHRS2_HUMAN DHRS4_HUMAN DR4L2_HUMAN NP_001075957 DHI1_HUMAN DHI1L_HUMAN TDH_HUMAN BDH2_HUMAN DHRS3_HUMAN DHB11_HUMAN DHB13_HUMAN RDH10_HUMAN RDHE2_HUMAN XP_498284 DECR2_HUMAN DECR_HUMAN DHRS1_HUMAN DCXR_HUMAN CBR1_HUMAN CBR3_HUMAN NDUA9_HUMAN MAT2B_HUMAN DHR11_HUMAN Identifier Corticosteroid-11-b-dehydrogenase isozyme Hydroxysteroid-11-b-dehydrogenase 1-like protein Fatty acid synthase Estradiol-17-b-dehydrogenase Retinol dehydrogenase 17-b-Hydroxysteroid dehydrogenase Peroxisomal trans-2-enoyl-CoA reductase Estradiol-17b-dehydrogenase Sterol-4-a-carboxylate 3-dehydrogenase SDR family member 7B SDR family member 7C Dihydropteridine reductase SDR family member 3-Ketodihydrosphingosine reductase 15-Hydroxyprostaglandin dehydrogenase Peroxisomal 2,4-dienoyl-CoA reductase 2,4-Dienoyl-CoA reductase, mitochondrial SDR family member L-Xylulose reductase Carbonyl reductase (NADPH) Carbonyl reductase (NADPH) NADH (ubiquinone) 1a subcomplex subunit Methionine adenosyltransferase subunit b Dehydrogenase ⁄ reductase SDR family member 11 SDR family member SDR family member SDR family member 4-like Inactive L-threonine-3-dehydrogenase 3-Hydroxybutyrate dehydrogenase type Short-chain dehydrogenase ⁄ reductase Estradiol-17-b-dehydrogenase 11 17-b-Hydroxysteroid dehydrogenase 13 Retinol dehydrogenase 10 Epidermal retinal dehydrogenase Description 1.3.1.38 1.1.1.62 1.1.1.170 1.1.-.1.1.-.1.5.1.34 1.1.-.1.1.1.102 1.1.1.141 2.3.1.85 1.1.1.62 1.1.1.- 1.1.1.146 1.1.1.- 1.1.-.1.1.1.184 1.1.-.- 1.-.-.- 1.3.1.34 1.3.1.34 1.1.-.1.1.1.10 1.1.1.184 1.1.1.184 1.1.1.30 1.1.-.1.1.1.1.1.-.1.1.1.1.1.1.- EC number Y Kallberg et al SDR classification using HMM 2379 X X X X X 24 13 23 17 36 18 11 17 7 SDR37C SDR38Ca SDR39U SDR40Ca SDR41Ca SDR42Ea SDR43Ua SDR44Ua SDR45Ca SDR47Ca SDR48Aa a X X X X X X X X X X X X X X X X I M E B A Family size Family designation Domain Table (Continued) 2380 This family has too few members (fewer than 20 with maximally 80% pairwise sequence identity) to allow for calculation of an HMM 1.1.-.1.1.1.1.1.1.1.1.1.1.5.1.30 1.1.1.1.1.1.1.1.-.- 1.1.1.153 X X X X X DHB7_HUMAN A6NH47_HUMAN SPRE_HUMAN D39U1_HUMAN DHR12_HUMAN WWOX_HUMAN D42E1_HUMAN D42E2_HUMAN BLVRB_HUMAN HTAI2_HUMAN CBR4_HUMAN DHB14_HUMAN NMRL1_HUMAN P56937 A6NH47 P35270 Q9NRG7 A0PJE2 Q9NZC7 Q8WUS8 A6NKP2 P30043 Q9BUP3 Q8N4T8 Q9BPX1 Q9HBL8 X P O Identifier 3-Keto-steroid reductase Putative uncharacterized protein HSD17B7P2 Sepiapterin reductase Epimerase family protein SDR39U1 SDR family member 12 WW domain-containing oxidoreductase SDR family 42E member Putative SDR family 42E member Flavin reductase Oxidoreductase HTATIP2 Carbonyl reductase 17-b-Hydroxysteroid dehydrogenase 14 NmrA-like family domain-containing protein 1.1.1.270 Y Kallberg et al Accession number Description EC number SDR classification using HMM There are few families with representatives from all three domains, only 14 of 314 (Table 4) Seven of them have mammal (and human) representatives Nine families are of the extended type, which is more than expected by chance, as the classical type is most common ( 69%) Therefore, it seems that families of the extended SDR type are represented in more species than the classical type, in agreement with early genome investigations [12] Interestingly, there is one family (SDR53C, related to glucose dehydrogenases) with only 38 members that still has members from all domains, in spite of its small size Typically, the bacterial members form the vast majority (80% or more) in most SDR families identified, in agreement with the fact that 79% of the SDRs are from that domain However, in two of these 14 families, the eukaryotic members are in the majority (SDR51C, l-xylulose reductases; and SDR53C, glucose-1-dehydrogenaserelated proteins) There are 41 families with only eukaryotic members (Table 5) Around half of them are unique to one group of species; insect alcohol dehydrogenases constitute one such family, there are seven families unique to plants, and as many as 15 families are unique to fungi Sixteen of the remaining families, with members from multiple groups of species, have mammalian (and human) representatives These families include several of the steroid dehydrogenases ⁄ reductases and carbonyl and fatty acyl reductases SDR types SDRs are divided into the types ‘classical’ and ‘extended’ [10], and it was previously noted that classical SDRs are more common; however, among SDRs that are present in all eukaryotes, the extended type is equally common [13] Now we are able to make a large-scale comparison, including not only eukaryotes but also prokaryotes Of the 314 families identified, there are 218 families judged to be classical and 52 extended (68% and 17%, respectively) In total, these cover about 27 900 proteins, and surprisingly, given that the majority of the families are classical, 36% of the proteins are of the extended type This means that many of the largest families are of the extended type Classical SDRs are in the vast majority in families with members from only one domain (Eukaryota or Bacteria) and also in families with both eukaryotic and bacterial members When archaeal members are involved, however, the pattern changes considerably Among the 14 families with members from all three domains, only five are classical; that is, the extended type is in the majority One reason for this could be that the FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS Y Kallberg et al SDR classification using HMM Table SDR members in human, mouse, and rat Genes per family Family Family name Human Mouse Rat SDR1E SDR2E SDR3E SDR4E SDR5C SDR6E SDR7C SDR8C SDR9C SDR10E SDR11E SDR12C SDR13C SDR14E SDR15C SDR16C SDR17C SDR18C SDR19C SDR20C SDR21C SDR22E SDR23E SDR24C SDR25C SDR26C SDR27X SDR28C SDR29C SDR30C SDR31E SDR32C SDR33C SDR34C SDR35C SDR36C SDR37C SDR38C SDR39U SDR40C SDR41C SDR42E SDR43U SDR44U SDR45C SDR47C SDR48A UDP-glucose-4-epimerase dTDP-D-glucose-4,6-dehydratase GDP-mannose-4,6-dehydratase GDP-L-fucose synthetase 3-Hydroxyacyl-CoA dehydrogenase type II UDP-glucuronic acid decarboxylase NADPH-dependent retinal reductase Peroxisomal multifunctional enzyme Multisubstrate SDR9C with preference for NAD(H) Fatty acyl-CoA reductase 3b-Hydroxysteroid dehydrogenase NADP(H)-dependent 17b-hydroxysteroid dehydrogenase (SDR12C) SDR13C with unknown substrate specificity L-Threonine dehydrogenase 3-Hydroxybutyrate dehydrogenase Multisubstrate NADP(H)-dependent SDR16C Peroxisomal 2,4-dienoyl-CoA reductase Microsomal 2,4-dienoyl-CoA reductase SDR19C with unknown substrate specificity L-Xylulose reductase NADPH-dependent carbonyl reductases and NADH dehydrogenase (ubiquinone) 1a subcomplex Methionine adenosyltransferase subunit b SDR24C with unknown substrate specificity SDR25C with unknown substrate specificity 11b-Hydroxysteroid dehydrogenases and Fatty acid synthase Multisubstrate SDR28C Peroxisomal trans-2-enoyl-CoA reductase NAD(H)-dependent 17b-hydroxysteroid dehydrogenase (SDR30C) Sterol-4-a-carboxylate-3-dehydrogenase SDR32C with unknown substrate specificity Dihydropteridine reductase SDR34C with unknown substrate specificity 3-Ketodihydrosphingosine reductase 15-Hydroxyprostaglandin dehydrogenase 17b-Hydroxysteroid dehydrogenase (SDR37C) Sepiapterin reductase SDR39U with unknown substrate specificity SDR40C with unknown substrate specificity SDR41C with unknown substrate specificity SDR42E with unknown substrate specificity Flavin reductase HIV-1 TAT-interactive protein NADH-dependent carbonyl reductase NAD(H)-dependent 17b-hydroxysteroid dehydrogenase (SDR47C) SDR48A with unknown function (NmrA-like) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 extended SDRs are typically involved in basic metabolic functions, and thus have a lower tendency to vary Classical SDRs, on the other hand, are involved in many different types of enzyme reactions, and are thus more diverse [1] Family correlation with functional annotations Among the identified families, only one-third of the proteins have informative annotations; the other twothirds have terms such as putative, hypothetical, or FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS 2381 SDR classification using HMM Y Kallberg et al A B 178 18 14 63 41 E Fig Number of SDR families with members representing one, two or three of the domains of life The numbers represent the numbers of families with members in different combinations of the three domains Eukaryota (E), Bacteria (B) and Archaea (A) possible ⁄ probable, or are only identified as SDR proteins One advantage of the present family grouping is that functional annotations can be concluded from other family members, as many families have at least a few members with annotations describing their functions In order to investigate whether family functions could be derived, annotations for each member were compared within the families In families where the described functions are quite general, we find no inconsistencies; that is, the annotations (if present) are the same for every member in a family, thus supporting the present classification We also find a good correlation between the present family classification and the function(s) among families for which a more detailed functional role is known, predominantly families with human and mammalian members In families with a single human member (34 families), there are no contradictory functions annotated; that is, for all members with known function, the annotation is the same There are some members that might have another function, according to the protein description, but the function seems to be derived rather than actually determined For instance, in family SDR6E, there are two members described as GDP-mannose-4,6-dehydratase and 3b-hydroxysteroid dehydrogenase ⁄ isomerase (accession numbers Q00VJ3 and A0ZGH3), respectively, suggesting that they belong to families SDR3E and SDR11E instead, but there are no experimental data, and pairwise sequence comparisons clearly identify the human representative for SDR6E (UXS1_HUMAN) as the closest human ortholog The 13 families with multiple human members typically contain one or several 17b-hydroxysteroid dehydrogenases (17b-HSDs) (see, for example, [14,15] for overviews of functions) These types of enzyme have mixed origins; one of them (type 5) is not even an SDR, but belongs to the AKR family, and phylogenetic studies have shown that 17b-HSD activity has evolved from different ancestors, e.g in types 1, and (corresponding to families SDR28C, SDR9C and SDR12C, respectively; see [16] and references therein) These studies also provide support for the inclusion of retinol dehydrogenases, an 11b-hydroxysteroid dehydrogenase and 17b-HSDs in the SDR9C family, as they most probably have a common ancestor among Table The 14 SDR families present in all domains The average ratio column shows the average number of members per species The letters E, B and A denote eukaryotic, bacterial and archaeal genomes, respectively The numbers represent percentage of members from each domain Families with human occurrences are indicated by bold type in the eukaryotic column Percentage in domain Family name Family designation Family size Number of species Average ratio E B A UDP-glucose-4-epimerase dTDP-D-glucose-4,6-dehydratase GDP-mannose-4,6-dehydratase GDP-L-fucose synthetase UDP-glucuronic acid decarboxylase L-Threonine dehydrogenase Microsomal 2,4-dienoyl-CoA reductase Gluconate-5-dehydrogenase UDP-glucuronate-4-epimerase L-Xylulose reductase Sulfolipid biosynthesis protein Glucose-1-dehydrogenase-related protein Capsular polysaccharide biosynthesis Acetoacetyl-CoA reductase SDR1E SDR2E SDR3E SDR4E SDR6E SDR14E SDR18C SDR49C SDR50E SDR51C SDR52E SDR53C SDR55E SDR152C 1273 1109 642 469 424 135 95 509 423 184 105 38 558 1444 705 652 395 299 261 103 66 250 281 99 89 33 372 747 1.8 1.7 1.6 1.6 1.6 1.3 1.4 2.0 1.5 1.9 1.2 1.2 1.5 1.9 13.8 4.1 11.2 14.5 28.3 25.9 22.1 5.5 10.9 53.3 10.5 57.9 0.2 2.2 85.9 92.8 86.0 83.2 69.3 71.9 75.8 93.7 87.9 46.2 80.0 36.8 99.6 97.2 0.2 2.9 1.9 1.1 2.4 2.2 2.1 0.6 1.2 0.5 9.5 5.3 0.2 0.6 2382 FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS Y Kallberg et al SDR classification using HMM Table SDR families unique to eukaryotes The average ratio column shows the average number of members per species M, Fi, I, P, Fu and O denote mammals, fish, insects, plants, fungi and other eukaryotes, respectively Species group Family name Dihydroflavonol-4-reductase Drosophila alcohol dehydrogenase Fatty acyl-CoA reductase Sex determination protein tasselseed-2 NADP(H)-dependent 17b-hydroxysteroid dehydrogenase (SDR12C) Aflatoxin biosynthesis, versicolorin reductase Multisubstrate SDR9C with preference for NAD(H) SDR24C with unknown substrate specificity 3b-Hydroxysteroid dehydrogenase Hypothetical protein SDR112C NADPH-dependent methylglyoxal reductase GRE2 NADH dehydrogenase (ubiquinone) 1a subcomplex Fatty acid synthase Menthol dehydrogenase Fatty acid synthase a subunit FasA, 3-oxoacyl-(acyl-carrier protein) domain NADPH-dependent HC-toxin reductase NADPH-dependent carbonyl reductases and Hypothetical protein SDR118C Male sterility 2-like protein Multisubstrate SDR28C NAD(H)-dependent 17b-hydroxysteroid dehydrogenase (SDR30C) Short-chain dehydrogenase ⁄ reductase SDR120C, putative Dihydropteridine reductase Aminoadipate-semialdehyde dehydrogenase NAD-dependent epimerase ⁄ dehydratase 11-b-Hydroxysteroid dehydrogenase-like protein Hypothetical protein SDR123C 11b-Hydroxysteroid dehydrogenases and Hypothetical protein SDR124C Short-chain dehydrogenase ⁄ reductase family protein SDR32C with unknown substrate specificity SDR128C oxidoreductase Sterol-4a-carboxylate-3-dehydrogenase D-Arabinitol-2-dehydrogenase Hypothetical protein SDR127C 3-Oxoacyl-(acyl-carrier protein) reductase Short-chain dehydrogenase ⁄ reductase family SDR130C 17b-Hydroxysteroid dehydrogenase (SDR37C) SDR39U with unknown substrate specificity Putative short-chain type alcohol dehydrogenase SDR132C C-3 sterol dehydrogenase ⁄ C-4 decarboxylase family Family designation Family size Number of species Average ratio SDR108E SDR109I SDR10E SDR110C SDR12C 456 404 257 243 164 120 91 31 47 52 3.8 4.4 8.3 5.2 3.2 x x SDR111C SDR9C 162 158 29 29 5.6 5.4 x x SDR24C SDR11E SDR112C SDR113E 149 129 128 112 23 49 27 37 6.5 2.6 4.7 3.0 x x x x x SDR22E 84 70 1.2 x x x SDR27X SDR114C SDR116U 74 70 69 29 14 44 2.6 5.0 1.6 x x x SDR115E SDR21C SDR118C SDR117E SDR28C SDR30C 66 52 51 49 47 47 24 19 14 22 32 7.3 2.2 2.7 3.5 2.1 1.5 SDR120C 44 18 2.4 SDR33C SDR121E SDR122U SDR119C SDR123C SDR26C SDR124C SDR125C 43 43 43 42 42 41 37 34 32 38 41 12 28 22 36 18 1.3 1.1 1.0 3.5 1.5 1.9 1.0 1.9 x SDR32C SDR128C SDR31E SDR126C SDR127C SDR129C SDR130C 33 33 32 31 30 27 25 24 31 19 27 19 17 17 1.4 1.1 1.7 1.1 1.6 1.6 1.5 x SDR37C SDR39U SDR132C 24 23 23 11 20 2.2 1.2 2.9 x x SDR133E 22 17 1.3 FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS M Fi I x P Fu x x O x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 2383 SDR classification using HMM Y Kallberg et al the invertebrates Thus, the family classification seems to be valid also for these families For every member in the first 47 families, we also made sequence comparisons with all identified human SDRs In every family except two, all of the members have the largest number of identities with the human representatives in their own family The first exception is the retinol dehydrogenase family SDR7C, where we find a total of 41 members (of 262) scoring higher towards human SDR41C1 (WWOX_HUMAN) than any of its own human representatives During HMM training, some overlaps were found between these clusters, but as we were unable to create a single HMM that captured every member in the two clusters, it was decided to keep them separate for now It is possible that these two families have the same ancient origin, and hence should be one family instead The other exception is SDR8C, comprising peroxisomal multifunctional enzymes, where 25 members (of 360) preferred SDR30C1 (DHB8_HUMAN) This is in spite the fact that HMM iteration with these proteins as seed led to the inclusion of human SDR8C1 and not SDR30C1 The SDR8C family is primarily involved in fatty acid metabolism, and has a multidomain structure, with an N-terminal SDR domain followed by a hydratase domain, and finally a sterol carrier protein domain Members of the SDR30C family consist of a single SDR domain; the exact function has yet to be discovered, but both fatty acid and steroid metabolism have been suggested Thus, with the knowledge available today, it is not possible to evaluate these families further Other SDR classifications As mentioned in Experimental procedures, Pfam [6] identifies SDRs through the three profiles Adh_short (PF00106), Epimerase (PF01370) and 3Beta_HSD (PF01073), thus classifying these proteins at a much more general level, which does not allow the more fine-grained analysis regarding the presently identified families Identifying members of the SDR superfamily is, of course, a necessary step in order to be able to cluster and divide them further, but does not give us insights into the function of a specific protein, owing to the large variation in functionality among SDRs Also, the general HMMs might not correctly identify sequence fragments, which more specialized HMMs can About 1600 SDRs were identified in this way, i.e not found by the general SDR HMMs but by the family HMMs only Another approach uses evolutionary trees [17] to achieve subfamily classification In comparison with 2384 the method presented herein, this approach arrives at much more fine-grained families; for example, our 3bhydroxysteroid dehydrogenase family is split into eight subfamilies, and our family with retinol dehydrogenases is split into as many as 19 subfamilies A classification system that is too specific would be impractical, as it would not provide a correct overview of the divergent SDR superfamily Furthermore, functional conclusions drawn from family members would be of less practical value with smaller families, owing to limitations in annotations Our HMM system as basis for nomenclature The presently characterized SDR families form a natural foundation for a nomenclature system We have therefore, together with a number of researchers in the SDR field, created such a nomenclature system [18], which is already in use [19] This nomenclature will help us to keep track of the different SDR families, and facilitate collection of knowledge on the structural and functional properties of one of the largest protein families known to date Experimental procedures A number of HMMs were developed in order to arrive at a subclassification of the SDR superfamily There are already HMMs (three Pfam HMMs and an HMM previously developed by us) for the identification of new SDR members in general The purpose of the HMMs now developed is to divide the SDRs into more manageable subfamilies with a more specialized function in common than general dehydrogenase ⁄ reductase activity The new HMMs were developed using an iterative approach to arrive at stable HMMs that correctly identifies their own members and disregards members of other SDR families (see below) SDR proteins were extracted from the Uniprot database [20], human RefSeq [21] and human Ensembl [22] as of October, 2008, using a previously developed HMM [10] and the Pfam [6] profiles PF00106, PF01073 and PF01370 This dataset consisted of 47 011 proteins (7905 only by our own method, and 6254 only by Pfam) In addition, 1581 proteins have so far been identified by the HMMs now developed In order to identify clusters of SDR families, each of the candidate sequences was compared with all of the other candidates using fasta [23] We tested clustering at various levels, and found that an initial clustering at the 40% level and an opt-score better than 700 were most appropriate, as judged from test cases with SDR enzymes of known function Furthermore, the 40% level has also been shown to be suitable for other classification (nomenclature) systems, FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS Y Kallberg et al such as those of AKRs [6] and cytochrome P450 [7] However, for the SDR classification, we strive to avoid a strict percentage residue identity threshold, as this might cause strange effects for members with residue identities close to the threshold, and would not correctly reflect the enzymatic properties of the family Instead, an iterative ‘fishing’ approach is taken, where the initial fasta clusters constitute a starting point only The members of each SDR cluster were aligned using clustalw [24] The alignments were made nonredundant, so that no pair of sequences had more than 80% sequence identity, in order to avoid bias After this redundancy reduction, the alignment was transformed into an HMM using hmmer [25,26] Subsequently, a database search against Uniprot was performed, hits with a score better than 400 were added to the cluster, and the HMM was retrained This process of aligning, HMM training and database search was repeated until no further cluster members could be added If a cluster had too few members, its HMM could be overtrained and only able to identify the training sequences Hence, minimum thresholds were set at three points in the clustering procedure, and clusters failing the threshold were set aside while the detection of further members was awaited The first point was before calculation of the initial alignment, if the fasta result list contained fewer than five hits with 40–80% residue identity The second threshold point was after the first HMM iteration, if fewer than 10 members were identified in the database search The third and last point was set after the iteration process had ended, if the cluster contained fewer than 20 members Subsequently, all HMMs were tested to ensure that each SDR member was detected by only one HMM, in order to achieve the desired specificity If two clusters were equal, one of the HMMs was selected, and if one cluster was a subset of another cluster, the HMM of the larger cluster was selected If there was only a partial overlap between the clusters, i.e if both sets had members that were not in the other set, none of the HMMs was selected Instead, an attempt was made to solve the overlap by excluding the overlapping members from both clusters and retraining the HMMs If this approach still resulted in overlapping clusters, the two sets were combined into one cluster, and an attempt was made to achieve a stable HMM for the whole set If this was still unsuccessful, no HMM was created for any of the two initial clusters Some SDRs form part of multidomain proteins, e.g FOX2_CANTR, a multifunctional b-oxidation protein with two SDR domains, and FAS_HUMAN, the cytosolic 2511 amino acid residue fatty acid synthase, which contains one SDR domain In order to cluster these multidomain proteins, the programs and scripts developed were adjusted so that only the SDR parts of the protein sequences were included in the HMM building procedure SDR classification using HMM A quality test was performed on the resulting collection of unique HMMs, using jack-knifing Each HMM was retrained with one of its members removed from the training set This was repeated until each member had been removed once, and the ability of the retrained HMMs to correctly identify the member left out was tested The test consisted of two parts: the member left out needed to have a score above the threshold (400), and all nonmembers (i.e members of other clusters) needed to have a score below the threshold The iterative clustering process was automated using a series of shell scripts and programs developed in C Typically, each SDR cluster needed eight iterations, and each iteration took approximately h on a Linux workstation equipped with an Intel 2.5 GHz processor Hence, going through the whole dataset would have been a tedious and time-consuming process Parts of the large-scale runs were therefore carried out on the 805-node Hewlett-Packard DL140 cluster Neolith at the National Supercomputer Centre (Linkoping, Sweden) ă Acknowledgements Financial support from Linkoping University and the ă Karolinska Institutet is gratefully acknowledged The Structural Genomics Consortium is a registered charity (number 1097737) that receives funds from the Canadian Institutes for Health Research, the Canadian Foundation for Innovation, Genome Canada, through the Ontario Genomics Institute, GlaxoSmithKline, the Karolinska Institutet, the Knut and Alice Wallenberg Foundation, the Ontario Innovation Trust, the Ontario Ministry for Research and Innovation, Merck & Co., Inc., the Novartis Research Foundation, the Swedish Agency for Innovation Systems, the Swedish Foundation for Strategic Research, and the Wellcome Trust U Oppermann is supported by the NIHR Oxford Biomedical Research Unit Computational resources were provided via the allocation committee of the Swedish National Infrastructure for Computing (SNIC) We also thank J.-O Jarrhed and the National Supercomă puter Centre (NSC), Linkoping, Sweden, for computer ă support References Kavanagh KL, Jornvall H, Persson B & Oppermann U ă (2008) Functional and structural diversity within the short-chain dehydrogenase ⁄ reductase (SDR) superfamily Cell Mol Life Sci 65, 3895–3906 Persson B, Krook M & Jornvall H (1991) Characterisă tics of short-chain alcohol dehydrogenases and related enzymes Eur J Biochem 200, 537–543 FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS 2385 SDR classification using HMM Y Kallberg et al Kallberg Y & Persson B (2006) Prediction of coenzyme specificity in dehydrogenases ⁄ reductases: a hidden Markov model-based method and its application on complete genomes FEBS J 273, 1177–1184 Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W et al (2007) The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families PLoS Biol 5, e16 Filling C, Berndt KD, Benach J, Knapp S, Prozorovski T, Nordling E, Ladenstein R, Jornvall H & Oppermann ă U (2002) Critical residues for structure and catalysis in short-chain dehydrogenases ⁄ reductases J Biol Chem 277, 25677–25684 Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL et al (2004) The Pfam protein families database Nucleic Acids Res 32 (Database issue) D138– D141 Sullivan FX, Kumar R, Kriz R, Stahl M, Xu GY, Rouse J, Chang XJ, Boodhoo A, Potvin B & Cumming DA (1998) Molecular cloning of human GDP-mannose 4,6-dehydratase and reconstitution of GDP-fucose biosynthesis in vitro J Biol Chem 273, 8193–8202 Tonetti M, Sturla L, Bisso A, Benatti U & De Flora A (1996) Synthesis of GDP-l-fucose by the human FX protein J Biol Chem 271, 27274–27279 Thoden JB, Wohlers TM, Fridovich-Keil JL & Holden HM (2001) Molecular basis for severe epimerase deficiency galactosemia X-ray structure of the human V94m-substituted UDP-galactose 4-epimerase J Biol Chem 276, 20617–20623 10 Kallberg Y, Oppermann U, Jornvall H & Persson B ă (2002) Short-chain dehydrogenases reductases (SDRs): coenzyme-based functional assignments in completed genomes Eur J Biochem 269, 4409–4417 11 Nobel S, Abrahmsen L & Oppermann U (2001) Metabolic conversion as a pre-receptor control mechanism for lipophilic hormones Eur J Biochem 268, 4113–4125 12 Jornvall H, Hoog J-O & Persson B (1999) SDR and ă ă ă MDR: completed genome sequences show these protein families to be large, of old origin, and of complex nature FEBS Lett 445, 261–264 13 Kallberg Y, Oppermann U, Jornvall H & Persson B ¨ (2002) Short-chain dehydrogenase ⁄ reductase (SDR) relationships: a large family with eight clusters common to human, animal, and plant genomes Protein Sci 11, 636–641 2386 14 Lukacik P, Kavanagh KL & Oppermann U (2006) Structure and function of human 17beta-hydroxysteroid dehydrogenases Mol Cell Endocrinol 248, 61–71 15 Moeller G & Adamski J (2006) Multifunctionality of human 17beta-hydroxysteroid dehydrogenases Mol Cell Endocrinol 248, 47–55 16 Baker ME (2001) Evolution of 17beta-hydroxysteroid dehydrogenases and their role in androgen, estrogen and retinoid action Mol Cell Endocrinol 171, 211–215 17 Krishnamurthy N & Sjolander K (2005) Phylogenomic inference of protein molecular function Curr Protoc Bioinformatics Chapter 6, Unit 6.9 18 Persson B, Kallberg Y, Bray JE, Bruford E, Dellaporta SL, Favia AD, Duarte RG, Jornvall H, Kavanagh KL, Kedishvili N et al (2009) The SDR (short-chain dehydrogenase ⁄ reductase and related enzymes) nomenclature initiative Chem Biol Interact 178, 94–98 19 Kowalik D, Haller F, Adamski J & Moeller G (2009) In search for function of two human orphan SDR enzymes: hydroxysteroid dehydrogenase like (HSDL2) and short-chain dehydrogenase ⁄ reductaseorphan (SDR-O) J Steroid Biochem Mol Biol 117, 117–124 20 Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M et al (2005) The universal protein resource (UniProt) Nucleic Acids Res 33, D154–D159 21 Pruitt KD, Tatusova T & Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins Nucleic Acids Res 35, D61–D65 22 Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L et al (2009) Ensembl 2009 Nucleic Acids Res 37, D690– D697 23 Pearson WR & Lipman DJ (1988) Improved tools for biological sequence comparison Proc Natl Acad Sci USA 85, 2444–2448 24 Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG & Thompson JD (2003) Multiple sequence alignment with the Clustal series of programs Nucleic Acids Res 31, 3497–3500 25 Durbin R, Eddy S, Krogh A & Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids Cambridge University Press, Cambridge 26 Eddy SR (1998) Profile hidden Markov models Bioinformatics 14, 755–763 FEBS Journal 277 (2010) 2375–2386 ª 2010 The Authors Journal compilation ª 2010 FEBS ... that the majority of the families are classical, 36% of the proteins are of the extended type This means that many of the largest families are of the extended type Classical SDRs are in the vast... all of the members have the largest number of identities with the human representatives in their own family The first exception is the retinol dehydrogenase family SDR7C, where we find a total of. .. article, we apply hidden Markov models (HMMs) to obtain a sequence-based subdivision of the SDR superfamily that allows for automatic classification of novel sequence data and provides the basis for