Short-chaindehydrogenases/reductases (SDRs)
Coenzyme-based functionalassignmentsincompleted genomes
Yvonne Kallberg
1,2
, Udo Oppermann
1
, Hans Jo¨ rnvall
1
and Bengt Persson
1,2
1
Department of Medical Biochemistry and Biophysics and
2
Stockholm Bioinformatics Centre, Karolinska Institutet, Sweden
Short-chain dehydrogenases/reductases(SDRs) are enzymes
of great functional diversity. Even at sequence identities of
typically only 15–30%, specific sequence motifs are detect-
able, reflecting common folding patterns. We have devel-
oped a functional assignment scheme based on these motifs
and we find five families. Two of these families were known
previously and are called ÔclassicalÕ and ÔextendedÕ families,
but they are now distinguished at a further level based on
coenzyme specificities. This analysis gives seven subfamilies
of classical SDRs and three subfamilies of extended SDRs.
We find that NADP(H) is the preferred coenzyme among
most classical SDRs, while NAD(H) is that preferred among
most extended SDRs. Three families are novel entities,
denoted ÔintermediateÕ, ÔdivergentÕ and ÔcomplexÕ, encom-
passing short-chain alcohol dehydrogenases, enoyl reducta-
ses and multifunctional enzymes, respectively. The
assignment scheme was applied to the genomes of human,
mouse, Drosophila melanogaster, Caenorhabditis elegans,
Arabidopsis thaliana and Saccharomyces cerevisiae.Inthe
animal genomes, the extended SDRs amount to around one
quarter or less of the total number of SDRs, while in the
A. thaliana and S. cerevisiae genomes, the extended mem-
bers constitute about 40% of the SDR forms. The numbers
of NAD(H)-dependent and NADP(H)-dependent SDRs
are similar in human, mouse and plant, while the propor-
tions of NAD(H)-dependent enzymes are much lower in
fruit fly, worm and yeast. We show that, in spite of the great
diversity of the SDR superfamily, the primary structure
alone can be used for functionalassignments and for pre-
dictions of coenzyme preference.
Keywords: short-chain dehydrogenases/reductases; genome;
coenzyme; sequence patterns; bioinformatics.
Short-chain dehydrogenases/reductases(SDRs) are
enzymes of 250 residue subunits catalysing NAD(P)(H)-
dependent oxidation/reduction reactions. The concept of
SDRs was established in 1981 [1], at a time when the only
members known were a prokaryotic ribitol dehydrogenase
and an insect alcohol dehydrogenase. Since then, the SDR
family has grown enormously, both in the number of
known members and the diversity of their functions.
Already some years ago, over 1000 forms were ascribed to
the SDR superfamily [2], and currently at least 3000
members, including species variants, are known with a
substrate spectrum ranging from alcohols, sugars, steroids
and aromatic compounds to xenobiotics. The N-terminal
region binds the coenzymes NAD(H) or NADP(H), while
the C-terminal region constitutes the substrate binding part.
Although the residue identity is as low as 15–30%, the 3D
folds are quite similar, except for the C-terminal regions.
The SDRs have been divided into two large families,
ÔclassicalÕ and ÔextendedÕ, with different Gly-motifs in the
coenzyme-binding regions, and different chain lengths;
around 250 residues in classical SDRs and 350 in extended
SDRs [3]. Few residues are completely conserved, but
several sequence motifs are distinguishable within the
families.
It is desirable to define distinct characteristics of these
families for functionalassignments of new sequences added
to the SDR superfamily. We have now defined characteristic
differences for all SDR types and distinguish five SDR
families. Furthermore, seven subfamilies are delineated
within the classical SDRs and three subfamilies within the
extended SDRs. These characteristics can be used for
functional predictions of further, novel structures, and the
assignment system developed is now applied to the genomes
of human, mouse, Drosphila melanogaster, Arabidopsis tha-
liana, Caenorhabditis elegans and Saccharomyces cerevisiae.
MATERIALS AND METHODS
We trained a Hidden Markov model [4] on a set of 95 SDR
sequences extracted from
SWISSPROT
with less than 70%
identity in pairwise comparisons, using a manually curated
alignment based on human SDRs as seed sequences. The
resulting Hidden Markov model was subsequently used to
search the databases
SWISSPROT
[5] and
KIND
[6], selecting
every sequence that had an expect value below 10
)15
as a
candidate SDR. When these candidate sequences were
aligned, they separated into five clusters (Fig. 1), two of
which were the classical and extended families [3] and three
were the specific families of insect alcohol dehydrogenase,
enoyl reductase and multifunctional enzymes. These three
novel families were named ÔintermediateÕ, ÔdivergentÕ and
ÔcomplexÕ, respectively. The first level of assignments would
then be to sort sequences into these five families using a
motif-based approach.
Correspondence to B. Persson, Department of Medical Biochemistry
and Biophysics, Karolinska Institutet, S-171 77 Stockholm,
Sweden. Fax: + 46 8 337 462, Tel.: + 46 8 728 7730,
E-mail: bengt.persson@mbb.ki.se
Abbreviation: SDR, short-chain dehydrogenase/reductase.
(Received 25 April 2002, revised 16 July 2002,
accepted 24 July 2002)
Eur. J. Biochem. 269, 4409–4417 (2002) Ó FEBS 2002 doi:10.1046/j.1432-1033.2002.03130.x
Based on a nonredundant set (<80% identity; 100
classical, 80 extended, 7 intermediate, 12 divergent and 12
complex) of known SDR members in
SWISSPROT
,we
developed sequence motifs covering the most conserved
parts of the sequences. Three sequence motifs were devel-
oped for each family (Fig. 2) to optimize specificity and
sensitivity. Within each family, 40 of the most preserved
amino acid residues in the alignment were selected. The
amino acid types ÔacceptedÕ at a position were those
observed together with those with similar amino acid
properties, e.g. if Ile and Val are observed, then Leu is also
accepted at that position.
During an iterative process, an automated sorting
procedure was developed. The sequences aligned were
scored against the sequence motifs in the following manner.
The presence of an accepted amino acid residue type at a
motif position increases the sequence score with one point.
If instead a gap is found at that position, the score is
decreased by one point. A large region of the motifs cover
the coenzyme-binding region. Other enzyme families that
also bind NAD(P)(H) might be detected with this profile,
and introduce false positives in our set. Thus, in order to
separate SDRs from other enzymes, key positions in the
classical and extended motifs were deduced from multiple
sequence alignments. These key positions (bold in Fig. 2)
render a score of +3 if present and a score of )3ifabsent.
Thus, each sequence is associated with five different scores,
one for each family. Incomplete sequences can pose a
problem when using sequence-based methods, because such
sequences might render a low score and thus be classified
incorrectly. In this report, sequences with more than 20%
gap positions in the alignment were removed from the data
set and not subjected to the scoring process.
The sorting procedure, with the groups and thresholds, is
shown in Fig. 3. The thresholds were obtained through a
systematic iterative procedure. The scores were used to sort
the sequences into one of the five families. There are
members of the SDR superfamily that do not meet any of
the family requirements, i.e. the scores are below the
thresholds. Rather than to lower the thresholds or to extend
the motifs, such sequences are sorted into an artificial group
called ÔunclassifiedÕ SDR. Another artificial group, Ôpoten-
tialÕ SDR, is also used. It will consist of sequences that are
not SDR members as far as can be judged today, but have
some properties in common with the SDR family.
For the structural comparisons, the 3D structures of
members within the SDR superfamily were superimposed
using
ICM
(version 2.7, Molsoft LLC, San Diego, CA, USA)
[7].
RESULTS
Five SDR families
In order to get functionalassignments for the members of
the SDR superfamily, we developed an assignment system
to distinguish families with specific characteristics. The SDR
superfamily divides into five families (Fig. 1), of which two
are the previously established classical and extended, and
three are novel entities, denoted intermediate, divergent and
complex.
The classical family encompasses oxidoreductases
(EC 1 ), such as steroid dehydrogenases and carbonyl
reductases. The extended family consists of isomerases
(EC 5 ), e.g. galactose epimerases, and lyases
(EC 4 ), such as glucose dehydratases, but several
oxidoreductases are also found within this family, e.g. in
Fig. 2. Conserved sequence motifs in the SDR families as derived from a multiple sequence alignment. For each of the five SDR families, specific
sequence patterns exist. Three motif segments, with a total of 40 preserved positions that cover the coenzyme-binding and active site regions, have
been chosen for each family. Multiple amino acid occurrences at a position are written within brackets. ÔxÕ denotes any amino acid residue or gap,
and when present the subsequent number indicates the number of x residues/gaps. Amino acid residues written in bold indicate positions of special
importance in the classical and extended motifs. Because the motifs are based upon the sequence majority, insertions in single sequences do not
affect the patterns.
Fig. 1. The two levels of classification within the SDR superfamily. At
the first level, the members of the SDR superfamily are separated into
five families. At the second level, members of the classical and extended
families are separated into seven and three subfamilies, respectively,
based upon coenzyme-binding residue patterns.
4410 Y. Kallberg et al. (Eur. J. Biochem. 269) Ó FEBS 2002
multifunctional enzymes such as the 3b-hydroxysteroid
dehydrogenase/D 4,5 isomerase cluster.
The intermediate family exhibits an atypical Gly-motif
(G/AxxGxxG/A) that resembles patterns of extended
SDRs, except that Ala is highly represented instead of
Gly. However, the remaining parts of the sequences are
more closely related to the classical SDRs, e.g. with an
NGAG motif (corresponding to the NNAG motif in b4,
Table 4), and with a subunit size ( 250 residues) as the
classical SDRs. In this family, thus denoted intermediate, we
find fruit fly alcohol dehydrogenases, constituting a set of
SDRs that divides into three lines with around 35%
sequence identity, in pair-wise comparisons, between them.
The divergent family with enoyl reductases from bacteria
and plants constitutes a set of NADH-dependent enzymes
with three patterns that deviate from those typical of most
SDRs. First, the Gly-motif is differently spaced with five
residues instead of three between the first two glycine
residues. Second, in bacteria the second and third glycines
have been replaced with serine and alanine, i.e. the motif is
GxxxxxSxA. Third, there is a methionine instead of a
tyrosine in the active site motif, while the tyrosine is found
three positions upchain, i.e. YxxMxxxK instead of YxxxK.
The3DstructuresofFabIfromEscherichia coli (PDB code
1qsg) and Mycobacterium tuberculosis (1bvr) reveal that the
tyrosine and lysine residues are close in space. They are
located within an a-helix and the spacing between the two
residues makes them face the same side with a similar
distance between Tyr-O
g
and Lys-N
f
as for the classical
SDRs, i.e. with a 1.3-A
˚
difference compared to the 3a,20b-
hydroxysteroid dehydrogenase, and with spatial freedom
for the lysine residue to move closer to the tyrosine residue.
Thus, they can function the same way as when the residues
are only three positions apart [8,9].
The complex family is named after its members, which
are parts of multifunctional enzyme complexes present in all
forms of life, e.g. fatty acid synthase. They are NADP(H)-
binding proteins with the SDR region having a beta-
ketoacyl reductive function. This group has the unique
motif of YxxxN at the active site rather than the typical
YxxxK.
Using a Hidden Markov model, candidate SDR
sequences were extracted from
SWISSPROT
and
KIND
.These
sequences were aligned and sorted into the five families,
classical, extended, intermediate, divergent and complex,
using a motif-based approach (for details please see the
Materials and methods section). The two databases show
the same ratios for the different families (Table 1). The
family of classical SDRs is the largest, capturing half of the
sequences, while the family of extended SDRs is second in
size with a quarter of the sequences.
Even when the most divergent sequences have been
assigned to families, there is still large sequence variation
among the members of the classical and the extended SDRs.
The sequence identity is as low as 8% (classical) and 10%
(extended) in pair-wise comparisons (Table 1). Thus, these
two families are subject to a further assignment procedure,
at a second level, based upon coenzyme-specificity.
Table 1. Number of SDR family members in the
SWISSPROT
and
KIND
databases.
Family
SWISSPROT KIND
Group size Residue identity Group size Residue identity
Classical 253 (50%) 8–99% 1512 (47%) 8–99%
Extended 125 (25%) 10–99% 856 (27%) 6–99%
Intermediate 62 (12%) 27–99% 158 (5%) 25–99%
Divergent 16 (3%) 28–98% 53 (2%) 24–99%
Complex 12 (2%) 20–74% 133 (4%) 15–99%
unclassified 16 (3%) – 97 (3%) –
potential 17 (3%) – 128 (4%) –
partial 12 (2%) – 267 (8%) –
Total 513 – 3204 –
Fig. 3. Flow chart of the family assignment procedure. Each sequence is
scored against the five different family motifs. Depending on these
scores, the sequences are sorted into seven groups – the five families
and two ÔartificialÕ groups. The conditions for each selection are given
within boxes.
Ó FEBS 2002 Coenzyme-basedfunctionalassignments of SDRs (Eur. J. Biochem. 269) 4411
Coenzyme-based subfamily assignments
The coenzyme-binding residues were used in the subfamily
assignments. A bab-fold, part of the Rossmann fold [10],
has been found to be in common in enzymes that bind
NAD(H), NADP(H) or FAD [11]. An acidic residue is
often present at the C-terminal end of the second b-strand in
enzymes that are NAD(H)-binding [12]. This residue forms
hydrogen bonds to the 2¢-and3¢-hydroxyl groups of the
adenine ribose moiety. NADP(H)-preferring enzymes have
instead two basic residues (Arg or Lys) that bind to the
2¢-phosphate [cf 13]. The first of these basic residues is found
in the Gly-motif, immediately preceding the second glycine.
The second basic residue is positioned directly after the
crucial acidic residue of NAD(H)-preferring enzymes, i.e. at
the first loop position after the second b-strand. The pattern
of charged residues was used to distinguish subfamilies
within the classical and extended SDR families.
Subfamilies within the classical SDR family
We superimposed experimentally solved 3D structures of
classical SDRs, and compared residues within 4 A
˚
of the
coenzyme. NAD(H)-preferring enzymes (3a,20b-hydroxy-
steroid dehydrogenase, 7a-hydroxysteroid dehydrogenase,
2,3-dihydroxybiphenyl dehydrogenase, 2,3-butanediol dehy-
drogenase, 3-hydroxyacyl-CoA dehydrogenase type 2 and
dihydropteridine reductase; PDB codes 2hsd, 1ahh, 1bdb,
1geg, 1e3w and 1dhr), have an acidic residue present at the
end of the second b-strand(keyposition36inTable2).
Presence of the Asp residue at this position alone seems to
determine the preference of NAD(H) over NADP(H), as
neither a basic residue adjacent to this acidic residue (1bdb),
nor a basic residue in the Gly-motif (2hsd) alters the
coenzyme preference. NADP(H)-binding enzymes seem to
be less strict in their requirement for two basic residues.
Three structures (carbonyl reductase, troponine reductase II
and sepiapterin reductase; PDB codes 1cyd, 2ae2, and 1oaa)
have both these residues (key positions 15 and 37 in
Table 2), while trihydroxynaphthalene reductase (1ybv) and
3-oxoacyl reductase (1edo) have only the first, and
17b-hydroxysteroid dehydrogenase type 1 (1fdu) has only
the second basic residue.
Because only few structures are experimentally solved, we
created an alignment including all classical SDRs with
coenzyme specificity annotated in
SWISSPROT
. The sequences
were aligned using a Hidden Markov model trained on
sequences from the classical family only, to avoid artefacts
due to the great diversity of the SDR superfamily. We found
that the correlations between patterns of charged residues
and coenzyme specificity are generally applicable. Sequence
motifs based upon the patterns of charged residues were
developed and used to sort the classical SDRs into four
subfamilies of NAD(H)-binding proteins (Fig. 1). These
subfamilies were denoted cD1d, cD1e, cD2 and cD3.
Sequences that bind NAD(H) and have a negatively
charged amino acid residue present at the end of the second
b-strand (key position 36, Table 2) are sorted into subfamily
cD1d if this charged residue is aspartic acid or subfamily
cD1e if it is glutamic acid. Sequences that instead have a
negatively charged residue at the first or second position
after the second b-strand (key positions 37 or 38, Table 2)
are sorted into subfamily cD2 or cD3, respectively.
The NADP(H)-binding proteins are sorted into three
subfamilies. Sequences with a basic residue in the Gly-motif
(key position 15, Table 2) are sorted into subfamily cP1,
while those with a basic residue at the first position after the
second b-strand (key position 37, Table 2) are sorted into
subfamily cP2. The cP3 subfamily is formed from sequences
that have basic residues at both these positions.
The new sorting process was applied to every classical
SDR sequence in
SWISSPROT
and
KIND
, giving the distribu-
tion of subfamilies shown in Table 2. NADP(H)-binding is
twice as frequent as NAD(H)-binding ( 60% vs. 30%),
indicating that there are more forms catalysing the reductive
reactions than the oxidative reactions. Only about 10% of
the sequences do not have any of the typical patterns and
thus cannot be classified.
For all but six of the 218 assigned classical SDRs, the
coenzyme specificity is correctly predicted, as judged by
agreements with the annotations in the
SWISSPROT
database
entries. Scrutinizing the six deviating cases, we find that in
four (Dhb1_Human, Dhb7_Mouse, Dhpr_Rat and
Idno_Ecoli) there are experimental studies [14–17] that
support our predictions. The remaining two cases are
sequences involved in fatty acid biosynthesis (Fabg_Thema
and Fag2_Syny3). They are annotated as NADPH-binding
in
SWISSPROT
, and other proteins of the same functional type
indeed use NADPH as coenzyme. However, in contrast to
them, these two sequences have an aspartic acid at the last
Table 2. Number of classical SDRs within the
SWISSPROT
and
KIND
databases, divided into different coenzyme-binding subfamilies. Key position
numbers refer to 3a,20b-hydroxysteroid dehydrogenase (PDB code 2hsd).
Subfamily
Key positions
SWISSPROT KIND
15 36 37 38
cD1d D 64 (25%) 389 (26%)
cD1e E 2 (1%) 16 (1%)
cD2 D/E 2 (1%) 6 (< 1%)
cD3 D/E 8 (3%) 28 (2%)
cP1 K/R 24 (10%) 120 (8%)
cP2 K/R 41 (16%) 280 (19%)
cP3 K/R K/R 77 (30%) 530 (35%)
Unclassified 35 (14%) 143 (9%)
Total 253 1512
4412 Y. Kallberg et al. (Eur. J. Biochem. 269) Ó FEBS 2002
position of the second b-strand and are thus predicted to be
NAD(H)-binding by our method (subfamily cD1d). It is still
not experimentally verified if these two sequences bind
NADH or if they bind NADPH in an atypical manner.
Subfamilies within the extended SDR family
The number of experimentally solved 3D structures for the
extended family is lower than for the classical family. At
present, there are two known structures for NAD(H)-
preferring enzymes (UDP-galactose 4-epimerase and
dTDP-glucose 4,6-dehydratase; PDB codes 1ek6 and
1bxk). As for the NAD(H)-preferring enzymes of the
classical type, those of the extended family also present the
acidic residue (at key position 33, Table 3), and it is
concluded to be the exclusive determinant of an
NAD(H)-preferring enzyme. There are two structures of
NADP(H)-preferring enzymes (GDP-fucose synthetase and
ADP-
L
-glycero-
D
-mannoheptose 6-epimerase; PDB codes
1bsv and 1eq2). However, when superimposing these
structures the root mean square deviation is 10 A
˚
,and
one of the main differences between the structures is in the
coenzyme-binding region. The second structure (1eq2) is
atypical of the family [18,19], as it prefers NADP(H) but still
has the aspartic acid at the end of the second b-strand
typical of NAD(H)-binding. Thus, the assignments of
NADP(H)-preferring enzymes of the extended type is based
on only the alignment of known annotated members of this
type. In the alignment, we find that the basic residue present
in the Gly-motif among the classical SDRs does not have a
counterpart among the extended SDRs. The second basic
residue, in the loop after the second b-strand, is conserved
among extended SDRs as well (key position 34, Table 3).
For the extended SDRs, two NAD(H)-binding sub-
families (eD1 and eD2) and one NADP(H)-binding
subfamily (eP1) were defined based on the alignment.
NAD(H)-binding sequences with an acidic residue at the
end of the second b-strand (key position 33, Table 3) are
sorted into the eD1 subfamily and those that have an acidic
residue two positions downchain are sorted into the eD2
subfamily. The eP1 subfamily will consist of NADP(H)-
bound sequences that have a basic residue at the first loop
position after the second b-strand (key position 34, Table 3).
Table 3 displays the results when this classification system is
appliedtothe
SWISSPROT
and
KIND
databases. In contrast to
the results for the classical SDRs, a majority of the extended
SDRs are predicted to be NAD(H)-binding rather than
NADP(H)-binding. The NAD(H)-binding enzymes are
twice as many as the NADP(H)-binding ones, indicating
that there are more dehydrogenases than reductases in the
extended SDR family. Around 10% of the sequences lack
charged residues at the deterministic positions.
For all but eight of the 118 assigned extended SDRs, the
predicted coenzyme specificities agree with those annotated
in
SWISSPROT
. There are three ADP-
L
-glycero-
D
-mannohep-
tose 6-epimerases that are predicted to be NAD(H)-binding.
The sequences harbour an aspartic acid residue at the
NAD(H)-deterministic position, but these enzymes prefer
NADP(H) rather than NAD(H). The structure of the
E. coli enzyme (1eq2) shows that the Asp residue is in a
more open conformation in contrast to other NAD(H)-
preferring enzymes, and that therefore NADP(H) can be
accommodated [18,19]. There are five other sequences
where the predicted coenzyme preferences are in disagree-
ment with the annotated preferences. One enzyme (galac-
tose epimerase, Gale_Vibch) is predicted to prefer
NADP(H), but as the galactose epimerases normally prefer
NAD(H), the prediction is probably deceived by a mis-
alignment due to a deletion of nine residues. Another
NADP(H)-predicted sequence (Noel_Rhifr) is annotated as
NAD(H)-preferring, but also as a mannose dehydratase,
which in general prefer NADP(H) to NAD(H). There are
no experimental data to support either alternative. The last
three sequences are dTDP-4-dehydrorhamnose reductases
(Rbd1_Ecoli, Rbd2_Ecoli and Rfbd_Salty) with around
80% pair-wise residue identity. They are predicted to be
NAD(H)-preferring but are annotated to be NADP(H)-
preferring. However, the enzyme from S. enterica
(Rfbd_Salty) has been shown to have dual coenzyme
specificity, with a slight preference for NADH [20].
Application to genome data
We also applied our method to six of the genome databases
available, i.e. human [21], mouse (July 2001; Celera
Genomics, Rockville, MD), C. elegans [22], D. melanogaster
[23], A. thaliana [24]; and S. cerevisiae [25].InFig.4,results
of the assignments are displayed. The numbers of SDRs
found are similar when comparing the human and mouse
genomes. These genomes were released recently and cannot
be considered to be complete. Thus, the number of SDRs in
these genomes can be expected to increase [26].
For the human and mouse genomes, the distribution
between classical (gray) and extended (white) families is
similar to that in the general protein databases, where the
extended members amount to around 25% or less of the
total SDR number. However, in the S. cerevisiae and
A. thaliana genomes about 40% of the SDR forms are
Table 3. Number of extended SDRs, within the
SWISSPROT
and
KIND
databases, assigned into different coenzyme-binding subfamilies. Key positions
numbers refer to UDP-galactose 4-epimerase (PDB code 1ek6).
Subfamily
Key positions
35
SWISSPROT KIND
33 34
eD1 D/E 79 (63%) 469 (55%)
eD2 D/E 3 (2%) 9 (1%)
eP1 K/R 36 (29%) 277 (32%)
Unclassified 7 (6%) 101 (12%)
Total 125 856
Ó FEBS 2002 Coenzyme-basedfunctionalassignments of SDRs (Eur. J. Biochem. 269) 4413
extended. Yeast has a much smaller genome than the others
with only 19 SDRs in total, and the seven extended SDRs
might reflect a critical minimum of extended SDRs [2]. In
the plant (A. thaliana) genome the extended members are
close to half of the total SDR forms, reflecting the different
metabolic requirements in plants involving several carbo-
hydrate rearrangements. The total number of SDR forms is
greater in A. thaliana than in other species, compatible with
the large number of gene duplications in plants [27].
However, the ratio between extended and classical forms
is still the same when reducing the data set for homology at
the 60% and 80% levels.
The absolute numbers of extended SDRs are similar in
the animal species (10–18). The number of classical SDRs is
between 39 and 48 in human, mouse and fruit fly, while the
worm has 72 classical SDRs. The worm shows a consid-
erable gene duplication tendency [28], which if affecting
classical and extended SDRs differently could explain this
difference.
Also shown in Fig. 4 are the results of the subfamily
assignments within the classical and extended SDRs. The pie
charts show the relative number of NAD(H)-preferring
sequences (lined pattern) vs. NADP(H)-preferring sequences
(solid) in each genome. The number of NAD(H)-dependent
SDRs is close to the number of NADP(H)-dependent SDRs
in human, mouse and A. thaliana. In contrast, the NAD(H)-
dependent enzymes amount to only one quarter in fruit fly
and one eighth in worm and yeast.
The observation that classical SDRs most frequently
utilize NADP(H) is remarkable. In the worm genome, 60
sequences are sorted into the NADP(H) classes, while only
eight are sorted into NAD(H) classes. For extended SDRs,
the observation that most of them in general are NAD(H)-
dependent is not valid for fruit fly and yeast, where most
extended SDRs instead bind NADP(H), and A. thaliana,
where the numbers of NAD(H)- and NADP(H)-dependent
forms are close to equal (34 vs. 27).
DISCUSSION
Database quality considerations
Our method for functionalassignments was applied to
completed eukaryotic genomes, revealing that the SDR
subfamily patterns vary considerably between different
species. However, the genome databases are often prelimi-
nary and contain errors. Exons might be missing resulting in
partial sequences. Falsely ascribed exon borders will result
in sequences with erroneous deletions and/or insertions. A
motif-based method, that is dependent on a correct
alignment, is of course sensitive to these types of error.
Still, bearing in mind that several genome sequences are
preliminary, this type of classification is valuable to deduce
early functional assignments.
Automated annotation methods are developed to assign
functions to newly sequenced proteins. A drawback with
automated annotation is that errors might be introduced
[29]. Manual annotation should be of higher quality but is
very time-consuming, which leads to difficulties in keeping
up the pace with the genome sequencing projects. In this
study, we detected some errors in annotation of coenzyme
specificity in
SWISSPROT
, a database that is manually
annotated and thereby believed to be reliable. There were
three different types of error between the keywords and the
references in these database entries. First, the quoted
publications reported different coenzyme specificities, but
the keywords only mentioned one of them. Second, there
were entries where the quoted publications stated one type
of coenzyme while the keyword stated a different type.
Third, there were entries where the keywords reported a
coenzyme specificity without any verifying reference, and
the keywords did not say ÔprobableÕ or Ôby similarityÕ,orany
other word to inform about the uncertainty. Thus, it is still
necessary to perform database assignment checks, and the
present method is useful for this purpose, in addition to its
value in primary assignments.
Classical SDRs vs. extended SDRs
The multiple sequence alignments of classical and extended
SDRs (Fig. 5) show that even though these families are
highly divergent, there are conserved regions that can serve
as fingerprints in the identification of novel SDR members
(Fig. 6). In these regions, used to identify classical
and extended SDR family members (see Materials and
methods), some motifs are of special interest. These are
listed in Table 4. In the N-terminal region, we find the
pattern of three glycine residues that is characteristic of
NAD(P)(H)-binding enzymes. These residues are spaced
differently in classical and extended SDRs (Table 4).
Fig. 4. Classical and extended SDRs and their
coenzyme preference shown for the genomes
investigated. The pie charts display the pro-
portions between classical (gray) and extended
(white) SDRs with specificity for NAD(H)
(lined pattern) and NADP(H) (solid), for each
of the six genomes studied. The number of
SDR enzymes with their coenzyme-specificity
assigned is given within parentheses.
4414 Y. Kallberg et al. (Eur. J. Biochem. 269) Ó FEBS 2002
In both families there is a conserved aspartic acid residue,
in the loop between b3anda3, required for stabilization of
the adenine-binding pocket [13,30]. In the extended family
this residue if often followed by another charged residue two
positions downchain.
The motif positioned in and adjacent to b4(Table4)is
less conserved among extended SDRs compared to classical
SDRs. Typically, extended SDRs prefer a histidine residue
rather than an asparagine residue at the end of this b-strand.
In classical SDRs, the NNAG motif has a role to stabilize
the b-strands within the central b-sheetandtopositionthis
central b-sheet [30].
There is a motif in a4 that is especially well conserved
among the extended SDRs. The a4 motif is also conserved
among the classical SDRs. Here, the asparagine residue is
involved in building the active site geometry by positioning
the lysine residue and being part of a postulated proton
relay [30].
The active site residues in b5anda5 (serine, tyrosine and
lysine) are found in both classical and extended SDRs. The
extended SDRs have a conserved proline residue preceding
the tyrosine residue, and also a conserved negatively
charged residue four residues downchain of the lysine
residue. Neither of these two residues are conserved in the
Fig. 6. 3D structure of a classical SDR enzyme
with motifs indicated. The spheres show the
coenzyme-deterministic positions for
NAD(H) in red and NADP(H) in blue.
Regions used to identify SDR members (cf.
Figure 2) are shown by blue ribbons. The
coenzyme is coloured magenta. The structure
is 3a,20b-hydroxysteroid dehydrogenase
(PDB code 2hsd). The figure was made using
the programme
ICM
.
Fig. 5. Multiple sequence alignments of classical and extended SDRs. Thefirstthreecolumnsgivethe
SWISSPROT
sequence identifier, PDB identifier
and subfamily membership. The secondary structure elements of 3a,20b-hydroxysteroid dehydrogenase (PDB code 2hsd) are shown above the
classical SDR alignment, while the secondary structure elements of UDP-galactose 4-epimerase (PDB code 1ek6) is shown below the extended SDR
alignment. Boxed residues denote key positions in coenzyme binding. Coloured residues represent conservation of 60%, as calculated for a larger
data set (red ¼ acidic, green ¼ polar, light blue ¼ hydrophobic, dark blue ¼ basic, purple ¼ Gly or Pro). Arrows 1, 2 and 3 above the alignment
show the key positions 15, 36 and 37 (cf. Table 2). Arrows 1, 2 and 3 below the alignment show the key positions 33, 34 and 35 (cf. Table 3).
Ó FEBS 2002 Coenzyme-basedfunctionalassignments of SDRs (Eur. J. Biochem. 269) 4415
classical family, instead, they have a conserved aspartic acid
residue about 13 positions downchain from the lysine
residue.
Coenzyme specificity as classification basis
The two-level classification system divides members of the
SDR superfamily into families and subfamilies, using a
motif-based approach. For the five families detected at the
first level – classical, extended, intermediate, divergent and
complex – specific sequence patterns were extracted
(Table 2). The patterns for families with few and/or closely
related members (i.e. the intermediate, divergent and
complex families) might be necessary to update when
further members are added, to avoid a bias towards the
presently known sequences.
At the second level, the sequences belonging to the
classical and extended families were further divided into
seven and three subfamilies, respectively. These subfamilies
were defined based on coenzyme specificity and patterns of
charged residues in the coenzyme-binding region. The
human 17b-hydroxysteroid dehydrogenase type 1 is an
NADP(H)-preferring enzyme with a serine residue (Ser12)
at the position before the second glycine residue of the
glycine motif. There is an arginine residue (Arg37) at the
first position after the second b-strand. Site-directed muta-
genesis experiments show that an exchange of Ser12 to
lysine increased the specificity for NADP(H), while a
substitution of Leu36 to an aspartic acid changed the
preference from NADP(H) to NAD(H) [34], supporting the
crystallographic analysis and our motif-based assignments.
The specificity might also depend on other factors than
the sequence patterns defined thus far. Some enzymes show
dual coenzyme specificity and might bind alternative
coenzymes in different tissues and in different cellular
compartments. Molecular modelling using docking calcu-
lations might be helpful in the prediction of coenzyme
preference [35].
There are members of the classical type where no motifsfor
coenzyme specificity were established, as no charged residues
are found at the key positions otherwise identified as crucial
for this task (Table 2). This is the situation for 11b-hydroxy-
steroid dehydrogenases type 2 and human 17b-hydroxy-
steroid dehydrogenase type 2. However, charged residuesare
found further downchain, and their roles might be clarified
when the 3D structures become known. The retinol dehy-
drogenases (RDH) constitute a group where experiments
show that bovine RDH is NAD
+
-dependent [36], while the
rat RDH is NADP
+
-dependent [37]. These two sequences
are very similar in the Gly-region and identical at the
positions used to distinguish between NAD(H) and
NADP(H) enzymes. Based on homology modelling of rat
and bovine RDH [38], a basic residue further downchain
(Lys64) in rat RDH is believed to enable NADP
+
to bind.
The corresponding residue in bovine is polar (Thr61). Only
when their respective 3D structures have been experimentally
determined, will it be possible to check which residues have
shouldered the burden of separating between NAD(H) and
NADP(H) specificity in these enzymes.
In summary, we have shown that functional assignments
can be made and coenzyme preferences can be predicted
from the amino acid sequence alone for SDR enzymes. For
this divergent superfamily, we could distinguish families and
subfamilies, which will help future assignments. The present
approach using hidden Markov models and sequence
patterns is general and can be extended to further enzyme
families.
ACKNOWLEDGEMENTS
Financial support from the Swedish Research Council, the Swedish
Foundation for Strategic Research, the Swedish Society for Medical
Research, the Swedish Society of Medicine, the Novo Nordisk
Foundation and Karolinska Institutet is gratefully acknowledged.
REFERENCES
1. Jo
¨
rnvall, H., Persson, M. & Jeffery, J. (1981) Alcohol and
polyol dehydrogenases are both divided into two protein
types, and structural properties cross-relate the different enzyme
activities within each type. Proc. Natl Acad. Sci. USA 78, 4226–
4230.
2. Jo
¨
rnvall, H., Ho
¨
o
¨
g, J O. & Persson, B. (1999) SDR and MDR:
completed genome sequences show these protein families to be
large, of old origin, and of complex nature. FEBS Lett. 445,261–
264.
3. Jo
¨
rnvall, H., Persson, B., Krook, M., Atrian, S., Gonzalez-
Duarte, R., Jeffery, J. & Ghosh, D. (1995) Short-chain dehy-
drogenases/reductases (SDR). Biochemistry 34, 6003–6013.
4. Karplus, K., Barrett, C. & Hughey, R. (1998) Hidden Markov
models for detecting remote protein homologies. Bioinformatics
14, 846–856.
5. Bairoch, A. & Apweiler, R. (2000) The SWISS-PROT protein
sequence database and its supplement TrEMBL in 2000. Nucleic
Acids Res. 28, 45–48.
Table 4. Conserved sequence motifs in the classical and the extended SDR families. In the motifs, ÔaÕ denotes an aromatic residue, ÔcÕ acharged
residue, ÔhÕ a hydrophobic residue, ÔpÕ a polar residue and ÔxÕ any residue. Alternative amino acids at a motif position are given within brackets.
Secondary
structure
element
SDR motifs
Suggested function Reference
Classical Extended
b1+a1 TGxxxGhG TGxxGhaG Structural role in coenzyme binding region [1,2,31]
b3+a3 Dhx[cp] DhxD Adenine ring binding of coenzyme [30]
b4 GxhDhhhNNAGh [DE]xhhHxAA Structural role in stabilizing central b-sheet [30]
a4 hNhxG hNhhGTxxhhc Part of active site [30]
b5 GxhhxhSSh hhhxSSxxhaG Part of active site [2,31]
a5 Yx[AS][ST]K PYxx[AS]Kxxh[DE] Part of active site [2,31]
b6 h[KR]h[NS]xhxPGxxxT h[KR]xxNGP Structural role, reaction direction [32,33]
4416 Y. Kallberg et al. (Eur. J. Biochem. 269) Ó FEBS 2002
6. Kallberg, Y. & Persson, B. (1999) KIND – a nonredundant pro-
tein database. Bioinformatics 15, 260–261.
7. Abagyan, R. & Totrov, M. (1994) Biased probability Monte Carlo
conformational searches and electrostatic calculations for peptides
and proteins. J. Mol. Biol. 235, 983–1002.
8. Stewart, M.J., Parikh, S., Xiao, G., Tonge, P.J. & Kisker, C.
(1999) Structural basis and mechanism of enoyl reductase inhibi-
tion by triclosan. J. Mol. Biol. 290, 859–865.
9. Rozwarski,D.A.,Vilcheze,C.,Sugantino,M.,Bittman,R.&
Sacchettini, J.C. (1999) Crystal structure of the Mycobacterium
tuberculosis enoyl-ACP reductase, InhA, in complex with NAD
+
and a C16 fatty acyl substrate. J. Biol. Chem. 274, 15582–15589.
10. Rossmann, M.G., Liljas, A., Bra
¨
nde
´
n, C I. & Banaszak, L.J.
(1975) The Enzymes,3rdedn.(Boyer,P.D.,eds),Vol.11,pp.61–
102.AcademicPress,NewYork.
11. Wierenga, R.K., de Maeyer, M.C. & Hol, W.G. (1985) Interaction
of pyrophosphate moieties with a-helices in dinucleotide binding
proteins. Biochemistry 24, 1346–1357.
12. Wierenga, R.K., Terpstra, P. & Hol, W.G. (1986) Prediction of the
occurrence of the ADP-binding beta alpha beta-fold in proteins,
using an amino acid sequence fingerprint. J. Mol. Biol. 187, 101–
107.
13. Tanaka, N., Nonaka, T., Nakanishi, M., Deyashiki, Y., Hara, A.
& Mitsui, Y. (1996) Crystal structure of the ternary complex of
mouse lung carbonyl reductase at 1.8 A
˚
resolution: the structural
origin of coenzyme specificity in the short-chain dehydrogenase/
reductase family. Structure 4, 33–45.
14. Breton, R., Housset, D., Mazza, C. & Fontecilla-Camps, J.C.
(1996) The structure of a complex of human 17beta-hydroxy-
steroid dehydrogenase with estradiol and NADP
+
identifies two
principal targets for the design of inhibitors. Structure 4, 905–915.
15. Nokelainen, P., Peltoketo, H., Vihko, R. & Vihko, P. (1998)
Expression cloning of a novel estrogenic mouse 17 beta-
hydroxysteroid dehydrogenase/17-ketosteroid reductase
(m17HSD7), previously described as a prolactin receptor-associ-
ated protein (PRAP) in rat. Mol. Endocrinol. 12, 1048–1059.
16. Varughese, K.I., Skinner, M.M., Whiteley, J.M., Matthews, D.A.
& Xuong, N.H. (1992) Crystal structure of rat liver dihydropter-
idine reductase. Proc. Natl Acad. Sci. USA. 89, 6080–6084.
17. Bausch, C., Peekhaus, N., Utz, C., Blais, T., Murray, E.,
Lowary, T. & Conway, T. (1998) Sequence analysis of the GntII
(subsidiary) system for gluconate metabolism reveals a novel
pathway for
L
-idonic acid catabolism in Escherichia coli. J. Bac-
teriol. 180, 3704–3710.
18. Deacon, A.M., Ni, Y.S., Coleman, W.G. Jr & Ealick, S.E.
(2000) The crystal structure of ADP-
L
-glycero-
D
-mannoheptose
6-epimerase: catalysis with a twist. Structure Fold. Des. 8, 453–462.
19. Ni,Y.,McPhie,P.,Deacon,A.,Ealick,S.&Coleman,W.G.Jr
(2001) Evidence that NADP
+
is the physiological cofactor of
ADP-
L
-glycero-
D
-mannoheptose 6-epimerase. J. Biol. Chem. 276,
27329–27334.
20. Graninger, M., Nidetzky, B., Heinrichs, D.E., Whitfield, C. &
Messner, P. (1999) Characterization of dTDP-4-dehydro-
rhamnose3,5-epimeraseanddTDP-4-dehydrorhamnosereductase,
required for dTDP-
L
-rhamnose biosynthesis in Salmonella enterica
serovar Typhimurium LT2. J. Biol. Chem. 274, 25069–25077.
21. Venter, J.C. et al. (2001) The sequence of the human genome.
Science 291, 1304–1351.
22. Wilson, R.K. (1999) How the worm was won. The C. elegans
genome sequencing project. Trends Genet. 15, 51–58.
23. Adams, M.D. et al. (2000) The genome sequence of Drosophila
melanogaster. Science 287, 2185–2195.
24. Huala, E. et al. (2001) The Arabidopsis Information Resource
(TAIR): a comprehensive database and web-based information
retrieval, analysis, and visualization system for a model plant.
Nucleic Acids Res. 29, 102–105.
25. Mewes, H.W. et al. (1997) Overview of the yeast genome. Nature
387, 7–65.
26. Kallberg, Y., Oppermann, U., Jo
¨
rnvall, H. & Persson, B. (2002)
Short-chain dehydrogenase/reductase (SDR) relationships: a large
family with eight clusters common to human, animal, and plant
genomes. Protein Sci. 11, 636–641.
27. Bancroft, I. (2000) Insights into the structural and functional
evolution of plant genomes afforded by the nucleotide sequences
of chromosomes 2 and 4 of Arabidopsis thaliana. Yeast 17, 1–5.
28. Semple, C. & Wolfe, K.H. (1999) Gene duplication and gene
conversion in the Caenorhabditis elegans genome. J. Mol. Evol. 48,
555–564.
29. Devos, D. & Valencia, A. (2001) Intrinsic errors in genome
annotation. Trends Genet. 17, 429–431.
30. Filling, C., Berndt, K.D., Benach, J., Knapp, S., Prozorovski, T.,
Nordling, E., Ladenstein, R., Jo
¨
rnvall, H. &Oppermann, U. (2002)
Critical residues for structure and catalysis inshort-chain dehy-
drogenases/reductases (SDR). J. Biol. Chem. 277, 25677–25684.
31. Oppermann, U.C., Filling, C., Berndt, K.D., Persson, B.,
Benach, J., Ladenstein, R. & Jo
¨
rnvall, H. (1997) Active site
directed mutagenesis of 3 beta/17 beta-hydroxysteroid dehydro-
genase establishes differential effects on short-chain dehydrogen-
ase/reductase reactions. Biochemistry 36, 34–40.
32. Filling, C., Nordling, E., Benach, J., Berndt, K.D., Ladenstein, R.,
Jo
¨
rnvall, H. & Oppermann, U. (2001) Structural role of conserved
Asn179 in the short-chain dehydrogenase/reductase scaffold.
Biochem. Biophys. Res. Commun. 289, 712–717.
33. Ghosh, D. & Vihko, P. (2001) Molecular mechanisms of
estrogen recognition and 17-keto reduction by human 17beta-
hydroxysteroid dehydrogenase 1. Chem. Biol. Interact. 130–132,
637–650.
34. Huang, Y.W., Pineau, I., Chang, H.J., Azzi, A., Bellemare, V.,
Laberge, S. & Lin, S.X. (2001) Critical residues for the specifi-
city of cofactors and substrates in human estrogenic 17beta-
hydroxysteroid dehydrogenase 1: variants designed from the
three-dimensional structure of the enzyme. Mol. Endocrinol. 11,
2010–2020.
35. Peralba, J.M., Cederlund, E., Crosas, B., Moreno, A., Julia
`
,P.,
Martı
´
nez, S.E., Persson, B., Farre
´
s, J., Pare
´
s, X. & Jo
¨
rnvall, H.
(1999) An NADP(H)-dependent stomach alcohol dehydrogenase.
Structural and enzymatic properties of a gastric NADP(H)-
dependent and retinal-active alcohol dehydrogenase. J. Biol.
Chem. 274, 26021–26026.
36. Simon, A., Hellman, U., Wernstedt, C. & Eriksson, U. (1995) The
retinal pigment epithelial-specific 11-cis retinol dehydrogenase
belongs to the family of short chain alcohol dehydrogenases.
J. Biol. Chem. 270, 1107–1112.
37. Chai, X., Boerman, M.H., Zhai, Y. & Napoli, J.L. (1995) Cloning
of a cDNA for liver microsomal retinol dehydrogenase. A tissue-
specific, short-chain alcohol dehydrogenase. J. Biol. Chem. 270,
3900–3904.
38. Tsigelny, I. & Baker, M.E. (1996) Structures important in
NAD(P)(H) specificity for mammalian retinol and 11-cis-retinol
dehydrogenases. Biochem. Biophys. Res. Commun. 226, 118–127.
Ó FEBS 2002 Coenzyme-basedfunctionalassignments of SDRs (Eur. J. Biochem. 269) 4417
. Short-chain dehydrogenases/reductases (SDRs)
Coenzyme-based functional assignments in completed genomes
Yvonne Kallberg
1,2
,. than
NADP(H)-binding. The NAD(H)-binding enzymes are
twice as many as the NADP(H)-binding ones, indicating
that there are more dehydrogenases than reductases in the
extended