Báo cáo khoa học: Prediction of coenzyme specificity in dehydrogenases ⁄ reductases A hidden Markov model-based method and its application on complete genomes doc
Predictionofcoenzymespecificityin dehydrogenases⁄
reductases
A hiddenMarkovmodel-basedmethodandits application
on complete genomes
Yvonne Kallberg
1,2
and Bengt Persson
1,2
1 IFM Bioinformatics, Linko
¨
ping University, Sweden
2 Centre for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
Dehydrogenases andreductases are enzymes of funda-
mental metabolic importance that utilize coenzymes
for electron transport (NAD(H), NADP(H) or
FAD(H
2
), herein denoted NAD, NADP and FAD).
The enzymes bind the coenzyme through a double
babab fold, resulting ina six-stranded b-sheet surroun-
ded by a-helices, known as the Rossmann fold [1].
This domain is often found in combination with other
domains of different folding types either on the N-ter-
minal side, C-terminal side, or interrupting the Ross-
mann fold [2]. For example, glutathione reductases
have two domains of the Rossmann-fold type, one
FAD-binding domain that is interrupted by an
NAD(P)-binding domain (PDB code 3grs [3]). 6-Phos-
phogluconate dehydrogenases have an NADP-binding
domain of the Rossmann-fold type followed by a
Keywords
bioinformatics; coenzyme specificity; hidden
Markov model; prediction; Rossmann fold
Correspondence
B. Persson, IFM Bioinformatics, Linko
¨
ping
University, S-581 83 Linko
¨
ping, Sweden
Fax: +46 13 137568
Tel: +46 13 282983
E-mail: bpn@ifm.liu.se
(Received 13 December 2005, revised 17
January 2006, accepted 23 January 2006)
doi:10.1111/j.1742-4658.2006.05153.x
Dehydrogenases andreductases are enzymes of fundamental metabolic
importance that often adopt a specific structure known as the Rossmann
fold. This fold, consisting ofa six-stranded b-sheet surrounded by a-helices,
is responsible for coenzyme binding. We have developed amethod to iden-
tify Rossmann folds and predict their coenzymespecificity (NAD, NADP
or FAD) using only the amino acid sequence as input. The method is
based upon hiddenMarkov models and sequence pattern analysis. The pre-
diction sensitivity is 79% and the selectivity close to 100%. The method
was applied ona set of 68 genomes, representing the three kingdoms arch-
aea, bacteria and eukaryota. In prokaryotes, 3% of the genes were found
to code for Rossmann-fold proteins, while the corresponding ratio in euk-
aryotes is only around 1%. In all genomes, NAD is the most preferred
cofactor (41–49%), followed by NADP with 30–38%, while FAD is the
least preferred cofactor (21%). However, the NAD preponderance over
NADP is most pronounced in archaea, and least in eukaryotes. In all three
kingdoms, only 3–8% of the Rossmann proteins are predicted to have
more than one membrane-spanning segment, which is much lower than the
frequency of membrane proteins in general. Analysis of the major protein
types in eukaryotes reveals that the most common type (26%) of the Ross-
mann proteins are short-chain dehydrogenases⁄ reductases. In addition, the
identified Rossmann proteins were analyzed with respect to further protein
types, enzyme classes and redundancy. The described method is available
at http://www.ifm.liu.se/bioinfo, where the preferred coenzymeand its
binding region are predicted given an amino acid sequence as input.
Abbreviations
ORF, open reading frame; SDR, short-chain dehydrogenase ⁄ reductase; TM, transmembrane.
FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS 1177
C-terminal catalytic domain consisting of a-helices
only (PDB code 2pgd [4]).
In the first part of the Rossmann fold (b
1
a
1
b
2
), there
are three glycine residues surrounded by hydrophobic
residues, with the first glycine at the end of the b
1
strand and the other two at the beginning of the a
1
helix (Fig. 3, top right, Experimental procedures). The
first two glycine residues are involved in dinucleotide
binding, while the third is involved in the close packing
of the b-strands and the a-helix [5]. Most of the early
characterized dehydrogenases⁄reductases showed a
spacing of these glycine residues ina GxGxxG pattern,
where ‘x’ denotes any residue [5,6]. However, as new
members of this fold have been recognized, the general
pattern is now described as Gx(x)Gx(x)G [7], i.e. the
spacing between the glycine residues can be one or two
residues. The members of the extended short-chain
dehydrogenase ⁄ reductase (SDR) family have this
GxxGxxG pattern, whereas the classical SDRs still do
not fit into the description, since they instead have a
GxxxGxG pattern ([8]and references therein).
The residues at the end of the b
2
strand normally
guide identification of the nature of the coenzyme, i.e.
if an enzyme binds FAD, NAD or NADP. In general,
the presence ofa negatively charged residue indicates
that FAD or NAD is the preferred cofactor [5], due to
the steric hindrance to accommodate the additional
2¢-phosphate found in NADP. NADP-preferring
enzymes typically have a basic residue one position
down-chain instead [5]. Among the classical SDRs, a
basic residue at the position preceding the second gly-
cine residue in the Gly-pattern also indicates that the
enzyme prefers NADP over NAD [8].
A more difficult task is to distinguish between the
coenzyme types FAD and NAD. Most NAD-prefer-
ring enzymes have an aspartic acid residue at the end
of the b
2
-strand, while FAD-preferring enzymes
instead have a glutamic acid residue at this position.
However, there are exceptions in both cases that pre-
vent this feature to be used to differentiate between
the two types.
We have now developed amethod that from the
amino acid sequence alone identifies a protein with
coenzyme binding of the Rossmann type, and predicts
the coenzyme specificity. The method is applied to all
eukaryotic and archaeal genomesanda representative
set of bacterial genomes.
Results and discussion
We have developed amethod for predictionof coen-
zyme specificity, based upon hiddenMarkov models
(HMMs) and sequence motifs (see Experimental proce-
dures). To the best of our knowledge there is no pre-
diction method available with the same applicability as
the one presented here. A search in InterPro [9] using
key words such as ‘Rossmann’, ‘NAD’, ‘NADP’ and
‘FAD’ reveals many entries but there is no single entry
which can be used to identify the motifs of interest.
While most entries are on protein family level, there
are some on domain level as well, e.g. ‘NAD_BS’
(identifier IPR000205) which identifies NAD binding
sites. However, this motif only identifies 29 gene prod-
ucts in the human Ensembl [10] database, a number
far below what could be expected.
Rossmann fold in completed genomes
The new method was applied to a selection of 68 com-
pleted genomes, representing archaea, bacteria and
eukaryota. In total, around 9200 Rossmann proteins
were identified in these genomes. The median numbers
of Rossmann proteins in each organism within eukary-
otes, bacteria and archaea are 196, 67 and 59, respect-
ively, corresponding to 1% of the eukaryotic proteins
and 3% of the prokaryotic proteins. As expected, the
number of predicted coenzyme binding proteins within
a genome increases with its size (Fig. 1). The number
of Rossmann folds has a steep increase for genomes
with up to 10 000 open reading frames (ORFs), while
it levels out for larger genomes. Among eukaryotes,
Oryza sativa is at the top with 655 predicted Ross-
mann proteins, and Trypanosoma brucei is at the bot-
tom with only three Rossmann proteins. In bacteria,
the corresponding extremes are Mycobacterium tuber-
culosis (185 proteins) and Chlamydophila caviae (13
proteins), while in archaea the top and bottom is rep-
resented by Haloarcula marismortui (146 proteins) and
Nanoarchaeum equitans (five proteins). The genomes of
Oryza sativa and Xenopus tropicalis have many more
0
100
200
300
400
500
600
700
800
0 10000 20000 30000 40000 50000 60000 70000
Open Reading Frames (ORFs)
Rossmann Folds
Archaea
Bacteria
Eukaryota
Fig. 1. Number ofcoenzyme binding proteins in each genome plot-
ted versus number of open reading frames. The number of Ross-
mann-folds increase steeply for genomes with up to 10 000 ORFs,
while it levels out for larger genomes.
Prediction ofcoenzymespecificity Y. Kallberg and B. Persson
1178 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
coenzyme binding proteins than the others (655 and
646, respectively), but given the size of their genomes
($61 000 and $53 000) the proportions are still within
the same range as for other eukaryotes. There are four
eukaryotic parasites (Plasmodium falciparum, Plasmo-
dium yoelii, Leishmania major and Entamoeba histolyti-
ca) for which the ratio ofcoenzyme binding proteins is
much lower than expected, possibly due to their ability
to rely on the dehydrogenase ⁄ reductase systems of the
host organism.
Redundancy
Prokaryotic species, with a typical maximum genome
size of 5000 ORFs, have a moderate sequence redund-
ancy among their coenzyme binding proteins. Using a
threshold of maximum 60% pair-wise sequence iden-
tity, 0–10% of the sequences are redundant. Most of
the small eukaryotic genomes have a comparable level
of redundancy. In general, the redundancy of Ross-
mann proteins is similar to that of other proteins in
the genomes. However, there are five genomes which
do not follow this pattern. In Thermoplasma volcanium,
Pyrococcus horikoshii, Thermococcus kodakaraensis,
Candida glabrata and Yarrowia lipolytica, the Ross-
mann proteins are two to three times more redundant
than proteins in general. The redundancy among euk-
aryotes increases with genome size and is 30–40% for
genome sizes around 30 000 ORFs. There are some
outliers, e.g. Apis mellifera, with a very high redund-
ancy level of 54% in spite ofa rather small genome
($17000 ORFs), but the redundancy in general in this
genome is 46%. Comparing the two plant genomes,
Arabidopsis thaliana and Oryza sativa, we find different
redundancy in general (33% vs. 46%), while the num-
bers are much closer considering Rossmann proteins
only (40% versus 37%).
Prediction ofcoenzyme specificity
In general, for all kingdoms, NAD is the specificity
most preferred, while FAD is the least (Table 1). Irres-
pective of kingdom, FAD preference constitutes 21%
on average, while the NAD and NADP ratios vary
somewhat. For nearly all prokaryotic organisms, the
NAD-preferring Rossmann folds are more numerous
than the NADP-preferring (Fig. 2). The only excep-
tions are Lactobacillus acidophilus, Staphylococcus
aureus, Aeropyrum pernix, Pyrobaculum aerophilum,
Sulfolobus tokodaii and Thermococcus kodakaraensis.
However, among eukaryotes it can be seen that for
most species the NAD- and NADP-preferring enzymes
are close to equal in numbers. In plant, worm and
insect, there is a majority of NADP-preferring enzymes
while mammals and chicken have a majority of NAD-
preferring enzymes. Ina previous study of short chain
dehydrogenases ⁄reductases (SDRs) it was found that
NADP is more frequent than NAD in human, mouse,
fruit fly, worm, plant and yeast [8]. As mentioned
above, this is still valid when including all Rossmann-
fold proteins for the lower organisms, but in human
and mouse the balance is shifted and NAD is the most
frequent coenzyme.
Dual coenzyme sites
Some proteins have two Rossmann binding sites; for
example, the flavin monooxygenases with both an
FAD and an NAD binding site. Out of the $9200 pro-
teins predicted to have a Rossmann fold, almost 700
have more than one such fold. For all kingdoms, the
fraction of Rossmann proteins with dual sites amount
to 0–10%, with some exceptions. Among the eukaryo-
tes Entamoeba histolytica, Plasmodium falciparum, and
Plasmodium yoelii the proportion is 15, 18 and 15%,
respectively. The bacterial genome of Chlamydophila
caviae also show a dual sites proportion of 15%, while
the archeal genomesof Thermococcus kodakaraensis
and Nanoarchaeum equitans show 17 and 20%, respect-
ively. These high ratios are partly caused by the low
number of Rossmann-fold proteins.
Protein families
Among the annotated human Rossmann proteins,
most proteins have EC numbers within main group 1
(oxidoreductases). However, there are several SDRs
and multifunctional enzymes also within groups 3
(hydrolases), 4 (lyases), and 5 (isomerases), reflecting
the versatility of the Rossmann fold.
Among the eukaryotic genomes annotated by
Ensembl, 60% of the Rossmann-fold proteins are
found to belong to 10 major groups. The SDR super-
family contributes with 26%, and is by far the largest
group (Table 2). The three next largest groups are var-
ious flavin-binding oxidoreductases with proportions
each of around 6%. Closely related species show
approximately the same number of proteins within
Table 1. Average coenzyme preference among archaean, bacterial,
and eukaryotic genomes.
Kingdom FAD NAD NADP
Archaea 0.21 0.49 0.30
Bacteria 0.21 0.46 0.33
Eukaryota 0.21 0.41 0.38
Y. Kallberg and B. Persson Predictionofcoenzyme specificity
FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS 1179
Fig. 2. Coenzyme preferences in all investi-
gated genomes from eukaryota, bacteria
and archaea. The left axis shows numbers
of coenzyme binding proteins, and the right
axis shows numbers of ORFs. Species
names are given on the horizontal axis.
Table 2. The 10 most common types of Rossmann-fold proteins in eukaryotic genomes. The types are listed according to annotation of
Pfam families as given in the Ensembl entries. The fish genome is represented by Danio rerio, the fly by Drosophila melanogaster, the worm
by Caenorhabditis elegans, and the yeast by Saccharomyces cerevisiae. The total column gives the percentage of all proteins of all types
and all species included in the study. The species columns give the number of proteins of each type.
Type
Total proportion
(%) Human Chimp Mouse Rat Fish Fly Worm Yeast Sum
Short-chain dehydrogenases⁄reductases 26 71 62 68 67 79 57 75 13 492
FAD-dependent pyridine nucleotide-
disulphide oxidoreductases
717131623111186105
Flavin-containing amine oxidases 5 18 17 12 17 5 8 5 0 82
FAD-dependent oxidoreductases 5 15 14 12 10 5 9 8 1 74
Zinc-containing alcohol dehydrogenases 4 12 12 8 11 7 5 6 11 72
Lactate ⁄ malate dehydrogenases 3 7 7 8 13 8 3 1 2 49
UBA ⁄ THIF-type NAD ⁄ FAD binding fold 3 10 8 8 8 1 4 3 3 45
Flavin-containing monooxygenases 3 7 7 10 10 3 2 5 1 45
D-isomer specific 2-hydroxyacid dehydrogenases 2 6 5 4 5 7 6 1 6 40
Aldehyde dehydrogenases 2 2 2 6 11 1 2 3 3 30
Prediction ofcoenzymespecificity Y. Kallberg and B. Persson
1180 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
each family, but there are a few notable exceptions.
Rat aldehyde dehydrogenases, for instance, are almost
twice as frequent as mouse aldehyde dehydrogenases,
and FAD-dependent pyridine nucleotide-disulphide
oxidoreductases are also more numerous in rat com-
pared to mouse. Another species which deviates from
the general pattern is yeast. In this species, the fifth
major group, zinc-containing alcohol dehydrogenases,
has almost as many members as the SDRs (Table 2).
Transmembrane regions
A number ofdehydrogenasesandreductases are mem-
brane-attached. The transmembrane (TM) helix can be
found in either the N-terminal part of the protein, as in
11-beta hydroxysteroid dehydrogenase type 1 [11], or
in the C-terminal, as in monoamine oxidase B [12].
There can also be multiple TM helices as, e.g. in the
proton pumping nicotinamide nucleotide transhydroge-
nase, a three domain protein with the first and third
domain binding NAD and NADP, respectively, and
the second domain consisting of 13–14 TM helices [13].
For all Rossmann proteins found in the genomes,
transmembrane regions were predicted (see Experimen-
tal procedures). Rossmann-fold regions are sometimes
falsely predicted as TM regions, due to the hydropho-
bic nature of the fold. In this study, over half (57%)
of the predicted membrane-bound proteins were found
to have at least one TM region predicted in the Ross-
mann fold. These predicted TM segments were there-
fore excluded in this analysis. As the TM prediction
ambiguities are considerable, Rossmann-fold predic-
tions could be used to increase the reliability of TM
predictions.
While the average proportion of membrane proteins
with two transmembrane segments or more is about
15–30% in all kingdoms [14,15], the proportion of
membrane-bound Rossmann-fold proteins only
amounts to 3–8% (Table 3). The proportion of mem-
brane bound proteins with Rossmann fold is about
twice as high in eukaryotes as in prokaryotes. It was
also noticed that the organisms, even closely related
ones, showed considerable variations in how many
Rossmann proteins had TM regions. There are three
parasites with a very high proportion of Rossmann
membrane proteins, Plasmodium falciparum and Plas-
modium yoelii with one-third each, and Encephalito-
zoon cuniculi with as many as five ofits six predicted
Rossmann proteins also being predicted as membrane
proteins.
The majority of proteins was found to harbor one
or two TM segments ($800 proteins vs. $350 proteins
with more than two TM helices), with one TM most
usual ($600 proteins). A positioning of the TM seg-
ments C-terminally of the coenzyme binding site was
twice as common as an N-terminally positioning.
Looking at differences in TM attachment between the
various coenzyme specificities it was found that
NADP-preferring enzymes are the most common type
to be membrane bound. Around 44% ($500 proteins)
of the Rossmann membrane proteins are NADP-pre-
ferring, which is a larger proportion than Rossmann
NADP-preferring proteins in general ( $36%, Table 4).
Inversely, NAD-preferring membrane proteins amount
to 33% ($400 proteins) which is lower than the fre-
quency in general ($43%, Table 4). Finally, FAD-
preference is 15% (close to 200 proteins), also below
the general occurrence ($21%). Thus, NADP prefer-
ence is overrepresented, while NAD and FAD pre-
ferences are underrepresented. Protein sequences
predicted to have two or more coenzyme binding sites
were the least common to be membrane bound, with
only $100 sequences out of $670 predicted to have
TM helices.
In the human genome, there are 45 Rossmann
proteins with predicted TM regions. The three main
families found among them are the SDRs (27%),
flavin-containing monooxygenases (13%) and F420-
dependent oxidoreductases (11%).
Proteins of the Rossmann-fold type constitute a con-
siderable group with many members. These proteins
display great versatility in terms of functions and
sequence compositions. In spite of these differences,
Table 3. Proportion of Rossmann-fold membrane proteins, with
more than one predicted transmembrane region, compared to
membrane proteins in general.
Archaea Bacteria Eukaryota
Rossmann-fold proteins 0.04 0.03 0.08
All proteins [16] 0.14 0.15 0.14
Table 4. Distribution of various types of Rossmann-fold transmem-
brane proteins with different coenzyme specificities. 1N and 2N
indicate 1 and 2 transmembrane segments N-terminally of the co-
enzyme binding site. Similarly, 1C and 2C denote 1 and 2 trans-
membrane segments C-terminally of the coenzyme binding site. >2
TM indicates more than two transmembrane segments, irrespect-
ive of the coenzyme binding site location. The numbers include all
68 investigated genomes.
Coenzyme 1N TM 2N TM 1C TM 2C TM >2 TM All
FAD 25 7 87 19 41 179
NAD 71 27 115 28 135 376
NADP 100 17 142 60 186 505
Dual 17 6 56 9 8 96
Total 213 57 400 116 370 1156
Y. Kallberg and B. Persson Predictionofcoenzyme specificity
FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS 1181
Fig. 3. Overview of the novel prediction method. Sample sequences of Rossmann-fold motif are shown (top right). aand b denotes secon-
dary structure elements. Arrows indicate positions of critical importance for coenzymespecificity prediction. In the flow chart, the boxes
describe the different steps of the method.
Prediction ofcoenzymespecificity Y. Kallberg and B. Persson
1182 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
our study demonstrates the power of sequence-based
predictions. It is our hope and belief that the presented
prediction tool will be a welcome addition to the
arsenal of analysis methods available for large scale
protein function exploration. The prediction tool is
available via http://www.ifm.liu.se/bioinfo, where a
web form allows the user to enter one or several amino
acid sequence(s) andin return get the Rossmann-fold
prediction with estimated coenzyme preference and
position.
Experimental procedures
We have developed amethod which identifies coenzyme
binding regions in proteins, and also predicts if the specific-
ity is FAD, NAD or NADP. The method is based upon a
combination of HMMs and sequence motif matching as
outlined in Fig. 3. The HMMs are used to extract a num-
ber of potential hits which subsequently are exposed to a
filtering process followed by predictionofcoenzyme specif-
icity. During the development phase, different combinations
of HMMs were tried: one for each type of specificity, one
for all, and one for FAD-binding combined with one for
NAD(P)-binding proteins. The latter was found to be the
best solution in terms ofspecificityand selectivity. All
HMMs were developed using the hmmbuild command in
HMMer [17], with the parameters –F and –fast, followed
by the hmmcalibrate command.
The ASTRAL database [18], version 1.65 with maximum
30% sequence identity, was used to obtain a trustworthy
test set. The selected proteins belong to the folds ‘NAD(P)-
binding Rossmann-fold domain’, ‘FAD ⁄ NAD(P)-binding
domain’ and ‘Nucleotide-binding domain’. The dataset was
scrutinized and only proteins utilizing FAD or NAD(P) in
a typical manner were used, i.e. only selecting sequences
with Gly, Ser or Ala in the key positions g
1
,g
2
and g
3
in
Fig. 1. A total of 16 proteins were removed, of which five
do not bind the coenzymes of interest and the others devi-
ate in their coenzyme-binding manner. The resulting data
set, with 120 members, was manually aligned based upon
their three-dimensional structures, and divided into six
groups with an even distribution of the three coenzyme spe-
cificities in each group (Supplement Tables 1–3). These
groups were then included ina six-fold jack-knife test, iter-
atively training the two HMMs, one with FAD-binding
sequences and one with NAD(P)-binding sequences, using
sequences from five of the groups and testing against the
remaining group anda false data set. The false data sets
were created by dividing the remaining sequences in the
ASTRAL data set (4701 sequences) into six equally sized
groups.
As the method is divided into two steps, true coenzyme
binding proteins can be lost either during the database
search or during the classification. Only two FAD-binding
proteins are lost (false negatives): one is classified as
NADP-binding and the other is classified as false, i.e. non-
Rossmann fold. Among the NAD-binding proteins a total
of 10 are false negatives: four are lost during the database
search, five are classified as false, and one is classified as
NADP-binding. The group with most failures is NADP-
binding proteins, with a total of 13 false negatives: eight
are lost during database search, three are classified as false,
and two are falsely predicted to be NAD-binding.
False positives, i.e. protein sequences falsely predicted to
have certain coenzyme specificities, can be of two types:
either they do not bind the coenzymes of interest or they
do but the coenzyme preference is not correctly predicted.
Initially, during the database search, 62 proteins were
picked up which do not bind any of the coenzymes of inter-
est. However, only three of them remain as false positives
after the classification step: molybdenum cofactor biosyn-
thesis protein (1jw9, MoeB), glycinamide ribonucleotide
transformylase (1kjq, PurT), anda cell division protein
(1ofu, FtsZ). In common for all three is a Rossmann-fold-
like structure at the predicted coenzyme binding site. MoeB
and PurT are ATP-binding proteins, but while the predicted
coenzyme binding region in MoeB is in contact with ATP,
in PurT it is the substrate (glycinamide ribonucleotide)
which is in contact with the corresponding region. FtsZ is a
GTPase anditscoenzyme is in contact with the region fal-
sely predicted to be NADP-bound. In addition to these
three there are four Rossmann-fold proteins where the
wrong coenzyme is predicted, rendering a total of seven
false positives.
Table 5. Prediction sensitivity andspecificityof the novel predictionmethod as judged towards the ASTRAL database. TP ¼ true positives,
FP ¼ false positives, FN ¼ false negatives, TN ¼ true negatives. The sensitivity was calculated as
TP
TP þFN
, the specificity as 1 À
FP
FP þTN
,and
Matthews correlation coefficient as
ðTP ÃTNÀFP ÃFNÞ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðTP þFP ÞðTPþFNÞðTNþFP ÞðTNþFN Þ
p
.
Coenzyme TP FP FN TN Sensitivity Specificity
Database
size
Matthews
correlation
coefficient
FAD 26 0 2 4793 0.929 1.000 4821 0.96
NAD 38 4 10 4769 0.792 0.999 4821 0.84
NADP 31 3 13 4774 0.705 0.999 4821 0.80
Total 95 7 25 4694 0.792 0.999 4821 0.86
Y. Kallberg and B. Persson Predictionofcoenzyme specificity
FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS 1183
All in all, for 95 of 120 sequences the correct coenzyme
specificity was predicted and only seven of 4701 sequences
were false positives, yielding an overall prediction sensitivity
of 79.2%, aspecificityof 99.9% anda Matthews correla-
tion coefficient of 0.86 (Table 5).
The method, using HMMs trained on all six groups, was
applied on 68 genomes: all available among eukaryotes (30)
and archaea (18), anda representative selection of 20 bac-
terial genomes. Genome sequences were downloaded from
ENSEMBL (ftp://ftp.ensembl.org/pub/release-30/), NCBI
(ftp://ftp.ncbi.nih.gov/genomes/) and TIGR (ftp://ftp.
tigr.org/pub/data/).
TM regions were predicted using phobius [19], a tool
based on HMMs, with ability to differentiate between sig-
nal sequences and true transmembrane sequences. The TM
regions were subsequently scrutinized, andin those cases
they overlap with a predicted Rossmann-fold region (coen-
zyme binding site plus 65 residues), the transmembrane pre-
diction was ignored.
References
1 Rossmann MG, Liljas A, Bra
¨
nde
´
n C-I & Banaszak LJ
(1975) In (Boyer, P D, eds), The Enzymes, Vol. 11, 3rd
edn. pp. 61–102. Academic Press, New York.
2 Brenner SE, Chothia C, Hubbard TJP & Murzin AG
(1996) Understanding protein structure: using scop for
fold interpretation. Methods Enzymol 266, 635–643.
3 Schulz GE, Schirmer RH, Sachsenheimer W & Pai EF
(1978) The structure of the flavoenzyme glutathione
reductase. Nature 273, 120–124.
4 Adams MJ, Ellis GH, Gover S, Naylor CE & Phillips C
(1994) Crystallographic study of coenzyme, coenzyme
analogue and substrate binding in 6-phosphogluconate
dehydrogenase: implications for NADP specificity and
the enzyme mechanism. Structure 2, 651–668.
5 Wierenga RK, De Maeyer MCH & Hol GJ (1985)
Interaction of pyrophosphate moieties with a-helixes in
dinucleotide binding proteins. Biochemistry 24, 1346–
1357.
6 Wierenga RK, Terpstra P & Hol WGJ (1986) Prediction
of the occurrence of the ADP-binding beta alpha beta-
fold in proteins, using an amino acid sequence finger-
print. J Mol Biol 187, 101–107.
7 Carugo O & Argos P (1997) NADP-dependent enzymes.
I: Conserved stereochemistry of cofactor binding. Pro-
teins 28, 10–28.
8 Kallberg Y, Oppermann U, Jo
¨
rnvall H & Persson B
(2002) Short-chain dehydrogenases⁄reductases (SDRs).
Eur J Biochem 269, 4409–4417.
9 Mulder NJ, Apweiler R, Attwood TK, et al. (2005)
InterPro, progress and status in 2005. Nucleic Acids Res
33, D201–205.
10 Hubbard T, Andrews D, Caccamo M, et al. (2005)
Ensembl 2005. Nucleic Acids Res 33, D447–453.
11 Odermatt A, Arnold P, Stauffer A, Frey BM & Frey FJ
(1999) The N-terminal anchor sequences of 11beta-
hydroxysteroid dehydrogenases determine their orienta-
tion in the endoplasmic reticulum membrane. J Biol
Chem 274, 28762–28770.
12 Binda C, Hubalek F, Li M, Edmondson DE & Mattevi
A (2004) Crystal structure of human monoamine oxi-
dase B, a drug target enzyme monotopically inserted
into the mitochondrial outer membrane. FEBS Lett 564,
225–228.
13 Jackson JB, Peake SJ & White SA (1999) Structure and
mechanism of proton-translocating transhydrogenase.
FEBS Lett 464, 1–8.
14 Liu J & Rost B (2001) Comparing function and struc-
ture between entire proteomes. Protein Sci 10, 1970–
1979.
15 Krogh A, Larsson B, von Heijne G & Sonnhammer EL
(2001) Predicting transmembrane protein topology with
a hiddenMarkov model: application to complete gen-
omes. J Mol Biol 305, 567–580.
16 Nilsson J, Persson B & von Heijne G (2005) Compara-
tive analysis of amino acid distributions in integral
membrane proteins from 107 genomes. Proteins 60,
606–616.
17 Eddy SR (1998) Profile hiddenMarkov models.
Bioinformatics 14, 755–763 (http://hmmer.wustl.edu ).
18 Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl
P, Levitt M & Brenner SE (2004) The ASTRAL Com-
pendium in 2004. Nucleic Acids Res 32, 189–192.
19 Ka
¨
ll L, Krogh A & Sonnhammer EL (2004) A com-
bined transmembrane topology and signal peptide pre-
diction method. J Mol Biol 338, 1027–1036.
Supplementary material
The following supplementary material is available
online:
Table S1. All enzymes used in the development of the
prediction method.
Table S2. Alignment of NAD- and NADP-preferring
enzymes used in the development of the prediction
method.
Table S3. Alignment of FAD-preferring enzymes used
in the development of the prediction method.
This material is available as part of the online article
from http://www.blackwell-synergy.com
Prediction ofcoenzymespecificity Y. Kallberg and B. Persson
1184 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
. Prediction of coenzyme specificity in dehydrogenases
reductases
A hidden Markov model-based method and its application
on complete genomes
Yvonne Kallberg
1,2
and. zinc-containing alcohol dehydrogenases,
has almost as many members as the SDRs (Table 2).
Transmembrane regions
A number of dehydrogenases and reductases