Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 11 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
11
Dung lượng
608,73 KB
Nội dung
Analysisofancientsequencemotifsinthe H
+
-PPase family
Joel Hedlund
1
*, Roberto Cantoni
2,3,4
*, Margareta Baltscheffsky
2
, Herrick Baltscheffsky
2
and Bengt Persson
1,4
1 IFM Bioinformatics, Linko
¨
ping University, Sweden
2 Department of Biochemistry and Biophysics, Arrhenius Laboratories, Stockholm University, Sweden
3 Department of Physical Sciences, ‘Federico II’ University of Naples, Italy
4 Department of Cell and Molecular Biology (CMB), Programme for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
Membrane-bound inorganic pyrophosphatase ⁄ pyro-
phosphate synthase (H
+
-PPase ⁄ H
+
-PP
i
synthase) [1,2]
activities were first described in chromatophores from
the purple nonsulphur photosynthetic bacterium, Rho-
dospirillum rubrum, where the enzyme functions as a
proton pump [3]. The gene for H
+
-PPase ⁄ H
+
-PP
i
syn-
thase, which is the enzyme involved inthe photo-
synthetic formation of pyrophosphate (PP
i
) [2], was
cloned in 1998 [4] and the primary structure and fur-
ther properties have been deduced [5]. Moreover, 3D
models of parts ofthe putative active site loop between
transmembrane segments 5 and 6 of R. rubrum have
been presented [6]. This 57 amino acid residue loop
contains three sequence motifs, underlined in the
following sequence: L
GGGIFTKCADVGADLVGKV
EAGIPEDDPRNPAVIA
DNVGDNVGDCAGMAAD
LFETY.
In what apparently is a pyrophosphate-binding
region and an essential part ofthe active site for the
phosphorylation ⁄ phosphatase reaction [7], three differ-
ent ‘primitive’ tetrapeptide motifs (DVGA, DLVG and
DNVG) [5,6] have been located in two nonapeptide
(enneapeptide) sequences (DVGADLVGK and
DNVGDNVGD) which, with their charged amino
acids at positions 1, 5 and 9, seem to be involved in
binding the metal-phosphate substrates [7]. These seq-
uence motifs are denoted ‘primitive’ because they have
a high content ofthe four ‘very early’ proteinaceous
amino acids G (glycine), A (alanine), D (aspartic acid)
and V (valine). In 1978, Eigen & Schuster [8] proposed
Keywords
bioinformatics; hidden Markov models;
molecular evolution; proteinaceous amino
acids; pyrophosphatase
Correspondence
B. Persson, IFM Bioinformatics, Linko
¨
ping
University, S-581 83 Linko
¨
ping, Sweden
Fax: +46 13 137 568
Tel: +46 13 282 983
E-mail: bpn@ifm.liu.se
*These authors contributed equally to this
work
(Received 9 August 2006, revised 26 Sep-
tember 2006, accepted 27 September 2006)
doi:10.1111/j.1742-4658.2006.05514.x
The unique familyof membrane-bound proton-pumping inorganic pyro-
phosphatases, involving pyrophosphate as the alternative to ATP, was
investigated by characterizing 166 members ofthe UniProtKB ⁄ Swiss-
Prot + UniProtKB ⁄ TrEMBL databases and available completed genomes,
using sequence comparisons and a hidden Markov model based upon a
conserved 57-residue region inthe loop between transmembrane segments
5 and 6. The hidden Markov model was also used to search the approxi-
mately one million sequences recently reported from a large-scale sequen-
cing project of organisms inthe Sargasso Sea, resulting in additional 164
partial pyrophosphatase sequences. The strongly conserved 57-residue
region was found to contain two nonapeptidyl sequences, mainly consisting
of the four ‘very early’ proteinaceous amino acid residues Gly, Ala, Val
and Asp, compatible with an ancient origin ofthe inorganic pyrophospha-
tases. The nonapeptide patterns have charged amino acid residues at posi-
tions 1, 5 and 9, are apparent binding sites for the substrate and parts of
the active site, and were shown to be so specific for these enzymes that they
can be used for functional assignments of unannotated genomes.
Abbreviations
H
+
-PPase ⁄ H
+
-PP
i
synthase, proton pumping inorganic pyrophosphatase ⁄ pyrophosphate synthase; HMM, hidden Markov model; P
i
, inorganic
phosphate; PP
i
, pyrophosphate.
FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS 5183
a detailed model for the evolution of RNA codons,
which suggests that the RNA triplets GGC and GCC
(coding for glycine and alanine, respectively) were the
earliest codons. Transition mutations at the second
base would then have given GAC and GUC, which
code for aspartic acid and valine, respectively. The evi-
dence supporting the role of these four proteinaceous
amino acids as ‘very early’ [5] has been further con-
firmed by Trifonov [9]. Notably, experiments by Miller
[10], leading to the synthesis of amino acids under cer-
tain ‘simulated prebiotic’ conditions, gave these four
amino acids inthe highest yield, and they all were
among the amino acids found inthe Murchison mete-
orite [11]. Both the high content ofthe ‘very early’
amino acids in these sequence motifs, and the fact that
the motifs have remained essentially unchanged during
billions of years of biological evolution, from Archaea
and Bacteria to Eukaryotes, provide a background for
our fascination with various aspects of their evolution
and function.
As pointed out previously [5], the GGG motif of
the R. rubrum H
+
-PPase may be of special functional
significance in a possible conformational change
mechanism for the physiological coupling between the
light-induced pumping of protons and the photo-
phosphorylation of inorganic phosphate (P
i
)toPP
i
[2]
or its reversal, the dark hydrolysis of PP
i
to P
i
[1]. All
three glycines are strictly conserved, except in very rare
cases, where one, usually the first G, is substituted with
an A. With this conservative substitution, and the
additional fact that, just as glycine, alanine can serve
as a versatile link in conformational changes, the sub-
stitution of one G with one A should not cause any
drastic change inthe suggested function ofthe GGG
motif. G and A are also seen elsewhere to have been
interchanged, as will be exemplified in our discussion
of the DVGADLVGK motif.
Two notable points ofthe H
+
-PPase family exist at
present. Unique simplicity of these homologous, inte-
grally membrane-bound enzymes resides in both their
substrates (P
i
and PP
i
) and their single-subunit dimer-
ized structure [12]. Its extreme hydrophobicity, with
approximately half ofthe residues inthe 16 ± 1 trans-
membrane segments, has made all attempts to obtain
high-resolution 3D information ofthe H
+
-PPase
unsuccessful, to date. Bioinformatics may be used to
provide new perspectives on the possible evolution of
this ancient, widely spread and unusually highly con-
served energy-transferring enzyme family.
The possibility has been considered that PP
i
could
have been a predecessor to ATP, and that H
+
-PPases
could have been direct or indirect evolutionary ances-
tors of ATPases [5,13]. From numerous genome
projects, a large number of both prokaryotic and euk-
aryotic H
+
-PPase sequences are now known. Our new
overview ofthe high content of strongly conserved,
very early amino acids inthe above shown three puta-
tive active site motifsinthe loop between transmem-
brane segments 5 and 6 deserves a closer look for
existing sequence similarities in other known polypep-
tides. We thus also evaluate the possibility that the
putative active site may have evolved from an original
enzyme structure involved in an emerging metabolism
of energy-rich phosphate [6]. A recent, additional indi-
cation in this direction was given by the discovery
that shifting the growth conditions of R. rubrum from
aerobic ⁄ dark to anaerobic ⁄ light resulted in concerted
transcriptional activation of both H
+
-PPase and pho-
tosynthetic genes, by the same anaerobic regulatory
factor [14], pointing towards PP
i
as the energy-rich
phosphate product of early bacterial photophosphory-
lation. Furthermore, the unique property ofthe H
+
-
PPase family, to use PP
i
rather than ATP for energy
coupling and utilization in biological membranes, pro-
vides an alternative angle to the study ofthe central,
but notoriously elusive, coupling mechanism between
energy-rich phosphates and proton pumps. Finally, the
major problems encountered by several laboratories in
obtaining H
+
-PPase samples for 3D studies have moti-
vated detailed searches with various bioinformatic
techniques in order to extend the characterization of
the members ofthe H
+
-PPase family.
Results and Discussion
Characterization of H
+
-PPase members
In order to detect members ofthe membrane-bound,
proton pumping H
+
-PPase family, all completed
genomes and the UniProtKB protein sequence data-
base [15] were searched using fasta [16], resulting in
characterization of over 100 different members. The
sequences were multiply aligned, revealing a number of
well-conserved regions, especially the highly conserved
57-residue segment inthe loop between transmembrane
segments 5 and 6. This segment was used to create a
hidden Markov model (HMM), which subsequently
was used to search for further homologues inthe data-
bases of UniProtKB and the currently available
genomes, in order to identify further family members.
Using this HMM, 166 sequences were found (supple-
mentary Table S1). Remarkably, only other H
+
-
PPases were found using this strategy, indicating high
specificity ofthe model. The poorest scoring H
+
-PPase
sequence had an E-value of 8.3e-25 (i.e. under the
circumstances ofthe search, only 8.3e-25 unrelated
Ancient sequencemotifsinthe H
+
-PPase family J. Hedlund et al.
5184 FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS
sequences could be expected to attain the same match
quality by chance alone) [17]. Thus, the E-value is
highly statistically significant. Furthermore, the best
scoring non-H
+
-PPase protein was detected at an
E-value of 15, far below statistical significance. Thus,
the model is well suited for detection of H
+
-PPases.
The 57-residue segment seems to be unique for the
H
+
-PPase family, which might be seen as an index of
their early separation from other families, making
H
+
-PPases a low-positioned branch inthe genealogi-
cal tree of protein families. Consequently, this seg-
ment can be used as a fingerprint inthe search for
further members of this family from new sequence
data.
Analyses ofthe various primary structures of H
+
-
PPases (PP
i
synthases) raise the possibility of explor-
ing, in molecular detail, various properties of this
‘primitive’ alternative to ubiquitously occurring ATP
synthases. Looking at the species distribution, H
+
-
PPases are present in several archaeal and bacterial
species [5,18]. In eukaryotes, these enzymes are found
in plants and in a few blood parasites, such as Tetra-
hymena and Plasmodium.
The fact that four ‘very early’ amino acid residues
have been retained indicates very early optimization.
This may be a rather unusual situation compared with
early motifsin other proteins, where stepwise evolution
of themotifs may have provided further optimization,
through introduction of later amino acids.
Different H
+
-PPase subfamilies
In order to characterize the interfamily relationships, a
dendrogram was calculated (Fig. 1) based upon the
multiple sequence alignment. The tree shows that H
+
-
PPases form two subfamilies, corresponding to type 1
and type 2 [19,20]. Several species present multiple
forms of H
+
-PPases, which in many cases are distantly
related (37–39% pairwise residue identity) and belong
to the separate types. For cress (Arabidopsis thaliana),
there are five type 1 and six type 2 enzymes, whereas
for rice (Oryza sativa) there exist 18 enzymes of type 1
and two enzymes of type 2. Among plants, the multi-
plicity has so far only been seen in organisms for
which the complete genomes are available, and this
distantly related multiplicity can thus be expected to
occur also in further plants. The blood parasites Tetra-
hymena, Plasmodium falciparum and Plasmodium yoelii
show a similar multiplicity. Moreover, some archaeal
species (e.g. Methanosarcina mazei and Methanosarcina
acetivorans) have multiple H
+
-PPases, whereas most
bacterial species do not show any multiplicity, except
for the bacterium Rhodopseudomonas palustris, which
has two different pyrophosphatases, found on separate
branches within type 2 (Fig. 1).
This multiplicity contrasts to the situation for the
family I soluble PPases, where two variants of different
sizes are known, but they are strictly divided between
different kingdoms – one in prokaryotes and the other
in eukaryotes [21].
Surprisingly, two eukaryotic sequences from the frog
Xenopus tropicalis are found (see Fig. 1) among the
bacterial type 2 members. However, because this gen-
ome project is not complete, it cannot yet be claimed
that these sequences are correct.
In order to characterize the differences between
sequences of types 1 and 2, we compared the positions
strictly conserved within each type, but differing
between the types, as indicated in Fig. 2. In total,
seven such differences were found – four inthe trans-
membrane segments and three on the cytosolic side –
whereas none was found on the noncytosolic side.
Two ofthe differences were found between residues
with different physico-chemical properties, implying
possible functional impact (Fig. 2, residues indicated
by bullet symbols). Functional impact has already been
confirmed for one of these differences (position 507 in
HPPA_STRCO), as an Ala⁄ Lys mutation introduced
at the corresponding position in Carboxydothermus
hydrogenoformans type 1 H
+
-PPase has been shown
to confer the potassium independence of type 2 H
+
-
PPases to the enzyme [22]. At position 253 inthe loop
between transmembrane segments 5 and 6, close to
one ofthe conserved nonapeptides, the type 1 enzymes
have predominantly hydrophobic Ile or Val, while
type 2 enzymes have polar Cys or Ser.
Furthermore, there are two exchanges within weakly
conserved clustalw groups of residues [23] (Fig. 2,
open ring symbols). At position 266 in transmembrane
segment 6, the type 1 residues are Glu or Gly, while
the type 2 residues are Val or Ala. At position 510, in
loop 11–12 close to the membrane, type 1 enzymes
prefer Gly, while type 2 enzymes prefer Thr or Ala.
The remaining three residue exchanges are within
clustalw groups, and these are all located in trans-
membrane segments (Fig. 2).
Conserved regions within H
+
-PPases
From the multiple sequence alignment of all H
+
-PPas-
es, a limited number of strongly conserved regions are
clearly visible. A conservation profile was calculated,
showing the degree of conservation averaged over win-
dows of 11 residues along the protein chain (Fig. 3).
From the resulting plot it is clear that most ofthe con-
served residues are located on the cytosolic side of the
J. Hedlund et al. Ancientsequencemotifsinthe H
+
-PPase family
FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS 5185
Ancient sequencemotifsinthe H
+
-PPase family J. Hedlund et al.
5186 FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS
Fig. 2. Schematic view ofthe H
+
-PPase sequence with residue differences between type 1 and type 2 indicated. The black line symbolizes
the amino acid chain, and the horizontal green lines denote membrane borders with the cytosolic side downwards and the noncytosolic side
upwards. The transmembrane topology is from a previous publication [26]. The transmembrane segments are numbered from 1 to 17 and
the loops are numbered 1–2 to 16–17. The letters N and C denote the N- and C-terminus, respectively. Positional numbers refer to the refer-
ence sequence from Streptomyces coelicolor (UniProtKB ⁄ Swiss-Prot ID HPPA_STRCO, AC Q9X913). The green regions represent the five
conserved regions in Fig. 3. The red boxes indicate the four nonapeptides (shown in red in Fig. 3). Thick lines denote positions where the
two most common residue types together make up more than 95% ofthe residue type content. Positions with differential conservation are
indicated by labels ofthe type ‘AB ⁄ CD’, which denotes that more than 90% ofthe Type 1 enzymes have residue A or B at this position,
while 90% ofthe Type 2 enzymes have C or D, and that A and C are the most common ones. Boldface residue letters indicate that more
than 85% ofthe sequences inthe corresponding subtype have this residue. Conserved substitutions within a
CLUSTALW ‘strongly conserved
group of residues’ are shown by dash symbols on the backbone, and open ring symbols are similarly used for
CLUSTALW ‘weakly conserved
groups of residues’. Bullet symbols on the backbone denote conserved substitutions between two residues that do not occur together in
any of the
CLUSTALW groups. This latter form of substitution implies functional impact. Also, in order to count as a conserved substitution,
none ofthe residue types A or B can be identical to the residue types C or D.
Fig. 1. Dendrogram of H
+
-PPases. The dendrogram is based upon the multiple sequence alignment of all PPase sequences found in Uni-
ProtKB ⁄ Swiss-Prot, UniProtKB ⁄ TrEMBL and GenomeLKPG databases, with removal of sequences showing 90% or more identity to any of
the other sequences inthe alignment. Two sequences – Q420R8_DESHA and Q4CED6_CLOTM – show unclear relationships and were
excluded without affecting the general tree topology. Red marks at branch points indicate bootstrap values below 888 ⁄ 1000, the bootstrap
value at the branch point separating Type 1 (K
+
-dependent) from Type 2 (K
+
-independent) enzymes. The horizontal bar shows the branch
length corresponding to 5% residue differences. Each branch end-point is designated with identifiers from UniProtKB ⁄ Swiss-Prot, Uni-
ProtKB ⁄ TrEMBL or the respective genome project, prefixed indicators of kingdom (A, archaeal; B, bacterial; E, eukarytoic), species group (A,
Alveolata; E, Euglenozoa; P, Plant; V, Vertebrates), and PPase type (1 or 2). Archaea are labelled yellow, bacteria blue, and plants green. The
red-labelled sequences originate from primitive eukaryotes (alveolates and euglenozoans). Two sequences from the unfinished Xen-
opus tropicalis genome project are shown in purple. Accession numbers are given within parentheses after the UniProtKB ⁄ Swiss-Prot ID.
For UniProtKB ⁄ TrEMBL IDs, the accession number forms the first part ofthe ID. All type 1 PPases are found inthe upper part ofthe dendro-
gram, while type 2 PPase are inthe lower part. Several plants and protozoan organisms have both type 1 and type 2 PPases, whereas most
of the bacterial forms only present one ofthe types.
J. Hedlund et al. Ancientsequencemotifsinthe H
+
-PPase family
FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS 5187
membrane, while only few residues on the noncytosolic
side are conserved. Furthermore, there are five seg-
ments that are seen as distinct peaks, indicating strong
conservation. These sequences are listed in Fig. 3. The
locations of these segments are schematically indicated
in the topology plot in Fig. 2, shown as red-coloured
segments. The first region corresponds to the previ-
ously mentioned 57-residue conserved region in the
loop between transmembrane segments 5 and 6, while
the second and third regions are found inthe loop
between segments 11 and 12 and the fourth and fifth
regions inthe loop between segments 15 and 16.
We developed HMMs for the regions 2+3 and
4+5, which were used to search UniProtKB ⁄ Swiss-
Prot and UniProtKB ⁄ TrEMBL databases for poten-
tially homologous proteins. However, also for these
regions, we only found proteins ofthe H
+
-PPase fam-
ily, thus emphasizing the uniqueness of this family.
The 57-residue conserved region inthe loop
between transmembrane segments 5 and 6
From the multiple sequence alignment of all 145 H
+
-
PPases, the consensus sequence for the well-conserved
57-residue region was calculated (Fig. 4A). The region
contains the three conserved sequence motifs: GGG,
DVGADLVGK and DNVGDNVGD, which have
already received particular attention from the evolution-
ary viewpoint [5,6] and appear to form functionally sig-
nificant parts ofthe active site ofthe enzyme [7]. Both
the second and the third motif contain nine amino acid
residues, of which the first, fifth and ninth are charged.
Fig. 3. Conservation plot for H
+
-PPases and conserved sequence motifs. From the multiple sequence alignment of H
+
-PPases, CLUSTALW col-
umn scores were averaged for ungapped 11-residue windows ofthe reference sequence from Streptomyces coelicolor, HPPA_STRCO, and
plotted in green. The predicted membrane topology is shown in blue, where high values indicate the noncytosolic side and low values the
cytosolic side, whereas medium values correspond to transmembrane regions. The plot shows that the conserved regions coincide with
cytoplasmic and transmembrane regions. There are five regions with distinct peaks above 55% column score (dotted line), indicating strong
conservation. The sequences for these regions in S. coelicolor are shown below the plot. In region 1, the well-conserved patterns, including
the two nonapeptides, are highlighted in red. Distantly related nonapeptides in regions 4 and 5 are also highlighted in red, together with a
GGS pattern, possibly corresponding to the GGG pattern in region 1 (cf. text).
Ancient sequencemotifsinthe H
+
-PPase family J. Hedlund et al.
5188 FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS
The number of amino acid residues separating the three
motifs is remarkably constant in all the H
+
-PPases.
The nonapeptide DVGADLVGK was seen to have
a partial counterpart inthe P-loop [24] ofthe active
b-subunit of ATP synthase. In an alignment of PP
i
synthase from R. rubrum with the P-loop in animal
mitochondrial ATP synthase, four of eight amino acid
residues were found to be identical [5].
In order to investigate further the evolutionary vari-
ation of this 57-residue region, we applied the HMM
to search the approximately one million sequences
recently reported from a large-scale sequencing project
of organisms inthe Sargasso Sea [25]. With the model,
we were able to extract an additional 164 pyrophos-
phate sequences (supplementary Table S2), not over-
lapping with the initial set of sequences. The consensus
sequence ofthe Sargasso sequences is shown in
Fig. 4B. Comparing all sequences detected in Uni-
ProtKB ⁄ Swiss-Prot, UniProtKB ⁄ TrEMBL and Geno-
meLKPG databases (Fig. 4A), it can be seen that the
variability is smaller among the Sargasso sequences,
but many ofthe residue variations are identical. Fur-
thermore, the variable regions are generally located at
the same sites as for the sequences in Fig. 4A.
For a number of plant H
+
-PPases, the HMM finds
distant similarity also to a second region of the
B
A
Fig. 4. Consensus sequences ofthe conserved 57 amino acid residue region. (A) The consensus sequence derived from all sequences found
in UniProtKB ⁄ Swiss-Prot, UniProtKB ⁄ TrEMBL and available genomes using the HMM search is shown. Thesequence is shown using
sequence logos [34], clearly showing the high conservation of this region. Residues are coloured according to chemical properties: green
represents polar residues (G, S, T, Y, C, Q, N), blue basic (K, R, H), red acidic (D, E) and black hydrophobic residues (A, V, L, I, P, W, F, M).
The amino acid residues at each position are also shown in plain text below the x-axis, where the top row represents the most common
residue type with alternative residues ordered in decreasing frequency. (B) We also used the HMM to search the approximately one million
sequences recently reported from the Sargasso Sea [25]. Thesequence logos and positional residue variants are shown as in (A).
J. Hedlund et al. Ancientsequencemotifsinthe H
+
-PPase family
FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS 5189
enzymes, possibly being visible traces of an ancient
gene duplication. This second region is located at resi-
dues 738–785 (numbering according to the A. thaliana
sequence with accession number Q9FWR2 (Uni-
ProtKB Q9FWR2_ARATH)). This region forms loop
15–16, located at the cytosolic side according to experi-
mental investigations [26] (cf. Fig. 3). The patterns of
this second region are also seen in further species vari-
ants, but are most clearly distinguishable inthe plant
sequences. According to the three well-conserved
sequence segments of H
+
-PPases, the second nonapep-
tide motif is well conserved, with all three aspartic acid
residues unchanged (Fig. 3, marked residues in regions
1 and 5). For the first nonapeptide motif, charges are
found at positions 1, 5 and 9 inthe order Asp, Asp,
Lys in region 1, whereas the order is Asp, Lys, Asp in
region 4. Notably, the positional spacing between the
two nonapeptides in region 1 and region 4+5 is identi-
cal (26 residues). Furthermore, the GGG motif pre-
ceding the nonapeptides in region 1 could correspond
to a GGA or GGS motif, preceding the nonapeptide
in region 4 (Fig. 3, marked residues in peptides 1 and
4). Thus, these observations, taken together, might
reflect an ancient gene duplication event, as previously
suggested [5].
Occurrence ofthe typical H
+
-PPase nonapeptides
in other proteins
In order to investigate the general occurrence of the
two nonapeptides, characterized as typical of H
+
-
PPases, they were compared with all sequences in the
UniProtKB ⁄ Swiss-Prot and UniProtKB ⁄ TrEMBL
databases. Thesequence patterns used inthe searches
were based upon all positional variants occurring in any
of the H
+
-PPases, excluding those with only a single
observation, resulting inthe patterns: D-[VIMT]-[GA]-
[AGS]-D-[LI]-[VSMA]-G-K and D-[NCFL]-[VITA]-
G-D-N-[VA]-G-D. In Table 1, these patterns are shown
as DpppDppGK and DppGDNpGD, respectively.
Remarkably, only 11 and 7, respectively, ofthe 192 and
32 possible combinations of different nonapeptides were
found to have matches in UniProtKB ⁄ Swiss-Prot or
UniProtKB ⁄ TrEMBL databases (Table 1). All but one
of the sequences found by the first nonapeptide pattern
are annotated as H
+
-PPases. For the second nonapep-
tide pattern, all but two are H
+
-PPases. The exceptions
are a putative DNA damage-inducible protein from
Erythrobacter litoralis (Q4TQ38–9SPHN), presumably
not related to the H
+
-PPases, and two hypothetical pro-
teins from Neurospora crassa (Q871A9_NEUCR and
Q7RZ15_NEUCR). Furthermore, as seen in Table 1,
four H
+
-PPases are not detected by the first nonapep-
tide pattern, because those proteins have one atypical
amino acid residue in this pattern. Thus, the nonapep-
tide patterns are, with these few exceptions, specific for
the H
+
-PPases. As seen in Table 1, the number of non-
H
+
-PPase hits increases dramatically when the patterns
are extended to DXXXDXXGK and DXXGDNXGD,
respectively, where X represents any amino acid residue.
We extended the pattern search to screen the Uni-
ProtKB ⁄ Swiss-Prot database for very simple motifs of
possible ancestral significance, with an alternation of
Asp and one ofthe other ‘very early’ amino acids (e.g.
DADADADAD) (Table 2). It can be seen that the pat-
tern VDVDV is under-represented compared with the
patterns ADADA and GDGDG, even when considering
the general frequencies ofthe residues (V, 6.7%; A,
7.9%; G, 7.0%). Similarly, the DGDGD pattern is
over-represented compared with the patterns DADAD
and DVDVD. This over-representation is still present
Table 1. Number of proteins and occurrences of PP
i
-related
sequence motifsinthe UniProtKB ⁄ Swiss-Prot and Uni-
ProtKB ⁄ TrEMBL databases. The peptides DpppDppGK and Dpp-
GDNpGD denote patterns based upon all positional variants
occurring in any ofthe H
+
-PPases, excluding those with only a sin-
gle observation, corresponding to the patterns: D-[VIMT]-[GA]-
[AGS]-D-[LI]-[VSMA]-G-K and D-[NCFL]-[VITA]-G-D-N-[VA]-G-D. In
the peptides DXXXDXXGK and DXXGDNXGD, X represents any
amino acid residue.
Sequence motif
UniProtKB ⁄ Swiss-Prot UniProtKB ⁄ TrEMBL
Proteins Hits PPases Proteins Hits PPases
First nonapeptide
DVGADLVGK 20 20 20 81 81 81
DVGGDLVGK 8 8 8 4 4 4
DMAADLVGK 1 1 1 7 7 7
DVGADLSGK 0 0 0 7 7 7
DIGADLVGK 0 0 0 6 6 6
DVGADLAGK 0 0 0 4 4 4
DTGADIVGK 0 0 0 2 2 2
DMGADLVGK 0 0 0 2 2 2
DVAADLVGK 0 0 0 1 1 1
DVGADIAGK 0 0 0 1 1 1
DTGADLAGK 0 0 0 1 1 0
DpppDppGK 29 29 29 116 116 115
DXXXDXXGK 809 810 30 7332 7366 118
Second nonapeptide
DNVGDNVGD 29 29 29 93 93 93
DLVGDNVGD 1 1 1 15 15 15
DCTGDNAGD 1 1 1 5 5 5
DCIGDNVGD 0 0 0 7 7 7
DFVGDNVGD 0 0 0 2 2 2
DCAGDNAGD 0 0 0 2 2 2
DLTGDNAGD 0 0 0 2 2 0
DppGDNpGD 31 31 31 126 126 124
DXXGDNXGD 31 31 31 160 175 125
Ancient sequencemotifsinthe H
+
-PPase family J. Hedlund et al.
5190 FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS
after homology reduction (at the 80% and 60% levels)
and might well reflect structural properties.
We also searched the complete UniProtKB (Swiss-
Prot + TrEMBL) database for patterns with altera-
tions of any two ‘very early’ amino acid residues. The
largest number of proteins was found for the sequences
AGAGA (6185) and GAGAG (6203), in agreement
with the assumption that G and A are both very early,
flexible and frequent amino acids. Close to 6000 results
were also reported for the pattern AVAVA (5865),
while much smaller numbers were found for GVGVG
(3386), ADADA (2383), DGDGD (2100) and
DVDVD (852). The small difference in frequencies
between V and D (6.7% and 5.3%, respectively) in
known proteins does not fully explain the discrepancy
between AVAVA and ADADA.
Three ofthe sequences found had long segments
consisting of repetitious patterns containing two of the
four early amino acid residues A, G, D and V. The
residues A and D, present in an alternation pattern
(ADADAD ) were found inthe surface protein, SdrI,
from Staphylococcus saprophyticus and two putative
peptidoglycan-bound proteins from Listeria innocua
and Listeria monocytogenes. These proteins are suppo-
sedly attached to the cell wall peptidoglycan by amide
bonds. Such patterns are believed to be evolutionary
relics of early sequence pattern formation by muta-
tions and duplications, from early homo-oligomers
(GGGGG and AAAAA ) [27].
In the conserved nonapeptides DVGADLVGK and
DNVGDNVGD, 14 out of 18 residues belong to the
four ‘primitive’ residue types G, A, V and D. In order to
make a general assessment of frequency of ‘primitive’
and repetitive patterns including theancient amino acid
residues, we searched the protein databases for
sequences including these four residues in various com-
binations. Thus, we scanned UniProtKB ⁄ Swiss-Prot for
the sequence motif D-[A ⁄ G ⁄ V]
3
-D-[A ⁄ G ⁄ V]
3
-D, and a
similar motif, where alanine is replaced with asparagine,
given the presence of asparagines in one ofthe two puta-
tive active site motifsof R. rubrum H
+
-PPase. The only
patterns found inthe proteins were DNVGDNVGD,
unique to the H
+
-PPase family, and DNNNDNNND,
in the spindle assembly checkpoint component MAD1
from Saccharomyces cerevisiae (Mitotic arrest deficient
protein 1; UniProtKB ⁄ Swiss-Prot ID MAD1_YEAST).
We thus concluded that the charged residues (1, 5 and 9)
of the two nonapeptides form a unique and unaltered
pattern, presumably with critical function and charac-
teristics ofthe H
+
-PPase family.
Putative metal-binding patterns
Asp residues are strictly conserved inthe H
+
-PPase
nonapeptide motifs
DVGADLVGK and DNVGD
NVG
D. The residues aspartic acid (Asp) and glutamic
acid (Glu) can act as metal ligands in various proteins
[28,29]. UniProtKB ⁄ Swiss-Prot was screened for pat-
terns of nine amino acid residues with either Asp or Glu
at every fourth position (1, 5 and 9) and allowing any
residue at the remaining positions. If the sequence
formed an a-helix, with one turn every 3.6 residues, the
charged residues would be facing the same side, to facili-
tate metal-binding properties at the active site.
In UniProtKB ⁄ Swiss-Prot, 11 648 proteins were
found with the motif D-X
3
-D-X
3
-D, while over two-
fold as many (26 389 proteins) were found with the
motif E-X
3
-E-X
3
-E. We investigated protein family
relationships based upon the Pfam annotations in the
UniProtKB ⁄ Swiss-Prot entries. For the ‘E-motif’, the
most frequent domain was elongation factor Tu (with
472 proteins), and the protein kinase domain (with 320
proteins) was the second most frequent. For the
‘D-motif’, the most frequent domain was the EF hand
Ca
2+
ion-binding motif (297 occurrences), followed by
S-adenosylmethionine synthases (187 proteins) and
kinases (178 occurrences).
Conclusions
The rapidly expanding numbers of available amino acid
sequences provided novel possibilities to explore early
biological evolution, inthe direction from ‘very early’
polypeptide synthesis to various known or putative
active sites of enzymes, especially those ofthe H
+
-PPase
family. Our analyses with bioinformatic methods have
Table 2. Sequence patterns containing the four ‘primitive’ amino
acid residues searched inthe UniProtKB ⁄ Swiss-Prot database.
Pattern
Number of
proteins
Number of
occurrences
A-D-A-D-A 162 173
D-A-D-A-D 104 112
A-D-A-D-A-D-A-D-A 22
D-A-D-A-D-A-D-A-D 22
D-A-A-A-D-A-A-A-D 00
D-V-D-V-D 74 79
V-D-V-D-V 55 59
D-V-D-V-D-V-D-V-D 12
V-D-V-D-V-D-V-D-V 12
D-V-V-V-D-V-V-V-D 00
D-G-D-G-D 140 145
G-D-G-D-G 101 108
D-G-D-G-D-G-D-G-D 00
G-D-G-D-G-D-G-D-G 00
D-G-G-G-D-G-G-G-D 00
J. Hedlund et al. Ancientsequencemotifsinthe H
+
-PPase family
FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS 5191
shown that the H
+
-PPases are unique in their sequence
properties, with no close relatives detectable using the
presently available sensitive methods. The analyses have
also shown that the membrane-bound H
+
-PPases form
a large family, divided into two subclasses – types 1 and
2 – where, notably, both types are found in plants, pro-
tozoa, bacteria and archaea. No occurrences exist in ver-
tebrates, with the possible exception of a reported
sequence from the X. tropicalis early genome project. In
soluble family I PPases, two structural types are known
to be strictly divided – one in prokaryotes and the other
in eukaryotes [21].
The well-conserved nonapeptides inthe loop between
transmembrane segments 5 and 6 show specific patterns
that can be used for functional assignments of unanno-
tated genomes. The distance between the two nonapep-
tides is unchanged in all known sequences. We believe
that our novel explorations on peptide motifs, both as
such and as formed in apparent closeness to situations
plausibly existing at the time ofthe origin and very early
evolution of life on Earth, may be usefully extended
when even more sequence data and, especially, when the
first 3D structure of an H
+
-PPase, become available.
Based on the 3D structure and the results presented
here, rational selections of site-specific mutants may be
expected to illuminate further both the evolutionary and
the dynamic aspects of H
+
-PPase function.
Experimental procedures
Pyrophosphatase sequences were searched using blast [30]
towards UniProtKB, version 6.3 (October 2005, http://
www.uniprot.org) [15], and an in-house database of all
genomes inthe public domain (ftp.ensemble.org; ftp.ncbi.
nih.gov; ftp.tigr.org), denoted GenomeLKPG (A. Bresell and
J. Hedlund, Linko
¨
ping University, Sweden, personal commu-
nication). The searches were complemented by HMM-based
screenings based upon the ‘H_PPase’ model from Pfam [31].
We also built and calibrated our own HMM, based upon 86
sequences. For the creation of HMM and screenings, the
hmmer software (http://hmmer.wustl.edu) [17] was used with
default parameters for building and calibrating (commands
‘hmmbuild’ and ‘hmmcalibrate’).
General sequence comparisons were made using the pro-
gram fasta [16] and pattern searches using the ps_scan
utility from the Prosite database [32].
In the multiple sequence alignments, sequences annotated
as fragments, and those shorter than 300 residues, were
removed to improve the alignment quality. Inthe phylo-
genetic trees, sequences with pairwise residue identity of
more than 90% to any other sequence were excluded. Mul-
tiple sequence alignments were calculated using dialign
[33], and dendrograms were generated using the neighbour
joining method, as implemented in clustalx [23].
The plots in Figs 2 and 3 were generated using in-house
produced software to calculate residue conservation and
intergroup differences, and to map information on substitu-
tions, conserved regions and sequencemotifs onto the
membrane topology.
Acknowledgements
We thank Anders Bresell for early access to the Geno-
meLKPG database and Jan-Ove Ja
¨
rrhed for computer
support. Financial support from Carl Tryggers Stiftelse
fo
¨
r Vetenskaplig Forskning, Magnus Bergvalls Stift-
else, Stiftelsen Wenner-Grenska Samfundet, Kar-
olinska Institutets Stiftelser and Linko
¨
ping University
is gratefully acknowledged.
References
1 Baltscheffsky M (1964) Some characteristics ofthe pyro-
phosphatase reaction in energy-generating systems.
Abstracts 1st FEBS Meeting, p. 67. London.
2 Baltscheffsky H, von Stedingk L-V, Heldt HW &
Klingenberg M (1966) Inorganic pyrophosphate: forma-
tion in bacterial photophosphorylation. Science 153,
1120–1122.
3 Moyle J, Mitchell R & Mitchell P (1972) Proton-trans-
locating pyrophosphatase of Rhodospirillum rubrum.
FEBS Lett 23, 233–236.
4 Baltscheffsky M, Nadanaciva S & Schultz A (1998) A
pyrophosphate synthase gene: molecular cloning and
sequencing ofthe cDNA encoding the inorganic pyro-
phosphate synthase from Rhodospirillum rubrum. Bio-
chim Biophys Acta 1364, 301–306.
5 Baltscheffsky M, Schultz A & Baltscheffsky H (1999)
H
+
-PPases: a tightly membrane-bound family. FEBS
Lett 457, 527–533.
6 Baltscheffsky H, Schultz A, Persson B & Baltscheffsky M
(2001) Tetra- and nonapeptidyl motifsinthe origin and
evolution of photosynthetic bioenergy conversion. In
First Steps inthe Origin of Life inthe Universe (Chela
Flores J, Owen T & Raulin F, eds), pp. 173–178. Kluwer,
Dordrecht.
7 Nakanishi Y, Saijo T, Wada Y & Maeshima M (2001)
Mutagenic analysisof functional residues in putative
substrate-binding site and acidic domains of vacuolar
H+-pyrophosphatase. J Biol Chem 276, 7654–7660.
8 Eigen M & Schuster P (1978) The hypercycle. Naturwis-
senschaften 65, 341–369.
9 Trifonov EN (2000) Consensus temporal order of
amino acids and evolution ofthe triplet code. Gene 261,
139–151.
10 Miller SL (1953) A production of amino acids under
possible primitive earth conditions. Science 117, 528–
529.
Ancient sequencemotifsinthe H
+
-PPase family J. Hedlund et al.
5192 FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS
[...]... improvement ofthe segment-to-segment approach to multiple sequence alignment Bioinformatics 15, 211–218 34 Crooks GE, Hon G, Chandonia JM & Brenner SE (2004) WebLogo: a sequence logo generator Genome Res 14, 1188–1190 Supplementary material The following supplementary material is available online: Table S1 List ofthe 166 proton-pumping inorganic pyrophosphatase (H+-PPase) sequences found in UniProtKB... cysteine-scanning mutagenesis J Biol Chem 279, 35106–35112 27 Baltscheffsky H, Schultz A & Baltscheffsky M (2002) Fundamental characteristics of life and ofthe molecular origin and evolution of biological energy conversion In Fundamentals of Life (Palyi G, Zucchi C & Cagliotti L, eds), pp 87–94 Elsevier SAS, Paris 28 Marsden BJ, Shaw GS & Sykes BD (1990) Calcium binding proteins Elucidating the contributions.. .Ancient sequencemotifsintheH+-PPasefamily J Hedlund et al 11 Oro J, Gibert J, Lichtenstein H, Wikstrom S & Flory DA (1971) Amino-acids, aliphatic and aromatic hydrocarbons inthe Murchison Meteorite Nature 230, 105–106 12 Maeshima M (2000) Vacuolar H(+)-pyrophosphatase Biochim Biophys Acta 1465, 37–51 13 Baltscheffsky H & Baltscheffsky M (1995) Energy-rich phosphate compounds and the origin of. .. (1998) Profile hidden Markov models Bioinformatics 14, 755–763 18 Belogurov G (2004) Pyrophosphate-energized proton pumps: identification ofthe residues determining K+ requirements and discovery of a Na+-dependent enzyme, Dissertation, University of Turku, Finland 19 Drozdowicz YM, Kissinger JC & Rea PA (2000) AVP2, a sequence- divergent, K(+)-insensitive H(+)-translocating inorganic pyrophosphatase from Arabidopsis... (2002) A lysine substitute for K+ A460K mutation eliminates K+ dependence in H+-pyrophosphatase of Carboxydothermus hydrogenoformans J Biol Chem 277, 49651–49654 23 Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG & Thompson JD (2003) Multiple sequence alignment with the Clustal series of programs Nucleic Acids Res 31, 3497–3500 24 Saraste M, Sibbald PR & Wittinghofer A (1990) The P-loop... motif in ATP- and GTP-binding proteins Trends Biochem Sci 15, 430–434 25 Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W et al (2004) Environmental genome shotgun sequencing ofthe Sargasso Sea Science 304, 66–74 26 Mimura H, Nakanishi Y, Hirono M & Maeshima M (2004) Membrane topology ofthe H+-pyrophosphatase of Streptomyces coelicolor determined... Swiss-Prot, UniProtKB ⁄ TrEMBL, and GenomeLKPG databases Table S2 List ofthe 164 proton-pumping inorganic pyrophosphatase (H+-PPase) partial sequences from the Sargasso Sea sequencing project This material is available as part ofthe online article from http://www.blackwell-synergy.com FEBS Journal 273 (2006) 5183–5193 ª 2006 The Authors Journal compilation ª 2006 FEBS 5193 ... and the origin of life In Evolutionary Biochemistry and Related Areas of Physicochemical Biology (Poglazov B, Kurganov BI, Kritsky MS & Gladilin KL, eds), pp 191–199 Bach Institute of Biochemistry and ANKO, Moscow 14 Lopez-Marques RL, Perez-Castineira JR, Losada M & Serrano A (2004) Differential regulation of soluble and membrane-bound inorganic pyrophosphatases inthe photosynthetic bacterium Rhodospirillum... an analysisof species variants and peptide fragments Biochem Cell Biol 68, 587–601 29 Auld DS (2001) Zinc coordination sphere in biochemical zinc sites Biometals 14, 271–313 30 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25, 3389–3402 31 Bateman A, Coin L, Durbin... 353–362 20 Belogurov GA, Turkina MV, Penttinen A, Huopalahti S, Baykov AA & Lahti R (2002) H+-pyrophosphatase of Rhodospirillum rubrum High yield expression in Escherichia coli and identification ofthe Cys residues responsible for inactivation my mersalyl J Biol Chem 277, 22209–22214 21 Tammenkoski M, Benini S, Magretova NN, Baykov AA & Lahti R (2005) An unusual, His-dependent family I pyrophosphatase . in the highest yield, and they all were among the amino acids found in the Murchison mete- orite [11]. Both the high content of the ‘very early’ amino acids in these sequence motifs, and the. along the protein chain (Fig. 3). From the resulting plot it is clear that most of the con- served residues are located on the cytosolic side of the J. Hedlund et al. Ancient sequence motifs in the. where alanine is replaced with asparagine, given the presence of asparagines in one of the two puta- tive active site motifs of R. rubrum H + -PPase. The only patterns found in the proteins were