Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 15 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
15
Dung lượng
569,77 KB
Nội dung
Modules,multidomainproteinsandorganismic complexity
Hedvig Tordai, Alinda Nagy, Krisztina Farkas, La
´
szlo
´
Ba
´
nyai and La
´
szlo
´
Patthy
Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary
The average size of a protein domain of known crystal
structure is about 175 residues; proteins that are larger
than 200–300 residues usually consist of multiple pro-
tein folds [1]. The individual structural domains of
such multidomainproteins are defined as compact
folds that are relatively independent inasmuch as the
interactions within one domain are more significant
than with other domains. The individual domains of
multidomain proteins usually fold independently of the
other domains.
Some multidomainproteins contain multiple copies
of a single type of structural domain, indicating that
internal duplication of a gene segment encoding a
domain has given rise to such proteins. Many multi-
domain proteins contain different types of domains
(i.e. domains that are not homologous to each other).
The genes of such multidomainproteins were created
by joining two or more gene segments that encode dif-
ferent protein domains. Such multidomain proteins,
consisting of multiple domains of independent evolu-
tionary origin, are frequently referred to as mosaic
proteins.
Multidomain proteins have some unique features
that endow them with major evolutionary significance.
In multidomainproteins a large number of functions
(different binding activities, catalytic activities) may
coexist making such proteins indispensable constituents
of regulatory or structural networks where multiple
interactions (protein–protein, protein–ligand, protein–
DNA, etc., interactions) are essential. For example,
the domains that constitute multidomainproteins of
the intracellular and extracellular signaling pathways
mediate multiple interactions with other components
of the signaling pathways. Similarly, the coexistence of
different domains with different binding specificities is
also essential for the biological function of multi-
domain proteins of the extracellular matrix: the mul-
tiple, specific interactions among matrix constituents
Keywords
domain; exon-shuffling; module;
multidomain protein; organismic complexity
Correspondence
L. Patthy, Institute of Enzymology,
Biological Research Center, Hungarian
Academy of Sciences, Budapest, POBox 7,
H-1518, Hungary
Fax: +361 4665465
Tel: +361 2093537
E-mail: patthy@enzim.hu
(Received 9 May 2005, revised 9 August
2005, accepted 12 August 2005)
doi:10.1111/j.1742-4658.2005.04917.x
Originally the term ‘protein module’ was coined to distinguish mobile
domains that frequently occur as building blocks of diverse multidomain
proteins from ‘static’ domains that usually exist only as stand-alone units
of single-domain proteins. Despite the widespread use of the term ‘mobile
domain’, the distinction between static and mobile domains is rather vague
as it is not easy to quantify the mobility of domains. In the present work
we show that the most appropriate measure of the mobility of domains is
the number of types of local environments in which a given domain is pre-
sent. Ranking of domains with respect to this parameter in different evo-
lutionary lineages highlighted marked differences in the propensity of
domains to form multidomain proteins. Our analyses have also shown that
there is a correlation between domain size and domain mobility: smaller
domains are more likely to be used in the construction of multidomain pro-
teins, whereas larger domains are more likely to be static, stand-alone
domains. It is also shown that shuffling of a limited set of modules was
facilitated by intronic recombination in the metazoan lineage and this has
contributed significantly to the emergence of novel complex multidomain
proteins, novel functions and increased organismiccomplexity of metazoa.
Abbreviations
TSP1, thrombospondin type I.
5064 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS
are indispensable for the proper architecture of the
extracellular matrix. As a corollary of their involve-
ment in multiple interactions, formation of novel
multidomain proteins is likely to contribute signifi-
cantly to the evolution of increased organismic com-
plexity since the latter reflects the complexity of
interactions among genes, proteins, cells, tissues and
organs [2].
Despite such valuable properties of large, complex
multidomain proteins the vast majority of proteins con-
tain only one domain [3–5]. Furthermore, recent studies
have revealed that the majority of multidomain pro-
teins tend to have very few domains. Wolf et al. [4]
have counted the number of different folds in each pro-
tein of proteomes of archaea, bacteria and eukarya and
the average fraction of the proteins with each given
number of domains was calculated. It has been conclu-
ded from these analyses that distributions of single-,
two-, three-domain, etc., proteins in archaea, bacteria
and eukarya is such that each next class (e.g. two-
domain proteins vs. single-domain proteins, three-
domain proteins vs. two-domain proteins, etc.) contains
significantly fewer entries than the previous one. More
recent mathematical analyses of the distribution of
multidomain proteins according to the number of dif-
ferent constituent domains have revealed that their
distribution follows a power law, i.e. single-domain
proteins are the most abundant, whereas proteins con-
taining larger numbers of domain-types are increasingly
less frequent. This type of distribution is consistent
with a random recombination (joining and breaking)
model of evolution of multidomain architectures [6].
The observation of Wolf et al. [4] that the size distri-
bution of multidomainproteins was very similar in
eukaryotes and prokaryotes apparently contradicted
the notion that evolution of complex eukaryotes fav-
ored (and benefited from) the formation of more and
larger multidomainproteins as they contributed to
their increased organismic complexity.
Recent analyses, however, provided evidence that
there may be a connection between the propensity of
protein domains to form multidomain architectures
and organismic complexity. For example, Koonin et al.
[6] have shown that – although in all proteomes the
domain distribution is compatible with a random
recombination model of the evolution of multidomain
architectures – the likelihood of domain joining
appears to increase in the order Archaea < Bacteria <
Eukaryotes, and there is a significant excess of larger
multidomain proteins in Eukaryotes. Similarly, Wuchty
[5] has shown that higher organisms tend to have more
complex multidomain proteins. Using graph theory-
based tools to survey and compare protein domain
organizations of different organisms Ye and Godzik [7]
have shown that the number of domains, the number
of domain combinations, and the size of the largest
connected component of domain-combination net-
works (measured by the number of domains it consists
of) of each organism increase with the complexity of
the organisms.
The propensity of different domain types to form
multidomain proteins shows great variation, ranging
from ‘static’ domains that rarely or never occur in
multidomain proteins, to ‘mobile’ domains (usually
referred to as modules) that frequently participate in
gene-rearrangements to build multidomain proteins.
Various analyses of the number of multidomain archi-
tectures in which different domain-types are involved
have shown that their distribution also follows a power
law: a minority of domain-types (the ‘mobile’ modules)
occur in numerous multidomain proteins, whereas the
majority of domains belong to categories that are
rarely used in multidomainproteins [5,6,8]. Such a
power law distribution indicates that the chance of a
domain to be used in the construction of novel multi-
domain proteins is proportional to the number of
times it has already been used.
As for any other type of genetic change, the fre-
quency of joining a given domain-type to other
domains to create novel multidomain architectures
reflects the probability of such a genetic change and
the probability of its fixation. In other words, the pro-
pensity of a domain to form multidomainproteins is a
function of the frequency of genetic events that can
lead to such gene-fusions and the selective value of the
resulting chimeric proteins. Accordingly, it is likely
that the most mobile modules have acquired this status
as a result of a combination of special structural, func-
tional and genomic features [9].
First, certain structural features of domains may
facilitate their preferential proliferation in multidomain
proteins. For example, the stability and folding auton-
omy of domains in multidomainproteins may be of
utmost importance for their mobility as this minimizes
the influence of neighboring domains [9]. Folding
autonomy can ensure that folding of the domain is not
deranged when inserted into a novel protein environ-
ment. It seems thus very likely that the most widely
used domains have been selected according to the rate,
robustness and autonomy of folding [10]. It is note-
worthy in this respect that multidomainproteins are
under-represented in Archaea compared with the other
two kingdoms of life and this fact is thought to be
related to the lower stability of multidomain proteins
in the hyperthermophilic environments where most
archaeal species live [6].
H. Tordai et al. Mobile domains
FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5065
Second, functional aspects may also contribute to
the proliferation of certain domains. For example, in
complex cellular signaling pathways there is a greater
demand for domains that mediate interaction with
other constituents of the pathways (e.g. protein kinase
domain) thus selection may have favored the spread of
these modules to other multidomain proteins. Finally,
special genomic features of certain genes (gene-
segments) may have significantly facilitated their com-
bination with other domains.
To gain further insight into the factors that influence
the mobility of domains and control the creation of
multidomain proteins, in the present work we have
compared the propensity of different domains to form
multidomain proteins in several major groups of
organisms (Bacteria, Archaea, Protozoa, Plants, Fungi,
metazoa) as well as in individual proteomes of some
representative species.
The specific questions we have addressed were: (a)
What is the most appropriate parameter that reflects
the evolutionary mobility of protein domains? (b) Are
there significant differences in the propensity to form
multidomain proteins in different evolutionary line-
ages? (c) How do structural and functional properties
of domains influence their mobility? (d) Is there reli-
able evidence for the notion that intronic recombina-
tion has significantly contributed to the remarkable
mobility of some domain-types in metazoa?
Results and discussion
Differences in the propensity to form multidomain
proteins in different evolutionary lineages
As shown in Table 1, different evolutionary groups
show significant differences in the propensity to form
multidomain proteins: the proportion of multidomain
proteins decreases in the order metazoa >
plants > fungi protozoa > bacteria > archaea. At
one extreme we find archaea where only 23% of the
entries contain more than one Pfam-A domain, while
metazoa represent the other extreme where 39% of the
entries correspond to multidomain proteins. It is also
clear from Table 1 that in metazoa a larger proportion
of Pfam-A domains participates in the construction of
multidomain proteins than in archaea. Furthermore,
the multidomainproteins of metazoa tend to be larger
than those in Archaea: multidomainproteins with
more than 10 PfamA domains are nine times more fre-
quent in metazoa than in archaea (Table 2). This
observation is in harmony with earlier conclusions that
the average protein length is considerably greater in
eukaryotes than in prokaryotes [11]. These differences
between different evolutionary lineages are unlikely to
be due to differences in annotation coverage. As
shown recently by Ekman et al. [12], the Pfam-A
domain coverage is similar for archaea, bacteria and
eukarya: in each group about 70% of the proteins
have at least one Pfam-A domain. In agreement with
this conclusion, our analyses have also shown that
Pfam-A coverage is similar for bacteria, archaea, pro-
tozoa, plants, fungi and metazoa (Table 3).
To gain a deeper insight into the factors controlling
the frequency and size of multidomainproteins in dif-
ferent groups of organisms we have plotted the num-
ber of multidomainproteins vs. the number of
constituent domains. Earlier studies have pointed out
that such distributions usually fit the power law:
P(i)@ci
–c
where P(i) is the number of multidomain
Table 1. Domains andmultidomainproteins in different groups of organisms.
a
Proteins containing at least one Pfam-A domain;
b
Pfam-A
domains;
c
proteins containing at least two Pfam-A domains;
d
domains occurring in at least one multidomain protein;
e
domains occurring only
as stand-alone domains in single domain proteins.
Proteins
a
Domains
b
Multidomain proteins
c
(% of proteins)
Mobile domains
d
(% of domains)
Static domains
e
(% of domains)
Bacteria 273 859 4079 73 076 (27%) 1974 (48%) 2105 (52%)
Archaea 23 728 1725 5529 (23%) 776 (45%) 949 (55%)
Protozoa 16 756 1967 5298 (32%) 932 (47%) 1035 (53%)
Plants 57 620 2562 20 359 (35%) 1305 (51%) 1257 (49%)
Fungi 20 371 2249 6434 (32%) 1102 (49%) 1147 (51%)
Metazoa 129 881 3272 51 085 (39%) 1748 (53%) 1524 (47%)
Table 2. Percentage of multidomainproteins containing more than
N number of Pfam-A domains in different groups of organisms.
N Bacteria Archaea Protozoa Plants Fungi Metazoa
1 26.67 23.30 31.62 35.33 31.58 39.33
2 8.88 7.01 14.95 14.28 12.98 17.97
3 3.94 3.40 8.84 8.00 6.88 11.11
4 1.96 1.66 5.72 5.33 4.15 7.66
5 1.24 1.13 3.94 3.84 2.72 5.62
10 0.27 0.19 1.14 1.22 0.39 1.74
Mobile domains H. Tordai et al.
5066 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS
proteins containing exactly i domains, c is a normaliza-
tion constant and c is a parameter, which typically
assumes values between 1 and 3 [13]. In double-log-
arithmic plots, the plot of P(i) as a function of i is a
straight line with a negative slope c. As shown in
Fig. 1, in the case of each evolutionary group the data
closely follow straight lines in double-logarithmic plots
consistent with power-law dependence. The distribu-
tion of values of metazoan multidomain proteins
was found to be significantly different from those of
multidomain proteins of plants (P ¼ 0.0002), bacteria
(P<0.0001), fungi (P<0.0001) or archaea
(P<0.0001). The fact that the slopes of the curves in
Fig. 1 are increasingly steeper in the order
metazoa fi plants fi bacteria fungi archaea
(Table 4) indicates that the likelihood of domain join-
ing is greater in metazoa than in prokaryotes, plants
and fungi. Surprisingly, the slope in protozoa is similar
to that observed for metazoa.
A possible explanation for the unusual abundance
of larger multidomainproteins in protozoa is that
parasitic protists have acquired metazoan-like multido-
main proteins through lateral gene transfer. Recently it
has been shown that different lineages of apicomplexan
protozoa (e.g. Plasmodium, Cryptosporidium) have
acquired distinct but overlapping sets of multidomain
surface proteins constructed from adhesion domains
typical of animal proteins, although in no case do they
share multidomain architectures identical to those of
animals [14,15]. Some of these proteins contain con-
served adhesion domains such as the epidermal growth
factor-like domain (EGF domain), thrombospondin
type I (TSP1) domain, the von Willebrand factor A
(vWA) domain and the PAN ⁄ APPLE domain that are
typically abundant in animal surface proteins but are
absent or rarely present in surface adhesion molecules
Table 3. Percentage of positions in domain-triplet types occupied
by Pfam-A domains vs. Nterm, Cterm and Unknown regions in
multidomain proteins of different groups of organisms. For defini-
tion of domain-triplet type, Nterm, Cterm and Unknown regions in
domain-triplets see Methods.
Bacteria Archaea Protozoa Plants Fungi Metazoa
Pfam-A 71 70 68 68 67 71
Nterm 8 9 6 6 7 6
Cterm 8 8 7 7 7 6
Unknown 13 13 19 18 19 16
Fig. 1. Distribution of multidomainproteins with respect to the number of constituent domains. The figure shows the number of constituent
domains (x axis, log
10
scale) compared with the number of multidomainproteins that have that number of domains (y axis, log
10
scale) in
bacteria (A), archaea (B), protozoa (C), plants (D), fungi (E) and metazoa (F). Parameters of the plots are compiled in Table 4.
Table 4. Parameters of the linear fit of the double logarithmic plots
for P(i) ¼ ci -c where P(i) is the number of multidomain proteins
and i is the number of constituent domains.
Bacteria Archaea Protozoa Plants Fungi Metazoa
c 2.9343 3.1744 2.7101 2.8356 3.0635 2.7457
R 0.9597 0.9822 0.9785 0.9737 0.9755 0.9865
N652134362340
P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
H. Tordai et al. Mobile domains
FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5067
in other eukaryotic lineages. A systematic analysis of
the C. parvum proteome has identified 32 widely con-
served surface domains distributed in 51 proteins,
including 24 noncatalytic protein- or carbohydrate-
interacting domains and seven catalytic domains. Most
strikingly, 10 of these domains, namely, TSP1, sus-
hi ⁄ CCP, Notch ⁄ Lin (NL1), NEC (neurexin-collagen
domain), fibronectin type 2 (FN2), pentraxin, MAM
domain (a domain present in meprin, A5, receptor
protein tyrosine phophatase mu), ephrin-receptor
EGF-like domain, the animal signaling protein hedge-
hog-type HINT domain and the scavenger domain
have thus far been found only in the surface proteins
of animals other than apicomplexans. The remaining
domains such as the EGF, LCCL domain (a domain
first found in Limulus factor C, Coch-5b2 and Lgl1),
Kringle, SCP domain are seen in some other eukaryo-
tes, but predominantly found only in animals. In phy-
logenetic analyses specific affinities between
apicomplexan and animal versions were recovered [16],
making horizontal gene transfer from animals, fol-
lowed by selective retention of functionally relevant
proteins involved in adhesion as the most parsimoni-
ous explanation for these observations.
It thus appears that metazoa favor the formation of
larger multidomainproteins than archaea, bacteria,
fungi, plants. To test whether this is related to the fact
that the world of extracellular (and some transmem-
brane) multidomainproteins has significantly expanded
in metazoa [2,9], we have analyzed the size distribution
of extracellular, transmembrane and intracellular
multidomain proteins of metazoa separately.
Differences in the propensity to form extra-
cellular, intracellular and transmembrane
multidomain proteins in metazoa
Double-logarithmic plots of the number of extracellu-
lar, intracellular and transmembrane multidomain pro-
teins vs. the number of constituent domains have
revealed that in each case the data follow straight lines
consistent with power-law dependence. The distribu-
tion of values for extracellular multidomain proteins,
however, differed significantly from those of intracellu-
lar multidomainproteins (P<0.0001), of transmem-
brane multidomainproteins (P ¼ 0.0010) or of total
metazoan multidomainproteins (P<0.0001). The
slope for extracellular multidomainproteins is shal-
lower than the value for intracellular multidomain pro-
teins, for transmembrane proteins or for total
metazoan multidomainproteins (Table 5).
These observations indicate that the ratio of domain
joining ⁄ breaking is greater for extracellular than for
intracellular multidomainproteins of metazoa. To test
whether this reflects the fact that exon-shuffling of
class 1–1 modules contributed primarily to the creation
of extracellular (and extracellular parts of some trans-
membrane) multidomainproteins of metazoa [2,9], we
have analyzed the size distribution of multidomain
proteins assembled from class 1–1 modules (irrespect-
ive of their subcellular localization).
Power law distribution of metazoan multidomain
proteins assembled by exon-shuffling from class
1–1 modules
Analysis of the double-logarithmic plot of the number
of multidomainproteins assembled from class 1–1 mod-
ules vs. the number of constituent domains has revealed
that the distribution of values differs significantly from
those for total metazoan multidomain proteins
(P<0.0001) or for intracellular metazoan multido-
main proteins (P<0.0001). The slope for multidomain
proteins assembled from class 1–1 modules is shallower
than the values for intracellular or total metazoan
multidomain proteins (Table 5). This observation is
consistent with the notion that exon-shuffling of class
1–1 modules has favored the creation of larger (primar-
ily extracellular) multidomainproteins of metazoa.
Domain size and propensity to form multidomain
proteins
By plotting the size of multidomainproteins as a func-
tion of the number of constituent Pfam-A domains we
obtained a linear relationship (Y ¼ A + B*X), where
X is the number of domains, B is the average size (in
amino acid residues) of Pfam-A domains actually used
to build multidomainproteinsand Y is the size of the
multidomain proteins (Fig. 2). The value of B was found
to be 80 amino acid residues, much smaller than the
average size of Pfam-A domains (178 residues) present
in the Pfam-A database. This observation suggests that
smaller domains are more likely to be used in the con-
struction of multidomain proteins. As the value of A is
Table 5. Parameters of the linear fit of the double logarithmic plots
for P(i) ¼ ci
–c
where P(i) is the number of extracellular, transmem-
brane, intracellular or class 1–1 multidomainproteinsand i is the
number of constituent domains.
Total Extracellular Transmembrane Intracellular Class 1–1
c 2.7457 2.1107 2.7479 2.8071 2.4031
R 0.9865 09362 0.9684 0.9781 0.9714
N40 35 34 37 39
P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
Mobile domains H. Tordai et al.
5068 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS
302 amino acid residues this also suggests that Pfam-A
domains larger than average are more likely to be static,
stand-alone domains. Figure 2 thus suggests that larger
Pfam-A domains predominate in single- and oligodo-
main proteins, whereas larger multidomainproteins are
constructed from smaller Pfam-A domains. It is note-
worthy in this respect that the most versatile mobile
modules (e.g. the EGF, ig, fn1, TSP_1, Sushi, Ldl_re-
cept_a, SH3–1, SH2 modules, kringles) are less than 100
amino acid residues. A possible explanation for this phe-
nomenon is that smaller, compact domains are more
likely to satisfy the folding autonomy criterion that is
crucial for their structural integrity in multidomain pro-
teins. This explanation is supported by the fact that the
rate of protein folding of single-domain proteins is
inversely proportional to protein length [17,18].
Measuring the evolutionary mobility of protein
domains
It has long been known that the propensity of individ-
ual domains to form multidomain architectures shows
significant differences: whereas the majority of
domains are rarely observed in multidomain proteins,
some domains are extremely widely used [18]. Never-
theless, the distinction between static and mobile
domains is rather vague since it is not simple to meas-
ure domain mobility.
The frequent reuse (‘mobility’) of a protein domain
increases several types of parameters such as (a) the
number of proteins in proteome(s) in which it is pre-
sent; (b) number of copies of the domain in proteo-
mes(s); (c) number of other domain-types with which
the given domain co-occurs to form multidomain
proteins; and (d) number of multidomain protein
architectures (linear sequence of domains, domain-
organizations) in which the given domain occurs.
Parameters (a) and (b) are rarely used to illustrate
differences in the mobility of protein domains, as it is
clear that these parameters may also be affected by
and may have more to do with gene duplications or
domain duplications than with domain mobility.
In recent years the mobility of a domain was most
frequently measured by the number of other domain-
types with which the given domain co-occurs (to which
it is ‘connected’) in multidomainproteins [5,7]. An
obvious problem with this ‘co-occurrence’ or ‘connec-
tivity’ approach is that a domain may co-occur with a
large number of other domains in large families of
multidomain proteins in which the given domain is
always in the same local context, i.e. it shows no sign
of mobility (Fig. 3). We face a similar problem if we
wish to use the number of multidomain protein archi-
tectures to measure mobility of domains: a domain
may occur in a large number of different architectures
in which the given domain is always in the same local
context (Fig. 3). As illustrated in this figure, during
evolution of multidomain protein families domain
insertions distant from the given domain may lead to
marked changes in the number of architectures in
which a given domain is present, marked changes in
the number of co-occurring domains even though the
given domain is present in the same local environment.
To assess the significance of these problems, in the
present work we have introduced the number of local
architecture-types in which the given domain occurs as
a measure of its mobility. Local architecture (local
context) is defined as the ‘triplet’ consisting of the clo-
sest upstream (if any) and downstream (if any) domain
neighbors of the given domain.
As illustrated in Table 6, ranking domains with
respect to the number of types of domains co-occur-
ring with the domain in metazoan multidomain pro-
teins (CO-OCCURRENCE), number of types of
metazoan multidomain protein architectures in which
the domain is present (ARCHITECTURE) and num-
ber of local architectures (TRIPLETS) in which the
domain is present give very different results.
Fig. 2. PfamA domain number and protein size of metazoan pro-
teins. The line shows a linear fit according to equation Y ¼
A + B*X, where Y is the size of proteins with a given number of
Pfam-A domains, X is the number of constituent Pfam-A domains
(N ¼ 129881; B ¼ 80.1 ± 0.3425; A ¼ 302.3 ± 1.243; r
2
¼ 0.2987,
P < 0.0001, 95% confidence interval). The figure shows only the
data for proteins containing less than 25 domain; the squares repre-
sent the average size of proteins with a given number of Pfam-A
domains.
H. Tordai et al. Mobile domains
FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5069
The similarities and differences of the information-
content of ‘TRIPLET’ vs. ‘ARCHITECTURE’ and
‘CO-OCCURRENCE’ are illustrated in Fig. 4. As
shown in Fig. 4(A), there is a clear linear relationship
(Y ¼ B*X; R ¼ 0.9156, P < 0.0001) between the
number of architecture-types (X) and the number of
Probable G protein-coupled receptor 97 precursor Q8R0T6
CD97 antigen precursor P48960
Brain
- specific angiogenesis inhibitor 1 Q8CGM0
Latrophilin-like protein LAT-2 AAQ84879
Probable G protein-coupled receptor 126 precursor Q86SQ4
Probable G protein-coupled receptor 125 precursor Q8IWK6
Probable G protein-coupled receptor 116 precursor Q8IZF2
Latrophilin-1 O97830
Receptor for egg jelly 3 protein Q95V80
Polycystic kidney disease and receptor for egg jelly related protein precursor Q9Z0T6
Fig. 3. Domain organization of representa-
tive multidomainproteins containing the
GPS-domain (G-protein-coupled receptor
proteolytic site domain). The rectangles
highlight the two types of local environ-
ments in which the GPS-domain occurs.
The multidomainproteins shown represent
10 distinct architectures and contain 18
types of co-occurring domains. Note that
GPS-containing multidomainproteins have
diverse architecture due to the relatively
high number of co-occurring domain-types,
although the local environment of the GPS
domain is mostly unchanged: it is present in
only two triplet types.
Table 6. Ranking of Pfam-A domains in metazoa with respect to
parameters reflecting their evolutionary mobility. Only the top-rank-
ing 20 are shown; the domains are listed in the order of decreasing
mobility. The domain names correspond to those used by the Pfam
database (http://www.sanger.ac.uk/Software/Pfam/) [22]. Class 1–1
modules are highlighted in bold.
Rank
Number of types of
co-occurring domains
Number of types
of architectures
Number of types
of ‘triplets’
1 Pkinase EGF EGF
2 EGF I-set Pkinase
3 Ank fn1 ig
4PH ig PH
5 zf-C3HC4 LRR fn3
6 zf-C2H2 Pkinase EGF_CA
7 fn1 EGF_CA I-set
8 EGF_CA Ank SH3–1
9 ig zf-C2H2 CUB
10 SH3–1 PH Ldl_recept_a
11 I-set SH3–1 zf-C2H2
12 efhand Ldl_recept_a TSP_1
13 PDZ Laminin_G_2 Ank
14 LRR Collagen Sushi
15 WD40 Sushi zf-C3HC4
16 Lectin_C efhand efhand
17 Ldl_recept_a PDZ PDZ
18 IQ IQ zf-CCHC
19 TSP_1 CUB C1–1
20 Helicase_C zf-C3HC4 SH2
A
B
Fig. 4. Comparison of the number of triplet types, number of archi-
tecture-types and number of co-occurring domain-types in metazoan
multidomain proteins. (A) The figure shows a linear fit according to
equation Y ¼ A + B*X, where Y is the number of triplet types, X is
the number of architecture-types containing a given domain (N ¼
1748; B ¼ 0.3673; R ¼ 0.9156, P < 0.0001). (B) The figure shows a
linear fit according to equation Y ¼ A + B*X, where Y is the number
of triplet types containing a given domain, X is the number of
domain-types co-occurring with that domain (N ¼ 1748; B ¼ 1.082;
R ¼ 0.9144, P < 0.0001). Class 1–1 modules showing greatest mobi-
lity (present in more than 15 triplet types) are highlighted in red.
Mobile domains H. Tordai et al.
5070 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS
triplet types (i.e. local architecture-types, Y) in which
a given domain is present, but the slope of the line
(B ¼0.3673) indicates that a given local architecture-
type may be found in several different global architec-
tures, i.e. there is a uniform tendency that changes in
architecture occur at distant regions. Furthermore,
examination of the data reveals that domains (repeats)
more prone to duplication than to shuffling (LRR,
Ank, etc.) are the ones that deviate from this linear
relationship most significantly.
Similarly, there is a linear relationship (R ¼ 0.9144,
P < 0.0001) between the number of domain types with
which a given domain co-occurs (connectivity) and the
number of triplet types in which it is present (Fig. 4B).
Nevertheless, examination of data reveals that the
majority of mobile class 1–1 modules known to have
been shuffled by exon-shuffling [20] deviate from this
linear relationship most significantly inasmuch as they
have higher triplet numbers than expected by the linear
relationship, they are above the line calculated by linear
regression analysis (Fig. 4B). This is also reflected in
the fact that in Table 6, class 1–1 modules (e.g. CUB-,
TSP1, Ldl_recept_a) occupy more prominent positions
in the TRIPLET column than in the ARCHITEC-
TURE or CO-OCCURRENCE columns.
On the other hand, domains [e.g. GPS (G-protein-
coupled receptor proteolytic site domain), Fig. 3] that
are present in almost invariable local environments of
a vast variety of multidomain protein architectures are
present in much lower number of triplet types than
expected if we assume a perfect linear relation. As
illustrated in Fig. 3, the high number of domains
co-occurring with the GPS domain, the high number
of architecture-types in which it is present reflects
domain-shuffling events distant from the GPS domain
and has little to do with mobility of the GPS domain.
It thus appears that the number of local architec-
ture-types (‘triplets’) in which the given domain is pre-
sent is a more relevant parameter to reflect the
‘shuffling’ or ‘insertion’ of a mobile domain into differ-
ent environments. Ranking of domains according to
this parameter has revealed that the best known
mobile modules (EGF, PH, ig, I-set, SH3–1, fn1,
EGF_CA, CUB, TSP_1, Ldl_recept_a, sushi, etc.)
occupy most of the top 20 positions in the TRIPLET
column of Table 6.
Domain size and domain mobility
We have used the number of triplet types in which a
Pfam-A domain occurs as a measure of its mobility to
investigate whether mobility correlates with domain
size. As shown in Fig. 5 there is a significant inverse
correlation between domain size and domain mobility.
This observation is in harmony with the data shown in
Fig. 2 that also suggest that smaller domains are more
likely to be used in the construction of multidomain
proteins, whereas larger domains are more likely to be
static, stand-alone domains. There are a few notewor-
thy exceptions to the generalization that the domains
showing greatest mobility are small. One of these
exceptions is the protein kinase domain that – with an
average size of 228 amino acids – shows the second
greatest mobility in metazoan multidomain proteins
(Fig. 5 and Table 6). It seems likely that its mobility
reflects primarily the great demand of this domain in
signaling networks.
Power law distribution of domain mobility
It is evident from Fig. 5 that the majority of domains
occur in a relatively small number of local architec-
ture-types, whereas a small minority of domains serves
as versatile building blocks of multidomain proteins.
This is in agreement with recent observations that
power laws describe the distribution of domains with
respect to the number of multidomain architectures in
which they occur [5,6,8].
Fig. 5. Domain size and domain mobility. The number of domain-
triplet types in which a Pfam-A domain occurs in metazoan multi-
domain proteins is plotted as a function of the average size of the
given Pfam-A domain family (in amino acid residues). Note that
there is an inverse correlation between domain size and domain
mobility (number of pairs ¼ 1748, Pearson r ¼ )0.1507,
P < 0.0001).
H. Tordai et al. Mobile domains
FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5071
To analyze the factors that influence domain mobil-
ity in different groups of organisms we have plotted
the number of domain-types as a function of their
‘mobility’, mobility being expressed either as the num-
ber of domain-types co-occurring with the given
domain or the number of triplet types in which a
domain occurs.
In the case of the co-occurrence approach the data
follow straight lines in double-logarithmic plots consis-
tent with power-law dependence. The distribution of
values of metazoa was found to be significantly differ-
ent from those of bacteria (P<0.0001), archaea ( P ¼
0.0002), plants (P<0.0001) and fungi (P<0.0001).
The slopes of the curves increase in the order metazoa
< bacteria < plants < archaea < fungi (Table 7A).
To test whether this is related to the fact that shuffling
of (class 1–1) modules and creation of extracellular and
transmembrane multidomainproteins was significantly
facilitated by intronic recombination in metazoa [9,21],
we have analyzed the domain co-occurrence plots for
extracellular, transmembrane and intracellular and class
1–1 multidomainproteins of metazoa separately.
The distribution of values for extracellular multi-
domain proteins differed significantly from that of
intracellular multidomainproteins (P<0.0051). The
slope for extracellular multidomainproteins is shal-
lower than the value for intracellular multidomain pro-
teins (Table 8A).
Furthermore, the values for multidomain proteins
assembled from class 1–1 modules differed significantly
from that for total metazoan multidomain proteins
(P<0.0001). The slope for class 1–1 multidomain
proteins is shallower than the value for total metazoan
multidomain proteins (Table 8A). These observations
are consistent with the notion that intronic recombina-
tion greatly increased the mobility of class 1–1 mod-
ules in metazoa, and this facilitated the creation of
novel extracellular multidomainproteins of animals.
Analysis of domain mobility of metazoan proteins
with the triplet approach has also revealed that the
data follow straight lines in double-logarithmic plots.
In the case of the triplet plots the distribution of val-
ues in metazoa is also significantly different from those
of bacteria (P<0.0001), archaea (P<0.0001), plants
(P<0.0001), protozoa (P<0.001) and fungi
(P<0.0001). Comparison of the slopes of co-occur-
rence vs. triplet plots in different groups of organisms
(Tables 7A and B) has revealed that in each case the
slopes of the triplet plots are steeper than those of
co-occurrence plots (metazoa: c ¼ 1.6170 vs. 2.0125;
bacteria: c ¼ 1.8207 vs. 2.2851; archaea: c ¼ 2.0278 vs.
2.4554; protozoa: c ¼ 1.9690 vs. 2.7118; plants: c ¼
1.8616 vs. 2.5508; fungi: c ¼ 2.2128 vs. 2.8692). It
seems likely that this is due to the difference of the two
approaches: the ‘global’ co-occurrence approach tends
to overestimate the mobility of domains as opposed to
the ‘local’ triplet approach (Fig. 3). Nevertheless, the
results of the two analyses are similar inasmuch as
metazoan domains display the shallowest slopes.
It is interesting to point out that the mobility distri-
bution of domains of protozoa is very similar to
those of Plants and Fungi (Table 7B), whereas the
size-frequency distribution of protozoan multidomain
proteins is more similar to that of metazoa (Table 4).
A possible explanation for this apparent contradiction
is that the lateral gene transfer of multidomain pro-
teins from animal hosts affects the size distribution of
the multidomain protein pool of parasitic protozoa,
but the domains thus acquired have lost their mobility
in the intron-poor genomes of protists.
In the case of the triplet plots the distribution of
values for multidomainproteins assembled from class
1–1 modules is significantly different from those of
extracellular proteins (P<0.0001), transmembrane
multidomain proteins (P<0.0001), intracellular multi-
domain proteins (P<0.0001) or total metazoan multi-
domain proteins (P<0.0001). The slope of the triplet
plot for multidomainproteins assembled from class 1–1
modules is shallower than that for extracellular proteins,
for transmembrane proteins, for intracellular multi-
Table 7. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci – c where P(i) is the number of domains.
Bacteria Archaea Protozoa Plants Fungi Metazoa
(A) i is the number of types of domains with which they co-occur in multidomain proteins
c 1.8207 2.0278 1.9690 1.8616 2.2128 1.6170
R 0.9755 0.9316 0.9596 0.9498 0.9688 0.9588
N 43 16 21 31 22 50
P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
(B) i is the number of domain-triplet types (local architectures) in which they occur in multidomain proteins
c 2.2851 2.4554 2.7118 2.5508 2.8692 2.0125
R 0.9726 0.9357 0.9852 0.9624 0.9734 0.9568
N 35 16 11 25 16 42
P £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001
Mobile domains H. Tordai et al.
5072 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS
domain proteins or for total metazoan multidomain
proteins (Table 8). Comparison of co-occurrence plots
(Table 8A) and triplet plots (Table 8B) for extracellular,
intracellular, transmembrane, class 1–1 multidomain
proteins has also revealed that order of the slopes is sim-
ilar in the two approaches: class 1–1 < extracellular <
transmembrane < intracellular multidomain proteins.
Global and local domain co-occurrence networks
Power law distributions are intimately related to the
so-called scale-free networks: networks in which the
frequency distribution of node degrees (i.e. the number
of other nodes to which a given node is connected) fol-
lows a power law. Accordingly, power law distribu-
tions are frequently analyzed and visualized through
scale-free networks.
The basis of the scale-free behavior of network evo-
lution (and power law distributions) is that the prob-
ability of a node acquiring a new connection is
proportional to the number of links that node already
has: there is a greater likelihood of nodes being added
to pre-existing hubs. For example, the fact that the
casting of actors in movies and the distribution of peo-
ple according to their wealth follow a power law is a
manifestation of ‘the rich get richer’ principle [5,6]. By
analogy, the fact that the distribution of domains
according to the frequency they are used to build
multidomain proteins follows a power law indicates
that the chance of a domain to be used is proportional
to the number of times it has already been used.
In the present work domain co-occurrence networks
and triplet networks were used to illustrate and quantify
the mobility of domains and the complexity of multido-
main protein networks. The number of vertices, connec-
tivities (edges) and the size of the largest connected
component were used to characterize the complexity of
the domain networks of different groups of organisms
(Table 9). The size (the number of vertices) of the largest
connected component increases linearly (with a slope of
1.0529) with the number of total vertices (as we proceed
from prokaryotes to higher eukaryotes), with a ‘lag’ of
about 500 vertices (Fig. 6A). A possible explanation for
this phenomenon is that some ancient domains formed
ancient multidomainproteins but apparently they no
longer participate in novel domain combinations,
instead remaining ‘islands’, separated from the largest
connected component of the domain network. An illus-
trative example of this group is the ancient multidomain
protein RNA polymerase Rpb1, constructed from
domains RNA_pol_Rpb1–1, RNA_pol_Rpb1–2,
RNA_pol_Rpb1–3, RNA_pol_Rpb1–4, RNA_-
pol_Rpb1–5, which combine only with each other.
The number of architecture types also increases with
the number of total vertices, and the correlation is best
described by a semilogarithmic plot (Fig. 6B) consis-
tent with a model in which domains combine at ran-
dom. It is noteworthy, however, that in the linear fit
to equation Y ¼ A + B*X the value of A suggests
that there are frozen ancient multidomain architec-
tures, the constituent domains of which do not partici-
pate in the construction of novel multidomain
proteins. It appears that this is another manifestation
of what we said above in connection with the set of
vertices excluded from the largest connected compo-
nent of domain networks: some ancient domains form
ancient multidomainproteins with permanent domain
partners (and conserved architectures) but they are no
longer used in the construction of novel multidomain
architectures. A comparison of the list of domains
excluded from the largest connected components in all
organisms with the list of domains in conserved multi-
domain architectures shared by all organisms has
revealed significant similarities. For example, ancient
domains ⁄ multidomainproteins (fulfilling basic func-
tions) such as Enolase_C and Enolase_N of enolase,
Table 8. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci – c where P(i) is the number of domains.
Total Extracellular Transmembrane Intracellular Class 1–1
(A) i is the number of types of domains with which they co-occur in extracellular, transmembrane, intracellular and class 1–1 multidomain
proteins of metazoa
c 1.6170 1.3193 1.4542 1.7839 1.0984
R 0.9588 0.8879 0.9337 0.9425 0.8951
N 50 17 19 25 37
P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
(B) i is the number of domain-triplet types (local architectures) in which they occur in extracellular, transmembrane, intracellular and class
1–1 multidomainproteins of metazoa
c 2.0125 1.6241 1.8397 2.3389 0.9233
R 0.9568 0.8668 0.9265 0.9638 0.9101
N 42 13 17 19 28
P £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001
H. Tordai et al. Mobile domains
FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5073
[...]... Ribonuc_red_lgN and Ribonuc_red_lgC of ribonucleotide reductases are present in both groups Domain networks andorganismiccomplexity As illustrated in Figs 6, 7 and 8 and summarized in Table 9, the total number of vertices and edges, the size of the largest connected component of domain networks, and the number of architecture types all increase parallel with the evolution of higher organisms of greater organismic. .. Identification of extracellular, intracellular and transmembrane multidomainproteins of metazoa Extracellular, transmembrane and intracellular multidomainproteins of metazoa were identified on the basis of the subcellular location information of database entries Extracellular proteins were identified as those annotated as extracellular, secreted or plasma proteins, intracellular proteins were identified as those... through multidomain transmembrane proteins such as receptor kinases, G-protein coupled receptors, etc Comparison of domain-networks of different eukaryotes thus confirms that the evolution of increased organismiccomplexity in metazoa is intimately associated with the generation of novel extracellular and transmembrane multidomainproteins that mediate the interactions among their cells, tissues and organs... al Mobile domains (extracellular, intracellular and transmembrane multidomain proteins) has revealed that extracellular domains used in the construction of extracellular proteins (and extracellular parts of transmembrane proteins) of metazoa are particularly enriched in domains of greater mobility Among the extracellular domains the so-called class 1–1 modules, i.e domains which have been shuffled by... containing extracellular domains but lacking intracellular domains and transmembrane domains Intracellular proteins were identified as those containing intracellular domains but lacking extracellular domains and transmembrane domains Transmembrane multidomainproteins were identified as those containing intracellular and ⁄ or extracellular domains and transmembrane domains PfamA domains were assigned a subcellular... 1–1 domains in metazoa, thereby facilitating the construction of extracellular and transmembrane multidomainproteins unique for metazoa [2,9] Methods Databases of multidomainproteins Fig 6 Correlation of the number of total vertices of domain networks with the number of vertices in LCC, the largest connected component and with the number of architecture types (A) The figure shows the linear fit according... PfamA domain) and there was no PfamA domain within this region, then the upstream region was defined as Nterm, the downstream region was defined as Cterm To assess the number of contexts in which a given domain (domain Di) can occur in multidomainproteins we have listed all domain triplets Du-Di-Dd, where Di is the domain analyzed and Du and Dd are the domains flanking domain Di at its N- and C-terminal... increase parallel with the evolution of higher organisms of greater organismiccomplexity At one extreme we find Archaea with the lowest values for the parameters reflecting the complexity of the world of multidomainproteins Conversely, metazoa, particularly Chordates, have the highest values in all these parameters Figures 7 and 8 also show that significant changes occurred in the structural organization... 2753 Gp_dh_C and Gp_dh_N of glyceraldehyde 3-phosphate dehydrogenase, Ldh_1 °C and Ldh_1_N of lactate ⁄ malate dehydrogenase, FGGY_N and FGG_C of the FGGY family of carbohydrate kinases, THF_DHG_CYH and THF_DHG_CYH_C of tetrahydrofolate dehydrogenase ⁄ cyclohydrolases, RNA_ pol_Rpb1–1, RNA_pol_Rpb1–2, RNA_pol_Rpb1–3, RNA_pol_Rpb1–4, RNA_pol_Rpb1–5 of RNA polymerase Rpb1, DNA_photolyase and FAD_binding_7... nuclear, mitochondrial, cytoskeletal proteins, transmembrane proteins were identified as those annotated as membrane proteins The correct assignment of proteins to these categories was also checked by the presence or absence of annotated transmembrane domains, presence or absence of extracellular or intracellular (cytoplasmic, nuclear) PfamA domains Extracellular proteins were identified as those containing . Modules, multidomain proteins and organismic complexity
Hedvig Tordai, Alinda Nagy, Krisztina Farkas, La
´
szlo
´
Ba
´
nyai and La
´
szlo
´
Patthy
Institute. extracellular) multidomain proteins of metazoa.
Domain size and propensity to form multidomain
proteins
By plotting the size of multidomain proteins as a