Ahybridclusteringofproteinbinding sites
Ga
´
bor Iva
´
n
1,2
, Zolta
´
n Szabadka
1,2
and Vince Grolmusz
1,2
1 Protein Information Technology Group, Department of Computer Science, Eo
¨
tvo
¨
s University, Budapest, Hungary
2 Uratim Ltd., Budapest, Hungary
Introduction
In recent years, the exploration of the human gen-
ome has received wide publicity. Although somewhat
less emphasized, another important bioinformatics
resource is the exponentially growing, publicly
available Protein Data Bank (PDB) [1], containing
more than 55 000 biological structures at the present
time.
The three-dimensional structures of small molecules,
e.g. drug molecules, can usually be calculated from
their chemical composition. Several databases exist
that contain millions of ligands. An example of this is
the freely available ZINC database [2] created from
catalogues of compound manufacturers. Contrary to
ligands, the three-dimensional structures of proteins
cannot be calculated easily; therefore, the rapid growth
of the PDB cannot be overestimated.
Most antimicrobial drug molecules act as enzyme
inhibitors. Inhibitors need to bind more strongly to the
enzyme than to the substrate of the enzyme; conse-
quently, the chemical and geometrical properties of the
binding sites are of utmost importance in drug discov-
ery and design.
The PDB contains the three-dimensional structures
of more than 55 000 entries. In a separate study [3],
we collected, verified and cleaned the list of approxi-
mately 27 000 bindingsites found in the PDB.
During the process of the identification of these
binding sites, we filtered out crystallization artifacts
and covalently bound small molecules, and also con-
sidered broken peptide chains, modified amino acids
and incorrectly labeled HET groups. The resulting
cleaned, strictly structured RS-PDB database [3]
can serve as an input for different data mining
algorithms. One such technique of classification is
clustering. By the clusteringofbindingsites it is
possible to create binding site similarity classes.
These classes can be useful for the classification of
protein–ligand interaction.
Keywords
binding sites; clustering; distance; OPTICS;
PDB; sequence
Correspondence
V. Grolmusz, Protein Information
Technology Group, Department of
Computer Science, Eo
¨
tvo
¨
s University,
Pa
´
zma
´
ny Pe
´
ter stny. 1 ⁄ C, H-1117 Budapest,
Hungary and Uratim Ltd., H-1118 Budapest,
Hungary
Fax: +36 1 381 2231
Tel: +36 1 381 2226
E-mail: grolmusz@cs.elte.hu
(Received 6 August 2009, revised 7 January
2010, accepted 12 January 2010)
doi:10.1111/j.1742-4658.2010.07578.x
The Protein Data Bank contains the description of approximately 27 000
protein–ligand binding sites. Most of the ligands at these sites are biologi-
cally active small molecules, affecting the biological function of the protein.
The classification of their bindingsites may lead to relevant results in drug
discovery and design. Clusters of similar bindingsites were created here by
a hybrid, sequence and spatial structure-based approach, using the
OPTICS clustering algorithm. A dissimilarity measure was defined: a dis-
tance function on the amino acid sequences of the binding sites. All the
binding sites were clustered in the Protein Data Bank according to this dis-
tance function, and it was found that the clusters characterized well the
Enzyme Commission numbers of the entries. The results, carefully color
coded by the Enzyme Commission numbers of the proteins, containing the
20 967 bindingsites clustered, are available as html files in three parts at
http://pitgroup.org/seqclust/.
Abbreviations
EC, Enzyme Commission; gp, gap penalty; OPTICS, Ordering Points to Identify the Clustering Structure; PDB, Protein Data Bank.
1494 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS
In this article, we present a fast, sequence-based
method for binding site clustering that takes into
account amino acid sequences in the close neighbor-
hood ofbinding sites. Our method is a hybrid, in the
sense that it uses the sequence information together
with steric data from the PDB in a clearly structured
manner.
Previous work
There is a very rich literature describing the identifi-
cation techniques for biological functions from struc-
tural protein information by the application of highly
nontrivial mathematical tools [4,5]. Some of these
tools have been applied to determine or analyze
protein–protein interaction network topology [6–10]
or bindingsites [6,11]. A considerable amount of
work has also been performed to devise polypeptide
sequence-order independent structural properties
[12–14]. Unlike other binding site clustering solutions
in the literature ([15–18]), we used ahybridof order-
independent methods that analyzes the three-dimen-
sional structure of the binding site together with an
order-analysis method; one of its main features is that
our order-analysis method is capable of handling
multiple polypeptide chains in the same binding site
(Fig. 1).
Results and Discussion
Our main result was the OPTICS (Ordering Points
to Identify the Clustering Structure)-based clustering
of the 20 967 bindingsites found. In order to verify
the capabilities of the clustering method, we need to
compare the clusters found with verified biological
functions.
Verification of results: biological relevance
Ideally, proteins of the same or closely related functions
ought to be assigned in the same cluster. We considered
the Enzyme Commission (EC) number classification
of enzymes [19], and color coded the EC numbers such
that closely related functions were given similar col-
ors, as provided in http://pitgroup.org/seqclust/bsites_
AAcodes/EC_colour.html.
The color-coded clusters, together with the ordinal
number of the binding site, the PDB ID, the cluster
ID and the EC number can be found in three large
html files (Page1, Page2, Page3) under http://pitgroup.
org/seqclust/. The clusters correspond to concave
regions in the figure.
The deviations of the EC numbers in all the clusters
were also computed, and are given in the online table
http://pitgroup.org/seqclust/bsites_AAcodes/EC_devia-
tion.txt. In most of the clusters, the deviation is zero;
the average deviation is 1.71%.
We believe that the validation of the enzymatic func-
tions through EC numbers shows that our clustering
method is an adequate solution for binding site cluster-
ing and classification.
Parameter settings and examples
We present here, as examples, four bindingsites from
the largest cluster (element count: 448) (see Fig. 2). All
four proteins are blood clotting factors. The whole
cluster is given in the online figure http://pitgroup.org/
seqclust/bsites_AAcodes/bsites_optics_M02_No001.html.
It should be noted that the whole cluster is colored
blue, and all the members of the cluster (between
line numbers 702 and 1149; cluster ID: 28) have EC
numbers of the form 3.4.21.X (serine proteases).
From the second largest cluster (element count: 188),
three bindingsites were visualized (Fig. 3). The whole
cluster is given in the online figure http://pitgroup.org/
seqclust/bsites_AAcodes/bsites_optics_M02_No001.html.
It should be noted that the whole cluster is colored deep
violet, and almost all members of the cluster (between
line numbers 1224 and 1411) have EC numbers
3.4.23.16 (HIV-1 retropepsins). More detailed analysis
of the homogeneity of the clusters is given in http://
pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt.
Clustering quality measurement
The quality ofclustering depends on several parame-
ters. These include the distance function used to deter-
mine the similarity or distance of the objects and
parameters of the clustering algorithm. In order to
Fig. 1. Abinding site with four protein chains (PDBID: 1CT8). Each
chain is colored differently.
G. Iva
´
n et al. Ahybridclusteringofproteinbinding sites
FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1495
obtain appropriate feedback about the quality of
clustering with a given parameter setting, quality
metrics need to be defined. For this purpose, we used
the ‘silhouette coefficient’ [20]. The advantage of the
silhouette coefficient is that it is completely independent
of the type of data being clustered; it uses only object
distances and cluster membership assignments for
its determination. Basically, the silhouette coefficient
measures how distinct are the clusters: the ‘silhouette
value’ ofa cluster is the smallest possible distance
between an element of this cluster and an element of
the neighboring clusters. The silhouette coefficient of
the overall clustering is the average of the silhouette
values for the individual clusters. More exactly, the
silhouette coefficient is defined as the average of
the silhouettes taken for all the objects; for example,
Fig. 2. Four bindingsites (PDB IDs: 1ZPB,
1RXP, 1C5Z, 2BZ6) from the same cluster.
The whole cluster is given in the online
figure http://pitgroup.org/seqclust/bsites_
AAcodes/bsites_optics_M02_No001.html.
Note that the whole cluster is colored blue,
and all the members of the cluster
(between line numbers 702 and 1149;
cluster ID: 28) have EC numbers of the form
3.4.21.X (serine proteases). More analysis
on the homogeneity of the clusters is
given in http://pitgroup.org/seqclust/EC_
deviation.txt.
Fig. 3. Three bindingsites from the same
cluster (one site from PDB ID 1BDL and
two sites from PDB ID 1W5V); these are
HIV-1 proteases. The whole cluster is given
in the online figure http://www.pitgroup.org/
seqclust/bsites_AAcodes/bsites_optics_
M02_No001.html. Note that the whole
cluster is colored deep violet, and almost all
the members of the cluster (between line
numbers 1210 and 1435) have EC numbers
of the form 3.4.23.16 (HIV-1 retropepsins).
More analysis on the homogeneity of the
clusters is given in http://www.pitgroup.org/
seqclust/bsites_AAcodes/EC_deviation.txt.
A hybridclusteringofproteinbindingsites G. Iva
´
n et al.
1496 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS
the silhouette of object i is defined as (b
i
– a
i
) ⁄ max(a
i
,
b
i
), where a
i
is the average distance of object i to the
points of its cluster, and b
i
is the minimum of the
average distances of object i to other clusters. It should
be noted that, typically, a
i
<b
i
, and so the silhouette
is equal to 1 – (a
i
⁄ b
i
). Clearly, for good clustering, the
typical a
i
value is much less than b
i
; therefore, the
silhouettes of the objects and the silhouette coefficient
are close to unity.
The data contained in Table 1 are based on empiri-
cal measurements. The values of the silhouette coeffi-
cient are strongly dependent on the applied distance
function. Therefore, it is questionable whether clusters
can be classified into rigid quality categories on the
basis of the silhouette coefficient value. However, it is
undoubtedly useful for comparing the quality of the
clusters.
The silhouette coefficient requires the clustering
algorithm to assign each binding site to a cluster by
definition. Thus, the silhouette coefficient value also
shows the amount of noise contained in the database.
The clustering algorithm used in this study is the
OPTICS algorithm (see later). This algorithm allows
some bindingsites to be marked as ‘noise’ (thus not
assigning them to any cluster). It does not seem reason-
able for bindingsites that are ‘noise’ to be taken into
account twice (once, as the OPTICS algorithm marks
them, and once during the calculation of the silhouette
coefficient). Therefore, bindingsites marked as ‘noise’
were not taken into account when calculating the silhou-
ette coefficient. Nevertheless, for completeness, we show
(Fig. 4) how the value of the silhouette coefficient would
change if bindingsites marked as ‘noise’ were taken into
consideration with a silhouette = 0 value.
Effects of parameters on the quality of clustering
and cluster size distribution
Within our binding site model, the distance function and
clustering algorithm, three main parameters affected the
properties of clustering: OPTICS MINPTS, OPTICS
cut-off level and gap penalty (gp) of the distance func-
tion. We examined how these parameters affected the
quality ofclustering measured by the silhouette coeffi-
cient. The results are given in Figs 4 and 5.
l
Effect of gp. Increasing gp improved slightly the
quality of clustering. This is understandable if we con-
sider that the introduction ofa less strict gp function
automatically decreases the average distance between
the clusters.
l
Effect of MINPTS. On increasing MINPTS, two
main effects were observed. An increase in MINPTS
yields better quality clustering. However, it also yields
a lot more bindingsites classified as ‘noise’. The main
cause of the latter effect is that the clusters that exist
in the database, but contain less points than MINPTS,
are not recognized; they are marked as ‘noise’. On the
basis of this observation, it can be stated that our
binding site database contains numerous small clusters.
l
Effect of OPTICS cut-off level. Increasing the cut-
off level decreases the quality of clustering, and also
the number ofbindingsites marked as ‘noise’. The
application of an extremely high cut-off level places
almost all bindingsites into the same cluster; the qual-
ity of such clustering can by no means be considered
as high.
In conclusion, low MINPTS and low cut-off levels
yield the best clustering quality (whilst covering 70–
80% of the bindingsites found in the PDB). In Figs 4
and 5, we represent the dependence ofclustering qual-
ity on these parameters.
Methods
Binding site representation
As a first step, an exact definition ofabinding site must be
provided. For easy algorithmic handling, we stored the
binding sites found in the PDB in a compact data structure.
The definition ofbinding sites
A binding site is defined as a set of atom pairs; the first
atom of the pair belongs to the protein, and the second
atom to the bound ligand, such that their distance is equal
to the sum of the van der Waals’ radii, calculated differ-
ently for different atom types. That is, only pairs within
noncovalent binding distances are included in the list. Bind-
ing sites containing covalently bound ligands are not con-
sidered in this work, as our main motivation was to review
pharmacologically significant binding sites.
A ‘binding amino acid (or residue)’ is an amino acid with
at least one of its atoms in the binding atom pair. A ‘bind-
ing amino acid sequence’ is an amino acid sequence that
Table 1. Cluster quality descriptions based on silhouette coefficient
values in [20].
Silhouette
coefficient Clustering quality
0.00–0.25 Clusters cannot be adequately
identified; cluster borders are not obvious
0.25–0.50 Clusters can be identified, but there
are numerous unclassifiable points (‘noise’)
0.50–0.70 Most of the data ⁄ points can be classified
0.70–1.00 Excellent distinguishable clusters
G. Iva
´
n et al. Ahybridclusteringofproteinbinding sites
FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1497
contains at least one binding amino acid. Basically, binding
sites are represented by storing all the binding amino acid
sequences of all the protein chains that are present at the
particular binding site.
Binding sites were extracted from the RS-PDB database
described in [21] and [3]. By using this definition for bind-
ing sites, all amino acids from a given amino acid sequence
that have at least one atom contained in an atom pair set
(describing abinding site) can be identified.
Residue sequence representation
An amino acid sequence refers to sequences consisting of
amino acids connected by peptide bonds that are of maxi-
mal length (i.e. they cannot be continued with further
amino acids on either end).
It should be noted that multiple amino acid sequences
might occur in the immediate vicinity ofa single binding
site, making binding site distance ⁄ similarity determina-
tion fairly complicated. An example ofabinding site
with four neighboring polypeptide chains can be seen in
Fig. 1.
Binding amino acid sequences were first extracted from
the bindingsitesof the RS-PDB database [3,21] and then
simplified as follows.
A string was assigned to each amino acid sequence in a
binding site. In this string, residues participating in the
bond were indicated by their one-character code; nonbind-
ing amino acids were indicated by ‘-’. As our purpose was
to deal with only the binding sections, the pre- and post-
fixes consisting of purely nonbinding amino acids (or, in
our notation, ‘-’) were deleted. Hence, all the strings con-
structed in this way start and end with abinding amino
acid.
A binding amino acid sequence constructed and trans-
formed in this way (from PDB entry 2BZ6) is as follows:
H
TT–D
P
DSCK S VSWGQGC
G.
Distance function
In order to use aclustering algorithm, we need to define a
distance function. The bindingsites are represented by all
amino acid sequences that participate in the bond with the
ligand. Consequently, we need to define the distance of the
sequence sets situated in the binding sites. This is accom-
plished first by defining the distance of two sequences
(described in the next section), and then by defining the
distance of the sequence sets. The reason for this comp-
lexity is the fact that more than one binding sequence can
be present in abinding site (see Fig. 1).
Sequence comparison algorithm
To measure the distances of the binding sections of amino
acid sequences constructed in this way, we used a modified
version of the algorithm employed to calculate the Levensh-
tein distance (denoted as L). The modifications involved the
assignment of different costs to gaps depending on where
they were inserted, whereas amino acid mismatches were
simply penalized by the value unity.
Fig. 4. Silhouette coefficient dependence on parameter MINPTS
when unclustered bindingsites are also taken into account at sil-
houette coefficient determination (gp = 1 ⁄ 10). The color coding is
given in Table 2.
Fig. 5. Number ofbindingsites contained in clusters as a function
of the number of clusters allowed to be used (gp = 1 ⁄ 10). The
color coding is given in Table 2.
A hybridclusteringofproteinbindingsites G. Iva
´
n et al.
1498 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS
The costs of aligned binding and nonbinding amino acids
were as follows:
l
The cost of two aligned, different amino acids is unity.
l
The cost of aligned, matching amino acids is zero.
Gaps were penalized as follows:
l
The insertion ofa gap with a length of one unit (one
amino acid) costs gp if the gap is aligned with a nonbinding
amino acid in the other sequence. If a gap is aligned with a
binding amino acid, its cost is unity.
l
The insertion of gaps at the end of sequences is only
penalized if they are aligned with binding amino acids. Gaps
inserted at either end ofa sequence have a zero cost if they
are aligned with nonbinding amino acids.
It can be shown that the Levenshtein distance (and
also our modified version) fulfills the required properties
for being a metric. Non-negativity and symmetry can be
seen directly from the definition (assuming non-negative
costs). It is also obvious that a zero distance can only be
achieved by comparing the same objects: L(x,y) = 0 if,
and only if, x = y (assuming that every compared
sequence starts and ends with abinding amino acid).
What is left to prove is the triangle inequality: for every
s, t, r strings (binding amino acid sequences), L(s,t) £
L(s,r)+L(r,t).
In other words, the triangle inequality asserts that
changing s to t via r cannot cost less than changing s to
t directly. As the Levenshtein distance (by definition) is
the minimum possible total cost of operations transform-
ing s into t, and the sequence of operations that trans-
form s into r and then r into t is also an allowed
sequence of operations, it cannot have a lower total cost
than L(s,t), as this would contradict the optimality of
L(s,t). (What we may need to prove at this point is that
the algorithm used indeed calculates the defined distance
– L.) This reasoning is also applicable to our modified
version of the Levenshtein distance; the only difference is
that we have a somewhat more sophisticated set of costs
for the insertion, deletion and changing of the characters.
We assume that the costs are non-negative, and any
binding amino acid sequence compared with our distance
function starts and ends with abinding amino acid. We
can now reformulate the above defined costs to be used
with ‘insert’, ‘delete’, ‘change’ operations.
Costs for insertion
l
Insertion of ‘-’ to the end of the sequence: 0.
l
Insertion of ‘-’ between the first and last binding amino
acids of the sequence: gp.
l
Insertion ofa one-letter code ofabinding amino acid: 1.
Costs for deletion
l
Deletion of ‘-’ from the end of the sequence: 0.
l
Deletion of ‘-’ between the first and last unchanged bind-
ing amino acids of the sequence: gp.
l
Deletion ofa one-letter code ofabinding amino acid: 1.
Costs for character change
l
For matching characters: 0.
l
For nonmatching characters: 1.
If we want to transform abinding amino acid sequence s
into t using the above operations, we cannot expect to
obtain a lower total cost by first transforming s to an arbi-
trary r and then r to t (compared with the direct transfor-
mation of s to t). This means that the triangle inequality
holds.
Binding site comparison method
The input of the distance function described above is two
strings that represent amino acid sequences extracted
from binding sites. However, our aim is to measure the
distance of the binding sites, not just single amino acid
sequences. We have seen in section ’Previous work’ in
Fig. 1 that multiple amino acid sequences might occur in
the immediate vicinity ofabinding site. Therefore, we
also need to define the distance of the sequence sets
representing binding sites.
For this purpose, a complete bipartite graph is defined.
This is a graph in which the set of vertices can be divided
into two disjoint sets, A and B, such that no edge has both
of its endpoints in the same set, |A|=|B| and the number
of edges is always |A|Æ|B|.
l
Points of the vertex sets A and B correspond to the
amino acid sequences of the first and second binding sites,
respectively. If the numbers of amino acid sequences are
not equal in the two binding sites, amino acid sequences
with zero length are added to the smaller set.
l
Weights are assigned to all edges of this graph that corre-
spond to the distance of the two amino acid sequences con-
nected by the edge. By ‘distance’, we mean the distance
defined in the previous section.
The distance of the sequence sets A and B is then defined
as the minimum weight perfect matching [22] in the graph
defined above.
It should be noted that, by the definition of the previous
section, the distance of an arbitrary residue sequence A to a
zero-length sequence B is the binding amino acid count of
sequence A.
Binding site distance normalization
The expected distance of two randomly generated binding
sites will be proportional to the sum of the binding amino
acids occurring at the binding sites. The maximum achiev-
able distance is always less than the sum of the binding
amino acids.
The distance of two bindingsites calculated using the
function described in the previous section does not describe
the binding site dissimilarity alone. If the distance of two
binding sites is three, it may be that they have three binding
amino acids each, and hence they may be completely differ-
G. Iva
´
n et al. Ahybridclusteringofproteinbinding sites
FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1499
ent. However, a distance of three between two binding sites
with 30 binding residues each is approximately a 10% dif-
ference, and so these bindingsites might be almost the
same.
Therefore, it is necessary to ‘normalize’ the distances. We
did this by dividing all distances by the sum of the binding
amino acids of the two bindingsites being compared. The
result of this operation yields a value between zero and
unity that can also be interpreted as a percentage of the
absolute maximum possible distance of the two binding
sites.
Clustering algorithm
For data clustering, we wanted to use an algorithm that
was not biased towards even-sized and regular-shaped
clusters.
One algorithm with this properties is DBSCAN [23],
which is a density-based algorithm. The density of objects is
defined with a radius-like e parameter and an object-count
lower limit (MINPTS): a neighborhood ofa certain object
‘o’ is considered to be dense if there exist at least MINPTS
objects within a distance of less than e. Therefore, MINPTS
and e are input parameters of the algorithm.
Unfortunately, the clustering structure of many real data-
sets cannot be characterized by global density parameters,
as quite different local densities may exist in different areas
of the data space. The OPTICS algorithm [24] overcomes
these difficulties by ordering the objects contained in the
database, creating a so-called ‘reachability plot’. The reach-
ability plot is a very clever visualization of high-dimen-
sional clusters. It is basically generated by assigning a
value, called the ‘reachability distance’, to all the objects of
the database, whilst going through the database points in a
specific order. The reachability distance is given on the y
axes, and the objects (i.e. binding site representations) are
numbered on the x axes. Clusters correspond to concave
regions in the plot. After the creation of the reachability
plot, cluster membership assignments can be created by cut-
ting the reachability plot with a horizontal line referred to
as the ‘cut-off level’.
The reachability plot ofa small database consisting of
binding sites that contain NAD as the ligand is shown in
Fig. 6.
Database parameters and further
settings used in the OPTICS algorithm
The parameters used for clustering were as follows:
OPTICS MINPTS, 2; OPTICS cut-off level, 20%; gp, 1 ⁄ 10.
The OPTICS algorithm was run on a database consisting
of 20 967 binding sites. Indistinguishable binding sites, which
were assigned exactly to the same binding amino acid
sequence sets and ligand identifiers, were contained only
once. (The original database without this kind of redundancy
Table 2. Colors assigned to different OPTICS cut-off levels.
Color Cut-off level (%)
Red 20
Green 30
Blue 40
Cyan 50
Magenta 60
Yellow 70
Fig. 6. OPTICS reachability plot ofa database consisting of 800 binding sites.
A hybridclusteringofproteinbindingsites G. Iva
´
n et al.
1500 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS
filtering consisted of 27 208 binding sites.) The distance of
the bindingsites was measured with the distance function
described above.
Using labeling encoding binding types
Following the suggestion of an anonymous referee, we
modified the labeling of the bond residues as follows: using
the approach first described in [25], we replaced each amino
acid’s one-letter abbreviation with one of the following five
characters (‘A’, ‘D’, ‘H’, ‘C’, ‘P’) depending on the assumed
type of interaction between the given amino acid and the
ligand. As several atoms of an amino acid can be located
within the ‘binding distance’ (defined to be more than 1.25
times the sum of covalent radii belonging to the protein
and ligand atoms, respectively, but < 1.05 times the sum
of the van der Waals’ radii belonging to these atoms) for a
given amino acid, we only considered its closest atom to
the ligand. Five types of interaction were used: ‘hydrogen-
bond acceptor’ (denoted by ‘A’); ‘hydrogen-bond donor’
(denoted by ‘D’); ‘mixed hydrogen-bond donor ⁄ acceptor’
(denoted by ‘H’, e.g. hydroxyl groups or side-chain nitrogen
atoms in histidine); hydrophobic aliphatic interaction
(denoted by ‘C’); and aromatic (denoted by ‘P’); all are
described in [25].
Using this labeling, we applied the OPTICS algorithm,
exactly as described above. The resulting clusters are given in
the second set of online supporting figures at http://pitgroup.
org/seqclust, in four html files, together with a statistical
analysis.
It is easy to see that, for the large clusters, the amino
acid labeling gives better results.
Conclusions
In this article, we have presented a fast, sequence-based
method capable of classifying the binding sites
contained in the publicly available PDB. We determined
the parameter settings yielding a classification with the
best quality (measured by the silhouette coefficient).
Our main result was a sequence-based approach, derived
from three-dimensional structures, used for binding site
clustering (rather than three-dimensional binding site
structure), that allows multiple sequences to occur at
each binding site. We also evaluated our clustering
results with a large, colored diagram (given at the URL
http://pitgroup.org/seqclust), where the colors corre-
spond to the EC numbers of the proteins containing the
binding sites. As witnessed by the colored diagram, and
also by the numerical deviations given in http://
pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt,
our method has a clear-cut biological significance. The
method presented in this work may help to reveal evolu-
tionary related binding sites, and may also be used to
filter redundancies (i.e. multiple occurring binding sites)
from the PDB. A possible step for further research could
be the creation of aggregate sequence set profiles for
each binding site cluster, generating binding site families
similar to the Protein Families Database [26,27].
Acknowledgements
This work was supported by Hungarian Scientific
Research Fund (NK-67867, CNK-77780), and by the
Hungarian National Office for Research and Technol-
ogy (OMFB-01295 ⁄ 2006 and OM-00219 ⁄ 2007).
References
1 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat
TN, Weissig H, Shindyalov IN & Bourne PE (2000)
The Protein Data Bank. Nucleic Acids Res 28,
235–242.
2 Irwin JJ & Shoichet BK (2005). A free database of
commercially available compounds for virtual screening.
J Chem Inf Comput Sci 45, 177–182.
3 Szabadka Z & Grolmusz V (2006). Building a struc-
tured PDB: the RS-PDB database. In: Proceedings of
the 28th IEEE EMBS Annual International Conference,
New York, NY, August 30–September 3, 2006,
pp. 5755–5758. IEEE Press, New York, NY.
4 Artamonova II, Frishman G, Gelfand MS & Frishman
D (2005) Mining sequence annotation databanks for
association patterns. Bioinformatics 21, iii49–iii57.
5 Gunasekaran K, Ma B & Nussinov R (2004) Is
allostery an intrinsic property of all dynamic proteins?
Proteins 57, 433–443.
Fig. 7. A representative of cluster 85 in the online table http://
www.pitgroup.org/seqclust/bsites_pseudocenters/bsites_optics_
M04_No001.html. Cluster 85 contains PDB entries 3B9J, 1FFU,
1JRP, 1T3Q, 2E3T, 1JRO, 1RM6, 1WY6, 1N5X; all of these contain
an Fe
2
⁄ S
2
cluster (FeS) bond.
G. Iva
´
n et al. Ahybridclusteringofproteinbinding sites
FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1501
6 Halperin I, Wolfson H & Nussinov R (2003). Sitelight:
binding-site prediction using phage display libraries.
Protein Sci 12: 1344–1359.
7 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ
(2005) Prediction of multimolecular assemblies by
multiple docking. J Mol Biol 349, 435–447.
8 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ
(2003). Protein structure prediction via combinatorial
assembly of sub-structural units. Bioinformatics 19
(Suppl 1): i158–i168.
9 Keskin O, Gursoy A, Ma B & Nussinov R (2007)
Towards drugs targeting multiple proteins in a systems
biology approach. Curr Top Med Chem 7, 943–951.
10 Keskin O, Nussinov R & Gursoy A (2008) Prism:
protein–protein interaction prediction by structural
matching. Methods Mol Biol 484, 505–521.
11 Keskin O & Nussinov R (2007) Similar binding sites
and different partners: implications to shared proteins
in cellular pathways. Structure 15, 341–354.
12 Tsai CJ, Lin SL, Wolfson HJ & Nussinov R (1996)
A dataset of protein–protein interfaces generated with a
sequence-order-independent comparison technique.
J Mol Biol 260, 604–620.
13 Alesker V, Nussinov R & Wolfson HJ (1996) Detection
of non-topological motifs in protein structures. Protein
Eng 9, 1103–1119.
14 Azarya-Sprinzak E, Naor D, Wolfson HJ & Nussinov
R (1997) Interchanges of spatially neighbouring residues
in structurally conserved environments. Protein Eng 10,
1109–1122.
15 Gold ND & Jackson RM (2006). Sitesbase: a database
for structure-based protein-ligand binding site
comparisons. Nucleic Acids Res 34(Database issue):
D231–D234.
16 Kinnings SL & Jackson RM (2009) Binding site
similarity analysis for the functional classification of
the protein kinase family. J Chem Inf Model 49,
318–329.
17 Kuhn D, Weskamp N, Hazllermeier E and Klebe G
(2007) Functional classification ofprotein kinase bind-
ing sites using cavbase. ChemMedChem 2, 1432–1447.
18 Kinjo AR & Nakamura H (2009) Comprehensive struc-
tural classification of ligand-binding motifs in proteins.
Structure 17 , 234–246.
19 Webb EC (1989) Enzyme nomenclature. recommenda-
tions 1984. Supplement 2: corrections and additions.
Eur J Biochem 179, 489–533.
20 Kaufman L & Rousseeuw P (1990). Finding Groups
in Data: An Introduction to Cluster Analysis. Wiley,
New York, NY.
21 Szabadka Z & Grolmusz V (2007) High throughput
processing of the structural information in the protein
data bank. J Mol Graph Model 25, 831–836.
22 Lova
´
sz L & Plummer MD (1986). Matching Theory,
Vol. 121 of North-Holland Mathematics Studies. North-
Holland Publishing Co., Amsterdam. Ann Discrete
Mathematics 29.
23 Ester M, H-Kriegel P, Sander J & Xu X (1996). A
density-based algorithm for discovering clusters in large
spatial databases with noise. In: Proceedings of the 2nd
International Conference on Knowledge Discovery and
Data Mining, Portland, OR, 1996, pp. 226–231. AAAI
Press.
24 Ankerst M, Breunig MM, Kriegel H & Sander J (1999).
Optics: ordering points to identify the clustering
structure. In: Proceedings of ACM SIGMOD ‘99
International Conference on Management of Data,
Philadelphia, PA, 1999, pp. 49–60. ACM Press.
25 Schmitt S, Kuhn D & Klebe G (2002) A new method
to detect related function among proteins independent
of sequence and fold homology. J Mol Biol 323, 387–
406.
26 Sonnhammer EL, Eddy SR, Birney E, Bateman A &
Durbin R (1998) Pfam: multiple sequence alignments
and hmm-profiles ofprotein domains. Nucleic Acids Res
26, 320–322.
27 Sonnhammer EL, Eddy SR & Durbin R (1997) Pfam:
a comprehensive database ofprotein domain families
based on seed alignments. Proteins 28, 405–420.
A hybridclusteringofproteinbindingsites G. Iva
´
n et al.
1502 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS
. ‘cut-off level’.
The reachability plot of a small database consisting of
binding sites that contain NAD as the ligand is shown in
Fig. 6.
Database parameters. 1497
contains at least one binding amino acid. Basically, binding
sites are represented by storing all the binding amino acid
sequences of all the protein chains