Báo cáo khoa học: A hybrid clustering of protein binding sites ppt

9 229 0
Báo cáo khoa học: A hybrid clustering of protein binding sites ppt

Đang tải... (xem toàn văn)

Thông tin tài liệu

A hybrid clustering of protein binding sites Ga ´ bor Iva ´ n 1,2 , Zolta ´ n Szabadka 1,2 and Vince Grolmusz 1,2 1 Protein Information Technology Group, Department of Computer Science, Eo ¨ tvo ¨ s University, Budapest, Hungary 2 Uratim Ltd., Budapest, Hungary Introduction In recent years, the exploration of the human gen- ome has received wide publicity. Although somewhat less emphasized, another important bioinformatics resource is the exponentially growing, publicly available Protein Data Bank (PDB) [1], containing more than 55 000 biological structures at the present time. The three-dimensional structures of small molecules, e.g. drug molecules, can usually be calculated from their chemical composition. Several databases exist that contain millions of ligands. An example of this is the freely available ZINC database [2] created from catalogues of compound manufacturers. Contrary to ligands, the three-dimensional structures of proteins cannot be calculated easily; therefore, the rapid growth of the PDB cannot be overestimated. Most antimicrobial drug molecules act as enzyme inhibitors. Inhibitors need to bind more strongly to the enzyme than to the substrate of the enzyme; conse- quently, the chemical and geometrical properties of the binding sites are of utmost importance in drug discov- ery and design. The PDB contains the three-dimensional structures of more than 55 000 entries. In a separate study [3], we collected, verified and cleaned the list of approxi- mately 27 000 binding sites found in the PDB. During the process of the identification of these binding sites, we filtered out crystallization artifacts and covalently bound small molecules, and also con- sidered broken peptide chains, modified amino acids and incorrectly labeled HET groups. The resulting cleaned, strictly structured RS-PDB database [3] can serve as an input for different data mining algorithms. One such technique of classification is clustering. By the clustering of binding sites it is possible to create binding site similarity classes. These classes can be useful for the classification of protein–ligand interaction. Keywords binding sites; clustering; distance; OPTICS; PDB; sequence Correspondence V. Grolmusz, Protein Information Technology Group, Department of Computer Science, Eo ¨ tvo ¨ s University, Pa ´ zma ´ ny Pe ´ ter stny. 1 ⁄ C, H-1117 Budapest, Hungary and Uratim Ltd., H-1118 Budapest, Hungary Fax: +36 1 381 2231 Tel: +36 1 381 2226 E-mail: grolmusz@cs.elte.hu (Received 6 August 2009, revised 7 January 2010, accepted 12 January 2010) doi:10.1111/j.1742-4658.2010.07578.x The Protein Data Bank contains the description of approximately 27 000 protein–ligand binding sites. Most of the ligands at these sites are biologi- cally active small molecules, affecting the biological function of the protein. The classification of their binding sites may lead to relevant results in drug discovery and design. Clusters of similar binding sites were created here by a hybrid, sequence and spatial structure-based approach, using the OPTICS clustering algorithm. A dissimilarity measure was defined: a dis- tance function on the amino acid sequences of the binding sites. All the binding sites were clustered in the Protein Data Bank according to this dis- tance function, and it was found that the clusters characterized well the Enzyme Commission numbers of the entries. The results, carefully color coded by the Enzyme Commission numbers of the proteins, containing the 20 967 binding sites clustered, are available as html files in three parts at http://pitgroup.org/seqclust/. Abbreviations EC, Enzyme Commission; gp, gap penalty; OPTICS, Ordering Points to Identify the Clustering Structure; PDB, Protein Data Bank. 1494 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS In this article, we present a fast, sequence-based method for binding site clustering that takes into account amino acid sequences in the close neighbor- hood of binding sites. Our method is a hybrid, in the sense that it uses the sequence information together with steric data from the PDB in a clearly structured manner. Previous work There is a very rich literature describing the identifi- cation techniques for biological functions from struc- tural protein information by the application of highly nontrivial mathematical tools [4,5]. Some of these tools have been applied to determine or analyze protein–protein interaction network topology [6–10] or binding sites [6,11]. A considerable amount of work has also been performed to devise polypeptide sequence-order independent structural properties [12–14]. Unlike other binding site clustering solutions in the literature ([15–18]), we used a hybrid of order- independent methods that analyzes the three-dimen- sional structure of the binding site together with an order-analysis method; one of its main features is that our order-analysis method is capable of handling multiple polypeptide chains in the same binding site (Fig. 1). Results and Discussion Our main result was the OPTICS (Ordering Points to Identify the Clustering Structure)-based clustering of the 20 967 binding sites found. In order to verify the capabilities of the clustering method, we need to compare the clusters found with verified biological functions. Verification of results: biological relevance Ideally, proteins of the same or closely related functions ought to be assigned in the same cluster. We considered the Enzyme Commission (EC) number classification of enzymes [19], and color coded the EC numbers such that closely related functions were given similar col- ors, as provided in http://pitgroup.org/seqclust/bsites_ AAcodes/EC_colour.html. The color-coded clusters, together with the ordinal number of the binding site, the PDB ID, the cluster ID and the EC number can be found in three large html files (Page1, Page2, Page3) under http://pitgroup. org/seqclust/. The clusters correspond to concave regions in the figure. The deviations of the EC numbers in all the clusters were also computed, and are given in the online table http://pitgroup.org/seqclust/bsites_AAcodes/EC_devia- tion.txt. In most of the clusters, the deviation is zero; the average deviation is 1.71%. We believe that the validation of the enzymatic func- tions through EC numbers shows that our clustering method is an adequate solution for binding site cluster- ing and classification. Parameter settings and examples We present here, as examples, four binding sites from the largest cluster (element count: 448) (see Fig. 2). All four proteins are blood clotting factors. The whole cluster is given in the online figure http://pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_M02_No001.html. It should be noted that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases). From the second largest cluster (element count: 188), three binding sites were visualized (Fig. 3). The whole cluster is given in the online figure http://pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_M02_No001.html. It should be noted that the whole cluster is colored deep violet, and almost all members of the cluster (between line numbers 1224 and 1411) have EC numbers 3.4.23.16 (HIV-1 retropepsins). More detailed analysis of the homogeneity of the clusters is given in http:// pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt. Clustering quality measurement The quality of clustering depends on several parame- ters. These include the distance function used to deter- mine the similarity or distance of the objects and parameters of the clustering algorithm. In order to Fig. 1. A binding site with four protein chains (PDBID: 1CT8). Each chain is colored differently. G. Iva ´ n et al. A hybrid clustering of protein binding sites FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1495 obtain appropriate feedback about the quality of clustering with a given parameter setting, quality metrics need to be defined. For this purpose, we used the ‘silhouette coefficient’ [20]. The advantage of the silhouette coefficient is that it is completely independent of the type of data being clustered; it uses only object distances and cluster membership assignments for its determination. Basically, the silhouette coefficient measures how distinct are the clusters: the ‘silhouette value’ of a cluster is the smallest possible distance between an element of this cluster and an element of the neighboring clusters. The silhouette coefficient of the overall clustering is the average of the silhouette values for the individual clusters. More exactly, the silhouette coefficient is defined as the average of the silhouettes taken for all the objects; for example, Fig. 2. Four binding sites (PDB IDs: 1ZPB, 1RXP, 1C5Z, 2BZ6) from the same cluster. The whole cluster is given in the online figure http://pitgroup.org/seqclust/bsites_ AAcodes/bsites_optics_M02_No001.html. Note that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases). More analysis on the homogeneity of the clusters is given in http://pitgroup.org/seqclust/EC_ deviation.txt. Fig. 3. Three binding sites from the same cluster (one site from PDB ID 1BDL and two sites from PDB ID 1W5V); these are HIV-1 proteases. The whole cluster is given in the online figure http://www.pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_ M02_No001.html. Note that the whole cluster is colored deep violet, and almost all the members of the cluster (between line numbers 1210 and 1435) have EC numbers of the form 3.4.23.16 (HIV-1 retropepsins). More analysis on the homogeneity of the clusters is given in http://www.pitgroup.org/ seqclust/bsites_AAcodes/EC_deviation.txt. A hybrid clustering of protein binding sites G. Iva ´ n et al. 1496 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS the silhouette of object i is defined as (b i – a i ) ⁄ max(a i , b i ), where a i is the average distance of object i to the points of its cluster, and b i is the minimum of the average distances of object i to other clusters. It should be noted that, typically, a i <b i , and so the silhouette is equal to 1 – (a i ⁄ b i ). Clearly, for good clustering, the typical a i value is much less than b i ; therefore, the silhouettes of the objects and the silhouette coefficient are close to unity. The data contained in Table 1 are based on empiri- cal measurements. The values of the silhouette coeffi- cient are strongly dependent on the applied distance function. Therefore, it is questionable whether clusters can be classified into rigid quality categories on the basis of the silhouette coefficient value. However, it is undoubtedly useful for comparing the quality of the clusters. The silhouette coefficient requires the clustering algorithm to assign each binding site to a cluster by definition. Thus, the silhouette coefficient value also shows the amount of noise contained in the database. The clustering algorithm used in this study is the OPTICS algorithm (see later). This algorithm allows some binding sites to be marked as ‘noise’ (thus not assigning them to any cluster). It does not seem reason- able for binding sites that are ‘noise’ to be taken into account twice (once, as the OPTICS algorithm marks them, and once during the calculation of the silhouette coefficient). Therefore, binding sites marked as ‘noise’ were not taken into account when calculating the silhou- ette coefficient. Nevertheless, for completeness, we show (Fig. 4) how the value of the silhouette coefficient would change if binding sites marked as ‘noise’ were taken into consideration with a silhouette = 0 value. Effects of parameters on the quality of clustering and cluster size distribution Within our binding site model, the distance function and clustering algorithm, three main parameters affected the properties of clustering: OPTICS MINPTS, OPTICS cut-off level and gap penalty (gp) of the distance func- tion. We examined how these parameters affected the quality of clustering measured by the silhouette coeffi- cient. The results are given in Figs 4 and 5. l Effect of gp. Increasing gp improved slightly the quality of clustering. This is understandable if we con- sider that the introduction of a less strict gp function automatically decreases the average distance between the clusters. l Effect of MINPTS. On increasing MINPTS, two main effects were observed. An increase in MINPTS yields better quality clustering. However, it also yields a lot more binding sites classified as ‘noise’. The main cause of the latter effect is that the clusters that exist in the database, but contain less points than MINPTS, are not recognized; they are marked as ‘noise’. On the basis of this observation, it can be stated that our binding site database contains numerous small clusters. l Effect of OPTICS cut-off level. Increasing the cut- off level decreases the quality of clustering, and also the number of binding sites marked as ‘noise’. The application of an extremely high cut-off level places almost all binding sites into the same cluster; the qual- ity of such clustering can by no means be considered as high. In conclusion, low MINPTS and low cut-off levels yield the best clustering quality (whilst covering 70– 80% of the binding sites found in the PDB). In Figs 4 and 5, we represent the dependence of clustering qual- ity on these parameters. Methods Binding site representation As a first step, an exact definition of a binding site must be provided. For easy algorithmic handling, we stored the binding sites found in the PDB in a compact data structure. The definition of binding sites A binding site is defined as a set of atom pairs; the first atom of the pair belongs to the protein, and the second atom to the bound ligand, such that their distance is equal to the sum of the van der Waals’ radii, calculated differ- ently for different atom types. That is, only pairs within noncovalent binding distances are included in the list. Bind- ing sites containing covalently bound ligands are not con- sidered in this work, as our main motivation was to review pharmacologically significant binding sites. A ‘binding amino acid (or residue)’ is an amino acid with at least one of its atoms in the binding atom pair. A ‘bind- ing amino acid sequence’ is an amino acid sequence that Table 1. Cluster quality descriptions based on silhouette coefficient values in [20]. Silhouette coefficient Clustering quality 0.00–0.25 Clusters cannot be adequately identified; cluster borders are not obvious 0.25–0.50 Clusters can be identified, but there are numerous unclassifiable points (‘noise’) 0.50–0.70 Most of the data ⁄ points can be classified 0.70–1.00 Excellent distinguishable clusters G. Iva ´ n et al. A hybrid clustering of protein binding sites FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1497 contains at least one binding amino acid. Basically, binding sites are represented by storing all the binding amino acid sequences of all the protein chains that are present at the particular binding site. Binding sites were extracted from the RS-PDB database described in [21] and [3]. By using this definition for bind- ing sites, all amino acids from a given amino acid sequence that have at least one atom contained in an atom pair set (describing a binding site) can be identified. Residue sequence representation An amino acid sequence refers to sequences consisting of amino acids connected by peptide bonds that are of maxi- mal length (i.e. they cannot be continued with further amino acids on either end). It should be noted that multiple amino acid sequences might occur in the immediate vicinity of a single binding site, making binding site distance ⁄ similarity determina- tion fairly complicated. An example of a binding site with four neighboring polypeptide chains can be seen in Fig. 1. Binding amino acid sequences were first extracted from the binding sites of the RS-PDB database [3,21] and then simplified as follows. A string was assigned to each amino acid sequence in a binding site. In this string, residues participating in the bond were indicated by their one-character code; nonbind- ing amino acids were indicated by ‘-’. As our purpose was to deal with only the binding sections, the pre- and post- fixes consisting of purely nonbinding amino acids (or, in our notation, ‘-’) were deleted. Hence, all the strings con- structed in this way start and end with a binding amino acid. A binding amino acid sequence constructed and trans- formed in this way (from PDB entry 2BZ6) is as follows: H TT–D P DSCK S VSWGQGC G. Distance function In order to use a clustering algorithm, we need to define a distance function. The binding sites are represented by all amino acid sequences that participate in the bond with the ligand. Consequently, we need to define the distance of the sequence sets situated in the binding sites. This is accom- plished first by defining the distance of two sequences (described in the next section), and then by defining the distance of the sequence sets. The reason for this comp- lexity is the fact that more than one binding sequence can be present in a binding site (see Fig. 1). Sequence comparison algorithm To measure the distances of the binding sections of amino acid sequences constructed in this way, we used a modified version of the algorithm employed to calculate the Levensh- tein distance (denoted as L). The modifications involved the assignment of different costs to gaps depending on where they were inserted, whereas amino acid mismatches were simply penalized by the value unity. Fig. 4. Silhouette coefficient dependence on parameter MINPTS when unclustered binding sites are also taken into account at sil- houette coefficient determination (gp = 1 ⁄ 10). The color coding is given in Table 2. Fig. 5. Number of binding sites contained in clusters as a function of the number of clusters allowed to be used (gp = 1 ⁄ 10). The color coding is given in Table 2. A hybrid clustering of protein binding sites G. Iva ´ n et al. 1498 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS The costs of aligned binding and nonbinding amino acids were as follows: l The cost of two aligned, different amino acids is unity. l The cost of aligned, matching amino acids is zero. Gaps were penalized as follows: l The insertion of a gap with a length of one unit (one amino acid) costs gp if the gap is aligned with a nonbinding amino acid in the other sequence. If a gap is aligned with a binding amino acid, its cost is unity. l The insertion of gaps at the end of sequences is only penalized if they are aligned with binding amino acids. Gaps inserted at either end of a sequence have a zero cost if they are aligned with nonbinding amino acids. It can be shown that the Levenshtein distance (and also our modified version) fulfills the required properties for being a metric. Non-negativity and symmetry can be seen directly from the definition (assuming non-negative costs). It is also obvious that a zero distance can only be achieved by comparing the same objects: L(x,y) = 0 if, and only if, x = y (assuming that every compared sequence starts and ends with a binding amino acid). What is left to prove is the triangle inequality: for every s, t, r strings (binding amino acid sequences), L(s,t) £ L(s,r)+L(r,t). In other words, the triangle inequality asserts that changing s to t via r cannot cost less than changing s to t directly. As the Levenshtein distance (by definition) is the minimum possible total cost of operations transform- ing s into t, and the sequence of operations that trans- form s into r and then r into t is also an allowed sequence of operations, it cannot have a lower total cost than L(s,t), as this would contradict the optimality of L(s,t). (What we may need to prove at this point is that the algorithm used indeed calculates the defined distance – L.) This reasoning is also applicable to our modified version of the Levenshtein distance; the only difference is that we have a somewhat more sophisticated set of costs for the insertion, deletion and changing of the characters. We assume that the costs are non-negative, and any binding amino acid sequence compared with our distance function starts and ends with a binding amino acid. We can now reformulate the above defined costs to be used with ‘insert’, ‘delete’, ‘change’ operations. Costs for insertion l Insertion of ‘-’ to the end of the sequence: 0. l Insertion of ‘-’ between the first and last binding amino acids of the sequence: gp. l Insertion of a one-letter code of a binding amino acid: 1. Costs for deletion l Deletion of ‘-’ from the end of the sequence: 0. l Deletion of ‘-’ between the first and last unchanged bind- ing amino acids of the sequence: gp. l Deletion of a one-letter code of a binding amino acid: 1. Costs for character change l For matching characters: 0. l For nonmatching characters: 1. If we want to transform a binding amino acid sequence s into t using the above operations, we cannot expect to obtain a lower total cost by first transforming s to an arbi- trary r and then r to t (compared with the direct transfor- mation of s to t). This means that the triangle inequality holds. Binding site comparison method The input of the distance function described above is two strings that represent amino acid sequences extracted from binding sites. However, our aim is to measure the distance of the binding sites, not just single amino acid sequences. We have seen in section ’Previous work’ in Fig. 1 that multiple amino acid sequences might occur in the immediate vicinity of a binding site. Therefore, we also need to define the distance of the sequence sets representing binding sites. For this purpose, a complete bipartite graph is defined. This is a graph in which the set of vertices can be divided into two disjoint sets, A and B, such that no edge has both of its endpoints in the same set, |A|=|B| and the number of edges is always |A|Æ|B|. l Points of the vertex sets A and B correspond to the amino acid sequences of the first and second binding sites, respectively. If the numbers of amino acid sequences are not equal in the two binding sites, amino acid sequences with zero length are added to the smaller set. l Weights are assigned to all edges of this graph that corre- spond to the distance of the two amino acid sequences con- nected by the edge. By ‘distance’, we mean the distance defined in the previous section. The distance of the sequence sets A and B is then defined as the minimum weight perfect matching [22] in the graph defined above. It should be noted that, by the definition of the previous section, the distance of an arbitrary residue sequence A to a zero-length sequence B is the binding amino acid count of sequence A. Binding site distance normalization The expected distance of two randomly generated binding sites will be proportional to the sum of the binding amino acids occurring at the binding sites. The maximum achiev- able distance is always less than the sum of the binding amino acids. The distance of two binding sites calculated using the function described in the previous section does not describe the binding site dissimilarity alone. If the distance of two binding sites is three, it may be that they have three binding amino acids each, and hence they may be completely differ- G. Iva ´ n et al. A hybrid clustering of protein binding sites FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1499 ent. However, a distance of three between two binding sites with 30 binding residues each is approximately a 10% dif- ference, and so these binding sites might be almost the same. Therefore, it is necessary to ‘normalize’ the distances. We did this by dividing all distances by the sum of the binding amino acids of the two binding sites being compared. The result of this operation yields a value between zero and unity that can also be interpreted as a percentage of the absolute maximum possible distance of the two binding sites. Clustering algorithm For data clustering, we wanted to use an algorithm that was not biased towards even-sized and regular-shaped clusters. One algorithm with this properties is DBSCAN [23], which is a density-based algorithm. The density of objects is defined with a radius-like e parameter and an object-count lower limit (MINPTS): a neighborhood of a certain object ‘o’ is considered to be dense if there exist at least MINPTS objects within a distance of less than e. Therefore, MINPTS and e are input parameters of the algorithm. Unfortunately, the clustering structure of many real data- sets cannot be characterized by global density parameters, as quite different local densities may exist in different areas of the data space. The OPTICS algorithm [24] overcomes these difficulties by ordering the objects contained in the database, creating a so-called ‘reachability plot’. The reach- ability plot is a very clever visualization of high-dimen- sional clusters. It is basically generated by assigning a value, called the ‘reachability distance’, to all the objects of the database, whilst going through the database points in a specific order. The reachability distance is given on the y axes, and the objects (i.e. binding site representations) are numbered on the x axes. Clusters correspond to concave regions in the plot. After the creation of the reachability plot, cluster membership assignments can be created by cut- ting the reachability plot with a horizontal line referred to as the ‘cut-off level’. The reachability plot of a small database consisting of binding sites that contain NAD as the ligand is shown in Fig. 6. Database parameters and further settings used in the OPTICS algorithm The parameters used for clustering were as follows: OPTICS MINPTS, 2; OPTICS cut-off level, 20%; gp, 1 ⁄ 10. The OPTICS algorithm was run on a database consisting of 20 967 binding sites. Indistinguishable binding sites, which were assigned exactly to the same binding amino acid sequence sets and ligand identifiers, were contained only once. (The original database without this kind of redundancy Table 2. Colors assigned to different OPTICS cut-off levels. Color Cut-off level (%) Red 20 Green 30 Blue 40 Cyan 50 Magenta 60 Yellow 70 Fig. 6. OPTICS reachability plot of a database consisting of 800 binding sites. A hybrid clustering of protein binding sites G. Iva ´ n et al. 1500 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS filtering consisted of 27 208 binding sites.) The distance of the binding sites was measured with the distance function described above. Using labeling encoding binding types Following the suggestion of an anonymous referee, we modified the labeling of the bond residues as follows: using the approach first described in [25], we replaced each amino acid’s one-letter abbreviation with one of the following five characters (‘A’, ‘D’, ‘H’, ‘C’, ‘P’) depending on the assumed type of interaction between the given amino acid and the ligand. As several atoms of an amino acid can be located within the ‘binding distance’ (defined to be more than 1.25 times the sum of covalent radii belonging to the protein and ligand atoms, respectively, but < 1.05 times the sum of the van der Waals’ radii belonging to these atoms) for a given amino acid, we only considered its closest atom to the ligand. Five types of interaction were used: ‘hydrogen- bond acceptor’ (denoted by ‘A’); ‘hydrogen-bond donor’ (denoted by ‘D’); ‘mixed hydrogen-bond donor ⁄ acceptor’ (denoted by ‘H’, e.g. hydroxyl groups or side-chain nitrogen atoms in histidine); hydrophobic aliphatic interaction (denoted by ‘C’); and aromatic (denoted by ‘P’); all are described in [25]. Using this labeling, we applied the OPTICS algorithm, exactly as described above. The resulting clusters are given in the second set of online supporting figures at http://pitgroup. org/seqclust, in four html files, together with a statistical analysis. It is easy to see that, for the large clusters, the amino acid labeling gives better results. Conclusions In this article, we have presented a fast, sequence-based method capable of classifying the binding sites contained in the publicly available PDB. We determined the parameter settings yielding a classification with the best quality (measured by the silhouette coefficient). Our main result was a sequence-based approach, derived from three-dimensional structures, used for binding site clustering (rather than three-dimensional binding site structure), that allows multiple sequences to occur at each binding site. We also evaluated our clustering results with a large, colored diagram (given at the URL http://pitgroup.org/seqclust), where the colors corre- spond to the EC numbers of the proteins containing the binding sites. As witnessed by the colored diagram, and also by the numerical deviations given in http:// pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt, our method has a clear-cut biological significance. The method presented in this work may help to reveal evolu- tionary related binding sites, and may also be used to filter redundancies (i.e. multiple occurring binding sites) from the PDB. A possible step for further research could be the creation of aggregate sequence set profiles for each binding site cluster, generating binding site families similar to the Protein Families Database [26,27]. Acknowledgements This work was supported by Hungarian Scientific Research Fund (NK-67867, CNK-77780), and by the Hungarian National Office for Research and Technol- ogy (OMFB-01295 ⁄ 2006 and OM-00219 ⁄ 2007). References 1 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN & Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28, 235–242. 2 Irwin JJ & Shoichet BK (2005). A free database of commercially available compounds for virtual screening. J Chem Inf Comput Sci 45, 177–182. 3 Szabadka Z & Grolmusz V (2006). Building a struc- tured PDB: the RS-PDB database. In: Proceedings of the 28th IEEE EMBS Annual International Conference, New York, NY, August 30–September 3, 2006, pp. 5755–5758. IEEE Press, New York, NY. 4 Artamonova II, Frishman G, Gelfand MS & Frishman D (2005) Mining sequence annotation databanks for association patterns. Bioinformatics 21, iii49–iii57. 5 Gunasekaran K, Ma B & Nussinov R (2004) Is allostery an intrinsic property of all dynamic proteins? Proteins 57, 433–443. Fig. 7. A representative of cluster 85 in the online table http:// www.pitgroup.org/seqclust/bsites_pseudocenters/bsites_optics_ M04_No001.html. Cluster 85 contains PDB entries 3B9J, 1FFU, 1JRP, 1T3Q, 2E3T, 1JRO, 1RM6, 1WY6, 1N5X; all of these contain an Fe 2 ⁄ S 2 cluster (FeS) bond. G. Iva ´ n et al. A hybrid clustering of protein binding sites FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS 1501 6 Halperin I, Wolfson H & Nussinov R (2003). Sitelight: binding-site prediction using phage display libraries. Protein Sci 12: 1344–1359. 7 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ (2005) Prediction of multimolecular assemblies by multiple docking. J Mol Biol 349, 435–447. 8 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ (2003). Protein structure prediction via combinatorial assembly of sub-structural units. Bioinformatics 19 (Suppl 1): i158–i168. 9 Keskin O, Gursoy A, Ma B & Nussinov R (2007) Towards drugs targeting multiple proteins in a systems biology approach. Curr Top Med Chem 7, 943–951. 10 Keskin O, Nussinov R & Gursoy A (2008) Prism: protein–protein interaction prediction by structural matching. Methods Mol Biol 484, 505–521. 11 Keskin O & Nussinov R (2007) Similar binding sites and different partners: implications to shared proteins in cellular pathways. Structure 15, 341–354. 12 Tsai CJ, Lin SL, Wolfson HJ & Nussinov R (1996) A dataset of protein–protein interfaces generated with a sequence-order-independent comparison technique. J Mol Biol 260, 604–620. 13 Alesker V, Nussinov R & Wolfson HJ (1996) Detection of non-topological motifs in protein structures. Protein Eng 9, 1103–1119. 14 Azarya-Sprinzak E, Naor D, Wolfson HJ & Nussinov R (1997) Interchanges of spatially neighbouring residues in structurally conserved environments. Protein Eng 10, 1109–1122. 15 Gold ND & Jackson RM (2006). Sitesbase: a database for structure-based protein-ligand binding site comparisons. Nucleic Acids Res 34(Database issue): D231–D234. 16 Kinnings SL & Jackson RM (2009) Binding site similarity analysis for the functional classification of the protein kinase family. J Chem Inf Model 49, 318–329. 17 Kuhn D, Weskamp N, Hazllermeier E and Klebe G (2007) Functional classification of protein kinase bind- ing sites using cavbase. ChemMedChem 2, 1432–1447. 18 Kinjo AR & Nakamura H (2009) Comprehensive struc- tural classification of ligand-binding motifs in proteins. Structure 17 , 234–246. 19 Webb EC (1989) Enzyme nomenclature. recommenda- tions 1984. Supplement 2: corrections and additions. Eur J Biochem 179, 489–533. 20 Kaufman L & Rousseeuw P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, NY. 21 Szabadka Z & Grolmusz V (2007) High throughput processing of the structural information in the protein data bank. J Mol Graph Model 25, 831–836. 22 Lova ´ sz L & Plummer MD (1986). Matching Theory, Vol. 121 of North-Holland Mathematics Studies. North- Holland Publishing Co., Amsterdam. Ann Discrete Mathematics 29. 23 Ester M, H-Kriegel P, Sander J & Xu X (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226–231. AAAI Press. 24 Ankerst M, Breunig MM, Kriegel H & Sander J (1999). Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD ‘99 International Conference on Management of Data, Philadelphia, PA, 1999, pp. 49–60. ACM Press. 25 Schmitt S, Kuhn D & Klebe G (2002) A new method to detect related function among proteins independent of sequence and fold homology. J Mol Biol 323, 387– 406. 26 Sonnhammer EL, Eddy SR, Birney E, Bateman A & Durbin R (1998) Pfam: multiple sequence alignments and hmm-profiles of protein domains. Nucleic Acids Res 26, 320–322. 27 Sonnhammer EL, Eddy SR & Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420. A hybrid clustering of protein binding sites G. Iva ´ n et al. 1502 FEBS Journal 277 (2010) 1494–1502 ª 2010 The Authors Journal compilation ª 2010 FEBS . ‘cut-off level’. The reachability plot of a small database consisting of binding sites that contain NAD as the ligand is shown in Fig. 6. Database parameters. 1497 contains at least one binding amino acid. Basically, binding sites are represented by storing all the binding amino acid sequences of all the protein chains

Ngày đăng: 15/03/2014, 10:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan