DISULFIDE CONNECTIVITY PREDICTION USING SECONDARY STRUCTURE INFORMATION AND DIRESIDUE FREQUENCIES

DISULFIDE CONNECTIVITY PREDICTION USING SECONDARY STRUCTURE INFORMATION AND DIRESIDUE FREQUENCIES F Ferrèa, P Cloteb,* a Department of Biology, Boston College, Chestnut Hill, MA (USA) 02467 Departments of Biology and Computer Science (courtesy appointment), Boston College, Chestnut Hill, MA (USA) 02467 b * To whom correspondence should be addressed: E-mail: clote@bc.edu Phone: +1 617-552-1332 Fax: +1 617-552-2011 Running Title: Disulfide connectivity prediction Keywords: Disulfide connectivity, Machine learning, Neural networks, Position specific scoring matrices Abstract Motivation: We describe a stand-alone algorithm to predict disulfide bond partners in a protein given only the amino acid sequence, using a novel neural network architecture (the diresidue neural network), given input of symmetric flanking regions of N- and C-terminus half-cystines augmented with residue secondary structure (helix, coil, sheet) as well as evolutionary information The approach is motivated by the observation of a bias in the secondary structure preferences of free cysteines and half-cystines, and by promising preliminary results we obtained using diresidue position specific scoring matrices Results: As calibrated by ROC curves from 4-fold cross-validation, our conditioning on secondary structure allows our novel diresidue neural network to perform as well as, and in some cases better than the current state-of-the-art method A slight drop in performance is seen when secondary structure is predicted rather than derived from three dimensional protein structures Availability: http://clavius.bc.edu/~clotelab/DiANNA Contacts: ferref@bc.edu, clote@bc.edu Supplementary Information: Supplementary Tables and Figures, and the complete list of PDB codes of monomers used, can be found at http://clavius.bc.edu/~clotelab/ Introduction Disulfide bonds (covalently bonded sulfur atoms from nonadjacent cysteine residues) play a critical role in protein structure, as noted by C Anfinsen (Anfinsen 1973), whose pioneering work first provided evidence that the native state of a protein is that conformation which minimizes its free energy1 There are relatively good algorithms, whose predictive accuracy is somewhat better than that of algorithms for secondary structure prediction, to determine whether a cysteine is in a reduced state (sulfur occurring in reactive sulfhydryl group SH), or oxidized state (sulfur covalently bonded)2 Early methods of Fiser et al (Fiser et al 1992) and of Muskal et al., (Muskal et al 1990) used sequence information alone to predict cysteine oxidation state The former used a statistical method, and achieved 71% accuracy, while the latter used a neural network and claimed 81% accuracy on a small test database In 1999 P Fariselli and R Casadio (Fariselli et al 1999) designed a jury of neural networks, trained on flanking sequence information in neighborhoods of oxidized versus reduced cysteines Their algorithm obtained an accuracy of 71%; when additionally trained on flanking evolutionary information (i.e multiple sequence alignments of homologous proteins) the accuracy improved to 81% In 2000 A Fiser and I Simon (Fiser and Simon 2000) used multiple sequence alignments in a different manner to obtain an accuracy of 82% In 2002 M.H Mucchielli-Giorgi et al (Mucchielli-Giorgi et al 2002) used a combination of perceptrons, trained on sets of proteins homogeneous in terms of their amino acid content, to obtain an accuracy of 84% In the same year Martelli et al (Martelli et al 2002) used a hybrid hidden Anfinsen reduced disulfide-bonded cysteines of bovine pancreatic ribonuclease by adding the denaturant urea; upon removal of the denaturant, the original disulfide bonds were reestablished, thus suggesting that the native state is in a global free energy minimum Disulfide-bonded cysteines are known as half-cystines, oxidized cysteines may either be half-cystines or instead covalently bonded to a metallic ligand; reduced cysteines are also called free cysteines Markov model and neural network system, reaching 88% accuracy3 Despite success in predicting cysteine oxidation state, there have been fewer attempts to solve the problem of determining whether two half-cystines form a disulfide bond with each other – the disulfide bond partner prediction problem In 2001 and 2002 papers P Fariselli and R Casadio (Fariselli and Casadio 2001; Fariselli et al 2002) designed a neural network to score likelihood that given half-cystine pairs may form a disulfide bond, using flanking sequence information, and subsequently applied the Edmonds-Gabow maximum weight matching algorithm to pair those most likely partners A recent paper by Vullo and Frasconi (Vullo and Frasconi 2004) describes the successful application of recursive neural networks (Frasconi et al 1998) to score undirected graphs that represent cysteine connectivity; evolutionary information is included to improve the prediction (in the form of vectors that label the graph vertices, i.e the protein cysteines) This method is currently the state-of-the-art In this paper, we describe our method to determine which cysteines are involved in disulfide bonds and for such, to list the disulfide bond partners Starting from the previous observation that there is a bias in the secondary structure preference of free cysteines and half-cystines (Petersen et al 1999), we develop a novel neural network to learn amino acid environments constituting the window contents of a symmetric region centered at partner half-cystines; the network architecture is designed with the aim of including in the training the signal that arises when using diresidue position specific scoring matrices (PSSM) Our final stand-alone program, called DIANNA (for DiAminoacid Neural Network Application), uses a diresidue neural network on the symmetric flanking residues about both cysteines of a potential disulfide bond, along with the PSIPRED-determined (Jones 1999) secondary structure of the residues Direct comparison of these accuracies is somewhat misleading, as testing was performed on different data sets and PSIBLAST-determined (Altschul et al 1997) evolutionary information Finally, following Fariselli and Casadio (Fariselli and Casadio 2001), the algorithm applies Ed Rothberg’s implementation (Rothberg) of the Edmonds-Gabow maximum weight matching algorithm (Gabow 1973; Lovasz and Plummer 1985) to assign disulfide bond partners, given the weighted complete graph, whose nodes are half-cystines and whose weights are values output from the neural network This novel approach, as calibrated using receiver operating characteristic (ROC) curves (Gribskov and Robinson 1996), shows a marked improvement over previous work of Fariselli and Casadio (Fariselli and Casadio 2001; Fariselli et al 2002), and is comparable or better than the method of Vullo and Frasconi (Vullo and Frasconi 2004) System and Methods Data Preparation To test our method on the same dataset used in previous papers describing methods for the disulfide connectivity prediction (Fariselli and Casadio 2001; Fariselli et al 2002; Vullo and Frasconi 2004), we selected 445 monomers from the SWISS-PROT database (Boeckmann et al 2003) (release 39) having at least two and at most five intra-chain disulfide bonds, and for which structural data are available in the PDB (Berman et al 2002) database If one SWISS-PROT entry is associated to more than one PDB chain, we selected the one with the best resolution Monomers are divided in four groups of approximately the same size trying to minimize the inter-set redundancy, as described in Fariselli et al (Fariselli et al 2002), in order to perform four-fold cross-validation experiments For the sake of comparison, we repeated the same experiments described in Vullo and Frasconi paper (Vullo and Frasconi 2004) on subsets of the whole dataset following the SCOP classification (Andreeva et al 2004) The majority (309 out of 446, 69%) of protein chains in the Vullo and Frasconi dataset (Vullo and Frasconi 2004) were unclassified in release 1.63 of SCOP, the version used by Vullo and Frasconi In contrast, we used the latest release of SCOP (1.65) in classifying our dataset, prepared following the procedure used by Vullo and Frasconi The corresponding SCOP classification for our data is given as follow:  (7.3%), β (25.1%), +β (19%), /β (7.7%), small proteins (29.3%), peptides (3.5%) and unclassified proteins (8.2%) The number of proteins in each subset is shown in Table The list of PDB monomers used in Martelli et al (Martelli et al 2002) has been employed for the training of an oxidation state prediction tool implemented in the DIANNA web server Finally, we used the PDBSELECT25 dataset (Hobohm and Sander 1994) to test our method on an unbiased list of proteins that includes monomers that may or may not have disulfide bonds Secondary structure and cysteine oxidation state annotations are derived from the Dictionary of Secondary Structure of Protein (DSSP) of Kabsch and Sander (Kabsch and Sander 1983) We clustered the seven different DSSP secondary structure notations into three classes: (i) helix (H) - alpha helix, 3/10 helix and pi helix; (ii) coil (C) - hydrogen bonded turn, bend and coil; (iii) sheet (E) - beta-bridge and extended strand We checked validity of disulfide bond annotation by computing the distance between sulfur atoms of annotated half-cystine partners in the dataset (average distance 2.04 Angstroms, standard deviation 0.105; maximum distance 2.93 Angstroms) Machine Learning We applied two machine learning methods, neural networks (Stuttgart Neural Network Simulator, SNNS, URL: http://www-ra.informatik.uni-tuebingen.de/SNNS/) and position specific scoring matrices, to calibrate the effect of considering secondary structure in disulfide bond prediction Throughout the following sections, P and N represent a training file of positive and negative examples, respectively, of sequence length w , e.g two 11-mers corresponding to the symmetric cysteine-centered size w =2 n +1=11 window contents of cysteines (i.e the n residues N-terminal and Cterminal to each cysteine, where n=5) Let P denote the pairs of window contents for all the half-cystines involved in an intra-chain bond, and let N denote the corresponding set of possible pairs of cysteines (intra-chain half-cysteines, inter-chain half-cysteines and free cysteines) that are not intra-chain disulfide bonds True positive predictions occur when a half-cystine pair that is a known bond is correctly predicted as such, while false negative predictions occur when known disulfide bonds are predicted not to be such Accordingly, a true negative is a cysteine pair correctly predicted to not form a disulfide bond, while a false positive is a pair of cysteines that is not a bond though predicted as such Letting TP TN  FP FN denote respectively the number of true positives, true negatives, false positives and false negatives, recall the definitions of accuracy, or Q2: TP TN TP TN  FP  FN or TP P , the sensitivity, TP rate (tpr), or Qc: the specificity, or Qnc: or TP TP  FN TP TN P N , TN N or TN TN  FP , and Matthew’s correlation coefficient: TPTN FP FN  (TP  FN )(TP FP ) (TN FP ) ( TN FN  ) The false positive rate, FP rate (fpr), is minus specificity Finally, Qp is the fraction of correctly assigned connectivity patterns, i.e the fraction of chains for which all the predictions are correct (FP = and FN = 0) To quantify the sensitivity/specificity trade-off of various methods, we considered 4fold cross-validation Generalized weight matrices Weight matrices4 can be constructed using the relative frequencies of the 20 amino acids in different positions of a set of training instances, and then used to score a test instance Define the background set B  P  N For set X  {P N  B} and amino acid a , let num( X  i a ) denote the number of occurrences of a in X in position i , and let f ( X  i a) denote the relative (monoresidue) frequency of a in X at position i , i.e f ( X  i a )  num ( X i  a ) X  To avoid numerators equal to 0, we add pseudocounts (Karplus 1995), i.e for fixed c  (in this paper we used c  0.2 ): f ( X  i a )  num ( X i  a )  c  X 20c For amino acid sequence s1 … sn and  i  n , define the positional log odds score: In the literature, weight matrices are also known as position specific scoring matrices (PSSM), or alternatively as profiles In this paper, we sometimes denote collectively mono- and diresidue weight matrices, explained later in the text, by PSSM  (i a)  log  f ( P i  a ) f ( B i  a )  Once the positional log odds scores are computed for a training set of sequences, the score of a test sequence can be obtained as the sum of log odds scores  ( s )  1i  n  (i si ) We denote this monoresidue weight matrix method by WM As reported in Zhang and Marr (Zhang and Marr 1993) for the first-order Markov case and in Clote (Clote 2003) for the general case, the notion of monoresidue scoring matrix can be extended immediately to the situation of not necessarily consecutive k tuple frequencies, for any fixed k  Under the assumption of positional independence, which often does not hold for biological sequence data, WM is provably the maximum likelihood estimator (Clote and Backofen 2000) Nevertheless, in some cases experimental evidence suggests that protein sequences can be more adequately modeled using diresidue (e.g with k =2), rather than monoresidue weight matrices (Bulyk et al 2002) For this reason, in this paper we used diresidue weight matrices, defined as follows For set X  {P N  B} of length n sequences, for positions  i  j  n and amino acids a b , let num( X  i j , a, b) denote the number of occurrences of amino acid a in position i when amino acid b is found in position j , and let f ( X  i, j a, b) denote the relative (diresidue) frequency, hence we define: f ( X  i  j  a b )  num ( X i  j  a b )  c  X 202 c Define diresidue positional log odds  (i j a b)  log  f ( P  i  j  a b ) f ( B  i  j  a b ) , and diresidue score  ( s )  1i  j  n  (i j si  s j ) We denote this diresidue weight matrix method by WM Neural networks We used the Stuttgart Neural Network Simulator (SNNS; URL: http://wwwra.informatik.uni-tuebingen.de/SNNS/), and wrote Python programs as well as some batchman (SNNS) code to train and test a variety of neural net architectures implemented in SNNS All neural networks are layered, feed-forward, fully connected nets (with the exception of the diresidue layer, described below), and trained by momentum back-propagation with a maximum of 10,000 cycles To avoid overfitting we checked the error progression on a validation set (one-fifth of the monomers from the training set of each cross-validation step, chosen randomly) In the unary representation of the neural network input encoding, given two size w windows centered respectively at N- resp C-terminus half-cystines, each window residue is represented by a 20 bit vector; each of the 20 bits is set to zero, except the one that is assigned to a given amino acid type To include evolutionary information in the input encoding, we ran PSIBLAST (Altschul et al 1997) (three iterations, against the non-redundant SWISS-PROT + TrEMBL database of sequences) on the input sequence to produce a profile – i.e frequencies f (i a) , for each of the 20 amino acids a and each position  i  2w , obtained from the multiple sequence alignment of homologous proteins The resulting input to our neural net consisted of w 20 frequencies To include secondary structure information, we extracted DSSP secondary structure annotations of each of the 2w residues, and we added to the evolutionary encoding vectors, w 3 additional binary inputs, which latter encode in unary the secondary structure (H,C,E) of each of the 2w residues5 For example, H is encoded 0, C is and E is 0 fully connected to the single output unit 5.Weighted Match Disulfide connectivity can be described as a graph whose nodes are the half-cystines and whose edges join pairs of nodes Connectivity prediction, i.e prediction of disulfide bond partners, is obtained by applying the Edmonds-Gabow maximum weight matching algorithm (Gabow 1973; Lovasz and Plummer 1985) as implemented in wmatch by Ed Rothberg (Rothberg), to the graph, whose nodes are the putative half-cystines and whose edges, which join pairs of nodes, are weighted by either the PSSM ( WM or WM ) positional log odds scores or the output of the neural net in the disulfide bond prediction module PSSM scores, that may be negative (negative values are not accepted by wmatch), are scaled in the interval (0, …,100) A different version of the connectivity prediction module that uses a greedy approach (i.e the bonds are chosen starting from the one with highest predicted score), was tested, but leads to poorer results (not shown) Implementation and Discussion The amino acid environment of half-cystines shows peculiar sequence characteristics that allow the discrimination between half-cystines and free cysteines using machine learning (Fiser et al 1992; Fariselli et al 1999; Fiser and Simon 2000) Moreover, the secondary structure conformation assumed by the cysteines and their neighboring residues is remarkably different when comparing disulfide-bonded versus free cysteines (Petersen et al 1999) Table and 3a show the secondary structure conformation frequencies detected in the analyzed dataset and computed using DSSP annotations These values are to some extent different than those of Petersen et al (Petersen et al 1999), but this could be due to a different (and in our case larger) dataset Considering the secondary structure of pairs of half-cystines known to form a disulfide bond, some combinations are preferred, presumably indicating a sort of structural complementarity (Table 3b) Therefore, we explored the possibility of using sequence and secondary structure information to infer the protein disulfide connectivity, using different machine learning approaches Figure (left panel) and Table show the performance of a feed-forward neural network trained with momentum back-propagation (NN2), described in the Methods section, trained using different input encodings The inclusion of secondary structure information leads to a marked improvement, as well as the inclusion of the 20 frequencies obtained in a multiple sequence alignment for each given residue of the window (this step is known as incorporating evolutionary information, and since the seminal work of Rost and Sander (Rost and Sander 1993), has been shown to substantially increase the accuracy of neural networks for protein secondary structure prediction; similar improvements obtained using evolutionary information in predicting cysteine oxidation state and disulfide connectivity have been demonstrated (Fariselli et al 1999; Vullo and Frasconi 2004)) The use of secondary structure information leads to a clear improvement either when using the unary or the evolutionary encoding of the input windows This is even more evident when looking at receiver operating characteristic (ROC) curves, comparing the sensitivity/specificity trade-off for different inputs used to train NN2 (Figure 1) Position specific scoring matrices (PSSM) can be constructed using the relative frequencies of the 20 amino acids in different positions of the cysteine-centered symmetric window These are then used to score an input putative disulfide bond A monoresidue scoring matrix ( WM ) was computed and tested in a four-fold crossvalidation experiment, with poor results (Figure right panel) Monoresidue weight matrices are provably the maximum likelihood estimator (Clote and Backofen 2000), but they imply positional independence (i.e the frequency of an amino acid a in position i of a sequence is independent of the frequency of b in position j), which may not hold for biological sequences The notion of monoresidue scoring matrix can be easily extended to k -tuple frequencies, for any fixed k  (Zhang and Marr 1993, Clote 2003) Besides, there is experimental evidence from M Bulyk et al (Bulyk et al 2002), which proves that protein-nucleotide binding in zinc fingers is more adequately modeled using diresidue (e.g with k =2), rather than monoresidue weight matrices For these reasons, we applied diresidue weight matrices ( WM ) to the same dataset used for WM ROC curves comparing the sensitivity/specificity trade-off for generalized weight matrix methods (see Figure 1) witness a diresidue frequency signal distinctive in the recognition of disulfide bonds versus non-disulfide bonded cysteine pairs WM yields a 2.2-fold improvement in the true positive rate over WM trained with the same dataset at 10% false positive rate Any weight matrix method, including WM , can be turned into predictive software, by allowing the user to stipulate a tolerated false positive rate fpr , then have the program determine the corresponding true positive rate tpr and threshold weight t for ROC point ( fpr  tpr ) , by table lookup in the precomputed ROC table Nevertheless, since WM is not statistically well-founded, we turned to neural nets, trying to include the diresidue signal that arises from WM Two novel diresidue neural network architectures were developed, the first ( dNN1 ) with only one hidden layer (called diresidue hidden layer) fully connected to the output unit, the second ( dNN ) provided with a second hidden layer containing five units that collect all the output from the diresidue hidden layer More details can be found in the Methods section The input includes evolutionary and secondary structure information; therefore each residue in the length w window is encoded by 23 units (including the three units necessary for the secondary structure information encoding) The performance of the different neural networks on the whole dataset, and on subsets of monomers having the same number of bonds B (2,3,4 or 5), or belonging to the same SCOP fold, is shown in Tables and 6; ROC curves for the putative disulfide bonds scores are shown in Figures and For all values of B, Matthews correlation coefficient and Qp obtained using dNN2 are better than all the other attempted approaches In general, the performance drops when analyzing subsets with a greater number of bonds B (with the remarkable exception of B=4) and subsets containing a small number of examples (,/β, peptides and unclassified proteins) Tables and Figures showing the performances on sub-subsets of proteins belonging to the same SCOP fold and having the same number of bonds B are shown as Supplementary Material at URL http://clavius.bc.edu/~clotelab (Tables 8, 9, 10, 11, 12, 13, 14, and Figures 4, 5, 6, 7, 8, 9, 10); the same general trend can be seen (the performance drops when B grows), even though in some cases the performance may be artificially good or artificially poor since the number of monomers belonging to certain sub-subsets may be too low In some cases (+β having bonds, peptides with or bonds) the sub-subsets contain too few monomers to run the 4-fold cross-validation The number of proteins in each subset is shown in Table All the previously described approaches produce as output a score, given as input a putative disulfide bond To obtain a connectivity prediction we followed the idea of Fariselli and Casadio (Fariselli and Casadio 2001; Fariselli et al 2002) applying the Edmonds-Gabow weight matching algorithm for connectivity prediction, using Rothberg’s wmatch (Rothberg) The PSSM or neural network scores are used to weight the edges of a graph whose vertices are all the cysteines of a protein; these scores are opportunely scaled Tables and show the results of the application of wmatch; the ratio of proteins for which the prediction is correct is shown by the Qp index Results show that, in general, dNN2 performs better than the other methods Nevertheless, it should be noted that WM2 (a much simpler and faster approach, though not statistically well founded) often leads to results comparable or even better than dNN2 (in the cases of B = 2, and 4, and for the subsets +β and small proteins) Comparison of Methods A direct comparison to the Fariselli and Casadio disulfide connectivity prediction method (Fariselli and Casadio 2001; Fariselli et al 2002) and to the recent work of Vullo and Frasconi (Vullo and Frasconi 2004) is immediate, since we used the same protein monomer subsets and evaluate the performance using the same indicators The fraction of protein chains for which the whole prediction is correct (the Qp value) for our method is comparable to that obtained for all the monomer subsets in Vullo and Frasconi (Vullo and Frasconi 2004), and in some cases (for B = 4, B = (2, ,5)) our method outperforms the Vullo and Frasconi technique We were able to compute ROC curves for the Fariselli and Casadio method The Fariselli-Casadio program CONPRED, when run on input consisting of symmetric flanking regions of an even number 2m of half-cystines, outputs two parts: (i) neural network scores, for each of the m ( m1) possible disulfide bond pairs, (ii) an assignment of disulfide bonding pattern, obtained by applying Edmonds-Gabow maximum weight matching to (i) In Fariselli and Casadio(Fariselli and Casadio 2001), disulfide bond partner assignment from CONPRED was shown to compare very favorably with respect to random assignment To obtain ROC curves for the Fariselli-Casadio method, we repeatedly ran CONPRED on symmetric half-cystine flanking regions extracted from files in the dataset, parsed the output of (i), and used DSSP disulfide bond annotation to tabulate true positive and true negative rates, necessary for ROC sensitivity/specificity values In Figure (bottom right panel), we superpose calculated ROC curves for Fariselli-Casadio CONPRED with window size w  (FC5), w  (FC7), w  11 (FC11) and w  15 (FC15) about each half-cystine, with the dNN2 ROC curve DIANNA web server We developed a web server, called DIANNA for DiAminoacid Neural Network Application, that provides three services: cysteine oxidation state, disulfide bonds and disulfide connectivity prediction The oxidation state prediction is an implementation of the procedure of Fariselli et al (Fariselli et al 1999) described above Evolutionary information is collected by aligning the user-submitted sequence to SWISS-PROT sequences using PSIBLAST Our disulfide bond connectivity prediction software is a web server that implements the diresidue neural network previously described (dNN2), fully trained with symmetric flanking regions of N- and C-terminus half-cystines augmented with residue secondary structure and evolutionary information Given two size w windows centered at an N- resp C-terminus putative half-cystine, we run PSIPRED on the whole sequence to predict the secondary structure (helix, coil, sheet) of each of the 2w residues; subsequently we use the PSIBLAST run performed by PSIPRED to produce the profile of each position  i  2w (we tested the accuracy of PSIPRED prediction with respect to the DSSP annotations on the entire dataset, obtaining an accuracy around 76% , similar to those claimed by D Jones (Jones 1999)) The connectivity prediction is obtained by wmatch as previously described To test how the predicted secondary structure (instead of that extracted from DSSP annotations) affects the performance of the neural network, we trained and tested with a 4-fold cross validation the same dataset as before, using PSIPRED predictions and evolutionary information Results show an increase of the false positive rate; nevertheless, the performance is not dramatically affected (Table last column, Table 6) Standard neural networks (NN2) seem to be more affected by the reduced accuracy of the secondary structure annotation, while diresidue NN performance are still rather good To test DIANNA on a dataset containing proteins that may or may not have disulfide bonds, in order to have an unbiased evaluation of the performance, we used the PDBSELECT25 (Hobohm and Sander 1994) list of monomers Out of 1769 monomers PDBSELECT25, 1011 have at least two cysteines; of this number (1011), 392 contain at least one disulfide bond The perfect prediction fraction, Qp, for DIANNA is rather low (0.227); the errors mainly arise from proteins that have only free cysteines (Table upper panel) To improve the DIANNA performance, we included an initial filtering step: only those monomers that have at least two predicted half-cysteines (by means of the oxidation state prediction tool described above) are submitted to DIANNA The filtering step eliminates 580 (out of 1011) monomers, of which 65 contain instead a disulfide bond Testing our algorithm on the remaining 431 chains, we obtained a very good global Qp value (0.627); nevertheless, the performance of DIANNA restricted to only those proteins containing disulfide bonds (B ranging from to 12), is lower (0.298) We tried also a similar approach to those used in the previous papers from Fariselli and Casadio (Fariselli and Casadio 2001; Fariselli et al 2002), filtering individual cysteines that are predicted to be free cysteines, but the improvement in performance is less pronounced Using DSSP-derived, instead of PSIPRED-predicted, secondary structure annotations, the performance is remarkably improved (global Qp 0.674, disulfide-bond containing proteins Qp 0.418) All the results are shown in Table Conclusion In this paper, we show how to use secondary structure annotations to improve disulfide bond partner prediction in a protein given only its amino acid sequence Even if the secondary structure is predicted by a machine learning approach instead of derived from the known three-dimensional structure, the performance of the prediction is still remarkable This allows the reliable application of this procedure to proteins for which the structure is still unknown Nevertheless, it should be noted that the software performance is strongly dependent on the knowledge of protein sequences related to the monomer analyzed (both for deriving evolutionary information and for a good quality secondary structure prediction); however, this flaw is inherent in all disulfide connectivity prediction methods available up to now A novel diresidue neural network architecture is used to simulate the strong performance of diresidue position specific matrices trained on the same dataset These neural networks can be applied to all those problems in which a diresidue architecture seems to represent a better model of the analyzed system (as in protein cleavage sites(Clote 2003)) Additionally, these diresidue neural networks require a smaller training time compared to fully connected networks with the same number of units In some cases, diresidue PSSM performs as well as the neural network approach Built in a modular fashion, our method combines two signals (diresidue and secondary structure) that are very different in nature, and is comparable and in some cases better than the current state-of-the-art methods Tested on a real case (a list of monomers that may or may not have disulfide bonds, and using predicted, rather than real, secondary structure annotations), the performances are still good, obtaining a perfect prediction ratio higher than 60% Acknowledgments We would like to especially thank P Fariselli for furnishing us with the executable code for CONPRED, M Sison for some programming in our initial exploratory approach, S Alvarez for a reference, M Muskavitch for a discussion about disulfide bonding in proteins Delta and Notch, and the referees for useful comments and suggestions Figures and Tables Captions Table Protein monomer dataset Number of monomers having a fixed number of disulfide bonds B (minimum 2, maximum 5), and belonging to different SCOP folds Table Cysteine secondary structure frequencies Frequencies of secondary structures, computed on the whole dataset using DSSP annotations Table Secondary structure of half-cystine neighbors and disulfide bond partners Left panel: relative frequency of secondary structures flanking the Nterminus resp C-terminus half-cystine in a disulfide bond in symmetric size 11 window (i.e the five residues upstream and the five downstream to each half-cystine), as tabulated from DSSP A secondary structure is assigned to each half-cystine Cterminal and N-terminal residues using a majority decision (i.e counting which secondary structure of each group of five residues is prevalent) Note the remarkable asymmetry of the Coil-Sheet and Sheet-Coil frequencies Right panel: frequencies of secondary structures of disulfide bond-forming half-cystines The expected frequency for pairs of secondary structures, one for each half-cystine, assuming independence of each half-cysteine, are computed as the product of corresponding frequencies from Table I The detected frequency is computed using DSSP annotations For example, in the 9.1% of the cases in the dataset the C-terminal half-cystine is in sheet conformation, while the N-terminal is in coil conformation (this is the value reported in the ‘%detected’ column) Since the frequency of coil half-cystine in the dataset is 0.46, and the frequency of sheet half-cystine is 0.33 (as reported in Table I), one can expect that the frequency of bonds, in which one half-cystine is coil and the other is helix, to be 0.33*0.46 = 0.152 (15.2%) This is the ‘expected’ frequency, that is different from the detected frequency; moreover, the frequency of the bonds in which the N-terminal half-cystine is in sheet conformation and the C-terminal is in coil is remarkably different (19.3%) In the last column, a secondary structure is assigned to the 11 residue window centered about the half-cystine, using a majority decision Table Neural network performance using different input encodings Performance of the NN2 neural net in a 4-fold cross-validation experiment using different input encodings: unary; unary with secondary structure information; evolutionary; evolutionary with secondary structure information Performance is evaluated by means of accuracy (Acc), sensitivity (Sen), specificity (Spe) and Matthews’s correlation coefficient (Mcc) Table Disulfide connectivity prediction performance of different algorithms Comparison of the performance of NN2, dNN1, dNN2, WM1 and WM2 All these prediction methods are applied to subsets of the dataset having a number of bonds B = 2,3,4 or 5, and to the whole dataset (B = (2, ,5)) Performance is evaluated by means of accuracy (Acc), sensitivity (Sen), specificity (Spe), Matthews’s correlation coefficient (Mcc) and fraction of protein for which the prediction is perfect Qp (only when wmatch is applied) The first six columns show the performance of different neural networks in correctly distinguish true from false disulfide bonds when the secondary structure annotations are extracted using DSSP (columns 1-3) or predicted using PSIPRED (columns 4-6) The last eight columns (columns 7-14) show the performance of the connectivity prediction obtained applying wmatch to the scores of the neural networks or the mono- and diresidue weighted matrices, opportunely scaled (see text for details); results of columns 12-14 are obtained using PSIPREDpredicted secondary structure information Performance is evaluated by means of accuracy (Acc), sensitivity (Sen), specificity (Spe), Matthews’s correlation coefficient (Mcc) and fraction of protein for which the prediction is perfect Qp (only when wmatch is applied) The last panel (bottom right panel) shows the performance of the different algorithms when the secondary structure is predicted from the protein sequence, instead of extracted from the three-dimensional structure Table Disulfide connectivity prediction performance of different algorithms Comparison of the performance of WM1, WM2, NN2, dNN1 and dNN2 All these prediction methods are applied to subsets of the dataset following the SCOP structural classification The secondary structure information used for the neural networks training is predicted by means of PSIPRED The first three columns show the performance of different neural networks in correctly predicting true from false disulfide bonds Columns 4-6 show the performance of the connectivity prediction obtained applying wmatch to the scores of the neural networks The last two columns show the performance of mono- and diresidue weighted matrices, respectively, for the diresidue connectivity prediction, applying wmatch to the PSSM score opportunely scaled (see text for details) Performance is evaluated by means of accuracy (Acc), sensitivity (Sen), specificity (Spe), Matthews’s correlation coefficient (Mcc) and fraction of protein for which the prediction is perfect Qp (only when wmatch is applied) Table Disulfide connectivity prediction performance of the diresidue neural network dNN2 on PDBSELECT25 The performance of the fully trained dNN2 has been tested on a non-redundant dataset composed of monomers that may or not contain disulfide bonds The first column shows the results obtained on the whole dataset, while the remaining columns refer to subsets having the same number B of disulfide bonds The upper panel shows the results on the entire PDBSELECT25, while the second panel from the top shows the performance after a preliminary filtering step that deletes from the dataset all monomers that have less than two predicted halfcystines, using a neural network trained for oxidation state predictions The third panel from the top shows the performance of dNN2 when a neural network trained for oxidation state predictions is used to filter individual cysteines that are predicted to be free cysteines, therefore only pairs of predicted half-cystines are submitted the dNN2 The bottom panel shows the performance of dNN2 using the same filtering procedure described for the second panel, when the secondary structure is predicted by means of PSIPRED Figure Disulfide connectivity prediction ROC curves for different input encodings and for PSSM Left panel: ROC curves for the NN2 neural network performance in a 4-fold cross-validation experiment Different inputs were tested: unary representation, unary representation together with secondary structure information, evolutionary information, evolutionary and secondary structure information Right panel: ROC curves for WM1 and WM2 position specific scoring matrices on the whole dataset in a 4-fold cross-validation experiment Figure Disulfide connectivity prediction ROC curves ROC curves for NN2, dNN1, dNN2 neural networks and WM1 and WM2 position specific scoring matrices on the whole dataset and on subsets having the same number of disulfide bonds B The last panel (bottom right) shows ROC curves for dNN2 compared to ROC curves for CONPRED with window size w  (FC5), w  (FC7), w  11 (FC11) and w  15 (FC15) Figure Disulfide connectivity prediction ROC curves for different protein folds ROC curves for NN2, dNN1, dNN2 neural networks and WM1 and WM2 position specific scoring matrices on subsets following the SCOP structural classification Secondary structure information used in the neural network training is predicted from the sequence using PSIPRED, as described in the text References ... have disulfide bonds Secondary structure and cysteine oxidation state annotations are derived from the Dictionary of Secondary Structure of Protein (DSSP) of Kabsch and Sander (Kabsch and Sander... possibility of using sequence and secondary structure information to infer the protein disulfide connectivity, using different machine learning approaches Figure (left panel) and Table show the... evolutionary information, and since the seminal work of Rost and Sander (Rost and Sander 1993), has been shown to substantially increase the accuracy of neural networks for protein secondary structure prediction;

Định dạng
Số trang	26
Dung lượng	458,5 KB