14 lazar: Lazy Structure–Activity Relationships for Toxicity Prediction CHRISTOPH HELMA Institute for Computer Science, Universita ¨ t Freiburg, Georges Ko ¨ hler Allee, Freiburg, Germany 1. INTRODUCTION The development of the lazar system was more or less a byproduct of editing this book. The intention was to demon- strate how to use some of the concepts from previous chapters as building blocks for the creation of a simple predictive tox- icology system that can serve as a reference for further devel- opments. Most of the machine learning and data mining techniques have been described in an earlier chapter (1). Some of the basic ideas (utilizing linear fragments for predic- tions) were developed already in the 1980s by Klopman and lazar stands for Lazy Structure–Activity Relationships. 479 © 2005 by Taylor & Francis Group, LLC are implemented in the MC4PC system. [a description of MC4PC is given elsewhere in this book (2)]. Please note that lazar is not a reimplementation of an existing system, because it uses its own distinct algorithms to generate descriptors and to achieve the predictions. Figure 1 Example learning set for lazar. Compound structures are written in SMILES notation (17); ‘‘1’’ indicates active com- pounds, ‘‘0’’ indicates inactivity. 480 Helma © 2005 by Taylor & Francis Group, LLC Initially in this chapter, I review some observations from the application of machine learning and data mining techniques to non-congeneric (compounds that do not have a common core structure) data sets that led to the basic lazar concept. After a description of the lazar algorithm, I present an example applica- tion for the prediction of Salmonella mutagenicity and draw Figure 2 Example test set. Compound structures are written in SMILES notation (17); toxicological activities are unknown. lazar 481 © 2005 by Taylor & Francis Group, LLC some conclusions for further improvements from the analysis of misclassified compounds. 2. PROBLEM DEFINITION First, a brief review of the problem setting: We have a data set with chemical structures and measured toxicological activi- ties. a This is the training set (Fig. 1). We have a second data set with untested structures (the test set, Fig. 2) and intend to predict the toxicological activities of these untested compounds. More specifically, we want to infer from the chemical structure (or some of its properties) of the test compound to its toxicological activity, assuming that the biological activity of a compound is determined by the chemical structure (or some of its properties). b For this purpose, we need: A description of the chemicals features that are responsible for toxicological activity, and A model that makes predictions based on these features. Most efforts in predictive toxicology have been devoted to the second task: The identification of predictive models based on a given set of chemical features (Table 1). Especially with toxicological effects, we frequently face the problem that biochemical mechanisms are diverse and unknown. It is there- fore hard to guess (and select) the chemical features that are relevant for a particular effect, and we risk making one of the following mistakes: Omission of important features that determine activi- ty=inactivity by selecting too few features. a Classifications (e.g., active=inactive) or numerical values (e.g., LC 50 s), for the sake of simplicity we will cover only the classification case in this chapter. b This is of course an oversimplification, because toxic action is caused by an interaction between the compound and the biological target. In the chapter by Marchal et al.(3), you can find more information about this topic. 482 Helma © 2005 by Taylor & Francis Group, LLC Deterioration of the performance (accuracy and speed) of statistical and machine learning systems c by select- ing too many features. Recently several schemes for feature selection have been devised in the machine learning community (see Journal of Machine Learning Research 3, 2003, that contains a Special Issue on Variable and Feature Selection). d In my experience recursive feature extraction (RFE) (8) performs very well on a variety of problems, but at the risk of overfitting the data (9). In theory, it is possible to counteract by performing another cross-validation step for feature selection, but this leads frequently to practical problems: long computation times and fragmented learning data sets due to nested cross-validation loops. e Furthermore, there are a lot of other problems that are frequently encountered in predictive toxicology: Models that are hard to interpret in terms of toxicolo- gical mechanisms (10). Table 1 Summary of Chemical Properties that Can Be Used for the Prediction of Toxicological Effects Presence of substructures Physico=chemical properties of the whole molecule (e.g., log P, HOMO, LUMO, ) Graph theoretic descriptors Biological properties (e.g., from screening assays) Spectra (IR, NMR, MS, ) c Most (if not all) of these systems are sensitive towards large numbers of irrelevant features. Based on my experience, this is also true for systems that claim to be insensitive in this respect (e.g., support vector machines). d In predictive toxicology, Klopman (7) used an automated feature selection process in the CASE and MULTICASE systems since the 1980s. e Recursive feature extraction needs an internal (10-fold) cross-validation step to evaluate the results (8). If we use another 10-fold cross-validation for feature selection, we have to generate 10 Â10 models. On the other hand, the data for a single model shrink to 0.9 Â0.9 ¼81% of the original size, which can be problematic for small data sets. lazar 483 © 2005 by Taylor & Francis Group, LLC Models that are too general (improper consideration of compounds with a specific mode of action) (11). Limitations of the models (e.g., substructures, that are not in the training set) are often unclear. No indication if a structure falls beyond the scope of the model. Sensitivity toward skewed distributions of acti- ves=inactives in the learning set. Handling of missing values in the training set. Ambiguous parameter settings (e.g., cutoff frequen- cies in MolFea (12), Kernel type, gamma, epsilon, tol- erance, parameters in support vector machines). My intention was to address these problems with the development of lazar. 3. THE BASIC lazar CONCEPT lazar is in contrast to the majority of the approaches described in this book— a lazy learning scheme. Lazy learning means that we do not generate a global model from the com- plete training set. Instead, we are creating small models on demand: one individual model for each test structure. This has the key advantage that the prediction models are more specific (13) because we can consider the properties of the test structure during model generation. If we want to predict, e.g., the mutagenicity of nitroaromatic compounds, information from chlorinated aliphatic structures will be of little value. As we will see below, the selection of relevant features and rele- vant examples from the training set is done automatically by the system and does not require any input of chemical con- cepts. On a practical side, we have integrated model creation and prediction. Therefore, we need no computation time for model generation (and validation), but predictions may require more computation time than predictions from a global model (this is of course very implementation dependent). lazar uses presently linear fragments to describe chemi- cal structures (but it is easy to use more complex fragments, e.g., subgraphs or to include other features like molecular 484 Helma © 2005 by Taylor & Francis Group, LLC properties, e.g., log P, HOMO, LUMO, etc.). Linear fragments are defined as chains of heavy atoms with connecting bonds. Branches or cycles are not considered explicitly in linear frag- ments. f 1,1,1-Trichloroethane (CCC1 3 ), e.g., will contain the linear fragments fC, Cl, C–C, C–Cl, C–C–Cl, Cl–C–Clg. For- mally linear fragments are valid SMARTS expressions (see http://www.daylight.com for further references), which can be handled by various computational chemistry libraries (e.g., OpenBabel http://openbabel.sourceforge.net). For the sake of clarity, we will separate by discuss the following three steps that are needed to obtain a prediction for an untested structure: 1. Identification of the linear fragments in the test structure. 2. Identification of the fragments that are relevant for the prediction. 3. Prediction of the activity of the test structure. In the following section, we will give a more detailed description of the algorithms for each of these steps. 4. DETAILED DESCRIPTION 4.1. Fragment Generation In lazar, we are presently using a very simplified variant of the MolFea (14) algorithm to determine the fragments of a given structure. The procedure is as follows. As a starting point, we use all elements from the per- iodic table of elements (including aromatic atoms). These are the candidate fragments for the first level. First, we examine which of them occur in (or match) the test struc- ture and eliminate those that do not match. Then we check which of the remaining fragments occur in the training structures. If a candidate fragment does not occur in the training structures, we remove it from the current level f But they are frequently implicitly considered: chains with more than six aromatic carbons, for example, indicate condensated aromatic rings. lazar 485 © 2005 by Taylor & Francis Group, LLC and store it in the set of unknown fragments, because we cannot determine if it contributes to activity or inactivity. From the remaining candidates (i.e., those that occur in the test structure and the training structures), we generate the candidates for the next level (i.e., candidates with an additional bond and atom), this step is called refinement and will be described below. The whole procedure is repeated until the candidate pool has been depleted [i.e., all fragments of the test structure that occur also in the training set have been identified (Fig. 3)]. 4.2. Fragment Refinement A naive way to refine level n to level n þ1 would be to attach a bond and an atom to each fragment of level n. This is, of course, very inefficient because we generate (and match) way too many fragments. We want to avoid to generate unne- cessary fragments because fragment matching is the time cri- tical step of our algorithm. g Fortunately, we can determine in many cases if a frag- ment cannot match before generating the fragment. For this purpose, we can use an important property of the language of molecular fragments—the generality relationship: Figure 3 Procedure for fragment generation. g lazar delegates this to the OpenBabel libraries http://openbabel. sourceforge.net: 486 Helma © 2005 by Taylor & Francis Group, LLC We define that one fragment g is more general than a fragment s (notation: g "s)ifg is a substructure of s (e.g., C–O is more general than N–C–C–O). This has the conse- quence that g matches whenever s does. Linear fragments are symmetric, which means that two syntactically different fragments are equivalent when they are a reversal of one another (e.g., C–C–O and O–C–C denote the same fragment). Therefore we can conclude that g is more general than s (g "s) if and only if g is a subsequence of s or g is a subse- quence of the reversal of s (e.g., C–O "C–C–O and O– C "C–C–O). We can use this generality relationship to refine frag- ments efficiently: As we know, all subfragments of a new can- didate fragment have to match; therefore, we need to combine only the fragments of the present level (i.e., those that match on the test and training compounds) to reach the next level. The two fragments C–C and C–O, e.g., can be refined to C– C–C, C–C–O, and O–C–O. This reduces the number of candi- dates considerably in comparison to attaching naively a bond and an atom. Another method to reduce the search space is to utilize the known matches of the current level to determine the potential matches of the new fragment. If we know, e.g., that C–C occurs in compounds fA, B, Cg and C–O occurs in fB,Dg we can conclude that: C–C–C can occur in compounds fA, B, Cg. C–C–O can occur in compounds fBg. O–C–O can occur in compounds fB, Dg. Knowing the potential matches of a new fragment allows us: (i) to remove candidates if they have no potential matches, and (ii) to perform the time consuming matching step only on the potential matches and not the complete data set. As predictions are usually performed for a set of test compounds, fragments (especially those from the lower levels, like C, C–C, etc.) are frequently to be reevaluated on the training set. Storing the matches of the fragments in a lazar 487 © 2005 by Taylor & Francis Group, LLC database (that can be saved permanently) helps to prevent this reevaluation. 4.3. Identification of Relevant Fragments After the successful identification of fragments of the test structure, we have the following information: The set of linear fragments that occurs in the test structure and in the training structures. The set of the most general fragments that occur in the test structure but not in the training structures (i.e., the shortest unknown fragments). For each fragment, the set of training structures, where the fragment matches. The activity classifications for the training structures. Let us consider now each fragment f as a hypothesis that indicates if a compound C with this fragment is active or inac- tive. First we have to evaluate if a fragment indicates activity or inactivity, then we have to distinguish between more or less predictive hypotheses (i.e., fragments) and select the most predictive ones. Fortunately, it is rather straightforward to evaluate the fragments on the training set because we know which com- pounds the fragment matches as well as the activity classifi- cations for these compounds. If a fragment matches only active or inactive compounds, the decision is obvious: We will call a fragment activating if it matches only active compounds or inactivating if it matches only inactive compounds. In real life, however, most fragments will occur in active as well as inactive compounds. It is tempting to call a fragment activat- ing if it matches more active compounds than inactives. This is certainly true for training sets that contain an equal num- ber of active and inactive compounds. But let us assume that only 10% (e.g., 10 of 100) of the training structures are active and we have identified a fragment that matches five active and five inactive compounds (Fig. 4). We realize that this frag- ment matches 5 from 10 (50%) active compounds and 5 from 90 (5.6%) inactive compounds and that it is justified to call 488 Helma © 2005 by Taylor & Francis Group, LLC [...]... (QSTAR) Predictive Toxicology New York: Marcel Dekker, 2005 3 Marchal K, De Smet F, Engelen K, De Moor B Computational biology and toxicogenomics Predictive Toxicology New York: Marcel Dekker, 2005 4 Frasconi P Neural networks and kernel machines for vector and structured data Predictive Toxicology New York: Marcel Dekker, 2005 5 Eriksson L, Johansson E, Lundstedt T Regression- and projection-based... A web interface to lazar can be found at http://www .predictive- toxicology. org/lazar/ ACKNOWLEDGMENTS I would like to thank Stefan Kramer and Luc DeRaedt for the development of MolFea, and Victor Horal-Gurfinkel and Peter Reutemann for programming assistance REFERENCES 1 Kramer S, Helma C Machine learning and data mining In: Helma S, ed Predictive Toxicology New York: Marcel Dekker, 2005 2 Klopman G,... leave-one-out cross-validation are summarized in Table 2; a comparison with results from the literature on similar (non-congeneric) data sets can be found in Table 3 Please keep in mind that neither the data set nor the validation methods are identical in Tables 2 and 3 6 LEARNING FROM MISTAKES For a model developer, the inspection of misclassified instances is the most instructive task A case-by-case... York: Marcel Dekker, 2005 5 Eriksson L, Johansson E, Lundstedt T Regression- and projection-based approaches in Predictive Toxicology Predictive Toxicology New York: Marcel Dekker, 2004 © 2005 by Taylor & Francis Group, LLC 497 lazar 6 Guba W Description and representation of chemicals Predictive Toxicology New York: Marcel Dekker, 2004 7 Klopman G Artificial intelligence approach to structure–activity studies:... of the Predictive Toxicology challenge Bioinformatics 2003; 19:1179–1182 11 Benigni R, Giuliani A Putting the Predictive Toxicology challenge into prespective: reflections on the results Bioinformatics 2003; 19:1194–1200 12 Kramer S, Frank E, Helma C Fragment generation and support vector machines for inducing SARs SAR QSAR Environ Res 2002; 13:509–523 13 Mitchell TM Machine Learning The McGraw-Hill... Accuracy of Salmonella Mutagenicity SAR Models for Non-congeneric Compounds from the Public Literature Author Perotta et al Klopman and Rosenkranz Klopman and Rosenkranz Klopman and Rosenkranz Helma, Kramer and DeRaedt Citation Method Accuracy (%) (18) (19) (19) (19) (20) — CASE MULTICASE CASE=GI MolFea þ ML 74a 72 80 47 78 a Leave-one-out cross-validation therefore relevant for mutagenicity To obtain... e.g., epoxides are the only compounds where (explicit) ring systemsk are needed, it might be more efficient to use the three-ring or epoxide flag as an additional feature 7 CONCLUSION In this chapter, I have presented a brief analysis of the major problems of current predictive toxicology systems Based on this analysis, I have applied some of the concepts from this book to develop a simple system called... J Chem Inf Comput Sci 2004; 44 :140 2 141 1 INTRODUCTORY LITERATURE 1 Witten IH, Frank E Data Mining San Francisco, CA: Morgan Kaufmann Publishers, 2000 2 Mitchell TM Machine Learning The McGraw-Hill Companies, Inc., 1997 GLOSSARY Aromatic: A planar ring system with delocalized electrons, e.g., a benzene ring Congeneric: Chemical structures with a common substructure Cross-validation: A method for estimating... to an overestimation of the effect of the biphenyl structure—therefore we use only the most predictive fragment (i.e., the fragment with the highest jpf j) from a set of redundant fragments and define redundancy as follows Figure 5 A set of redundant fragments for a particular compound The most predictive, non-redundant fragments are marked by an arrow © 2005 by Taylor & Francis Group, LLC 491 lazar... that are not in the training set) It is likely that at least some of them are activating or deactivating and Table 2 Accuracy of lazar Predictions for Salmonella Mutagenicity Determined by Leave-One-Out Cross-Validation Salmonella Mutagenicity t True positive rate npp t True negative rate nn n tp þtn Total accuracy nall 0.66 0.84 0.78 tp: number of true positive predictions, np: total number . data. Predictive Toxicology. New York: Marcel Dekker, 2005. 5. Eriksson L, Johansson E, Lundstedt T. Regression- and projec- tion-based approaches in Predictive Toxicology. Predictive Toxicology. . The CPDB con- tains only an overall classification for Salmonella mutageni- city, but no information about metabolic activation or tester strains. The results from leave-one-out cross-validation. instructive task. A case-by-case inspec- tion of each compound can reveal systematic errors in the pre- diction system. It avoids, on the other hand, desperate efforts to increase the predictive accuracy