PosterSessionIVComputationalAspectsofMolecularDiversityand Cornbinatorial Libraries - MOLDIVS A NEW PROGRAM FOR MOLECULAR SIMILARITY ANDDIVERSITY CALCULATIONS Vadim A Gerasimenko, Sergei V Trepalin, Oleg A Raevsky Institute of Physiologically Active Compounds of Russian Academy of Sciences, 142432, Chernogolovka, Moscow region, Russia At present molecular similarity anddiversity calculations are the important tools for lead generation and optimization, especially in the fields of high-throughput screening andcombinatorial chemistry There are many approaches to this problem, which differ in descriptors used, similarity anddiversity measures and compounds selection algorithms Descriptors of different types (topological indexes, physical property descriptors, 2D and 3D structural keys) can be used for this purpose It was shown,' that structural 2D descriptors perform better then others in their ability to distinguish between biologically active and inactive compounds The discriminating power of these descriptors depends on the degree to which they encode information relevant to ligand-receptor binding (hydrophobic, dispersion, electrostatic, steric and hydrogen bonding interactions).' One of the approaches to this problem is to produce composite descriptors from structural and global physical property descriptors by means of principal component analysis and multidimensional ~ c a l i n gHowever ~ this method is unable to handle sets of compounds of real sizes (10.000 - 1.000.000 compounds) because ofcomputational limitations of multidimensional scaling required for transforming discrete structural descriptors to continuous variables In this report we propose an alternative approach based on combination of structural fragments and local physicochemical property descriptors On the basis of this approach the new program MOLDIVS (MOLecular DIVersityand Similarity) for Microsoft Windows 95/NT was created MOLDIVS has friendly graphic user interface and it permits to perform the whole range of similarity anddiversity calculation tasks on large sets of compounds In this program it is possible to use the structural descriptor of two types: plain structural fragment and combined structural-physicochemical fragments Both fragments are defined as atom-centered concentric environment^.^ Fragment consists of a central atom and neighboring atoms connected to it within the predefined sphere size (number of bonds between the central and edge atoms) For each fragment the complete connection table is stored For each atom in a fragment the information on the atom and bond type, charge, valency, cycle type and size is coded into fixed-length variables, which are subsequently 423 used to define a pseudo-random hash value for this fragment The complete set of fragments with selected sphere size is created automatically and forms a fragments library For each fragment in the library the frequency of occurrence is calculated An unlimited number of fragments and sphere of any size can be used In structural-physicochemical fragments each atom is characterized by three parameters: partial atomic charge,' polarizability6 and H-bond donor/acceptor factor' instead of atomic element type as in plain structural fragments Adjustable ranges of these properties are used as atomic types There are many examples when similarity based on plain structural fragments substantially differs from similarity based on structuralphysicochemical fragments In many cases structural-physicochemicalfragments produced better separation of biologically active compounds because they explicitly encode information relevant to ligand-receptor interactions The program permits an estimation of similarity of each molecule in the database with all other molecules sorting them on the value of similarity with the initial molecule It is possible to use different molecular similarity coefficients: Tanimoto, Euclidean and Cosine Different measures ofdiversityof the whole database A are available in this program: DIVERSITY (A) = SUMr, (DISSIMILARITY (T, 4) / Nz DIVERSITY (A) = SUM, (MIN, (DISSIMILARITY (TJ))) (1) /N (2) The program allows rapid estimation ofdiversityof the whole database according to equation (1) using the cosine similarity coefficient on the basis of the centroid algorithm.8 Different compound selection algorithms for diverse subset formation (stepwise elimination and cluster amp ling,^ number of maximum dissimilarity selection algorithms") are used in this program The program was successfully tested on databases with biological and medicinal activity data and in real drug design work The comparison of results obtained by MOLDIVS and other commercially available programs is camed out R.D Brown, Y.C Martin, Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection, J Chem In$ Comput Sci.36572 (1996) R.D Brown, Y.C Martin, The informational content of 2D and 3D structural descriptors relevant to ligand-receptor binding, J Chem In$ Comput Sci 37: (1997) E.J Martin, J.M Blaney, M.A Siani, D.C Spellmeyer, A.K Wong, W.H Moos, Measuring diversity: experimental design ofcombinatoriallibraries for drug discovery, J Med Chem 38: 1431 (1995) S.V Trepalin, A.V Yarkov, L.M Dolmatova, N.S Zefirov, WinDat: an NMR database compilation tool, user interface and spectrum libraries for personal computers, J Chem In$ Comput Sci 35:405 (1995) D.B Kireev, V.I Fetisov, N.S Zefirov, Approximate molecular electrostatic potential computations: applications to quantitative structure-activity relationships, J Mol Strucf (Theochem) 304: 143 (1994) K.J Miller, Additivity methods in molecular polarizability, J.Am Chem SOC.112:8533 (1990) O.A Raevsky, Hydrogen bond strength estimation by means of the HYBOT program package, in: Computer-Assisted Lead Finding and Optimization, H van de Waterbeemd, B Testa, G Folkers, eds., Wiley-VHC, Base1 (1997) J.D Holliday, S.S Ranade, P Willett, A fast algorithm for selecting sets of dissimilar molecules from large chemical databases, Quant Struct.-Act Relat 14:501 (1995) R Taylor, Simulation analysis of experimental design strategies for screening random compounds as potential new drugs and agrochemicals, J Chem In$ Comput Sci 35:59 (1995) 10 D Chapman, The measurement ofmolecular diversity: a three-dimensional approach, J Cornput.- Aid MoZ Design 10:501 (1996) 424 EASY DOES IT: REDUCING COMPLEXITY IN LIGAND-PROTEIN DOCKING Djamal Bouzida, Daniel K Gehlhaar, and Paul A Rejto Agouron Pharmaceuticals, Inc 3301 North Torrey Pines Court La Jolla, CA 92037 INTRODUCTION Computational methods in structure-based drug design are used in a number of applications, including prediction of the structure of ligand-protein complexes also known as the docking problem, estimation of ligand-protein binding affinity, and in de novo design Depending on the level of detail incorporated into the model, as well as the number of times the calculation is performed, the computational demands of these studies range fiom a few seconds on a small workstation to months on dedicated supercomputers.In the pharmaceutical industry, the criterion for a useful computational technique is simple: it must provide information of sufficient quality to impact the discovery or optimization of lead compounds, and it must so in a timely manner In our work, we have found that a critical decision that governs the successful application ofcomputational methods in structure-based drug design is the choice of the model used to represent the problem Traditionally, computational chemists have developed highly detailed force fields to describe atomic interactions While in principle such efforts provide accurate representations of chemical systems, there are two significant practical problems that arise in their application First, it is difficult to obtain high-quality parameters for the force field in a rapid manner, and second, it is not possible to adequately sample the enormous conformational space of ligand-protein systems As a consequence, the computational requirements of detailed atomic-level simulations are not compatible with the large number of molecules that are now available in commercial databases or in typical combinatoriallibraries As such, there is a need for methods that efficiently reduce the size and complexity of the problem, while still providing useful information Previously, we have developed a method for the prediction of bound ligand-protein complexes based on a simplistic, short-ranged potential.' Because structure prediction is a much easier problem than free energy calculation, this potential, while not sufficiently accurate to estimate ligand-protein binding affinities, correctly predicts the bound conformation for a variety of ligand-protein complexes Unlike detailed force fields, this potential yields a smoother energy landscape and is more compatible with high throughput computational database screening More recently, we have extended this method to two 425 types of docking simulations where some features of the resulting ligand-protein complex are known a priori.2 In complexes where a covalent bond is formed between a nucleophilic cysteine or serine residue and an electrophilic ligand atom, constraints are placed on the location of the ligand Likewise, when combinatoriallibraries are developed that include a substructure whose bound conformation is the size of the available conformational space is reduced RESULTS To validate the ability to identify leads &om a database, ligands containing ketones and esters were screened against the reactive enzyme porcine pancreatic elastase A known inhibitor was ranked in the top one percent of all compounds that satisfied the screening criteria, which included a generalization of the LUDI scoring h c t i o n to estimate binding a f f i n i t ~In ~ addition, the correct stereoisomer and binding mode for this compound were selected Compounds unrelated to this inhibitor were also found, some of which form favorable hydrogen bonds in the active site, though none have been tested for activity A virtual library was generated by direct alkylation of the pteridine ring in methotrexate with 7,677 compounds, each of which had a molecular weight less than 250 and an m i n e group with at least one hydrogen and one neighbor in an aromatic group From this virtual library, only 516 satisfied the screening criteria, 7% of the original library As anticipated, methotrexate was predicted to have the best binding energy, but a number of other compounds were generated that also form good hydrogen bond interactions within the active site CONCLUSIONS In order to successfully apply computational tools in structure-based drug design, it is important to use all information about the system of interest prior to beginning the computational study We have developed a simplified representation of ligand-protein interactions that provides a balance between accuracy and speed, and software that takes advantage of knowledge about the structure of certain types of ligand-protein complexes in order to reduce computational complexity We have shown that the predicted structure of known inhibitors of dihydrofolate reductase and porcine pancreatic elastase correspond to the experimentally observed structure with increased probability compared to an unrestricted simulation When combined with a simple estimate of binding affinity, these inhibitors were ranked favorably, thus enriching the hit rate of the targeted library REFERENCES D.K Gehlhaar, G.M Verkhivker, P.A Rejto, C.J Sherman, D.B Fogel, L.J Fogel, and S.T Freer, Molecular recognition of the inhibitor AG- 1343 by HIV- protease: conformationally flexible doclang by evolutionary programming, Chem Biol 2:3 17 (1995) D K Gehlhaar, D Bouzida, and P.A Rejto, Reduced dimensionality in ligand-protein structure prediction: covalent inhibitors of serine proteases and design of site-directed combinatoriallibraries in: ACS Symposium Series on Rational Drug Design (in press) E.K Kick, D.C Roe, A.G Skillman, G Liu, G., T.J Ewing, Y Sun,LD Kuntz, and J.A Ellman, Structure-based design andcombinatorial chemistry yield low nanomolar inhibitors of cathepsin D, Chem Biol 4:297 (1997) P.W Rose, Scoring methods in ligand design, in 2nd UCSF Course in Computer-Aided Molecular Design, San Francisco, (1997) 426 STUDY OF THE MOLECULAR SIMILARITY AMONG THREE HIV REVERSE TRANSCRIPTASE INHIBITORS IN ORDER TO VALIDATE GAGS, A GENETIC ALGORITHM FOR GRAPH SIMILARITY SEARCH Nathalie MEURICE', Gerald M MAGGIORA', Daniel P VERCAUTEREN3 ' 'i3 F.R.I.A PhD Fellowship Laboratoire de Physico-Chimie Informatique, Facultts Universitaires Notre-Dame de la Paix, Rue de Bruxelles, 61, B-5000 NAMUR (Belgium) Computer-Aided Drug Discovery, Pharmacia & Upjohn, 301 Henrietta Street, Kalamazoo, MI 49007-4940 INTRODUCTION The conception of potent therapeutical agents relies on the knowledge of the interaction mode between the ligands and their receptor sites However, very often, the direct study of these interactions is difficult as the three-dimensional (3D) structure of the receptor sites is not completely known Consequently, an indrect approach resides in the comparison of the ligands of interest on the basis of their physico-chemical properties, in such a way to deduce the nature of their common molecular sites involved in the binding to the macromolecule andor responsible for their particular activity In this general framework, we have focused our efforts on the elaboration and improvement of an original genetic algorithm method, named GAGS (Meurice et al., 1997), for computing the similarity between ligands of biopharmacological interest, especially those whose receptor crystal structures have not been determined In order to validate our GAGS approach, we study a system of ligands whose receptor structures are available and compare the molecular alignements to the available experimental (XRAY) and theoretical (MIMIC) models STUDIED SYSTEM We have selected a set of three HIV Reverse Transcriptase Inhibitors (HIV RTI's), namely Nevirapine, a-APA, and TIBO The crystal structures of these ligands bound to HIV Reverse Transcriptase (RT) are available An (< experimental D model is thus obtained by superimposing the crystal structures of HIV RT with the bound inhibitor, and then removing the protein TOPOLOGICAL ANALYSIS OF 3D SMOOTHED ELECTRON DENSITY MAPS Ab initio 3D electron density maps (EDW of the three selected ligands have been obtained using RHF/SCF/6-3 lG* calculations Removal of the details contained in these maps using wavelet multiresolution analysis (Daubechies filter, 20 coefficients, levels of smoothing) produces smoothed 3D grids The information contained in such 3D smoothed 421 EDM can be further simplified into molecular graphs using topological analysis, which allows to locate the critical points of the electron density function, ie., peaks and passes in our study Punctual values of the density and distances between critical points are set as the diagonal and non-diagonal elements of property matrices, respectively As a result, the molecular graphs of TIBO, Nevirapine, and a-APA contain 6, 12, and 14 critical points, respectively The three molecular graphs are then compared using our GAGS method GAGS, GA FOR GRAPH SIMILARITY SEARCH GAS are optimization techniques inspired by the natural concepts of the Darwinian evolution Within GAGS, the chromosomes are defined as 2D integer arrays where the first dimension is the number of ligands to be compared, and the second one is determined by the number of fitting points In such a way, each chromosome is a hypothesis of subgraph match between the initial set ofmolecular graphs The evaluation function measures an RMS value between the property matrices built from the evaluated subgraph match, and is thus minimized during the GA generations An automated decoding process has been implemented in order to create molecular overlays corresponding to each of the solution chromosomes The GAGS comparison leads to overlays in agreement with the ((experimental model B When optimized in the MIMIC steric and electrostatic fields, the GAGS overlay converges towards the superimposition of the RTI’s that was obtained by the MIMIC model, with a similarity of 61% (Mestres et aZ., 1997) Figure 1- (a) GAGS, (b) experimental, and ( c ) MIMIC superimposition models CONCLUSIONS AND OUTLOOK This work allows to assess the GAGS approach as a valuable tool for the discovery of good ligand alignements As a consequence, GAGS might be used as a powerful search engine as well and the resulting molecular overall superimpositions might then be quickly optimized in MIMIC fields in order to produce precise overlays and yield quantitative similarity indices REFERENCES Mestres, J., Rohrer, D.C., Maggiora, G.M., 1997, MIMIC: a molecular-field matching program Exploiting applicability ofmolecular similarity approaches, J Comput Chem., 18:934 Meurice, N., Leherte, L., Vercauteren, D.P., Bourguignon, J.-J., Wermuth, C.G., 1997, Development of a genetic algorithm method especially designed for the elucidation of the benzodiazepine receptor pharmacophore, in: Computer-Assisted Lead Finding and Optimization, H van de Waterbeemd, B Testa, G Folkers, eds., Verlag Helvetica Chimica Acta (VHCA), Basel, 497 428 A DECISION TREE LEARNING APPROACH FOR THE CLASSIFTCATION AND ANALYSIS OF HIGH-THROUGHPUT SCREENING DATA Michael F.M Engels, Hans De Winter, Jan P Tollenaere Dept Theoretical Medicinal Chemistry, Janssen Research Foundation, Beerse, Belgium High-throughput screening (HTS) of large librariesof compounds is applied by drug companies to pick up active molecules Besides this “fishing” part, HTS can also play an important role as a source for structure-activity analysis, which relates physicochemical or structural features to biological activity The latter aspect, although of great importance, has hardly been explored over the last years due to i) the size and complexity of the data sets, ii) the lack of, or rather ignorance about, suitable mathematical tools, iii) the quality of the biological data - biological activity is frequently described by just two (active - not active) or three (active - medium active not active) categories - and iv) the lack of appropriate molecule descriptors In this study, decision tree learning (DTL) and rule induction’ (RI) have been used for the classification and structure-activity analysis of an in-house set of data on 27000 compounds tested for dopamine D2 binding activity (biological activity indicated as either “active” or “not active”; around 1300 compounds were found to be active) Both, DTL and RI are machine learning methods for finding complex interactions between many variables which try to explain a distinct set of responses Both methods are able to deal with large numbers of data (observations) and show good performance in analysing noisy data,2 as HTS data may be Compounds have been represented either by a set of topological keys created by a customized Daylight program (Daylight Chemical Information Systems Inc.) which calculates all possible substructures consisting of up to four atoms, or by a set of 3D keys as implemented in the ChemX software DTL identifies the descriptor with the strongest association with the biological activity Using that descriptor, the set of data is split into two sets, one in which all compounds possess that feature, and one in which all compounds lack that feature This procedure is repeated with all resultant subsets as long as the degree of association is above a given threshold criterion Since sets of data are always split into ’ Quinlan, J.R “C4.5: Programs For Machine Learning”, Morgan Kaufmann Publishers, 1993 429 two subsets, the procedure results in a binary decision tree RI uses the resultant decision tree for extracting rules Rendering the decision tree in a set of ‘rules packages’ creates another way to show the complex interactions in such a set of data Application of the DTL method to the dopamine D2 data using topological keys resulted in a binary decision tree with some 200 terminal nodes This is a drastic reduction of information in comparison to the original data Figure shows two compounds, ocaperidone and risperidone, which are members of one class of compounds predicted to be “active” (89 terminal nodes have been classified as “active”) For such a large Figure : Ocaperidone and risperidone, two molecules classified as being active by the DTL are mandatory for activity are highlighted set of data, it is possible that groups of method Substructural fragments which compounds will act in different ways It can be discussed whether compounds in different terminal nodes bind in alternative modes or at different sites When making use of the ChemX 3D keys the application of the RI method to the dopamine D2 set of data resulted in 84 sets of rules, 37 of which predict “active” compounds Table shows a typical example of such a Table : Set of rules characterising ocaperidone, risperidone and 30 other molecules as being “active” set of rules In comparison [F - molecule contains fluorine - contains one positive charge center - distance between two H-bond acceptor pharmacophore points is between 5.5 and fi - distance between an aromatic group and one H-bond acceptor function is between 12 and 13 fi - not possible to bring one aromatic group and a positive charge centre in a distance range between 5.5 and 6.0 fi together THEN class active to the topological keys the interpretation of the set of rules using Chemx 3D keys is burdensome, however, in combination with systematic confor- mational searching and ’ molecular modeling, it turned out to provide valuable filters for pharmacophore modeling * Mitchell, T.M “Machine Learning”, McGraw-Hill, 1997 430 ... quality of the biological data - biological activity is frequently described by just two (active - not active) or three (active - medium active not active) categories - and iv) the lack of appropriate... alternative approach based on combination of structural fragments and local physicochemical property descriptors On the basis of this approach the new program MOLDIVS (MOLecular DIVersity and Similarity)... diagonal and non-diagonal elements of property matrices, respectively As a result, the molecular graphs of TIBO, Nevirapine, and a-APA contain 6, 12, and 14 critical points, respectively The three molecular