PREDICTIVE TOXICOLOGY - CHAPTER 13 potx

13 PASS: Prediction of Biological Activity Spectra for Substances VLADIMIR POROIKOV and DMITRI FILIMONOV Institute of Biomedical Chemistry of Russian Academy of Medical Sciences, Moscow, Russia 1. INTRODUCTION Each pharmaceutical research and development project is aimed at discovering new drugs for the treatment of certain diseases. The investigation of new pharmaceuticals is carried out in a stepwise manner. This is because drug discovery is a time-consuming process involving enormous financial resources and manpower, and with a substantially high risk factor. On average, it requires 12 years and approximately $800 million for introducing a new medicine to the market (1) with a high risk of negative results (1 out of 10,000 459 © 2005 by Taylor & Francis Group, LLC substances studied is developed to a safe and potent drug). Drug research starts with identification of a ‘‘lead molecule’’ with required biological activity. Subsequently, the lead molecule is developed to get more potent compounds with appropriate pharmacodynamic and pharmacokinetic properties that can qualify as drug candidates (2). General biological potential of any molecule under study is also evaluated in stages. The emphasis is first laid on testing for specific activity followed by general pharmacology and toxicology study, clinical trials, postmarketing registration of adverse effects, etc. As a result, adverse=toxic actions are often discovered at a stage when a lot of time and money are already expended (3). At the same time, it is practically impossible to test experimentally all compounds against each known kind of biological activity and possible toxic effects. So, a computer- aided prediction is the ‘‘method of choice’’ at the early stage of drug research. Relying on predicted results, one may estab- lish the priorities for testing a particular compound and the basis for selecting the most prospective hits=leads=candidates from the set of compounds available for screening. Application of computational methods has significantly decreased the time required for obtaining a compound with the required properties with reduction in financial expenditure. In addi- tion, it helps to obtain more effective and safety medicines. Both computer-aided analysis of quantitative structure– activity=structure–property relationships (QSAR=QSPR) and molecular modeling are widely used for finding and optimiz- ing lead compounds. However, the majority of such methods are constrained by studying a single targeted biological activity within the particular chemical series (4–6). Typically, they are applied step-by-step to analyze different activities= properties in correspondence with the sequential study of biologically active compounds mentioned above. On the other hand, most of the known biologically active compounds demonstrate several or even many kinds of biological activity, which constitute the so-called ‘‘biological activity spectrum’’ of the compound (3). Some components of the biological activity spectrum may serve as a basis for the treatment of certain pathologies, while others may be a source for adverse=toxic 460 Poroikov and Filimonov © 2005 by Taylor & Francis Group, LLC effects. For instance, thalidomide was prescribed worldwide (1950s to early 1960s) to pregnant women as treatment for morning sickness. Subsequently, it was discovered that thalidomide was teratogenic ($12,000 babies were born with tiny or no limbs, flipper-like arms and legs, with serious facial deformities and defective organs). Because of this, the drug was withdrawn from the market in 1962 (7). However, now thalidomide is again considered as a prospective pharmaceutical agent because of some newly discovered activities, e.g., angiogenesis inhibitor, tumor necrosis factor antagonist, and others (8). If, at the early stage of study, researchers could predict the most probable biological activities in drugs like thalidomide, they might avoid the dramatic consequences of their adverse=toxic action and could suggest wider pharmacotherapeutic applications. 2. BRIEF DESCRIPTION OF THE METHOD FOR PREDICTING BIOLOGICAL ACTIVITY SPECTRA The computer program PASS (Prediction of Activity Spectra for Substances) was developed as a tool for evaluation of general biological potential in a molecule under study (9). There had been several earlier attempts to develop such a kind of computer system (10–13). In particular, the feasibility for computer-aided prediction of biological activity of chemical compounds on the basis of their structural formulae was studied within the State System for Registration of New Chemi- cal Compounds Synthesized in the USSR in 1972–1990 (14). For some objective and subjective reasons, this problem was not completely solved, but the studies carried out at that time provided the background and experience necessary for development of such a computer program. The latest version of PASS (1.911) predicts about 1000 kinds of biological activity with the mean prediction accuracy of about 85%. PASS could predict only 541 kinds of biological activity in 1998 (15) and 114 kinds in 1996 (16) (mean prediction accuracy was only 78% in 1996). The default list PASS 461 © 2005 by Taylor & Francis Group, LLC of predictable biological activities includes main and side pharmacological effects (e.g., antihypertensive, hepatoprotec- tive, sedative, etc.), mechanisms of action (5-hydroxytrypta mine agonist, acetylcholinesterase inhibitor, adenosine uptake inhibitor, etc.), and specific toxicities (mutagenicity, carcinogenicity, teratogenicity, etc.). Information about novel activities and new compounds can be straightforwardly included into PASS, and used for further prediction of biological activity spectra for new chemical compounds. A complete list of biological activities predicted by PASS along with a detailed description of the algorithm, applications, and efficiency of PASS is available on the web site (17). Besides, it is also possible to get predictions of biological activity spectra or estimate the accuracy of prediction of the biological activity by submitting substances with known activities and obtaining results of prediction via the internet (18). 2.1. Biological Activity Presentation In PASS, biological activities are described qualitatively (active or inactive). Reflecting the result of chemical compound’s interaction with a biological object, the biological activity depends on both the compound’s molecular structure and the terms and conditions of the experiment. Therefore, structure–activity relationship analysis based on qualitative presentation of biological activity describes general ‘‘biological potential’’ of the molecule being studied. On the other hand, qualitative presentation allows integrating information con- cerning compounds tested under different terms and conditions and collected from many different sources as in the PASS training set. Any property of chemical compounds determined by their structural peculiarities can be used for prediction by PASS. It is clear that the applicability of PASS is broader than the prediction of biological activity spectra. For example, we use this approach to predict drug-likeness (19) and biotransformation of drug-like compounds (20). 462 Poroikov and Filimonov © 2005 by Taylor & Francis Group, LLC 2.2. Chemical Structure Description The 2D structural formulae of compounds were chosen as the basis for description of chemical structure, because this is the only information available in the early stage of research (compounds may only be designed but not synthesized yet). Plenty of characteristics of chemical compounds can be calculated on the basis of structural formulae (21). Earlier (22), we applied the substructure superposition frag- ment notation (SSFN) codes (23). But SSFN, like many other structural descriptors, reflects the abstraction of chemical structure by the human mind rather than the nature of the biological activity revealed by chemicals. The multilevel neighborhoods of atoms (MNA) descriptors (24–26) have certain advantages in comparison with SSFN. These descriptors are based on the molecular structure representation, which includes the hydrogens according to the valences and partial charges of other atoms and does not specify the types of bonds. MNA descriptors are generated as recursively defined sequence:  zero-level MNA descriptor for each atom is the mark A of the atom itself, and  any next-level MNA descriptor for the atom is the sub-structure notation A(D 1 D 2 ÁÁÁD i ÁÁÁ), where D i is the previous-level MNA descriptor for ith immediate neighbor of the atom A. The mark of the atom may include not only the atomic type but also any additional information about the atom. In particular, if the atom is not included into the ring, it is marked by ‘‘–’’. The neighbor descriptors D 1 D 2 ÁÁÁD i ÁÁÁ are arranged in a unique manner, e.g., in lexicographic order. Thus iterative process of MNA descriptors generation can be continued covering first, second, etc., neighborhoods of each atom. For instance, starting from N atom in the piperidine-2,6- dione part of thalidomide molecule, the following MNA descriptors of the zero to the third level can be generated: PASS 463 © 2005 by Taylor & Francis Group, LLC MNA=0: N MNA=1: N(CCC) MNA=2: N(C(CCN–H)C(CN–O) C(CN–O)) MNA=3: N(C(C(CCC)N(CCC)–O(C))C(C(CCC)N(CCC)– O(C)) C(C(CC–H–H) C(CN–O) N(CCC)–H(C))) In the latest version of PASS (1.911), which is discussed in this paper, molecular structure is represented by the set of unique MNA descriptors of the third level (MNA=3). The list of thalidomide’s MNA=3 descriptors is given below: 1. C(C(C(CCC)C(CC–H)C(CN–O))C(C(CCC)C(CC–H)– H(C))C(C(CCC)N(CCC)–O(C))) 2. C(C(C(CCC)C(CC–H)C(CN–O))C(C(CC–H)C(CC– H)–H(C))–H(C(CC–H))) 3. C(C(C(CCC)C(CC–H)C(CN–O))N(C(CCN–H)C(CN– O)C(CN–O))–O(C(CN-O))) 4. C(C(C(CCC)C(CC–H)–H(C))C(C(CC–H)C(CC–H)– H(C))–H(C(CC–H))) 5. C(C(C(CCN–H)C(CC–H–H)–H(C)–H(C))C(C(CCN– H)N(CC–H)–O(C))N(C(CCN–H)C(CN–O)C(CN– O))–H(C(CCN–H))) 6. C(C(C(CCN–H)C(CC–H–H)–H(C)–H(C))C(C(CC– H–H)N(CC–H)–O(C))–H(C(CC–H–H))–H(C(CC–H– H))) 7. C(C(C(CC–H–H)C(CN–O)N(CCC)–H(C))C(C(CC– H–H)C(CN–O)–H(C)–H(C))–H(C(CC–H–H))–H(C (CC–H–H))) 8. C(C(C(CC–H–H)C(CN–O)N(CCC)–H(C))N(C(CN– O)C(CN–O)–H(N))–O(C(CN-O))) 9. C(C(C(CC–H–H)C(CN–O)–H(C)–H(C))N(C(CN–O) C(CN–O)–H(N))–O(C(CN–O))) 464 Poroikov and Filimonov © 2005 by Taylor & Francis Group, LLC 11. N(C(C(CCC)N(CCC)–O(C))C(C(CCC)N(CCC)–O(C)) C(C(CC–H–H)C(CN–O)N(CCC)–H(C))) 12. N(C(C(CCN–H)N(CC–H)–O(C))C(C(CC–H–H)N (CC–H)–O(C))–H(N(CC–H))) 13. –H(C(C(CCC)C(CC–H)–H(C))) 14. –H(C(C(CCN–H)C(CC–H–H)–H(C)–H(C))) 15. –H(C(C(CC–H–H)C(CN–O)N(CCC)–H(C))) 16. –H(C(C(CC–H–H)C(CN–O)–H(C)–H(C))) 17. –H(C(C(CC–H)C(CC–H)–H(C))) 18. –H(N(C(CN–O)C(CN–O)–H(N))) 19. –O(C(C(CCC)N(CCC)–O(C))) 20. –O(C(C(CCN–H)N(CC–H)–O(C))) 21. –O(C(C(CC–H–H)N(CC–H)–O(C))) The substances are considered to be equivalent in PASS if they have the same set of MNA descriptors. Since MNA descriptors do not represent the stereochemical peculiarities of a molecule, the substances, whose structures differ only stereochemically, are formally considered as equivalent. 2.3. Training Set The PASS estimations of biological activity spectra of new compounds are based on the structure–activity relationships knowledgebase (SARBase), which accumulates the results of the training set analysis. The in-house–developed PASS training set includes about 50,000 known biologically active substances (drugs, drug candidates, leads, and toxic compounds). Since new information about biologically active compounds is discovered regularly, we perform the special informational search and analyse the new information, which is further used for updating and correcting the PASS training set. 2.4. Algorithm of Activity Spectra Estimation The algorithm of prediction was chosen from a large number of options examined in the past several years. It is based on the specially designed B-statistics, in which the well-known PASS 465 © 2005 by Taylor & Francis Group, LLC Fisher’s arcsine transformation is used. On the basis of a molecule’s structure represented by the set of m MNA descriptors fD1, ,D m g for each kind of activity A k , the following B k values are calculated: B k ¼ðS k À S 0k Þ=ð1 À S k Á S 0k Þ S k ¼ Sin½S i ArcSinð2PðA k jD i ÞÀ1Þ=m S ok ¼ 2PðA k ÞÀ1 where P(A k jD i ) is a conditional probability of activity of kind A k if the descriptor D i is present in a set of molecule’s descriptors; P(A k ) is a priori probability to find a compound with activity of kind A k . For any kind of activity A k ,ifP(A k jD i )is equal to 1 for all descriptors of a molecule, then B k ¼1; if P(A k jD i ) is equal to 0 for all descriptors of a molecule, then B k ¼À1; if there is no relationship between the molecule’s descriptors and activity of kind A k , and, so, P(A k jD i ) %P(A k ), then B k %0. Up to the PASS version 1.703, the algorithm of prediction was based on the following data: n is the total number of compounds in the SARBase; n i is the number of compounds containing descriptor D i in the structure description; n k is the number of compounds containing the kind of activity A k in the activity spectrum; n ik is the number of compounds containing both the kind of activity A k and the descriptor D i . And the estimations of probabilities P(A k ), P(A k jD i ) are given by PðA k Þ¼n k =n; PðA k jD i Þ¼n ik =n i In PASS version 1.703 and later, instead of integers n i and n ik , the sums g i and g ik of descriptors weights w are used, where w ¼1=m, and m is the number of MNA descriptors of individual molecule. This modification increases the accuracy 466 Poroikov and Filimonov © 2005 by Taylor & Francis Group, LLC of prediction significantly. So, right now the estimations of probabilities P(A k jD i ) are given by PðA k jD i Þ¼g ik =g i The main purpose of PASS application is to predict the activity spectra for new substances. To provide more accurate predictions, if the compound under prediction has the equivalent structure in the SARBase, this structure is "excluded" from the SARBase during the prediction with all associated information about its biological activities. The calculations are done by using n À1, g i Àw, and, when the kind of activity A k is contained in its activity spectrum in the SARBase, by using n k À1 and g ik Àw. Here w ¼1=m, and m is a number of MNA descriptors in molecule under prediction and its equivalent in the SARBase. The B k values are calculated using MNA descriptors, which are found in SARBase, i.e., for descriptors of a molecule under prediction with g i > 0or g i Àw > 0, in the case of structure ‘‘exclusion.’’ To take the ‘‘yes=no’’ qualitative prediction, it is necessary to determine B-statistics threshold values for each kind of activity A k . Using theory of statistical decision, this can be done on the basis of risk function’s minimization. But nobody can a priori specify the risk functions for all activity kinds and all possible practical tasks. Therefore, the predicted activity spectrum in PASS is presented by the rank-order list of activities with probabilities ‘‘to be active’’ Pa and ‘‘to be inactive’’ Pi, which are the functions of B-statistics for a molecule under prediction. The B-statistics functions Pa and Pi are the results of the training procedure described below. The list is arranged in descending order of Pa ÀPi; thus, the more probable activity kinds are at the top of the list. The list can be shortened at any desirable cutoff value, but Pa > Pi is used by default. If the user chooses a rather higher value of Pa as a cutoff for selection of probable activities, the chance to confirm the predicted activities by the experiment is also high, but many existing activities will be lost. For instance, if Pa > 80% is used as a threshold, about 80% of real PASS 467 © 2005 by Taylor & Francis Group, LLC activities will be lost; for Pa > 70%, the portion of lost activities is 70%, etc. 2.5. Training Procedure For each compound from the training set, MNA descriptors are generated and its known activity spectrum and set of descriptors are stored in the SARBase. If this compound has the equivalent structure in SARBase, only new activities are added to activity spectrum. After inclusion of all information from the training set(s) into SARBase, the values n, g i , n k , g ik are calculated. For each compound in the SARBase and for each activity kind A k , values B k of B-statistics are calculated. Calculations are done taking into account the described above ‘‘exclusion’’ of processed compound. For each activity kind A k , the calculated values B k are subdivided into two samples: for active and inactive compounds. These obtained samples are used for calculation of the smooth estimations of B-statisties distribution functions on the following basis. Suppose we have the sample x 1 , , x n of n values of random variable X, which has an unknown distribution function F(x). Using an empirical step-function for approximation of F often faults because of small n. To provide the smooth estimation of F(x), the inverse function x(F) is calculated as the conditional expectation of random variable X: xðFÞ¼S i ðn À 1Þ! Á F iÀ1 =ði À 1Þ! Áð1 ÀFÞ nÀi =ðn À iÞ! Á x 0 i where (n À1)!ÁF iÀ1 =(i À1)!Á(1 ÀF) nÀi =(n Ài)! is the binomial distribution, and x 0 1 ,ÁÁÁ,x 0 n (x 0 1 < x 0 2 < ÁÁÁ< x 0 n) is the ranked sample x 1 , ,x n . The distribution function F(x) is given reci- procal function of quantiles x(F). Each sample of B values for active compounds is arranged in the ascending order; each sample of B values for inactive compounds is arranged in descending order. The above described quantiles b(F) are calculated. As a result, for each appropriate kind of activity, the probabilities Pa and Pi are given by b active ðPaÞ¼B; b inactive ðPiÞ¼B 468 Poroikov and Filimonov © 2005 by Taylor & Francis Group, LLC [...]... 0.154 0.183 0.186 0.211 0.098 0 .132 0 .135 0.163 0.144 0 .134 0.172 0.160 0.146 0.296 0.335 0.190 0.295 0.268 Biological activity GABA receptor agonist Cardiovascular analeptic Antidyskinetic Oxidoreductase inhibitor Lipocortins synthesis antagonist Superoxide dismutase stimulant Hypnotic Platelet adhesion inhibitor Immunomodulator Antiinflammatory Alzheimer’s disease treatment l-lactate dehydrogenase stimulant... reflects all kinds of its biological activity, which can be found in the compound’s interaction with biological entities First kind error of prediction: ‘‘False-positives,’’ when an inactive compound is predicted to be active LOO CV: Leave-one-out cross-validation is applied to all compounds from the training set Each compound is excluded from the training set with the information about its activities, and... Group, LLC PASS 477 26 Poroikov VV, Filimonov DA, Borodina YuV, Lagunin AA, Kos A Robustness of biological activity spectra predicting by computer program PASS for non-congeneric sets of chemical compounds J Chem Inf Comput Sci 2000; 40 :134 9 135 5 27 See http:==www.mdli.com 28 Voigt JH, Bienfait B, Wang S, Nicklaus MC Comparison of the NCI open database with seven large chemical structural databases J Chem... Cheng CC Affinity capillary electrophoresis in biomolecular recognition Cell Mol Life Sci 1998; 54: 663–683 8 Deplanque G, Harris AL Anti-angiogenic agents: clinical trial design and therapies in development Eur J Cancer 2000; 36:1 713 1724 9 Poroikov V, Filimonov D Computer-aided prediction of biological activity spectra Application for finding and optimization of new leads In: Holtje HD, Sippl W, eds Rational... atoms and do not specify the types of bonds MNA descriptors are generated as recursively defined sequence: zero-level MNA descriptor for each atom is the mark A of the atom itself; any next-level MNA descriptor for the atom is the substructure notation A(D1D2ÁÁÁDiÁÁÁ), where Di is the previous-level MNA descriptor for ith immediate neighbor of the atom A If the atom is not included into the ring it is... many biological activities, which are probable according to the predictions, were never tested Fortunately, there exists the NCI database with 42,689 het- © 2005 by Taylor & Francis Group, LLC PASS 471 erogeneous compounds each being tested in anti-HIV assays (28) We used this database to estimate to what extent the application of PASS predictions enriches the population of active compounds in the... Blinova V, Dmitriev A, Poroikov V Predicting biotransformation potential from molecular structure J Chem Inform Comput Sci 2003; 43:1636–1646 21 Guba W Representation of chemicals In: Helma C., ed Predictive Toxicology Marcel Dekker, 2003; 11–35 22 Filimonov DA, Poroikov VV, Karaicheva EI, et al Computeraided prediction of biological activity spectra of chemical substances on the basis of their structural... are the probabilities to be active and inactive respectively Pa 0.988 0.964 0.883 0.742 0.751 0.731 0.776 0.626 0.623 0.619 0.603 0.581 0.576 Pi 0.011 0.002 0. 013 0.007 0.023 0.008 0.085 0.003 0.006 0.022 0.016 0.034 0.036 Biological activity 4-Aminobutyrate transaminase inhibitor Inflammatory bowel disease treatment Lysase inhibitor Antiarthritic Ovulation inhibitor Steroid synthesis inhibitor Ligase... Validation Leave one out cross-validation for all $1000 kinds of biological activity and $50,000 substances provides the estimate of PASS prediction accuracy at the training procedure Average accuracy of prediction is about 86% according to the LOO CV estimation, while that for particular kinds of activity varies from 63% (antacid, multiple sclerosis treatment) to 99% (urokinase-type plasminogen activator... Angiogenesis inhibitor Nootropic Calcium channel agonist Forty-four kinds of biological activity are predicted as probable with Pa > 30% Most known pharmacotherapeutic and adverse=toxic effects of thalidomide from the literature appeared in the PASS predicted biological activity spectrum (marked in bold) Some new activities are predicted too, such as 4-aminobutyrate transaminase inhibitor, lysase inhibitor, . generated as recursively defined sequence:  zero-level MNA descriptor for each atom is the mark A of the atom itself, and  any next-level MNA descriptor for the atom is the sub-structure notation. by the rank-order list of activities with probabilities ‘‘to be active’’ Pa and ‘‘to be inactive’’ Pi, which are the functions of B-statistics for a molecule under prediction. The B-statistics. Harris AL. Anti-angiogenic agents: clinical trial design and therapies in development. Eur J Cancer 2000; 36:1 713 1724. 9. Poroikov V, Filimonov D. Computer-aided prediction of biological activity

Định dạng
Số trang	20
Dung lượng	187,54 KB