Mol Divers DOI 10.1007/s11030-015-9571-9 FULL-LENGTH PAPER Multi-output model with Box–Jenkins operators of linear indices to predict multi-target inhibitors of ubiquitin–proteasome pathway Gerardo M Casañola-Martin · Huong Le-Thi-Thu · Facundo Pérez-Giménez · Yovani Marrero-Ponce · Matilde Merino-Sanjuán · Concepción Abad · Humberto González-Díaz Received: 11 August 2014 / Accepted: 14 February 2015 © Springer International Publishing Switzerland 2015 Abstract The ubiquitin–proteasome pathway (UPP) plays an important role in the degradation of cellular proteins and regulation of different cellular processes that include cell cycle control, proliferation, differentiation, and apoptosis In this sense, the disruption of proteasome activity leads to different pathological states linked to clinical disorders such as inflammation, neurodegeneration, and cancer The use of UPP inhibitors is one of the proposed approaches to manage these alterations On other hand, the ChEMBL database contains >5,000 experimental outcomes for >2,000 compounds tested as possible proteasome inhibitors using a large number of pharmacological assay protocols All these assays report a large number of experimental parameters of biological activity like EC50 , I C50 , percent of inhibition, and many others that have been determined under many different conditions, targets, organisms, etc Although this large amount of data offers new opportunities for the computational discovery of proteasome inhibitors, the complexity of these data represents a bottleneck for the development of predictive models In this work, we used linear molecular indices calculated with the software TOMOCOMD-CARDD and Box–Jenkins moving average operators to develop a multi-output model that can predict outcomes for 20 experimental parameters in >450 assays carried out under different conditions This generated multi-output model showed values of accuracy, sensitivity, and specificity above 70 % for training and validation series Finally, this model is considered multi-target and multi-scale, because it predicts the inhibition of the UPP for drugs against 22 molecular or cellular targets of different organisms contained in the ChEMBL database Keywords Ubiquitin–proteasome pathway inhibitors · CHEMBL · Multi-target · Multi-scale and multi-output models · Moving averages · QSAR Electronic supplementary material The online version of this article (doi:10.1007/s11030-015-9571-9) contains supplementary material, which is available to authorized users Y Marrero-Ponce Facultad de Química Farmacéutica, Universidad de Cartagena, Cartagena de Indias, Bolivar, Colombia G M Casola-Martin (B) · C Abad Departament de Bioqmica i Biologia Molecular, Universitat de València, 46100 Burjassot, Spain e-mail: gmaikelc@gmail.com; gerardo.casanola@uv.es M Merino-Sanjuán Department of Pharmacy and Pharmaceutical Technology, University of Valencia, Valencia, Spain G M Casañola-Martin · F Pérez-Giménez Unidad de Investigación de Diso de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, Valencia, Spain M Merino-Sanjuán Institute of Molecular Recognition and Technological Development (IDM), Inter-Universitary Institute from Polytechnic University of Valencia and University of Valencia, Valencia, Spain G M Casañola-Martin Faculty of Environmental Science, Pontifical University Catholic of Ecuador in Esmeraldas (PUCESE), C/ Espejo y Santa Cruz S/N, 080150 Esmeraldas, Ecuador H González-Díaz (B) Department of Organic Chemistry II, University of the Basque Country UPV/EHU, 48940 Leioa, Spain e-mail: humberto.gonzalezdiaz@ehu.es H Le-Thi-Thu School of Medicine and Pharmacy, Vietnam National University Hanoi (VNU), 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam H González-Díaz IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain 123 Mol Divers Introduction The ubiquitin–proteasome pathway is one of the two main proteolytic systems in mammalian cells [1] This pathway is involved in a great number of cellular processes that include cellular homeostasis, cell cycle control, gene expression, DNA repair, signal transduction, immune responses, and apoptosis [2] The growing list of human diseases in which protein homeostasis is disrupted reveals the importance of the ubiquitin–proteasome pathway for normal cellular function and its potential as a therapeutic target [3] The proteasome core was a primary target inhibitor to cancer therapy since the discovery of the proteasome inhibitor bortezomid, and at present, the process of proteasome inhibitors development involves the use of many methods [4] Current efforts in this field of proteasome inhibitors are aimed to the search for new drugs against the ubiquitin–proteasome pathway, showing greater selectivity, potency, and safety properties to minimize sideeffects In this sense, it is important to develop new in silico models in order to predict novel, potent, and selective ubiquitin– proteasome pathway inhibitors Due to its accessibility, in this work it is necessary to carry out a compilation of large datasets of these compounds from public sources The CHEMBL database [5,6] (https://www.ebi.ac.uk/chembldb) includes more than 11,420,000 activity data for >1,295,500 compounds, and 9,844 targets This vast quantity of data opens a widespread field for the application of computational approaches for activity prediction [6,7] The analysis of the data is very complex due to the three types of chemical and pharmacological information that appears: (1) multi-targeting, (2) multi-outputting, and/or (3) multiscaling Therefore, the multi-targeting approach emerges from the formation of different pairs of interactions (Iqr ) between drugs (dq ) and targets (tr ) [8–10] In our case, the target interactions are represented as networks of nodes (proteins, genes, RNAs, miRNAs) interconnected by a link when there is a target–target interaction between two of them The multi-output complication comprises the use of different types of targets, assay conditions, assays, organisms, experimental measures, etc., in order to decide whether two nodes (assays, drugs, targets, etc.) are linked Ii j = or not Ii j = The case of multi-scaling is given by the different structural levels of the organization of matter that can be described by different input variables In this sense, the models need to be multi-scale to collect the information at some of the following levels: molecular structure (drugs), macromolecular structure (molecular targets), cellular (cellular line targets), and organisms (species from where the targets were extracted) In our previous study, we used the MARCHINSIDE (MI) to obtain the Shannon entropy measures of a 123 molecular graph (G) which we used in turn as inputs for Box– Jenkins moving average (MA) operators used in time series analysis [11] MA models gained popularity after the initial proposed researches by Box and Jenkins [12] about autoregressive integrated Box–Jenkins moving average (ARIMA) and similar models The Box–Jenkins MA operators used in time series are the average values of one characteristic of the system for different intervals of time or seasons In multioutput modeling, we calculate the MA operators as the average of the property of the system (molecular descriptors or any other property, to be considered) for all drugs or targets with a specific response in one assay carried out under a subset of reference conditions (cr ) Consequently, our MA operator acts over a sub-set of conditions of the pharmacological assays The application of MA operators to other domains different from time is increased due to its wide applications In this sense, the main objective of this kind of work is to assess interactions or links between drugs and targets, proteins, brain regions, and other complex systems For this, the use of MA properties of network nodes (drugs, proteins, reactions, laws, neurons, etc.) that form links (Iqr ), in specific the rth sub-set of reference conditions (cr ) is adequate For this reason, we decided to call this strategy as assessing of links with moving averages (ALMA), in a similar manner as other authors for different multi-target and/or multi-output (mo) models [13–15] The method is very versatile, because we can use molecular descriptors calculated by different chemoinformatics software as input The software TOMOCOMD-CARDD (TC), developed by Marrero-Ponce et al [16], is a well-known tool for the calculation of several families of 2D/3D molecular descriptors In particular, we can use TC to calculate different types of atom-based linear indices f q (G, N , M, w)g for a given compound (qth compound) We can compute these indices for the molecular graph G of the compound, taking into consideration a specific norm (N), matrix (M), a vector of physicochemical weights (w) for atoms, etc In addition, we can determinate linear indices for different groups (g) of atoms in the molecule and assign them different values according to the specific molecular fragments selected Some applications of linear indices include the estimation of chemical, physical, and kinetics properties of compounds [17,18] Studies of different biological activities are also encouraged by this method Some examples are on antibacterials [19], tyrosinase inhibitors [20], trypanosomal inhibitors [21] and so on [22] Besides, the linear indices are very flexible and useful to study different complex systems The types of complex systems already studied with linear indices include RNA secondary structures [23], and protein stability effects [24] In a recent work [25], an ALMA model for neuroprotective drugs present in CHEMBL was capable of predict- Mol Divers Fig A representative sample of the compounds used in this study together with its ChEMBL code ing Iqr of drugs with targets in multi-output tests taking into account the drug responses In the parametrization of structural parameters of compounds, the TOPS-MODE program [26] was used In a more recent work [27], using MI scheme an ALMA classifier with good performance was found Both models were able to predict the links between drugs and targets However, we did not carry out a formal construction or a comparison of the drug–target networks for the CHEMBL data in the previous papers In any case, despite the high versatility of entropy measures to codify structural information, there is not any report of a multi-target model for drug–target interactions for compounds with inhibitory activity of the ubiquitin– proteasome pathway Therefore, in this study, we describe for the first time a multi-target, multi-outputting, and multiscale ALMA model based on atom-based linear indices for CHEMBL data of ubiquitin–proteasome pathway inhibitory compounds Materials and methods CHEMBL dataset: assembling of training and validation sets We searched and downloaded from the public database ChEMBL a general data set composed of >5,602 results of multiple assays endpoints [5,6] for UPP inhibitors The value of the observed (obs) class variable Aq (cr )obs = (active compound) or Aq (cr )obs = (non-active compounds) to every qth drug was biologically assayed in different conditions cr The dataset used to train and validate the model includes N = 5, 602 statistical cases formed by Nd = 2, 954 unique drugs, with each one of the drugs assayed in at least one out of 20 possible standard type measures, which were determined in at least one out of 474 assays Each assay involves, in turn, at least one out of 20 protein or cellular targets from seven different organisms In Fig some examples 123 Mol Divers Table Dataset used in this study Dataset Active Non-active Total Training 1,827 2,376 4,203 607 792 1,399 2,434 3,168 5,602 CV Total of the compounds used in this work are shown A structural diversity is encountered in the chemicals extracted from the ChEMBL dataset with the UPP inhibitory activity as can be observed In the same way, in Supplementary Material the SMILES codes of the 2,897 compounds used in this study are depicted As noted above, the total set of statistical cases (5,602) formed all the experimental space used here In our case, at the time of choosing the training and validation sets, we took into account that each one of the different conditions would be included in both training and validation sets, for the active and inactive cases to guarantee an adequate and representative sample for the training and test sets Because of this, we picked out randomly the compounds for the training (T) and validation (CV) sets As shown in Table 1, there are 1,827 active cases and 2,376 inactive ones belonging to the training set (4,203 cases) The validation set consists of 1,399 cases and has 607 active and 792 inactive cases These cases in the validation set were never used in the development of the ALMA models Molecular descriptors: TOMOCOMD-CARDD atom-based linear indices TOMOCOMD-CARDD is a molecular descriptor (MD) calculating program comprised of two suites with parallel functionalities The first is a comprehensive collection of MD calculating modules based on the so-called “relations frequency matrices,” molecular fingerprints and a pool of the most relevant MDs reported in the literature The second suite comprises a set of modules derived from algebraic considerations, collectively known as QuBiLS (acronym for Quadratic, Bilinear and Linear MapS) This suite includes three modules: (1) QuBiLS-MAS (QuBiLS-based on Graph– Theoretical Electronic-Density Matrices and Atomic weightingS), (2) QuBiLS-MIDAS (QuBiLS-based on MInkowski Distance matrices and Atomic weightingS), and (3) QuBiLSPOMAS (QuBiLS-based on molecular surface-based POtential Matrices and Atomic weightingS) In this application, only QuBiLs-MAS module is included QuBiLs-MAS constitutes a unique combination of methods to calculate MDs on an algebraic basis These MDs can be used for a wide range of applications in all areas of chemistry, in particular in drug design, lead compound discovery and optimization, QSAR/QSPR studies, similarity searching, diversity assess- 123 ment of compound libraries, and prediction of adsorption, distribution, metabolism, excretion and toxicity (ADMET) properties In this work, we use the atom-based linear indices calculated with the software TC ver 1.0 [16] as molecular descriptor Dik For each qth chemical, we calculated the different types of atom-based linear indices f q (N , M, w)g The norm selected was the Manhattan distance (N1 ) The used M matrix was the graph-theoretical electronic-density matrix [called non-stochastic (NS)] [28,29] The atoms in each molecular structure were differentiated with the following physicochemical weights (w): Ghose–Crippen LogP, electronegativity, and van der Waals volume that can allow a better understanding of the problem Moreover, in our study the different groups of atoms calculated for the compounds were H bond acceptors (A), C atoms in aliphatic chain (C), H bond donors (D), C atoms in aromatic portion (P), and heteroatoms (X) The general equation for the definition of the atom-based linear indices is shown below (Eq 1) ng | f i |g f qk (G, N1 , M, w)g = f qk (w)g = (1) i=1 Computational methods A theoretical framework in ALMA models ALMA models may be classified as a general type of model to assess the links in different systems These approaches are very adaptable to all molecular descriptors, graphs invariants, or descriptors for complex networks Here we used f qk linear indices of kth type for the qth compound represented by a matrix M The aim of this model is to link the scores Sq (cr ) with the molecular descriptors Dik of a given compound dr and the deviation terms f qk (cr ) = f qk − f qk (cr ) The model has the following general form: r =7 k=5 Sq (cr ) = a0 + ar k × q f k (cr ) r =1 k=0 r =7 k=5 = a0 + q q br k × f k − f k (cr ) (2) r =1 k=0 The output-dependent variable is Sq (cr ) = Sq (c1 , c2 , c3 , c4 , c5 , c6 , c7 ) = Sq (measure type, target, target mapping, assay type, data curation, assay protocol, organism) In our case, the attribute Sq (cr ) is a mathematical annotation for the effects of the qth compound defined as dr , in the r th test developed in the cr terms In this Eq (2), the f qk and f qk (cr ) are used as independent attributes The input variable f qk (cr ) is the mean of the kth descriptors f qk of all qth chemicals assayed in one test procedure developed in the reference terms cr The attributes f qk (cr ) are defined like Mol Divers Fig Graphical flowchart of all the steps taken in this work to develop the new ALMA model for UPP inhibitors the Box–Jenkins moving average operators proposed previously [12] in other successful applications [30–33] In the definition of this MA approach the f qk (cr ) is the sum f qk descriptors for the nr compounds evaluate under the same term conditions cr Later, we proceed to divide this value by the nr drugs as can be observed in Eq (3) f qk (cr ) = nr q=nr f qk (cr ) (3) q=1 Developing and performance of QSAR ALMA model In order to assemble the ALMA model, we used the linear discriminant analysis (LDA) technique implemented in the software package STASTICA 6.0 [34] This heuristic technique is very useful for the task of separating two or more classes as described in detail in the technical literature [35,36] This algorithm is capable of finding models and giving in the output the prediction of the group membership of new observations Besides, this technique is one of the most commonly used with several applications in drug discovery for different biological activities that are included among others: theoretical studies of acetylcholinesterase inhibitors, modeling of anti-allergic natural compounds by molecular topology and predictive modeling of human monoamine oxidase inhibitors [37–39], and more recently in high-dimensional datasets [40] In our study, the STATISTICA software [41] was used to develop the ALMA model, and performance parameters were considered to assess the quality of the classification functions In the same way, the quantity of variables in the models was kept to minimum taking into account the principle of parsimony (Occam’s razor) The quality of this ALMA model was determined by examining Wilks’ λ parameter (U statistic), whose values for the overall discrimination can take values in the range from (perfect discrimination) to (no discrimination) The square Mahalanobis distance (D ) indicates the separation of the respective groups, showing whether the model possesses an appropriate discriminatory power for differentiating between the two corresponding groups The Fisher ratio (F), the corresponding p level [ p(F)], the accuracy (Ac), specificity (Sp), and sensitivity (Sn) were also used to assess the quality performance of the ALMA model [33] In 123 Mol Divers Table Variables used as input for the model Table Results of ALMA models % Groupsb Iqt (cr )pred = 1c Sp 73.0 Iqt (cr )obs = 1,007 372 Sn 71.0 Iqt (cr )obs = 820 2,004 Ac 71.6 Total Sp 70.9 Iqt (cr )obs = 334 137 Sn 70.6 Iqt (cr )obs = 273 655 Ac 70.7 Total Sub-set Stat.a Train CV Iqt (cr )pred = 0d a Sn positive correct/positive total, Sp negative correct / negative total, Ac total correct/total b I (c )obs Observed experimental measure of interaction/not qt r interaction (1/0) with the r th target c Prediction of the experimental measure of interaction with the r th target d Prediction of the experimental measure of the not interaction with the r th target Fig we depict the graphical flowchart for all steps given in this work in order to develop a new ALMA model for UPP inhibitors Results and discussion Model training and validation Here, we report the first ALMA model to predict the experimental measure of interaction with the r th target (Iqr = 1) or not (Iqr = 0) when the qth drug presents a value higher than average The output Sq (cr ) of our ALMA model depends on both chemical structure of the qth compound and the set of conditions selected to carry out the biological assay (cr ) Therefore, different outputs in terms of probabilities should be expected if the test conditions cr are changed for the same compound [42] The boundary conditions cr included here are those defined previously in “Computational methods” section As can be noted in Table 2, the values of accuracy, specificity, and sensitivity of the ALMA classification equation for the training and calibration sets are above 70 % These values are considered adequate in bioactivity data modeling studies [38] The statistical parameters used to measure the quality of the equation were number of cases used to train the model (N ), Chi-square (χ ), and p level [33] The probability cutoff for this LDA model is i p1 (c j ) > => Ai (c j ) = In the same way, due to the complexity of the molecular descriptors in the equation, we depicted a more detailed description of the meaning of the seven variables included in the model in Table In this case, the equation that predicts probability outcome above zero for a chemical di has a positive response in the r th tests developed using the cr terms The equation of the best ALMA model found in this work was 123 Variable Symbol Molecular descriptor details f1 f q1 (N1 , M, e)A Linear index of order of M calculated for the set of atoms A using e f2 f q2 (N1 , M, e)D Linear index of order of M calculated for the set of atoms D using e f3 f q3 (N1 , M, v)A Linear index of order of M calculated for the set of atoms A using v f4 f q4 (N1 , M, v)D Linear index of order of M calculated for the set of atoms D using v f5 f q5 (N1, M, e)D Linear index of order of M calculated for the set of atoms D using e f6 f q6 (N1 , M, e)D Linear index of order of M calculated for the set of atoms D using e f7 f q7 (N1 , M, e)X Linear index of order of M calculated for the set of atoms X using e M is the graph-theoretical electronic-density matrix The sets of atoms for local indices are A set of H bond acceptor atoms (N, O, F, Cl), D set of H bond donors (N and O atoms that have one bond with an H atom), and X heteroatom (all atoms different to C and H atoms) The weight vectors used to calculate the linear indices were v for atomic Van der Waals volumes and e vector for atomic electronegativities Sq (cr ) = −0.2159 − 0.0004 × f − 0.2265 × + 0.0007 × f + 0.0002 × + 0.1358 × f + 0.0408 × f − 0.0027 × f5 f7 N = 4203 χ = 838.661 p < 0.005 f2 (4) As can be observed from Eq 4, the parameters f , f , and f have negative impact in the activity, and these are the boundary conditions related to measure, target, and data curation, respectively On the other hand, the variables f 3, f 4, f , and f (mapping, assay type, protocol, and organism) have a positive influence on the activity Besides, using this equation we can have the parameters that contribute most to the activity In the case of f , with a coefficient of 0.1358, which is a very reasonable result because the most important variations in the activity, even in the same compounds are given by the different protocols used to quantify the activity The same occurs with f parameter, which has a coefficient of 0.2265 in the equation, with a significant negative contribution to the activity To use this model in predictive studies, we only have to substitute the value of the molecular descriptor of the compound ( f qk ) in the Eq 4, and the respective average value of the descriptor for all compounds was measured under the same boundary conditions [ f qk (cr )] In Table some exam- 4,357.88 ( f )avg 57 402 2,874 n(c2 ) 242 326 Ratio Activity (%) I C50 (nM) TARGET ID CHEMBL4662 CHEMBL612443 25 19 CHEMBL1274497 CHEMBL2327275 19.17 12.25 1.74 15.62 Homologous protein 45 796 n(c3 ) CHEMBL4208 CHEMBL2007629 4,080.78 3,451.81 ( f )avg 171 396 Homologous protein complex 36.78 32.13 ( f )avg Rattus norvegicus Mus musculus Organism 16 230 n(c7 ) Inhibition of caspase-like activity of rabbit 26S proteasome β subunit ( f )avg 18.25 32.02 Organism Saccharomyces cerevisiae Candida albicans 34 n(c7 ) 13 161 n(c3 ) Proteasome component C5 Non-molecular Inhibition of chymotrypsin-like activity of human 20S proteasome β subunit in human erythrocyte 22 36 24 n(c1 ) 6,098.40 1,643.05 3,715.53 2,597.57 ( f )avg 31.99 42.72 ( f )avg 2,856.61 2,674.17 ( f )avg 26S Proteasome non-ATPase regulatory subunit 14 Name Activity Ratio CC50 /I C50 E D50 (nM) I C50 (µg ml−1 ) Experimental measure (units) Target Mapping 15.72 5.30 ( f )avg 2,473.02 3,294.29 Inhibition of caspase-like activity of 20S proteasome catalytic core from rabbit erythrocytes Inhibitory activity was evaluated against trypsin-like (proteasome) enzyme Inhibition of trypsin-like activity of yeast 20S proteasome at 80 µM 2,277 23 CHEMBL2037857 Target Mapping Protein n(c2 ) Activity (s −1 ) TARGET ID 12 E D50 (µM) 4,059.44 4,970.39 ( f )avg Luminescent cell-based dose titration retest counterscreen to identify proteasome inhibitors Homo sapiens 25 CHEMBL817282 23.69 4,650.92 1,174 12 n(c1 ) EC50 (nM) Inhibition (nM) Experimental measure (units) Inhibition of chymotrypsin-like proteasome activity of human 20S proteasome 630 11 CHEMBL859730 11.99 Description n(c7 ) 56 CHEMBL861973 4.69 ( f )avg Oryctolagus cuniculus 1,115 CHEMBL1614339 3,777.35 Organism n(c6 ) Assay ID 291 Protein complex 2,858.17 ( f )avg n(c3 ) 326 Subcellular fraction Proteasome Target Mapping 5,353.34 4,315.44 94 204 3,063.51 ( f )avg 318 n(c1 ) Proteasome macropain subunit MB1 Name Ratio (M/s) Inhibition (%) Ratio I C50 FC Experimental measure (units) 13.36 15.85 6,204.87 5,462.09 5,196.63 325 K i (nM) ( f )avg n(c1 ) Experimental measure (units) Table Examples of multi-scale average values for different measures, targets, assays, and organisms Mol Divers 123 Mol Divers ples of these average values for different targets, measures, assays, and organisms are depicted Moreover, in Table of online Supplementary Material the values of these parameters for the seven reference conditions are listed In addition, for each boundary condition, different quantities of experimental values can be obtained As can be observed in Table 4, for the case of the experimental measures, some values such as I C50 (nM), EC50 (nM), and K i (nM) are most represented, which gives more accuracy when performing predictions Although other values are less represented, like ratio (M/s), and ED50 (µM) this also gives diversity, being representative of other experimental parameters that could be explored by the researchers The same occurs for the other used boundary conditions, illustrating the wide diversity of experimental assays, targets, and organisms, to which predictions could be done Our current approach versus previous methods For any modeling study as our case of UPP inhibitory activity, the adequate performances for the training and test set should be proved, and after that, the comparison with previous methodologies should be assessed Therefore, we reviewed the literature to search for QSAR studies on UPP inhibitory activity In our case, we only found one study for this thematic, containing a database consisting of 705 cases (compounds) [43] For this report, the values of accuracy in training and prediction sets overcome the results of our experiment with values greater than 85 and 80 %, respectively However, in the case of the previous study, the dataset used only consist of one target and experimental assay under one condition The techniques used are based on machine learning algorithms opposite to our case, where a simplex technique like linear discriminant analysis was used As mentioned in previous items, the main advantage of our proposed ALMA model is the use of different reference conditions (assays, targets, organism, etc.) from which a wide variety of predictions could be done discovery, can be useful to accelerate the identification of compounds with high qualities using minimum resources This new method should be successful for fast and parallel evaluation of huge structural chemical databases [44] These strategies, which are more efficient, can be used in complement with the QSAR models in virtual assays, and the costs can be reduced in all terms of massive screening [45,46] In this sense, we developed a useful mt-QSAR and moQSAR model based on TC atom-based linear indices to fit a large and complex data extracted from ChEMBl This ALMA model, based in multiplexing, was capable of discriminating with good performances in different boundary conditions that include assay conditions, targets, and organisms among others Moreover, the influence (positive or negative) of each parameter for the activity was explained in some detail together with the contribution (high or low) of the different boundary conditions to the UPP inhibitory activity This allows to look forward to many new insights in the field of UPP inhibitors research for the years to come, and how the combination of molecular descriptors and Box– Jenkins moving average operators helps to develop useful multi-output models This new type of QSAR models could be used as innovative technologies with the aim to increase the hit rates discovery on biomolecular screening tasks for the identification of potential active compounds Finally, the present report opens new ways for the search of drugs that interact with different targets in the UPP linked to the search of new chemical entities that are active against neurodegenerative diseases, inflammation, or cancer Acknowledgments Casañola-Martin, G M thanks the program Estades Temporals per a Investigadors Convidats for a fellowship to research at Valencia University (2013–2014) Le-Thi-Thu, H gratefully acknowledges the support from the National Vietnam National University, Hanoi Marrero-Ponce, Y thanks the International Professor program for a fellowship to work at Cartagena University in the year 2013–2014 Also, thanks to Prof Aroa Reguero from the Pontifical Catholic University of Ecuador in Esmeraldas (PUCESE) for her help in the review of the manuscript Finally, the authors also thank the anonymous referees and editor for their useful comments that contributed to the improvement of this work Conclusions The ubiquitin–proteasome pathway (UPP) plays a main role in many human pathologies, such as multiple myeloma, neurodegenerative diseases, and others that have a great impact in the human kind However, the traditional methods for the identification of hit or lead compounds to be introduced in the drug-like research process is getting more difficult In this sense, the in silico techniques in the drug discovery are proposed as one of the solutions that could help this process to become more efficient and fast New algorithms that involve information technology, statistical procedures with complex methodologies, and drug 123 References Ciechanover A (2005) Proteolysis: from the lysosome to ubiquitin and the proteasome Nat Rev Mol Cell Biol 6:79–87 doi:10.1038/ nrm1552 Tu Y, Chen C, Pan J, Xu J, Zhou ZG, Wang CY (2012) The ubiquitin proteasome pathway (UPP) in the regulation of cell cycle control and DNA damage repair and its implication in tumorigenesis Int J Clin Exp Pathol 5:726–738 Zhang J, Wu P, Hu Y (2013) Clinical and marketed proteasome inhibitors for cancer treatment Curr Med Chem 20:2537–2551 doi:10.2174/09298673113209990122 Mol Divers Pevzner Y, Metcalf R, Kantor M, Sagaro D, Daniel K (2013) Recent advances in proteasome inhibitor discovery Expert Opin Drug Discov 8:537–568 doi:10.1517/17460441.2013.780020 Heikamp K, Bajorath J (2011) Large-scale similarity search profiling of ChEMBL compound data sets J Chem Inf Model 51:1831– 1839 doi:10.1021/ci200199u Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery Nucleic Acids Res 40:D1100–D1107 doi:10.1093/nar/ gkr777 Mok NY, Brenk R (2011) Mining the ChEMBL database: an efficient chemoinformatics workflow for assembling an ion channelfocused screening library J Chem Inf Model 51:2449–2454 doi:10.1021/ci200260t Hu Y, Bajorath J (2010) Molecular scaffolds with high propensity to form multi-target activity cliffs J Chem Inf Model 50:500–510 doi:10.1021/ci100059q Erhan D, L’Heureux PJ, Yue SY, Bengio Y (2006) Collaborative filtering on a family of biological targets J Chem Inf Model 46:626– 635 doi:10.1021/ci050367t 10 Namasivayam V, Hu Y, Balfer J, Bajorath J (2013) Classification of compounds with distinct or overlapping multi-target activities and diverse molecular mechanisms using emerging chemical patterns J Chem Inf Model 53:1272–1281 doi:10.1021/ci400186n 11 Tenorio-Borroto E, Garcia-Mera X, Penuelas-Rivas CG, VasquezChagoyan JC, Prado-Prado FJ, Castanedo N, Gonzalez-Diaz H (2013) Entropy model for multiplex drug–target interaction endpoints of drug immunotoxicity Curr Top Med Chem 13:1636– 1649 doi:10.2174/15680266113139990114 12 Box GEP, Jenkins GM (1970) Time series analysis: forecasting and control Holden-Day, San Francisco 13 Speck-Planche A, Kleandrova VV, Cordeiro MN (2013) Chemoinformatics for rational discovery of safe antibacterial drugs: simultaneous predictions of biological activity against streptococci and toxicological profiles in laboratory animals Bioorg Med Chem 21:2727–2732 doi:10.1016/j.bmc.2013.03.015 14 Speck-Planche A, Kleandrova VV, Luan F, Cordeiro MN (2012) Chemoinformatics in multi-target drug discovery for anti-cancer therapy: in silico design of potent and versatile anti-brain tumor agents Anti-Cancer Agent Med Chem 12:678–685 doi:10.2174/ 187152012800617722 15 Speck-Planche A, Kleandrova VV, Luan F, Cordeiro MN (2012) Chemoinformatics in anti-cancer chemotherapy: multi-target QSAR model for the in silico discovery of anti-breast cancer agents Eur J Pharm Sci 47:273–279 doi:10.1016/j.ejps.2012.04.012 16 Marrero-Ponce Y, Valdés-Martini JR, Jacas CRG (2012) TOMOCOMD-CARDD QuBiLS Software QUBILs-MAS Version 1.0, CAMD-BIR Unit, Universidad Central “Marta Abreu” de Las Villas 17 Marrero-Ponce Y, Medina-Marrero R, Castillo-Garit JA, RomeroZaldivar V, Torrens F, Castro EA (2005) Protein linear indices of the ‘macromolecular pseudograph alpha-carbon atom adjacency matrix’ in bioinformatics Part 1: prediction of protein stability effects of a complete set of alanine substitutions in Arc repressor Bioorg Med Chem 13:3003–3015 doi:10.1016/j.bmc.2005.01.062 18 Marrero-Ponce Y, Castillo-Garit JA, Torrens F, Romero-Zaldivar V, Castro E (2004) Atom, atom-type, and total linear indices of the “molecular pseudograph’s atom adjacency matrix”: application to QSPR/QSAR studies of organic compounds Molecules 9:1100– 1123 doi:10.3390/91201100 19 Marrero-Ponce Y, Medina-Marrero R, Martinez Y, Torrens F, Romero-Zaldivar V, Castro EA (2006) Non-stochastic and stochastic linear indices of the molecular pseudograph’s atom adjacency matrix: a novel approach for computational -in silico- screening 20 21 22 23 24 25 26 27 28 29 30 and “rational” selection of new lead antibacterial agents J Mol Mod 12:255–271 doi:10.1007/s00894-005-0024-8 Rescigno A, Casañola-Martin GM, Sanjust E, Zucca P, MarreroPonce Y (2011) Vanilloid derivatives as tyrosinase inhibitors driven by virtual screening-based QSAR models Drug Test Anal 3:176– 181 doi:10.1002/dta.187 Vega MC, Montero-Torres A, Marrero-Ponce Y, Rolón M, GómezBarrio A, Escario JA, Arán VJ, Nogal JJ, Meneses-Marcel A, Torrens F (2006) New ligand-based approach for the discovery of antitrypanosomal compounds Bioorg Med Chem Lett 16:1898–1904 doi:10.1016/j.bmcl.2005.12.087 Brito-Sánchez Y, Castillo-Garit JA, Le-Thi-Thu H, GonzálezMadariaga Y, Torrens F, Marrero-Ponce Y, Rodríguez-Borges JE (2013) Comparative study to predict toxic modes of action of phenols from molecular structures SAR QSAR Environ Res 24:235– 251 doi:10.1080/1062936x.2013.766260 Marrero-Ponce Y, Castillo-Garit JA, Nodarse D (2005) Linear indices of the ‘macromolecular graph’s nucleotides adjacency matrix’ as a promising approach for bioinformatics studies Part 1: prediction of paromomycin’s affinity constant with HIV-1 psiRNA packaging region Bioorg Med Chem 13:3397–3404 doi:10 1016/j.bmc.2005.03.010 Marrero-Ponce Y, Medina-Marrero R, Castillo-Garit JA, RomeroZaldivar V, Torrens F, Castro EA (2005) Protein linear indices of the ‘macromolecular pseudograph α-carbon atom adjacency matrix’ in bioinformatics Part 1: prediction of protein stability effects of a complete set of alanine substitutions in Arc repressor Bioorg Med Chem 13:3003–3015 doi:10.1016/j.bmc.2005.01.062 Luan F, Cordeiro MN, Alonso N, Garcia-Mera X, Caamano O, Romero-Duran FJ, Yanez M, Gonzalez-Diaz H (2013) TOPSMODE model of multiplexing neuroprotective effects of drugs and experimental-theoretic study of new 1,3-rasagiline derivatives potentially useful in neurodegenerative diseases Bioorg Med Chem 21:1870–1879 doi:10.1016/j.bmc.2013.01.035 Marzaro G, Chilin A, Guiotto A, Uriarte E, Brun P, Castagliuolo I, Tonus F, Gonzalez-Diaz H (2011) Using the TOPS-MODE approach to fit multi-target QSAR models for tyrosine kinases inhibitors Eur J Med Chem 46:2185–2192 doi:10.1016/j.ejmech 2011.02.072 Alonso N, Caamano O, Romero-Duran FJ, Luan F, Dias Soeiro Cordeiro MN, Yanez M, Gonzalez-Diaz H, Garcia-Mera X (2013) Model for high-throughput screening of multi-target drugs in chemical neurosciences; synthesis, assay and theoretic study of rasagiline carbamates ACS Chem Neurosci 4:1393–1403 doi:10.1021/ cn400111n Marrero-Ponce Y, Castillo-Garit JA, Olazabal E, Serrano HS, Morales A, Castanedo N, Ibarra-Velarde F, Huesca-Guillen A, Sanchez AM, Torrens F, Castro EA (2005) Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic Bioorg Med Chem 13:1005–1020 doi:10.1016/j.bmc.2004.11.040 Marrero-Ponce Y, Machado-Tugores Y, Pereira DM, Escario JA, Barrio AG, Nogal-Ruiz JJ, Ochoa C, Aran VJ, Martinez-Fernandez AR, Sanchez RN, Montero-Torres A, Torrens F, Meneses-Marcel A (2005) A computer-based approach to the rational discovery of new trichomonacidal drugs by atom-type linear indices Curr Drug Discov Technol 2:245–265 doi:10.2174/157016305775202955 Concu R, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, Uriarte E, Bolas-Fernandez F, Podda G, Pazos A, Munteanu CR, Ubeira FM, Gonzalez-Diaz H (2009) 3D entropy and moments prediction of enzyme classes and experimental–theoretic study of peptide fingerprints in Leishmania parasites Biochim Biophys Acta 1794:1784–1794 doi:10.1016/j.bbapap.2009.08.020 123 Mol Divers 31 Speck-Planche A, Kleandrova VV, Luan F, Cordeiro MN (2011) Multi-target drug discovery in anti-cancer therapy: fragment-based approach toward the design of potent and versatile anti-prostate cancer agents Bioorg Med Chem 19:6239–6244 doi:10.1016/j bmc.2011.09.015 32 Tenorio-Borroto E, Rivas CGP, Chagoyan JCV, Castanedo N, Prado-Prado FJ, Garcia-Mera X, Gonzalez-Diaz H (2012) ANN multiplexing model of drugs effect on macrophages; theoretical and flow cytometry study on the cytotoxicity of the anti-microbial drug G1 in spleen Bioorg Med Chem doi:10.1016/j.bmc.2012.07 020 33 Hill T, Lewicki P (2006) Statistics: methods and applications: a comprehensive reference for science, industry and data mining StatSoft, Tulsa 34 StatSoft Inc (2002) STATISTICA (data analysis software system), version 6.0 35 Tabachnick BG, Fidell LS (1996) Using multivariate statistics HarperCollins College, NewYork 36 Duart MJ, García-Domenech R, Anton-Fos GM, Galvez J (2001) Optimization of a mathematical topological pattern for the prediction of antihistaminic activity J Comput Aided Mol Des 15:561– 572 doi:10.1023/A:1011115824070 37 Prado-Prado FJ, Escobar M, García-Mera X (2013) Review of bioinformatics and theoretical studies of acetylcholinesterase inhibitors Curr Bioinform 8:496–510 doi:10 2174/1574893611308040012 38 García-Domenech R, Zanni R, Galvez-Llompart M, De JuliánOrtiz JV (2013) Modeling anti-allergic natural compounds by molecular topology Comb Chem High Throughput Screen 16:628– 635 doi:10.2174/1386207311316080005 123 39 Helguera AM, Pérez-Garrido A, Gaspar A, Reis J, Cagide F, Vina D, Cordeiro MNDS, Borges F (2013) Combining QSAR classification models for predictive modeling of human monoamine oxidase inhibitors Eur J Med Chem 59:75–90 doi:10.1016/j.ejmech.2012 10.035 40 Mai Q (2013) A review of discriminant analysis in high dimensions Wiley Interdisciplin Rev Computat Statist 5:190–197 doi:10.1002/ wics.1257 41 StatSoft Inc (2001) STATISTICA (data analysis software system) vs 6.0 StatSoft Inc., Tulsa 42 Gerets HH, Dhalluin S, Atienzar FA (2011) Multiplexing cell viability assays Methods Mol Biol 740:91–101 doi:10.1007/ 978-1-61779-108-6-11 43 Casañola-Martin GM, Le-Thi-Thu H, Marrero-Ponce Y, CastilloGarit JA, Torrens F, Perez-Gimenez F, Abad C (2014) Analysis of proteasome inhibition prediction using atom-based quadratic indices enhanced by machine learning classification techniques Lett Drug Des Discov 11:705–711 doi:10.2174/ 1570180811666140122001144 44 Oprea TI (2002) Current trends in lead discovery: are we looking for the appropiate properties? J Comput Aid Mol Des 16:325–334 doi:10.1023/A:1020877402759 45 Xu J, Hagler A (2002) Chemoinformatics and drug discovery Molecules 7:566–700 doi:10.3390/70800566 46 Seifert HJM, Wolf K, Vitt D (2003) Virtual high-throughput in silico screening Biosilico 1:143–149 doi:10.1016/ S1478-5382(03)02359-X ... describe for the first time a multi- target, multi- outputting, and multiscale ALMA model based on atom-based linear indices for CHEMBL data of ubiquitin? ? ?proteasome pathway inhibitory compounds Materials... measures to codify structural information, there is not any report of a multi- target model for drug? ?target interactions for compounds with inhibitory activity of the ubiquitin? ?? proteasome pathway. .. therapeutic target [3] The proteasome core was a primary target inhibitor to cancer therapy since the discovery of the proteasome inhibitor bortezomid, and at present, the process of proteasome inhibitors