Exploring different strategies for imbalanced ADME data problem case study on Caco 2 permeability modeling

Mol Divers DOI 10.1007/s11030-015-9649-4 FULL-LENGTH PAPER Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling Hai Pham-The1 · Gerardo Casañola-Martin2,3,4 · Teresa Garrigues5 · Marival Bermejo6 · Isabel González-Álvarez6 · Nam Nguyen-Hai1 · Miguel Ángel Cabrera-Pérez5,6,7 · Huong Le-Thi-Thu8 Received: 23 May 2015 / Accepted: 13 November 2015 © Springer International Publishing Switzerland 2015 Abstract In many absorption, distribution, metabolism, and excretion (ADME) modeling problems, imbalanced data could negatively affect classification performance of machine learning algorithms Solutions for handling imbalanced dataset have been proposed, but their application for ADME modeling tasks is underexplored In this paper, various strategies including cost-sensitive learning and resampling methods were studied to tackle the moderate imbalance problem of a large Caco-2 cell permeability database SimElectronic supplementary material The online version of this article (doi:10.1007/s11030-015-9649-4) contains supplementary material, which is available to authorized users B Huong Le-Thi-Thu ltthuong1017@gmail.com Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hanoi, Vietnam Departament de Bioquímica i Biologia Molecular, Universitat de València, Burjassot, 46100 Valencia, Spain Unidad de Investigación de Diso de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, Valencia, Spain Facultad de Ingeniería Ambiental, Universidad Estatal Amazónica, Paso lateral km 1/2 via Napo, Puyo, Ecuador Department of Pharmacy and Pharmaceutical Technology, University of Valencia, Burjassot, 46100 Valencia, Spain Department of Engineering, Area of Pharmacy and Pharmaceutical Technology, Miguel Hernández University, 03550 Sant Joan d’Alacant, Alicante, Spain Unit of Modeling and Experimental Biopharmaceutics, Chemical Bioactive Center, Central University of Las Villas, 54830 Santa Clara, Villa Clara, Cuba School of Medicine and Pharmacy, Vietnam National University, 144 Xuan Thuy, Hanoi, Vietnam ple physicochemical molecular descriptors were utilized for data modeling Support vector machine classifiers were constructed and compared using multiple comparison tests Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of oversampling data where misclassification rates for minority class have values of 0.11 and 0.14 for training and test set, respectively A consensus model with enhanced applicability domain was subsequently constructed and showed improved performance This model was used to predict a set of randomly selected high-permeability reference drugs according to the biopharmaceutics classification system Overall, this study provides a comparison of numerous rebalancing strategies and displays the effectiveness of oversampling methods to deal with imbalanced permeability data problems Keywords ADME modeling · Caco-2 cell permeability · Biopharmaceutics classification system · Support vector machine · Cost-sensitive learning · Resampling technique Abbreviations AD ADME AURC BCS BE C CD EMA F FN Applicability domain Absorption, distribution, metabolism, and excretion Area under the ROC Biopharmaceutics classification system Bioequivalence Penalty parameter Critical distance European medicines agency Bioavailability False negative 123 Mol Divers FDA FP H class HIA IVIVC MD M-P class Papp RBF ROC SMOTE SVM SVs WHO US Food and Drug Administration False positive High-permeability class Human intestinal absorption In vitro–In vivo correlation Molecular descriptor Moderate-to-poor permeability class Apparent permeability coefficient Radial basis function Receiver operator curve Synthetic minority oversampling technique Support vector machine Support vectors World health organization Introduction Issues associated with imbalanced class distribution are frequently encountered in real-world applications of machine learning and data mining methods [1] As reported in the literature, data imbalance-related issues normally originate from two distinct problems: (i) the interclass imbalance, where the distribution of class labels varies widely and (ii) the within-class or intraclass imbalance, where the distribution of members within each class is unequal [2] In most applications, independently from imbalance degree, classifiers tend to learn from prevalent classes, while ignoring the small classes Consequently, the overall predictions often have bias toward the majority class and such “apparent” high overall accuracy is meaningless when we also consider the minority class [1] Numerous strategies have been proposed to handle imbalanced dataset In general, these could be grouped in two categories based on their rebalancing targets: (i) reconsidering the misclassification cost or cost-sensitive learning and (ii) creating more balanced class distribution in the training set or data sampling [2] Since the cost-sensitive approaches require modification at the algorithm level, so resampling methods that only make change on the data distribution are considered more applicable [3] However, to date there is little evidence showing that one strategy is better than the other Consequently, a comparative analysis is necessary to draw valid conclusions when applying a strategy On the other hand, for the classification of many ADME properties, the collected data often display imbalanced class distribution Evidence can be found in reported studies of blood-brain barrier (BBB) penetration [4], adenosine triphosphate (ATP)-binding cassette (ABC) transporter [5], protein binding [4], Cytochrome P450 (CYP) enzymesubstrate/inhibitor specificity [6], human intestinal absorption (HIA) [7,8], bioavailability (F), and so on [4] In most cases, researchers agreed that the class imbalance problem 123 is critical in data modeling Therefore, the management of imbalanced dataset should receive special attention in order to improve and rebalance model performance Solutions for imbalanced data exist, but their applications in ADME modeling are still limited Therefore, an exploratory analysis of the most common strategies reported to deal with the imbalanced ADME data problem is needed Among ADME properties, permeability, the capability of a drug to penetrate across the human gastrointestinal tract (GIT), is a key factor governing human intestinal absorption (HIA) [9] In vitro models, such as MDCK (Madin-Darby Canin Kidney) and Caco-2 (adenocarcinoma cells derived from colon), are widely used as high throughput screening (HTS) methods for the permeability assessment of drug-like molecules in the early stages of drug discovery [10] Especially, the Caco-2 monolayer, which exhibits morphological as well as functional similarities to human intestinal enterocytes, is considered as a better surrogate marker for estimating in vivo drug absorption than other epithelial cell cultures [11,12] Currently, Caco-2 monolayers are recommended by numerous regulatory agencies for the permeability classification of drugs according to the biopharmaceutics classification system (BCS) [11,13,14] Prior to running in vitro Caco-2 assays, reliable permeability prediction through the use of computational (in silico) tools is very useful, e.g., to prioritize compounds to be tested as well as to guide the structural optimization in order to improve the absorption profile of lead compounds However, accurate in silico permeability prediction is one of the most difficult tasks in ADME modeling [12,15,16] Data reported in the literature show considerable inter- and intra-laboratory variability [17] We therefore consider that permeability data analysis by classification is better than regression methods By dividing the dataset into two groups (high vs medium–low permeability), the within-group variability of the experimental data could be greatly reduced In our previous studies, we developed various classification models based on a large and heterogeneous Caco-2 permeability database compiled from different sources [12,15] Unsurprisingly, the results supported our hypothesis that classification models could overcome the data variability problem and accurately estimate experimental in vitro measurements These studies also demonstrated the potential of quantitative structure-property relationship (QSPR) models in the HIA prediction of compounds that undergo passive transport mechanisms In this regard, a suitable permeability cut-off value that maximizes in vitro–in vivo correlation (IVIVC) should be defined [18] Given that a threshold value plays an arbitrary role that affects directly the data skewness and model performance, it is of the utmost importance that imbalance problems should be preliminarily treated in order to develop accurate permeability classification models [11] Mol Divers Based on the challenges mentioned above, the main goals of this work are as follows: (i) to investigate the potential of the most common rebalancing strategies (at the data and the algorithm levels) reported in the literature for the prediction of a large and imbalanced Caco-2 permeability database; (ii) to obtain reliable classification models that improve the prediction of high-permeable compounds taking into account the permeability class definitions of the BCS; and (iii) to corroborate the in silico predictions with in vitro Caco-2 permeability of 47 drugs belonging to different classes of the BCS Particularly, we briefly review the principles of all the rebalancing strategies, concentrating on the way they should be applied for handling imbalanced permeability data classification problems, using standard machine learning technique, such as support vector machine (SVM) arguably too constrained [19] Therefore, a new threshold was adapted taking into account the lower confidence interval rule (0.8–1.25) commonly recommended by the FDA for assessing the bioequivalence (BE) of drug products (available at http://www.fda.gov/cder/orange/default.htm) [11] This method was successfully developed by Kim et al to differentiate between high and low in situ permeability compounds, using the 90 % confidence interval of the permeability ratio of the test compound to the reference compound like Metoprolol [20] Finally, 352 compounds were assigned to the high-permeability class (H class), while the remaining 737 compounds belonged to the moderate-to-poor permeability class (M-P class) The latter class outnumbered the H class by more than times This is a typical case of moderately imbalanced binary dataset with overlapped classes Materials and methods Computational method Data collection and permeability class definition Molecular descriptor calculation In this study, a large and heterogeneous database composed of 1116 organic compounds was carefully assembled from more than 320 published articles The data collection was routinely performed taking into account those factors that could mainly contribute to the variability of experimental results as described in the literature [11,12,15,17,18] Since a number of compounds appeared with two or more in vitro assays, the mean values were employed, excluding those laid outside of the mean ± 2SD (standard deviation) ranges In a preliminary inspection of the dataset, 12 compounds with very low or very high molecular weight (MW ≤ 60 or MW ≥ 1400 Da) were excluded from the modeling set because they are most likely to cross the cell membrane via carrier-mediated (active) transport [15] Furthermore, after calculating molecular descriptors, a set of 14 quaternary amines and a sodium salt appeared with some missing descriptor values was also excluded from our modeling procedure (see Supplemental Table S3) The remaining 1089 molecules, having different physicochemical characteristics such as molecular size, polarizability, hydrophilicity, lipophilicity, and molecular charge, were used to develop global classification models for the prediction of Caco-2 cell permeability Based on the guidance provided by the US Food and Drug Administration (FDA) for the application of in vitro permeability data in the context of BCS, we defined the high permeability class that maximizes the fundamental tradeoff between in vitro permeability and human oral absorption [14] In the literature, Metoprolol (average apparent permeability Papp = 20 × 10−6 cm/s with HIA rate of 96 %) is widely used as a reference compound to discriminate high from low permeable drugs However, current definitions of high permeability class based on an HIA value of 90 % are Oral absorption is a complex process affected by physicochemical properties of the drug, drug formulation, and gastrointestinal physiology factors [9] Therefore, physicochemical descriptors should be preferentially considered for the construction of classification models In this study, 115 molecular descriptors (MDs) belonging to different families (constitutional, ring, functional group counts, atomcentered fragments, charge descriptors, molecular properties and drug-like) were calculated using the SMILES code of each compound as input to the DRAGON software v.6.0 [21] For the calculation of charge descriptors, a preliminary semi-empirical PM3 structural optimization based on the Polak–Ribiere algorithm implemented in Hyperchem v.8.0 [22] was performed for each compound Selection of training and test sets The selection of training and test sets was performed using k-means cluster analysis (k-MCA) implemented in STATISTICA v.8.0 [23] Firstly, the dataset was split into kdifferent clusters (k < 20) of the highest possible dissimilarity The Fisher ratio and p-level of significance ( p < 0.05) were considered to select the optimal number of variables included in the analysis and the number of clusters that represented the structural information of the dataset Subsequently, compounds of the training and test sets were randomly selected from previous clusters In this procedure, the linkage distance of the members in each cluster was taken into account in such way that for each compound in training set there is always a similar compound in the test set Finally, the dataset was divided in two sets of 871 and 218 compounds, which corresponded to the training and test sets, 123 Mol Divers respectively The test set was never used in the development of classification models Additionally, a set of 47 compounds with experimental Caco-2 permeability was collected to corroborate with computational predictions Many of them are well-studied drugs that have been commonly used as internal reference standards in Caco-2 cell permeability assays for testing new compounds The prediction of this set is very useful for the demonstration of the suitability and validity of the proposed threshold in the context of the BCS model is less sensitive to the errors and keeps the number of misclassification small However, increasing C too much can make the model loss the generalization capability and easily overfit Choosing an appropriate value of C is one of the main tasks for developing good SVM classifier [24] The Eq can be solved using Lagrange multipliers and the final classification function is k sgn w T φ (x) + b = sgn yi αi K (xi , x) + b (2) i=1 Support vector machine and effects of imbalanced data on model construction Support vector machine (SVM) was used for model building in this work Since the pioneering theoretical studies of Vapnik and Burges [24,25], SVM technique has gained extensive popularity in many research fields, particularly for modeling ADME and toxicity data SVM embodies the structural risk minimization (SRM) principle which takes into account the capacity of the classifier (similar to model complexity) and the trade-off between minimization of training error and reduction of the model complexity [24] A more detailed description of the SVM theory can be widely found in the literature Herein, we only describe briefly basic SVM theory and methods to resolve imbalanced problem based on SVM algorithm Generally, each object in SVM is described by an input vector xi of k real numbers (features or descriptors), which corresponds to a point in a k-dimensional space If the number of training set is n, then each case is described as X T rain = ( xi , yi ) , i = 1, n, where xi = (xi1 , xi2 , , xik ) are input vectors Response variable yi = +1/ − corresponds to H and M-P classes in this work Here we applied the classification SVM type to find an optimal separating hyperplane (maximum margin) by solving the following primal optimization equation [26]: T w w + C w,b,ξ k ξi subject to yi w T φ (xi ) + b Min ≥ − ξi , i=1 (1) where C is the capacity constant, w is the vector of coefficients, b a constant, and ξ are parameters for handling non-separable input data (non-negative slack variable ξ ≥ 0, i = 1, k) The term φ(xi ) is a feature function (reverse to kernel function) that maps xi into a higher dimensional space It should be noted that the parameter C, also called penalty, represents the trade-off between the empirical error and the margin In theory, when a large value of C is set, the model optimization procedure will choose a small-margin for optimum separation hyperplane (OSH) so that the final 123 For linearly separable classes, it is not necessary to introduce slack variable or feature function In this case, the optimization problem from Eq can be solved by a simple quadratic function under linear constraints Conversely, for the classification of linearly non-separable data, a kernel function K (xi , x) that maps the input variable in a higher dimensional space is defined in order to determine the shape of OSH Several kernel functions have been used, such as linear, polynomial, radial basis function (RBF), and sigmoid kernel Among them, RBF is the most widely used and achieves, almost invariably, good performance [27] K xi , x j = exp −γ xi − x j (3) In this work, the RBF kernel was considered due to its advantages regarding to other kernels [27] The value of gamma (γ ) was set to during the model optimization In order to determine the best value of parameter C, a selection procedure using 10-fold cross-validations was performed For the construction of SVM models, we applied the extreme decomposition algorithm of Platt (SMO) and libSVM [26,28] implemented in WEKA v.6.0 [29] The principle of maximal parsimony (Occam’s razor) was taken into account Consequently, only classifiers displaying good performance with fewer variables and fewer support vectors (SVs) were selected Various studies have analyzed the negative effect of data imbalance on SVM performance [30,31] In summary, three main problems were pointed out The first one is the boundary skewness In this problem, the imbalanced ratio of two classes in training set makes that the minority examples reside away from the “ideal boundary.” Then SVM tends to learn a boundary that is too close to these instances and displays therefore imbalanced performance The second one is related to the constant C specification, also known as weakness of soft-margins [30] As can be seen from Eq 1, the unique trade-off C assumed for two unbalanced classes does not account for the different cumulative errors between classes Veropoulos et al proposed a strategy to increase C associated with the minority class [32] The last effect is the imbalance of SV ratio It is believed that this imbalance can make the clas- Mol Divers sification of a test instance close to the separation hyperplane skew toward the majority class [33] However, according to the Karush–Kuhn–Tucker (KKT) conditions [24], the αi terms in Eq which act as weights in the final decision function must satisfy k ≤ αi ≤ C and αi yi = 0, (4) i=1 As described in Eq (4), the αi values corresponding to the minority class must be larger than those of the majority class Therefore, SVs of the smaller class generally receive higher weights than the prevalent class, which can partly reduce the effect of imbalance in SVs [30] Herein, the classification imbalance was analyzed by comparing the number and the ratio between the SVs corresponding to H and M-P classes Solutions for imbalanced data problem There are two main approaches for dealing with imbalanced data problem: algorithm learning level (cost-based) and data level (resampling) strategies Algorithmic level strategies At the algorithmic level, choosing an appropriate inductive bias for target class is a common strategy As described above, penalty constants associated with classification error for each class should be modified Consequently, H class was assigned with a higher penalty than M-P class, and Eq becomes Min w T w + C H w,b,ξ ξi + C M−P yi =1 ξi yi =−1 subject to yi w T φ (xi ) + b ≥ − ξi ξi ≥ 0, i = 1, k, (5) where C H and C M−P are different penalty constants associated with error terms for the H class and M-P class, respectively For this method, we used the C-SVM algorithm of LibSVM integrated in the WEKA environment where different weight parameters can be added for each class [26] In this case, for two classes, we considered the ratio of the number of compounds in every class to estimate the suitable weights Corollary of this setting, “weights” parameter was “0.5 − 1,” the penalty parameters for M-P and H classes are 0.5×C and 1×C, respectively (the C optimization procedure was described above) Another algorithmic level method is the cost-sensitive learning In theory, standard classification algorithms assume balanced data distribution and level of importance for each class Therefore, standard data mining techniques could be inefficient when trying to predict a minority class in an imbalanced dataset or when a false negative (FN) is considered to be more important than a false positive (FP) [34] Regarding this problem, cost-sensitive classifiers that take the cost matrix into consideration during model building and generate models with the lowest expected cost were employed The cost matrix can be seen as an overlay to the standard confusion matrix In this case, the user can control the FN and FP rate by means of increasing the misclassification cost of minority (H) class compounds with respect to the cost for misclassifying majority (M-P) class Metacost was selected to build models based on criteria of a previous study [35] This method re-labels the training samples with their minimum expected cost classes, and then rebuilds the models based on modified training set For WEKA-SVM, it is recommended using the BuildLogisticModels procedure for better estimation of probability [34] We gradually augmented the FN cost 20 times in stages from 1.5 to 2.5, while maintaining a cost of for FP, and subsequently compared current FN rates to the original SVM performance without any modified cost These differences were incremented until 10 % for the cross-validation model, which coincided with a cost of 1.8FN The misclassification cost of H class was set to 1.8, while M-P class was maintained at the value of Data level strategies At the data level, several approaches to resample (over- or undersample) the original data were used in this work The resampling approaches are well known to be more flexible than cost-based methods [36] However, there are some limitations of resampling approaches that one should take into account before developing computational models Firstly, in medicinal chemistry, changing the number of instances could lead to different ADs in predictive space, especially with undersampling that can affect the robustness of in silico models Secondly, since the optimal class distribution of the training data is usually unknown, learning from a forced sample-distribution may not reflect the real- world population distribution, which used to be randomly drawn Nevertheless, resampling continues to be the most widely used strategy because it could approximate the target population (rebalancing sample) better than the original (biased sample), even though random effects can no longer be considered [30] For the undersampling approach, a random subsample of dataset was applied using supervised SpreadSubsample filter in WEKA This technique allows us to specify the maximum “spread” between the rarest and most common class It means the selected subsample not only can achieve a desirable uniform distribution of two classes, but also preserves the relationship among classes in original data sets (the maximum class distribution spread) Therefore, “in theory,” 123 Mol Divers this method may be useful in rebalancing data distribution without throwing away valid instances In this work, after choosing a distribution spread of 1, the training set was divided equally into two classes The weight instances were not considered, so this factor cannot influence the global error Since undersampling data may cause information loss, other approaches that keep the majority, or at least, remove fewer instances are desirable A common strategy is based on multiplication of the minority class, namely oversampling method However, the main drawback of this method is the possible over-fitting problem [30] Therefore, a sophisticated method proposed by Chawla et al [37], namely SMOTE (Synthetic Minority Oversampling Technique), was used in this work to preprocess the data This approach generates synthetic samples of minority class on the basis of similarity principle In brief, each new minority sample is created from the two nearest neighbors and has similar structure with them [37] Generally, SMOTE presents some advantages over the random oversampling approach because of its informed properties and rebalance capability without causing over-fitting problem In WEKA, the user can specify the amount of SMOTE and number of nearest neighbors As default configuration, the number of nearest neighbors was maintained in 5, and 100 % of SMOTE instances were created Lastly, we explored another sampling approach in WEKA filter module, namely Resampling This method refers to a supervised filter that combines undersampling and oversampling It works effectively in cases of large and sparse data in which much noise must be eliminated In the oversampling, a random replacement guarantees the same covariance between two samples This filter can be made to bias the class distribution toward a uniform distribution, wherefore we selected a value of In this step, we resampled a new data with the same number of samples in the original training set Modeling procedure According to the initial purpose, various SVM models were developed to identify imbalance problems and to find out optimal solutions Figure describes our model building sequence As can be seen in Fig 1, a support vector classifier was firstly developed with the original imbalanced dataset (Primal-SVM) to identify the problems Performance of internal calibration (by 10-fold cross-validation) was the main criterion for the model selection The best variable subset was selected using wrapper method [38] The negative effect of imbalanced data distribution was revealed through comparing misclassification rates After identifying the problem, eight models corresponding to strategies for 123 combating the classification imbalance problem were constructed For the analysis of algorithmic level methods, two models (Metacost and CSVM) were developed As described above, Metacost is a classifier implemented in WEKA that reweights the training samples according to the whole cost assigned to each class in the cost matrix The CSVM model was constructed by increasing the penalty constant associated with misclassification rates of the H class with respect to those of the M-P class At the data level, all subsample, oversample, and a combination thereof were analyzed For the subsampling method, the number of majority class (M-P class) was halved applying SpreadSubsample filter (models of SS1 and SS2) The oversampling method was carried out using SMOTE algorithm doubling number of minority H class (SMOTE1 and SMOTE2) The Resample filter of WEKA was applied for combining previous approaches (RS1 and RS2) The influence of the feature selection procedure on the sampling techniques was analyzed when two models were building for each preprocessing filter The models numbered (SS1, SMOTE1, and RS1) were performed directly from the same variable set of Primal-SVM; meanwhile, the rest (SS2, SMOTE2, and RS2) are developed with the new variables selected from new balanced data distribution All models were presented to a revalidation process with original imbalanced training set (871 compounds) to check the AD Finally, an external test of 218 compounds was predicted by each model for the robustness analysis of applied methods As an additional interest, a consensus system was constructed by voting predictions of obtained models This simple ensemble method was analyzed according to the imbalance degree of classification results on test set The BCS permeability prediction of the 47 reference drugs was also discussed Wrapper feature selection In this study, the Wrapper approach was used for feature selection Based on SVM algorithms, two searching algorithms were used in sequence for wrapper: the hill-climbing (greedy) and the best-first method In the first step, we performed a greedy backward search through the entire space of attributes This method, so-called backward elimination, starts with the full set of features and greedily adds or removes the one that most improves performance or degrades performance slightly (without backtracking) [39] In theory, going backward from full set of features may capture interaction among features more easily; however, the main drawback of this algorithm is its expensive computational cost In order to improve hill-climbing feature subset selection, best-first search was subsequently executed on the GreedyStepwise results Essentially, best-first search selects the best variable from entire feature space and subsequently adds new Mol Divers Fig Strategies explored for overcoming imbalanced Caco-2 data problem Models developed with variable subset selected from [a] Primal-SVM model and [b] independently from Primal-SVM model Asterisk all sampling techniques were performed only on training set variables to the model so that this new subset still displays significant improvement [38] This procedure will stop when no improvement is found and the final variable subset will be returned All these feature selection methods were performed using WEKA v.6.0 [29] For the GreedyStepwise method, backward search was chosen rather than forward Table Confusion matrix and common performance measures for the classifier evaluation Actual H class Actual M-P class Predicted H class True positives (TP) False positive (FP) Predicted M-P class False negatives (FN) True negatives (TN) Specificity (Sp) = TN/(TN+FP) Sensitivity (Se) = Recall = TP/(TP+FN) Precision (Pr) = TP/(TP+FP) Classifier evaluation: applicability domain and performance measurements The AD is an important aspect for the evaluation of all rebalancing approaches, especially when subsampling strategies have been applied The need to define an AD for the developed models is associated with their ability to generate reliable predictions in terms of chemical structures, physicochemical properties, and mechanisms of action In this regard, AD of each model was determined based on three methods (ranges, Euclidean distance, and probability density) integrated in AmbitDiscovery software [40] As for the performance assessment, global accuracy (Q ) is not an appropriated criterion to evaluate classifiers developed from imbalanced dataset [1] The main concern is related to the prediction of high-permeable compounds For this purpose, seven performance measures derived from the confusion matrix were used (Table 1) Another criterion to assess the classification performance is the Receiver Operator Curve (ROC) The ROC graphs are two-dimensional graphs in which true positive rate, TPrate = TP/(TP+FN), is plotted on the Y-axis, and false positive rate, FPrate = FP/(FP+TN), is plotted on the X-axis by means of the variation of decision threshold Here, we evaluated the G-mean (G) = (Sp × Se)1/2 F-measure (F1) = 2/(Se−1 + Pr−1 ) Matthews correlation coefficient (MCC) = (TP × TN − FN × FP)/[(TP + FN)(TP + FP)(TN + FN)(TN + FP)]1/2 Accuracy (Q) = (TN+TP)/(TN+TP+FN+FP) quality of classifiers by mean of the area under the ROC curve, abbreviated AURC It has been shown that there is a clear similarity between AURC and well-known Wilcoxon statistics [41,42] Statistical comparison of classifiers In this study, various strategies have been explored in order to address the imbalanced data classification challenges Obtained models have been analyzed in terms of classification accuracy and rebalancing ability However, it still presents an elementary need to provide a general comparison between classifiers that can lead us to a better understanding of strategy improvements [43] To this well, several non-parametric statistical tests were performed [44] This comparison procedure starts with assessing multi- 123 Mol Divers comparison statistical tests using Friedman test [45] and Iman-Davenport tests [46] with the null hypothesis that all the classification models have no difference on average In case of null hypothesis rejected in any of previous tests, posthoc tests were subsequently applied using Bonferroni-Dunn test at α = 0.05 and 0.10 [47] Essentially, this test meaj sures an average rank R j = 1/N i ri corresponding to each classifier and uses a critical distance (CD) to declare the significant difference of ranked classifiers from the best one [43] An advantage of the Bonferroni-Dunn test is that it is easier to describe and to visualize because it uses the same CD for all comparisons The value of CD can be computed based on formula: C D = qα k(k + 1) 6N (6) This procedure was performed using an in-house software adapted from Demsar’s study [43] More detail of methodology can be found in our previous comparative study of other non-linear machine learning techniques [44,48] Results and discussion Identification of unbalanced data problem with standard SVM algorithm Primal-SVM model was obtained with 10 variables from the original imbalanced training set of 871 compounds Unsurprisingly, this model was significantly biased toward M-P compounds A set of 505 compounds from 611 cases were correctly predicted as M-P class (about 83 % of Sp), while only 176 instances were correctly classified as H class, representing a sensibility (Se) of 62.41 % for this class Although overall accuracy of this model maintained acceptable level with overall accuracy (Q) >78 %, this classifier cannot be used for screening the high-permeable compounds The relative low G-mean (about 73 %) reflected this situation In addition, the selected variables still correlated well with permeability (MCC ∼ 0.5) Note that the model included only 10 variables and used a relative low penalty (C = 320) for classification errors as parameter optimization procedure The imbalanced level of our data was 589/282 and the PrimalSVM presented similar (label) bias in prediction outcome of the dataset (611/260) The above results showed that the class distribution was the main reason for the low performance In addition, we did not observe any small clusters in the k-MCA analysis; therefore, the within-class imbalance problem can be ruled out After checking the behaviors of SVs on hyperplane surface, a balanced number of them were appreciated (from 455 SVs, 215 are of H class and 240 are of M-P class), suggesting the 123 skewed distribution and overlapped data are the basic reasons behind the imbalance problem Solutions at algorithmic level As described above, Metacost classifier was developed modifying the misclassification costs associated to each class The performance of this model was slightly improved with respect to the Primal-SVM model The accuracy of H class prediction was 71.28 %, an increase of nearly 1.0 log unit from the first model (see Table 2) Using the same variable subset, only 193 (20 % of training set) boundary SVs were found, suggesting that Metacost is the simpler classifier However, precision was relatively weak, and then F-measure was still low On the other hand, CSVM appeared to be effective in rebalancing the prediction results The TP rates were 77.30 and 77.08 % for H and M-P class, respectively However, increasing so much recall (Se) at the expense of precision (Pr) made the Pr measures drop completely (61 %), and the overall accuracy did not improve In comparison with Metacost, CSVM performance seemed to be slightly better Generally, at the algorithmic data level, Metacost and CSVM were tolerable; however, the overall improvements were not significant, and there was a trade-off between Se and Pr measurements Solutions at data level Subsampling method Firstly, 307 compounds belonging to M-P class were excluded from training set The remainders (564 compounds) have two classes equally distributed The same variables previously selected for Primal-SVM were used to develop SS1, while for SS2 a new variable selection was processed (Table 2) In comparison with Primal-SVM, the performances of SS1 and SS2 models were better rebalanced, since the distribution of data had been changed However, the difference of variable subsets selected by SS1 and SS2 models clearly affected model performance The accuracy classification of M-P class by SS1 model was very low (72.70 %) in comparison with Primal-SVM (85.74 %) The overall accuracy was therefore lower than Primal-SVM Meanwhile, SS2 performance was significantly improved in comparison with Primal-SVM A further analysis of the validation results of these models on overall training and test sets is necessary to confirm the robustness of obtained models, because there were 307 “hidden” compounds out of current training process Mol Divers Table Performance of SVM models under 10-fold cross-validation following different strategies Strategy Modela SV/data ratiob Sp (%) Se (%) Pr (%) G (%) F1 (%) MCC Q (%) Non Primal-SVM (10) 455/871 82.65 62.41 67.69 73.15 64.94 0.49 78.19 Algorithmic level Metacost (10) 193/871 85.43 71.28 63.81 75.82 67.34 0.51 77.61 CSVM (10) – 87.64 77.30 61.76 77.19 68.66 0.52 77.15 SS1 (10) 292/564 76.78 78.01 74.07 75.31 75.99 0.51 75.35 SS2 (8) 306/564 79.04 79.79 77.05 77.99 78.40 0.56 78.01 SMOTE1 (10) 599/1153 80.29 80.50 76.30 78.24 78.34 0.57 78.23 SMOTE2 (11) 543/1153 82.91 83.16 78.56 80.68 80.79 0.61 80.66 Data level RS1 (10) 459/871 81.51 84.57 79.88 80.25 82.15 0.61 80.60 RS2 (9) 445/871 83.24 86.52 79.44 80.52 82.83 0.62 81.06 a SVMs obtained by 10-fold cross-validation method, quantity between parenthesis indicates the number of variables included in each classifier Proportion between number of support vectors and number of training set (original and resampling set) Results of C optimization procedure by cross-validation: Primal-SVM (320), Metacost (510), CSVM (200 for H class and 100 for M-P class), SS1 (1100), SS2 (421.05), SMOTE1 (810), SMOTE2 (540), RS1 (490) and RS2 (572.73) b Oversampling method After creating 218 new “artificial” cases by the SMOTE method, training set size became greater with 1153 “compounds.” The performance of SMOTE1 and, specially, SMOTE2, was significantly rebalanced and improved in comparison with Primal-SVM (Table 2) The SMOTE2 model exhibited highly balanced performance with Fmeasures >81 % and TPrate of >83 % From a simple visualization, it is clear that oversample was one of the most potential strategies to treat imbalance data Above all, the problem of reducing AD could be ruled out However, since these models considered an additional distribution and the FN number of SMOTE1 and SMOTE2 were 110 and 95, respectively, which were similar to Primal-SVM (106), a revalidation on original training set distribution is necessary for recognizing the real discriminate capacity of these models Combination of simple subsample and oversample strategy: resampling approach For generating a new balanced distribution, resampling that makes use of advance offered by subsampling and oversampling methods was applied The set of 280 cases from H class were randomly chosen and duplicated, while 411 M-P cases were willfully extracted, giving a final balanced dataset of the same number of original training set (871 compounds) As can be appreciated in Table 2, two RS1 and RS2 models display similar and high performance Interestingly, the distributions of support vectors (H vs M-P classes) were the same as well as the number of variables selected in RS1 and RS2 in comparison with Primal-SVM model Additionally, with high MCC and G-mean values, current strategy should be a promising solution for overcoming unbalance problem Similarly, to other data-based technique, validation on both original training and test set is essential to recognize the real effectiveness of this strategy Validation of classification models Two validation processes were carried out: (i) on the original imbalanced dataset and (ii) on the external test set In the revalidation of training set, besides calculating performance measures, the number of compounds out of model AD is taken into account Results in the test set validation are presented in Table A consensus model was also developed The models developed by each strategy were compared with the Primal-SVM Unsurprisingly, models based on algorithm modifications only showed a slight improvement, while models of the other category are all noticeably better Among the first group, CSVM still performs better than Metacost on the test set Conversely, subsample, oversample, and a combination thereof should be appropriately applied for overcoming imbalanced data problems Note that there was an acceptable number (≤10) of compounds out of the ADs determined by all models Therefore, the possibility of throwing out valuable information, which remained our principal concern when applying undersampling techniques, could be discarded by applying currently proposed workflow An interesting finding was that the variable selection procedure significantly impacted on the predictive capability of classification models, although the cross-validation results of the same algorithm seemed to be similar That is, when sampling methods were applied, the employment of the new variable set selected from new balanced data distribution should not overcome the models that the selected variables raised from the original imbalanced distribution data The 10 variables selected by Primal-SVM showed a surprisingly predictive ability on the test set, especially when an appropriated sam- 123 Mol Divers Table Validation performance of obtained models on test set Strategy Model OUTa Sp (%) Se (%) Pr (%) G (%) F1 (%) MCC Q (%) AURC Non Primal-SVM 0/0 80.00 53.62 66.07 68.33 59.20 0.43 76.39 0.834 Algorithmic level Metacost 0/0 86.01 71.43 67.57 77.31 69.44 0.54 79.72 0.843 CSVM 0/1 87.94 75.71 68.83 79.65 72.11 0.58 81.19 0.840 Data level SS1 4/2 91.06 83.58 62.22 80.07 71.34 0.57 78.87 0.845 SS2 4/1 87.83 79.71 55.56 74.51 65.48 0.46 72.90 0.830 SMOTE1 0/0 89.47 80.00 65.88 80.20 72.26 0.58 80.28 0.847 SMOTE2 0/1 86.76 74.29 67.53 78.29 70.75 0.56 79.81 0.863 RS1 7/2 90.23 81.43 69.51 82.09 75.00 0.62 82.33 0.858 RS2 Consensus model a 2/0 91.67 85.71 62.50 80.36 72.29 0.58 78.70 0.871 0/0 89.13 78.57 68.75 80.81 73.33 0.60 81.65 – Number of compound out of applicability domain of (training set/test set) pling strategy was applied Note that, two classes of the test set display the same unbalanced distribution of the original training set Analyzing the prevalence (ratio between TP and TN) of SMOTE1 and RS1, we observed a highly balanced degree, such as 80.00 %/81.41 % in SMOTE1 and 81.43/82.76 % in RS1, suggesting that these two strategies are suitable for current unbalanced data Interestingly, AURC analysis did not give us a clear conclusion on the performance improvement from Primal-SVM (see Tables and 3) There is a little change from the first model, although the prevalence on the predictions has been rebalanced A previous study showed that the proportion of positive to negative instances in the test set does not affect to the ROC curve [49] Here, on the unbalanced distribution of test set, the use of AURC for selecting an appropriated strategy to apply might be misleading The AURC might not depend on class distribution, but the overlapped nature of data Finally, a consensus model was constructed based on voting mechanism In general, this model slightly outperforms the best standalone models (SMOTE1 and RS1) with 80.81 % of G and 73.33 % of F1 As displayed in Table 3, this consensus model displayed great advantage over other classifiers in covering AD Furthermore, using current multiclassifier could eliminate the concern about choosing the wrong solution [44,48] Statistical comparison of classifiers In order to illustrate the differences between the obtained models, all performances of nine SVM models in the test set validation were subjected to multiple comparison procedures As the first step, the average ranks corresponding to each classifier were calculated Accordingly, all classifiers were ranked as follows: RS1 – RS2 – SMOTE1 – CSVM – SS1 – Metacost – SS2 – Primal-SVM (see 123 Fig Rankings obtained through Friedman test and graphical representation of Bonferroni-Dunn procedure considering RS1 as control model Significance level α of 0.05 and 0.10 expressed as continuous and dotted lines, respectively Fig 2) The Friedman’s test null hypothesis was rejected ( p = 0.00) So there is a significant difference between classifiers obtained The same result was observed by ImanDavenport with p < 0.0005 Subsequently, the post-hoc Bonferroni–Dunn test (at α = 0.05 and 0.10) was applied in order to reveal which classification model performed equivalently to the best-ranked model (RS1) At the end, the CD values were 3.145 for α = 0.05 and 2.884 for α = 0.10 In comparison to the lowest bar, which corresponded to the best model (RS1), CSVM, SS1, SMOTE1, and RS2 were considered to have similar performances, since any of them exceed the critical difference (CD) of Bonferroni–Dunn test Additionally, it is possible to identify models that are significantly worse than the others, viz Primal-SVM and SS2 Mol Divers Analysis of molecular descriptors and physicochemical impacts Caco-2 cell permeability is a very complex process where different mechanisms can operate simultaneously during the movement of a molecule across the monolayer Among the main mechanisms described, there are transcellular and paracellular passive diffusions, the influx/efflux mediated by transporters, passive transport mediated by membrane-bound proteins, receptor-mediated endocytosis, and electrostatic gradient-driven mechanisms [9] Drug permeability is also affected by multiple physicochemical properties such as lipophilicity, hydrogen bonding interactions between solute and solvent, intramolecular hydrogen bonding, the “shape and size” characteristics, and charge state of drugs [12] Even though the SVM is considered a black-box modeling technique, an understanding of the relationships among selected descriptors could be possible and useful for a modeler who attempts to analyze the discriminative feature combinations for a further study of absorption screening In this sense, a general physical interpretation, in structural terms of feature combinations according to their discriminant robustness, was given It is noteworthy that from 115 zeroto-one-dimensional MDs calculated, 35 were selected for the construction of models, and each obtained models appeared with its own variable set (see Table 4) The nine models were obtained with a relative small number of variables (among 8–11 variables) As depicted in Table 4, the MDs that better described the permeability in Caco-2 were those related with the hydrogen-binding capacity, polar surface area, logP/D, E-State, molecular shape, and size The drug-like filter proposed by Oprea et al [50] and the lead-like filter described by Congreve et al [51] were included in RS2 and CSVM models, respectively These descriptors are defined as the number of violations of every rule A simple inspection of both descriptors (DLS_02 and LLS_01) reveals some incongruence The 80 % of training set (696 compounds) did not fulfill at least of Oprea’s rules; meanwhile, the 63 % (546 compounds) almost fulfilled Congreve’s lead-like rules (LL_01 ≤ 0.33) In this sense, it is important to remark that many compounds of our dataset might not become drugs, so few knowledge could be extract from them In our opinion, the use of a “rule of thumb” to find new drug candidates should not be based on a general concept The chemical spaces of “drug-like” and/or “lead-like” should be identified by integrating knowledge about specified processes or mechanisms (rather than a global rule), such as solubility and permeability [12,52] On the other hand, from the all descriptors selected, six were used more than times Among them, TPSA(Tot) appeared in the three best models (RS1, SMOTE1, RS2) and octanol-water partition coefficients (MLOGP, ALOGP, ALOGP2) were included in models, remarking the role of hydrogen bonding capacity of molecules and the lipophilicity information, respectively Both factors in combination could clearly describe two movements of a compound across intestinal epithelial barrier: from aqueous to lipid environment, and from apical to basolateral side of bi-lipid layer [12,15] Comparison with other studies dealing with imbalanced ADME data problem There are only a few studies reported in the literature addressing the class imbalance issues in ADME modeling In most cases, this problem is managed by applying directly specific methods, e.g., cost-sensitive and subsample approaches These approaches rebalance classification performance without any analysis of imbalance effects and possible modeling strategies [5–8] It was not easy to make a direct comparison between the findings of this study and those published previously due to the particularity of the data used, imbalance degree, and modeling techniques Among ADME properties, published datasets of oral absorption (HIA) were very biased toward high absorption (HIA+) group [7] Using libSVM technique, Hou et al explored the cost-sensitive approach to deal with imbalance problem by modifying penalty parameters (C) associated with HIA+ and HIA− classes [7] In this work, value of C for HIA− class was assigned to be 5.5 times over HIA+ class However, the classification of the final SVM model is still biased to the HIA+ with accuracy of >99 % compared to 72 % for HIA− class This result is in agreement with current findings in this study In order to improve the classification of HIA− class, Newby et al developed various classification trees using two strategies for modeling imbalanced absorption data of Hou [7,8]: undersampling and misclassification cost However, this study did not analyze the undersampling method clearly as well as the effect of random selection of majority cases on the AD In contrast, a cost-sensitive strategy was studied well, taking into account various factors that contribute to the final model performance, such as variable selection, costs assigned for minority class, etc As the authors did not explore other strategies, it is difficult to compare with other methods currently used for coping with the class imbalance problem In an interesting study by Eitrich et al., numerous strategies for modeling imbalanced ADME datasets were explored on the basis of SVM technique [6] In this study, the authors developed a new strategy combining oversampling and modified SVM algorithm As an example, proposed strategy has been applied for classifying a data set of CYP2D6 inhibitors (185 drug-like compounds) In general, this method seems to have potential for dealing with imbalanced ADME data problems However, the proposed algorithm has not been widely 123 Mol Divers Table MDs selected by the wrapper methods for the construction of SVM models Family Descriptors Frequency Meaning Constitutional SCBO Sum of conventional bond orders (H-depleted) nO Number of oxygen atoms nX Number of halogen atoms C (%) Percentage of C atoms Ring Functional group count Atom-centered fragments Charge Molecular properties Drug-like O (%) Percentage of O atoms nCIR Number of circuits Rbrid Ring bridge count NRS Number of ring systems nR06 Number of 6-membered rings D/Dtr05 Distance/detour ring index of order D/Dtr06 Distance/detour ring index of order nCar Number of aromatic C(sp2) nCb- Number of substituted benzene C(sp2) nR=Cs Number of aliphatic secondary C(sp2) nHDon Number of donor atoms for H-bonds (N and O) nHAcc Number of acceptor atoms for H-bonds (N, O, F) nHBonds Number of intramolecular H-bonds (with N, O, F) C-002 CH2R2 C-025 R–CR—R C-040 R-C(=X)-X/R-C#X/X=C=X H-050 H attached to heteroatom O-058 =O qnmax Maximum negative charge Q2 Total squared charge RPCG Relative positive charge PCWTE1 Partial charge weighted topological electronic index TPSA(NO) Topological polar surface area using N,O polar contributions TPSA(Tot) Topological polar surface area using N,O,S,P polar contributions MLOGP Moriguchioctanol-water partition coeff (logP) ALOGP Ghose-Crippenoctanol-water partition coeff (logP) ALOGP2 Squared Ghose-Crippenoctanol-water partition coeff (logP∧ 2) SAtot Total surface area from van der Waals surface area (P_VSA-like) descriptors SAdon Surface area of donor atoms from van der Waals surface area (P_VSAlike) descriptors DLS_02 Modified drug-like score from Oprea et al (6 rules) LLS_01 Modified lead-like score from Congreve et al (6 rules) implemented and the validation of new strategy remains limited According to modeling strategy, our study provided an entire analysis using a large number of methods to deal with imbalanced ADME data problem This current study did not attempt to develop a new method but to find out the best modeling procedure when facing with a moderate imbalance problem like Caco-2 cell permeability data There is not a 123 universal strategy for imbalanced datasets Therefore, prior to developing models for classifying imbalance data, the nature of imbalance problem, the influence of feature selection, and possible strategies should be analyzed and compared In this context, our study could serve as a guideline for modeling imbalanced data in general and ADME-related properties in particular Mol Divers Suitability of proposed permeability threshold for BCS permeability class prediction Permeability measurement using Caco-2 monolayer is recommended by FDA and the European Medicines Agency (EMA) for provisionally assigning BCS class [11] These data can be used for developing in silico models for BCS classification, since the clinical dose and the human absorbed fraction are generally unknown at the initial stage of drug discovery [11] A small set of 47 drugs with a broad range of structural and permeability properties was randomly recollected from reported homogeneous experimentations [12,15,53, 54] Subsequently, their Caco-2 permeability classes were predicted by obtained models and analyzed in context of regulatory acceptance The absorption profiles and BCS classification of this subset have been studied [11,12] Analysis of the concordance between predicted permeability in Caco2 model and BCS permeability class can give more reason for the applicability of in silico approaches in BCS classification [11,13] The average values of experimental permeability are shown in Table All models predicted this set of drugs The FN/FP ratio was: Primal-SVM (8/2), Metacost (4/5), CSVM (3/5), SS1 (4/4), SS2 (1/6), SMOTE1 (4/3), SMOTE2 (5/2), RS1 (4/4), RS2 (4/5), and consensus model (3/3) According to the use of Caco-2 permeability data for BCS classification, a model with lower FN is desirable Of course, FP should also be maintained acceptable Table summarizes predicted results of consensus model Only compounds out of 47 were incorrectly predicted Amlodipine, Chloramphenicol, Cimetidine, Cortisol, Ribonabant, and Trimethobenzamide are incorrectly predicted, while Haloperidol is non-classified As can be appreciated, the consensus model displayed very good performance Its accuracy was 86.9 % for both M-P and H class, whereas the corresponding results for Primal-SVM were 95.7 and 65.2 %, respectively The predictive power and rebalance capacity of consensus model and applied strategies were evidenced According to the suitability of cut-off value selection, it is of great interest to analyze the concordance of in vitro permeability classification of drugs and their allocations on BCS reported in the literature [11,13] The first case is Acetylsalicylic acid (ASA), which is classified by the World Health Organization (WHO) as class I (high solubility, high permeability) [13] Higher Caco-2 permeability values were assessed, and ASA high permeability class was justified [55] Therefore, it is suggested that our criterion is too strict for classifying ASA On the other hand, compounds such as Azithromycin (HIA of 37 %), Etoposide (HIA ≤ 60 %), and Ivermectin (HIA of 56 %) not have a conclusive permeability class yet [13] The causes of their poor absorption could be due to poor solubility, or poor solubility and poor permeability in combination Nevertheless, their quite low permeability values in in vitro Caco-2 assays (Papp < × 10−6 cm/s) suggest that the absorption process for these compounds is solubility/permeability-limited Also, it is well known that Etoposide and Ivermectin suffer major impact of efflux mechanism in Caco-2 cells [56] There are drugs (Chloramphenicol, Codeine, Digoxin, and Haloperidol) owning different permeability classifications according to the in vitro Caco-2 permeability values with respect to the reported BCS classes [11] The BCS classification for these compounds is questionable Chloramphenicol, Codeine, and Haloperidol were commonly reported in the literature with high HIA and F values [57] Digoxin has a high absorption variability which could be explained by its active transport and the inhibition of the efflux transporter such as the Pglycoprotein (P-gp) [11] The current BCS for Digoxin could be regarded as misleading, and the BCS classification (I or III) depends on particle size [58] In general, the Caco-2 permeability and Fa data showed clear rank-order relationship All H class compounds have HIA ≥85 % and 81.8 % of M-P class have HIA < 85 % In accordance with our previous study [11], these findings confirm that in silico modeling of Caco-2 permeability using proposed threshold is suitable for provisional BCS permeability classification Concluding remarks and future direction It has been shown that solutions for combating imbalanced data problems are essential in model construction for Caco-2 permeability predictions Although this imbalance degree is moderate, it can seriously influence the model quality, such as delivering poor performance, making it impossible to predict high-permeability class, among other confused conclusions ADME processes are, in general, complex and difficult to predict; modeling a dataset with imbalance can exacerbate the problems Identifying the nature of the problems and proposing effective solutions remain two essential points in the model construction In this study, these important ADME modeling issues were addressed Some important remarks can be identified Firstly, oversampling and combination of over and subsampling are the most effective strategies Secondly, the variable selection procedure in case of resampling methods must be carried out based on the original imbalanced distribution of the data The procedure should be (i) developing a preliminary model from the imbalanced data distribution, and then (ii) using the same variable subset selected from step (i) for further application of subsample, oversample, and so on The results also indicated that our models, based on penalty modified SVM, were accurate and very consistent The Spreadsubsample method implemented in WEKA can work well without a loss of any 123 Mol Divers Table Caco-2 experimental permeability, HIA and in silico predictions of 47 reference drugs for BCS classification Compound name Papp a (×10−6 cm/s) In vitro classification Fab (%) BCS classc Therapeutic class In silico classificationd Acetylsalicylic acid 2.2 M-P 100 H NSAID M-P Alfuzosin 2.3 M-P 71 – Antihypertensive M-P Amiodarone 12.9 M-P 85 – Antidepressant M-P Amitriptyline 54.7 H 95 H Antidepressant H Amlodipine 21.6 H 85 H Antihypertensive M-P Amoxicillin 1.6 M-P 94 H Antibacterial M-P Atenolol 2.7 M-P 56–65 M-P Antihypertensive M-P Antipyrine 28 H 98 H NSAID H Azithromycin 0.6 M-P 37 M-P(*) Antibacterial M-P Caffeine 29.1 H 95 H CNS stimulant H Carbamazepine 23 H 100 H Antiepileptic H Celiprolol 0.89 M-P 50 – Antihypertensive M-P Chloramphenicol 30.1 H 90 M-P Antibacterial M-P Cimetidine 3.5 M-P 87 – Antihistamine H Codeine 52.6 H 93–95 M-P Opioid analgesic H Colchicine 0.9 M-P 55 – For Gout treatment M-P Cortisol 13.1 M-P – – Wide therapeutic Index H Darunavir 12 M-P 80 – Antiretroviral M-P Diazepam 41 H 100 H Anxiety disorders H Digoxin 1.49 M-P 78–96 H Antiarrhythmic M-P Etoposide 0.7 M-P 50–60 M-P(*) Cytotoxic M-P Fexofenadine 1.9 M-P 30 – Antihistamine M-P Fluoxetine 27.9 H 95 – Antidepressant H Furosemide 0.4 M-P 60 M-P Diuretic M-P Haloperidol 19.7 H 100 M-P Antipsychotic NC** Ibuprofen 48.8 H 92–100 H NSAID H Indomethacin 18.92 H 100 H NSAID H Ivermectin 0.8 M-P 56 M-P(*) Antifilarial M-P Levofloxacin 28.4 H 100 H Antibacterial H Linopirdine 17.1 H 100 – Anti-Alzheimer H Methotrexate 0.4 M-P 59–70 M-P Antineoplastic M-P Metoprolol 21.33 H 95 H Antihypertensive H Naproxen 26.1 H 87 H NSAID H Nevirapine 32.2 H 93 H Antiretroviral H Norfloxacin 2.1 M-P 71 – Antibacterial M-P Paracetamol 24.6 H 90 H Analgesic and antipyretic H Phenobarbital 23.6 H 90–100 H Antiepileptic H Propranolol 25.2 H 95 H Antimigraine H Ranitidine M-P 57 M-P Antiulcerative M-P Rimonabant 126.6 H 90 – Antiobesity M-P Terbutaline 4.65 M-P 73 – Bronchodilator and tocolytic M-P Theophylline 24.9 H 98 – Bronchodilator H Thymitaq 4.1 M-P 82 – Cytotoxic M-P 123 Mol Divers Table continued Compound name Papp a (×10−6 cm/s) In vitro Classification Fab (%) BCS classc Therapeutic class Ticlopidine 241.6 H 90 – Anticoagulant H Trimethobenzamide 10.9 M-P 60 – Antiemetic H Verapamil 35 H 95 H Antiarrhythmic H Vinblastine 5.1 M-P M-P Antineoplastic M-P In silico Classificationd a Average In vitro Permeability extracted from Refs [12,15,53,54] Fraction absorbed in human (http://modem.ucsd.edu/adme/databases/databases_intestinal_absorption.htm) c BCS permeability classes reported in the literature [11] d In silico permeability classification of consensus model ∗ Compounds without a clear classification in the BCS according to Refs [11,13] ∗∗ Non-classified drug b valuable information Finally, the use of ROC curve for evaluating models constructed on an imbalanced database must be done with care, since it can assimilate to the conclusion extracted from overall accuracy that tends to misinterpret this problem On the other hand, the use of Caco-2 cell permeability data has been demonstrated to be useful to provisionally classify the BCS system Our new proposed threshold is correlated well with HIA ≥ 85 %, which is a regulatory benchmarking boundary to identify compounds belonging to class I and II in BCS In addition, two molecular properties that mostly impact on in vitro permeability of drugs were identified: lipophilicity (LogP) and polar surface area (PSA) These findings agree with our previous analysis [12] In general, many ADME databases, such as HIA and F, substrate and inhibitor of metabolizing proteins, are generally available in the literature with moderate to high imbalance degree Therefore, it is an ongoing need to apply currently studied strategies using other machine learning techniques for developing QSPR models able to accurately predict these imbalanced ADME data It is anticipated that the current study can help the researchers to identify the best solution to face the imbalance problem and to improve the data management on ADME profiling and prediction Support information Supplemental Tables S1 and S2 (.xls) list all training and test data set compounds with MDs calculated, respectively Supplemental Table S3 (.xls) lists compounds excluded from modeling procedure due to extreme characteristics (MW ≤ 75) and permanent charge (quaternary ammonium and sal) In addition, all Tables (S1, S2, and S3) include List of References (in second sheet) from which chemical structures and permeability values were recompiled Acknowledgments H.L-T-T is supported by Vietnam National University H.P-T, M.B, I.G-A, T.G, and M.A.C-P acknowledge financial support of AECID (Grant No 1- D/031152/10 and DCI-ALA/19.09.01 /10/21526/245-297/ALFA 111(2010)29) We greatly appreciate Mr Aaron Burns from Oxford English UK Vietnam for his careful review and helpful editing of this manuscript References Chawla NV (2010) Data mining for imbalanced datasets: an overview In: Data mining and knowledge discovery handbook Maimon O, Rokach L (eds) vol 45, 2nd edn Springer, 233 Spring Street, New York, NY 10013, USA, pp 875–886 doi:10.1007/ 978-0-387-09823-4 Japkowicz N (2003) Class imbalances: are we focusing on the right issue? Paper presented at the ICML’2003 Workshop on learning from imbalanced data sets (II) Washington, DC, 21 August 2003 Drummond C, Holte RC (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling In: Proceedings of the international conference on machine learning (ICML 2003) Workshop on learning from imbalanced data sets II, Washington, DC Trotter MWB, Holden SB (2003) Support vector machines for ADME property classification QSAR Comb Sci 22:533–548 doi:10.1002/qsar.200310006 Pinto M, Trauner M, Ecker GF (2012) An in silico classification model for putative ABCC2 substrates Mol Inf 31:547–553 doi:10 1002/minf.201200049 Eitrich T, Kless A, Druska C, Meyer B, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques J Chem Inf Model 47:92– 103 doi:10.1021/ci6002619 Hou T, Wang J, Li Y (2007) ADME evaluation in drug discovery The prediction of human intestinal absorption by a support vector machine J Chem Inf Model 47:2408–2415 doi:10.1021/ ci7002076 Newby D, Freitas AA, Ghafourian T (2013) Coping with unbalanced class data sets in oral absorption models J Chem Inf Model 53:461–474 doi:10.1021/ci300348u Avdeef A (2003) Absorption and drug development: solubility, permeability, and charge state, 1st edn Wiley, Hoboken doi:10.1002/ 047145026X 10 Oltra-Noguera D, Mangas-Sanjuan V, Centelles-Sangüesa A, Gonzalez-Garcia I, Sanchez-Castaño G, Gonzalez-Alvarez M, Casabo V-G, Merino V, Gonzalez-Alvarez I, Bermejo M (2015) Variability of permeability estimation from different protocols of subculture and transport experiments in cell monolayers J Pharmacol Toxicol Methods 71:21–32 doi:10.1016/j.vascn.2014.11.004 11 Pham-The H, Garrigues T, Bermejo M, González-Álvarez I, Monteagudo MC, Cabrera-Pérez MÁ (2013) Provisional classification and in silico study of biopharmaceutical system based on Caco- 123 Mol Divers 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 cell permeability and dose number Mol Pharm 10:2445–2461 doi:10.1021/mp4000585 Pham-The H, González-Álvarez I, Bermejo M, Garrigues T, LeThi-Thu H, Cabrera-Pérez MÁ (2013) The use of rule-based and QSPR approaches in ADME profiling: a case study on Caco-2 permeability Mol Inf 32:459–479 doi:10.1002/minf.201200166 Annex 8: Proposal to waive in vivo bioequivalence requirements for WHO Model List of Essential Medicines immediate-release, solid oral dosage forms (2006) WHO Expert Committee on specification for pharmaceutical preparations WHO Technical Report Series No 937:391-461 http://www.who.int/medicines/ publications/essentialmedicines/en/index.html CDER/FDA FDA Guidance for industry: waiver of in vivo bioavailability and bioequivalence studies for immediate-release solid oral dosage forms based on a biopharmaceutics classification system (2000) Federal Drug and Food Administration, Rockville www.fda.gov/downloads/Drugs/GuidanceCompliance RegulatoryInformation/Guidances/ucm070246.pdf Pham-The H, Gonzalez-Diaz I, Bermejo-Sanz M, Mangas-Sanjuan V, Centelles I, Garriges TM, Cabrera-Perez MA (2011) In silico prediction of Caco-2 permeability by a classification QSAR approach Mol Inf 30:376–385 doi:10.1002/minf.201000118 Le-Thi-Thu H, Canizares-Carmenate Y, Marrero-Ponce Y, Torrens F, Castillo-Garit JA (2015) Prediction of Caco-2 cell permeability using bilinear indices and multiple linear regression Lett Drug Des Discov, vol 12 (E-pub ahead of print) doi:10.2174/ 1570180812666150630183511 Prieto P, Hoffmann S, Tirelli V, Tancredi F, González I, Bermejo M, De Angelis I (2010) An exploratory study of two Caco-2 cell models for oral absorption: a report on their within-laboratory and between-laboratory variability, and their predictive capacity Altern Lab Anim 38:367–386 Volpe DA (2008) Variability in Caco-2 and MDCK cell-based intestinal permeability assays J Pharm Sci 97:712–725 doi:10 1002/jps.21010 Polli JE, Yu LX, Cook JA, Amidon GL, Borchardt RT, Burnside BA, Burton PS, Chen ML, Conner DP, Faustino PJ, Hawi AA, Hussain AS, Joshi HN, Kwei G, Lee VH, Lesko LJ, Lipper RA, Loper AE, Nerurkar SG, Polli JW, Sanvordeker DR, Taneja R, Uppoor RS, Vattikonda CS, Wilding I, Zhang G (2004) Summary workshop report: biopharmaceutics classification system-implementation challenges and extension opportunities J Pharm Sci 93:1375–1381 doi:10.1002/jps.20064 Kim JS, Mitchell S, Kijek P, Tsume Y, Hilfinger J, Amidon GL (2006) The suitability of an in situ perfusion model for permeability determinations: utility for BCS Class I biowaiver requests Mol Pharm 3:686–694 doi:10.1021/mp060042f Maenner MJ, Denlinger LC, Langton A, Meyers KJ, Engelman CD, Skinner HG (2009) Detecting gene-by-smoking interactions in a genome-wide association study of early-onset coronary heart disease using random forests BMC Proc 3(Suppl 7):S88 doi:10 1186/1753-6561-3-S7-S88 HyperChem (TM) Professional 8.0.5 Hypercube, Inc., 1115 NW 4th Street, Gainesville, Florida 32601, USA (www.hyper.com/) STATISTICA (data analysis software system) (2007) 8.0 edn StatSoft, Inc., Tulsa (www.statsoft.com) Vapnik V (1995) The nature of statistical learning theory Springer, New York Burges CJC (1998) A tutorial on support vector machines for pattern recognition Data Min Knowl Disc 2:127–167 doi:10.1234/ 12345678 Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines ACM Trans Intell Syst Technol 2:1–27 doi:10.1145/ 1961189.1961199 Hsu C-W, Chang C-C, Lin C-J (2003) A practical guide to support vector classification Department of Computer Science, National 123 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Taiwan University, http://www.csie.ntu.edu.tw/~cjlin Accessed 17 October 2014 Platt JC (1999) Fast training of support vector machines using sequential minimal optimization Advances in kernel methods MIT Press, Cambridge, pp 185–208 Witten HI, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn Morgan Kaufmann, San Francisco Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets In: Boulicaut J-F, Esposito F, Giannotti F, Pedreschi D (eds) Machine learning: ECML 2004, vol 3201., Lecture notes in computer science Springer, Berlin, pp 39–50 doi:10.1007/978-3-540-30115-8_7 Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets Knowl Inf Syst 25:1–20 doi:10.1007/ s10115-009-0198-y Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines In: International joint conference on AI (IJCAI 99), Stockholm, pp 55–60 Wu G, Chang EY (2003) Adaptive feature-space conformal transformation for imbalanced-data learning In: Proceeding of the 20th international conference on machine learning (ICML-2003), vol Washington DC, pp 816–823 Schierz AC (2009) Virtual screening of bioassay data J Cheminform 1:1–12 doi:10.1186/1758-2946-1-21 Domingos P (1999) MetaCost: A general method for making classifiers cost-sensitive In: KDD ’99 Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, San Diego, pp 155–164, doi:10.1145/312129 312220 Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts ACM SIGKDD Explor Newsl 6:40–49 doi:10.1145/1007730 1007737 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Synthetic minority over-sampling technique J Artif Intell Res 16:321–357 doi:10.1613/jair.953 Kohavi R, John GH (1997) Wrappers for feature subset selection Artif Intell 97:273–324 doi:10.1016/S0004-3702(97)00043-X John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem In: Cohen WW, Hirsh H (eds) Machine learning proceedings of the eleventh international conference Morgan Kaufmann, San Francisco, pp 121–129 Jaworska J, Nikolova-Jeliazkova N, Aldenberg T (2005) QSAR applicabilty domain estimation by projection of the training set descriptor space: a review Altern Lab Anim 33:445–459 Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions In: Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD-97), Newportbeach, August 1997, pp 43–48 Le-Thi-Thu H, Casanola-Martín GM, Marrero-Ponce Y, Rescigno A, Abad C, Khan MT (2014) A rational workflow for sequential virtual screening of chemical libraries on searching for new tyrosinase inhibitors Curr Top Med Chem 14:1473–1485 doi:10.2174/ 1568026614666140523120336 Demsar J (2006) Statistical comparisons of classifiers over multiple data sets J Mach Learn Res 7:1–30 Le-Thi-Thu H, Marrero-Ponce Y, Casanola-Martin GM, Cardoso GC, Chávez MC, Garcia MM, Morell C, Torrens F, Abad C (2011) A comparative study of nonlinear machine learning for the “In silico” depiction of Tyrosinase Inhibitory Activity from Molecular Structure Mol Inf 30:527–537 doi:10.1002/minf.201100021 Friedman M (1940) A comparison of alternative tests of significance for the test of m rankings Ann math Statist 11:86–92 doi:10 2307/2235971 Mol Divers 46 Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic Commun Stat 9:571–595 doi:10 1080/03610928008827904 47 Dunn OJ (1961) Multiple comparisons among means J Am Stat Assoc 56:52–64 doi:10.2307/2282330 48 Le-Thi-Thu H, Cardoso GC, Casañola-Martin GM, Marrero-Ponce Y, Puris A, Torrens F, Rescigno A, Abad A (2010) QSAR models for tyrosinase inhibitory activity description applying modern statistical classification techniques: A comparative study Chemom Intell Lab Syst 104:249–259 doi:10.1016/j.chemolab.2010.08.016 49 Fawcett T (2003) ROC Graphs: notes and practical considerations for data mining researchers Technical Report HPL-2003-4 HP Laboratories, Palo Alto 50 Oprea T (2000) Property distribution of drug-related chemical databases J Comput Aided Mol Des 14:251–264 doi:10.1023/ A:1008130001697 51 Congreve M, Carr R, Murray C, Jhoti H (2003) A rule of three for fragment: based lead discovery? Drug Discov Today 8:876–877 doi:10.1016/S1359-6446(03)02831-9 52 Cabrera-Perez MA, Pham-The H, Bermejo M, Alvarez IG, Alvarez MG, Garrigues TM (2012) QSPR in oral bioavailability: specificity or integrality? Mini-Rev Med Chem 12:534–550 doi:10 2174/138955712800493753 53 Tremblay P, Auger S, Picard P, Blachon G, Julian B, Laplanche L, Sarcy C, Estoul S, Moliner P, Fedeli O, Fabre G (2010) LDTD384MS/MS for in vitro assays Paper presented at the 58th ASMS Conference on Mass Spectrometry, Salt Lake City 54 Hu M, Ling J, Lin H, Chen J (2004) Use of Caco-2 cell monolayers to study drug absorption and metabolism In: Yan Z, Caldwell GW (eds) Optimization in drug discovery: in vitro methods, vol 2., Methods in pharmacology and toxicologyHumana Press Inc., Totowa, pp 19–35 doi:10.1385/1-59259-800-5:019 55 Dressman JB, Nair A, Abrahamsson B, Barends DM, Groot DW, Kopp S, Langguth P, Polli JE, Shah VP, Zimmer M (2012) Biowaiver monograph for immediate-release solid oral dosage forms: acetylsalicylic acid J Pharm Sci 101:2653–2667 doi:10 1002/jps.23212 56 Letcher SG (2010) Phylogenetic structure of angiosperm communities during tropical forest succession Proc Biol Sci 277:97–104 doi:10.1098/rspb.2009.0865 57 Zhao YH, Le J, Abraham MH, Hersey A, Eddershaw PJ, Luscombe CN, Butina D, Beck G, Sherborne B, Cooper I, Platts JA (2001) Evaluation of human intestinal absorption data and subsequent derivation of a quantitative structure-activity relationship (QSAR) with the Abraham descriptors J Pharm Sci 90:749–784 doi:10.1002/jps.1031 58 Butler JM, Dressman JB (2010) The developability classification system: application of biopharmaceutics concepts to formulation development J Pharm Sci 99:4940–4954 doi:10.1002/jps.22217 123 ... classification and in silico study of biopharmaceutical system based on Caco- 123 Mol Divers 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 cell permeability and dose number Mol Pharm 10 :24 45? ?24 61... 0.847 SMOTE2 0/1 86.76 74 .29 67.53 78 .29 70.75 0.56 79.81 0.863 RS1 7 /2 90 .23 81.43 69.51 82. 09 75.00 0. 62 82. 33 0.858 RS2 Consensus model a 2/ 0 91.67 85.71 62. 50 80.36 72. 29 0.58 78.70 0.871 0/0... classification models for the prediction of Caco- 2 cell permeability Based on the guidance provided by the US Food and Drug Administration (FDA) for the application of in vitro permeability data in

Định dạng
Số trang	17
Dung lượng	708,75 KB