Send Orders for Reprints to reprints@benthamscience.ae Current Bioinformatics, 2015, 10, 000-000 Optimum Search Strategies or Novel 3D Molecular Descriptors: Is there a Stalemate? Yovani Marrero-Ponce*,1,2,3, César R García-Jacas1,4, Stephen J Barigye1,8, José R Valdés-Martiní1, Oscar Miguel Rivera-Borroto1,5, Ricardo W Pino-Urias1,5, Néstor Cubillán6 and Ysaías J Alvarado7 and Huong Le-Thi-Thu9 Unit of Computer-Aided Molecular “Biosilico” Discovery and Bioinformatics Research (CAMD-BIR International), Cartagena de Indias, Bolívar, Colombia Institut Universitari de Ciència Molecular, Universitat de València, Edifici d'Instituts de Paterna, P.O Box 22085, E-46071, València, Spain Grupo de Investigación en Estudios Qmicos y Biológicos, Facultad de Ciencias Básicas, Universidad Tecnológica de Bolívar, Cartagena de Indias, Bolívar, Colombia Grupo de Investigación de Bioinformática, Centro de Estudio de Matemática Computacional (CEMC), Universidad de las Ciencias Informáticas (UCI), La Habana, Cuba Yovani Marrero-Ponce Faculty of Computing and Systems, Pontifical University Catholic of Ecuador in Esmeraldas (PUCESE) C/ Espejo y Santa Cruz S/N, 080150 Esmeraldas, Ecuador Laboratorio de Electrónica Molecular, Universidad del Zulia, Facultad Experimental de Ciencias, Departamento de Química Maracaibo, República Bolivariana de Venezuela Laboratorio de Caracterización Molecular y Biomolecular, Departamento de Investigación en Tecnología de los Materiales y el Ambiente (DITeMA), Instituto Venezolano de Investigaciones Científicas (IVIC), Avenida 74 calle 14A, Maracaibo, República Bolivariana de Venezuela Departamento de Química, Universidade Federal de Lavras, UFLA Caixa Postal 3037, 37200-000 Lavras, MG, Brazil School of Medicine and Pharmacy, Vietnam National University, Hanoi (VNU) 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam Abstract: The present manuscript describes a novel 3D-QSAR alignment free method (QuBiLS-MIDAS Duplex) based on algebraic bilinear, quadratic and linear forms on the kth two-tuple spatial-(dis)similarity matrix Generalization schemes for the inter-atomic spatial distance using diverse (dis)-similarity measures are discussed On the other hand, normalization approaches for the two-tuple spatial-(dis)similarity matrix by using simple- and double-stochastic and mutual probability schemes are introduced With the aim of taking into consideration particular inter-atomic interactions in total or local-fragment indices, path and length cut-off constraints are used Also, in order to generalize the use of the linear combination of atom-level indices to yield global (molecular) definitions, a set of aggregation operators (invariants) are applied A Shannon’s entropy based variability study for the proposed 3D algebraic form-based indices and the DRAGON molecular descriptor families demonstrates superior performance for the former A principal component analysis reveals that the novel indices codify structural information orthogonal to those captured by the DRAGON indices Finally, a QSAR study for the binding affinity to the corticosteroidbinding globulin using Cramer’s steroid database is performed From this study, it is revealed that the QuBiLS-MIDAS Duplex approach yields similar-to-superior performance statistics than all the 3D-QSAR methods reported in the literature reported so far, even with lower degree of freedom, using both the 31 steroids as the training set and the popular division of Cramer’s database in training [1-21] and test sets [22-31] It is thus expected that this methodology provides useful tools for the diversity analysis of compound datasets and high-throughput screening structure–activity data Keywords: Alignment free method, aggregation operator, Minkowski distance matrix, principal component analysis, QuBiLSMIDAS, 3D-QSAR, two-tuple spatial-(dis)similarity matrix, TOMOCOMD-CARDD, variability analysis INTRODUCTION The advent of 3D-QSAR methods represents a fundamental shift from the classical Hansch-Fujita (2D-QSAR) *Address correspondence to this author at the Unit of Computer-Aided Molecular “Biosilico” Discovery and Bioinformatics Research (CAMD-BIR International), Cartagena de Indias, Bolívar, Colombia; Tel: 3043926347; E-mails: ymarrero77@yahoo.es, ymponce@gmail.com 1574-8936/15 $58.00+.00 approach, motivated by the rationale that the spatial arrangement of molecular structures plays determinant role in comprehending the ligand–receptor interactions [1] Right from the pioneering work by Cramer [2], the 3D-QSAR methods have enjoyed considerable enthusiasm over their capability to adequately model the biological activities of chemical structures In principle, the 3D-QSAR techniques could be divided in two main groups, alignment-based techniques (COMFA-related methods) and alignment © 2015 Bentham Science Publishers Current Bioinformatics, 2015, Vol 10, No independent methods [e.g CoMASA (Comparative Molecular Active Site Analysis)] However, the use of 3DQSAR methods has been far from a fairy tale; several problems have been met On one hand, the use of alignment rules comes along with a number of challenges, such as their subjectivity, i.e they are generally inapplicable to structurally diverse datasets (albeit there are works in this sense, e.g see reference [3]), and the computation of steric and/or electrostatic interaction energies yields numerous variables (high dimensionality MD space) relative to the dataset size and usually include noisy variables that tend to compromise the quality of the QSAR models [4-6] Efforts have been made to address the limitations of 3DQSAR methods For example, techniques aimed at addressing the high dimensionality problem include: filtering data points prior to QSAR modeling [7, 8], variable selection procedures [5, 9, 10] and grouping points [11] On the other hand, similarity matrix correlations defined in terms of shape or electrostatic potentials were introduced with the aim of lowering the computational cost of 3D methods, though at the expense of loss of significant features of the molecules [12] Also strategies aimed at improving the alignment rules have been proposed, such as the Monte Carlo algorithm [13] and least squares fitting [14] Other approaches such as the hypothetical active site analysis (HASL) convert superposed molecular sets to a set of spaced points (lattice) to a regular dimension which are defined by 3D-Cartesian coordinates and atom-types [15-17] On the other hand, rather than improving the alignment rules, several alignment-independent techniques have been proposed such as the use of 3D-models based on Cartesian coordinates [18], molecular transforms [19, 20], molecular spectra [21, 22], as well as the extension of traditional 2D molecular indices to consider 3D information [23-27] Other alignment-free methods include CoMMA [Comparative Molecular Moment Analysis] [28], van der Waals excluded volume [29] etc These methods are invariant to both translation and rotation of the molecular structures, and have generally yielded to comparable results with respect to the alignment-based methods However, although relentless efforts have been made to improve or provide alternative, robust and computationally cheap 3D-QSAR techniques, either due to the complexity of modeling biological activities or the very weakness inherent to the present methods, improvements on the quality of the 3D-QSAR models have in reality been minimal, creating some kind of “out of reach” model performance So is it possible to penetrate through these “boundaries”? Looking at the current state of 3D-QSAR modeling in general, the balance of responses to this interrogative may possibly lie towards the negative end However, our argument is that, it is imperative to diversify the space spanned by the 3D molecular parameters, to yield variables that correctly “fit” or adjust to the “troublesome” behavior of the molecules, other than nearly exclusively concentrating on the quest for the correct relationship among variables by using more powerful (linear or non-linear) search strategies and optimization functions In previous reports, Marrero-Ponce et al introduced outstanding features related with the topological (2D) and Marrero-Ponce et al chiral (2.5D) aspects of the molecules through the atombased and bond-based TOMOCOMD-CARDD (acronym for Topological Molecular Computer Design – Computer Aided Rational Drug Design) molecular descriptors (MDs) (now condensed in QuBiLS-MAS module) [30-39] These MDs codify molecular information by means of the bilinear, quadratic and linear algebraic forms and the graph– theoretical electronic-density matrices Thus, bearing in mind these successfully results and based on the same linear algebraic concepts, this manuscript is dedicated to the definition and generalization of the 3D algebraic-based QuBiLS-MIDAS (acronym for Quadratic, Bilinear and NLinear Maps based on n-Tuple Spatial Metric [(Dis)Similarity] Matrices and Atomic WeightingS) Duplex MDs for relations between atom-pairs, which constitute a module of the TOMOCOMD-CARDD framework THEORETICAL FRAMEWORK 2.1 Bilinear, Quadratic and Linear Form-based Indices for Atom-Level and Total (Whole-Molecule) Definitions If a molecule is composed by n atoms then the kth bilinear, quadratic and linear MDs for each atom “a” are computed as bilinear, quadratic and linear algebraic maps (forms) in ℝ n, in canonical basis set, and these are mathematically expressed as shown as follows, respectively: !,! ! ! 𝑥𝑥, 𝑦𝑦 = !!!! !!!! 𝑔𝑔!" 𝑥𝑥 𝑦𝑦 = 𝑌𝑌 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 !,! !𝐿𝐿! = 𝑏𝑏 ! !,! 𝑋𝑋 𝔾𝔾 !,! ! 𝐿𝐿! = 𝑞𝑞 ! !,! 𝑋𝑋 𝔾𝔾 𝑥𝑥, 𝑥𝑥 = 𝑞𝑞 𝑥𝑥 = 𝑋𝑋 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 !,! = 𝑓𝑓 1,2, … , 𝑛𝑛 ! 𝐿𝐿! !,! 𝑥𝑥 = ! !!! ! !!! !,! ! ! ! !!! 𝑔𝑔!" 𝑢𝑢 𝑥𝑥 (1) !,! ! ! ! !!! 𝑔𝑔!" 𝑥𝑥 𝑥𝑥 = 𝑈𝑈 ! 𝔾𝔾 !,! = (2) 𝑋𝑋 ∀ 𝑎𝑎 = (3) where, n is the amount of atoms of the chemical structure, u is a vector with coefficients equal to 1, and x1,…, xn and y1,…, yn are the coordinates (or components) of the molecular (or property) vectors x and y in a system of canonical (‘natural’) basis vectors of ℝ n The use of atom-based molecular vectors as representations of chemical structures has been explained in detail elsewhere [35-37, 40] In the present report, the components of these molecular vectors are computed from the following atom- and fragment-based properties (weighting schemes): 1) atomic mass (m), 2) the van der Waals volume (v), 3) the atomic polarizability (p), 4) atomic Pauling electronegativity (e), 5) atomic GhoseCrippen LogP (a) [23, 41, 42], 6) atomic Gasteiger-Marsili charge (c) [43], 7) atomic polar surface area (psa) [44], 8) atomic refractivity (r) [23, 41, 42], 9) atomic hardness (h) and 10) atomic softness (s) These properties were implemented in the QuBiLS-MIDAS program [45, 46] mainly using the Chemistry Development Kit (CDK) library [47] a,k The coefficients gij are the elements of the kth two-tuple atom-level spatial-(dis)similarity matrix (SDSM) 𝔾𝔾!,! for k atom “a” These are obtained from the coefficients gij of the 𝔾𝔾! as follows: Novel 3D Algebraic Molecular Descriptors gija,k = gijk Current Bioinformatics, 2015, Vol 10, No if i = a ∧ j = a k gij if i = a ∨ j = a =0 otherwise = (4) So, if a molecule is divided into “a” atoms then the matrix 𝔾𝔾! can be divided into “a” atom-level matrices 𝔾𝔾!,! , in such a way that the kth power of the matrix 𝔾𝔾! is exactly equal than the sum of the kth power of the atom-level matrices 𝔾𝔾!,! Like this, each 𝔾𝔾!,! matrix determines an atom-level index for each atom “a”, which is denoted as La (see Eqs 1-3) In this way, the total (whole-molecule, that is, considering all atoms) bilinear, quadratic and linear indices may be represented as a vector 𝐿𝐿 of size n, where each entry La corresponds to the kth bilinear, quadratic or linear atomlevel index (descriptor) for the atom “a” Therefore from this decomposition, the total bilinear, quadratic and linear indices are calculated as linear combination (summation) of the atom-level indices (values of the vector 𝐿𝐿 ) Generalizations of this approach using several aggregation operators will be discussed later (see section 2.6) The summation over 𝐿𝐿 is equivalent to the product for the property vector [X] T (or [U] T), the 𝔾𝔾! matrix and the property vector [Y], analogous to the original approach for 2D global bilinear, quadratic and linear algebraic forms [37, 48, 49], as shown in Eqs 5-7 (see also Eqs 1-3): 𝑏𝑏 ! 𝑥𝑥, 𝑦𝑦 = 𝑞𝑞 ! 𝑥𝑥 = ! 𝑓𝑓 𝑥𝑥 = ! !!! ! 𝐿𝐿! ! !!! ! 𝐿𝐿! ! !!! ! 𝐿𝐿! = 𝑋𝑋 ! 𝔾𝔾! 𝑌𝑌 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 = [𝑋𝑋]! 𝔾𝔾! 𝑋𝑋 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 ! ! = [𝑈𝑈] 𝔾𝔾 𝑋𝑋 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 (5) (6) (7) where [X] and [Y] are column vectors (nx1 matrices) corresponding to the coordinates of the vectors x and y in the canonical basis of ℝn, while [U]T and [X]T (1xn matrices) are the transposes of the vector [U] and the property vector [X], respectively Finally, 𝔾𝔾! is the kth two-tuple spatial-(dis)similarity matrix for a molecule, which constitutes a generalization of the well-known geometric distance matrix [20, 50] The geometric distance matrix (or geometry matrix) of a molecule is a square symmetric matrix n×n, where each entry rij is only computed as the Euclidean distance (geometric distance) between the atoms i and j; and the diagonal entries are always zero [12, 20, 50, 51] In the present report, several approaches are proposed as an extension/generalization of the traditionally used geometric distance matrix These will be discussed in the next section 2.2 The Two-Tuple Spatial-(Dis) Similarity Matrix (SDSM) and their Physicochemical Nature The development of keen interest in the codification of the geometric and topographic aspects of the molecular structures as a logical extension of the topological representation can be traced way back to the mid-1980s This approach codifies information related with the molecular geometry represented by a geometric distance matrix [12, 19, 20, 24] As was previously mentioned the geometric distance matrix uses the Euclidean distance to codify interatomic interactions within a molecule Formally, let N be a set of elements, a function D: N × N → ℝ possesses the following properties, for all a,b,c ∈N : D (a,b) ≥ (non-negative) D (a,b) = D (b, a) (symmetric) D (a,a) = (reflexive) D (a,b) ≤ D(a,c) + D(c,b) (triangle inequality) If D holds for the properties 1-3 it is called a distance on N, while if D holds for properties 1-4 it is then denominated a distance metric On the other hand, if D holds for the axioms 1, and is denominated as pseudometric, but if D does not hold the property is a nonmetric To compute the distance between two atoms the 3D Cartesian coordinates x, y, z are considered These coordinates are continuous variables, constituting the Euclidean metric the most common measure employed to compute the distance for these types of variables It is striking that up to the moment the Euclidean distance has been considered as practically the exclusive inter-atomic metric in the computation of 3D MDs, although there is no evidence other than the intuitive reasoning that upholds it as the most suitable distance metric Therefore, if a molecule is in an Euclidean space and taking into account the previous distance and metric definitions, it is then possible to generalize the distance between the atoms i and j through Minkowski distance [52]: 𝑑𝑑!" = 𝑥𝑥! − 𝑥𝑥! ! + 𝑦𝑦! − 𝑦𝑦! ! + 𝑧𝑧! − 𝑧𝑧! ! ! ! (8) where, x, y and z represent the coordinates in Cartesian axis; and p is the Minkowski distance norm or order [e.g., p = is the city-block (Manhattan) distance, p = the well-known Euclidean distance, and p = ∞ Chebyshev (Lagrange) distance] It should be noted that the “Minkowski distance matrix” with elements defined in Eq is the more general (extended or expanded) case of the well-known geometric distance matrix (if p = 2) However, there exist numerous metrics that have been used successfully in machine learning algorithms and similarity studies [53-55], that could be used to compute the inter-atomic dissimilarity and in this way serving as generalization schemes for the spatial distance, g ij for atoms i and j Table shows the set of metrics used in this report for the computation of the inter-atomic geometric distance So, why use diverse (dis)-similarity metrics? Due to the fact that the values obtained from these may exhibit a high degree of correlation as an indicative of the similarity between the objects under study, as shown by Holliday et al in a comparative analysis of the Cosine and Tanimoto coefficients [56] Conversely, whether these values show low correlations among them then may be a reflex of very different features among the objects that are being compared Therefore, it must not be assumed as a premise that exist any single “best” distance metric even if this report is only addressed to the domain of Current Bioinformatics, 2015, Vol 10, No Table Marrero-Ponce et al Metrics used to compute the “distance” between two atoms of a molecule Formulaa Metrics Rangeb Average Range Minkowski (M1-M7) p = 0.25, 0.5, 1, 1.5, 2, 2.5, 3, and ∞ [where, when p= it is the Manhattan, cityblock or taxi distance (also known as Hamming distance between binary vectors) and p = is Euclidean distance) 𝑑𝑑!" = Chebyshev/Lagrange (M8) Canberra (M10) 𝑑𝑑!" = Lance - Williams/Bray-Curtis (M11) 𝑑𝑑!" = Clark/Coefficient of Divergence (M12) 𝑑𝑑!" = Soergel (M13) Bhattacharyya (M14) Wave – Edges (M15) 𝑥𝑥! − 𝑦𝑦! !!! ! ! ! 0, ∞ 𝑑𝑑 = 0, 𝑛𝑛 𝑑𝑑 = 0,1 𝑑𝑑 = 0, 𝑛𝑛 𝑑𝑑 = 0,1 𝑑𝑑 = 0, ∞ 𝑑𝑑 = 0, 𝑛𝑛 𝑑𝑑 = 𝑑𝑑!" = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥! − 𝑦𝑦! (Minkowski formula when p = ∞) 𝑑𝑑!" = 𝑑𝑑!" = 𝑑𝑑!" = !!! ! !!! ! !!! ! 𝑥𝑥! + 𝑦𝑦! 𝑥𝑥! − 𝑦𝑦! 𝑥𝑥! + 𝑦𝑦! !!! 𝑛𝑛 ! 𝑥𝑥! − 𝑦𝑦! ! ! 𝑥𝑥! + 𝑦𝑦! 𝑥𝑥! − 𝑦𝑦! !!! 𝑚𝑚𝑚𝑚𝑚𝑚 ! !!! !!! 1− ! 𝑥𝑥! − 𝑦𝑦! 𝑥𝑥! , 𝑦𝑦! 𝑥𝑥! − 𝑦𝑦! ! 𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥! , 𝑦𝑦! 𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥! , 𝑦𝑦! 𝑑𝑑!" = − 𝐶𝐶𝐶𝐶𝐶𝐶!" where, Angular Separation/[1-Cosine (Ochiai)] (M16) 𝐶𝐶𝐶𝐶𝐶𝐶!" a ! 𝑿𝑿𝑿𝑿 = = 𝑿𝑿 𝒀𝒀 ! !!! 𝑥𝑥! 𝑦𝑦! ! ! !!! 𝑥𝑥! ! ! !!! 𝑦𝑦! 𝑑𝑑!" 𝑛𝑛 ! 0, ∞ ! 𝑑𝑑!" 𝑛𝑛 0,1 𝑑𝑑!" 𝑛𝑛 0, 𝑑𝑑!" 𝑛𝑛 0, 𝑛𝑛 𝑛𝑛 𝑑𝑑!" 𝑛𝑛 0, 𝑑𝑑!" 𝑛𝑛 0, ∞ 𝑛𝑛 𝑑𝑑!" 𝑛𝑛 0,1 0,2 The variables xj and yj are the values of the coordinate j of the atoms X and Y of a molecule, respectively The h value is equal to and corresponds to the 3D Cartesian coordinates (x, y, z) of an atom The p values in Minkowski metric are 0.25, 0.5, (Manhattan), 1.5, (Euclidean), 2.5 and (Minkowski) b“Range” refers to “range” and not to “rank” and is defined as 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥! − 𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥! chemical structure handling In fact, as noted by Jones and Curtice [57] in a debate regarding the association between indexing terms in information retrieval systems: “What is annoying is that no clear-cut criterion for choice among the alternatives has emerged As a result, few candidate measures have been permanently dismissed from consideration, and a rather large set of formulas remains available.” Accordingly, there is hence a continuing interest of analyzing and comparing those available coefficients (metrics) in order to ensure that the most suitable one(s) are used in any concrete similarity-based system In conclusion, the use of several (dis)-similarity metrics is necessary because they have some degree of orthogonality and thus, the corresponding obtained MDs will have independent information, which will also be highly complementarily because each metric reflects very different characteristics of atom-pairs in a molecule On the other hand, with the aim of taking into account close and distant inter-atomic interactions within the molecular skeleton, we adapt a generalized expression for the kth two-tuple spatial-(dis)similarity matrix, denoted by 𝔾𝔾! , where superscript k indicates the power to which the 𝔾𝔾 matrix is raised In this sense, for k = 0, the matrix 𝔾𝔾! has each entry equal to 1; for k = 1, the elements gij of the matrix 𝔾𝔾! represent the (dis)-similarity between atoms i and j (see Table 1) Furthermore, to achieve greater discrimination of molecular structures the diagonal entries could have assigned two different values: 1) representing the amount of lone-pair electrons for each atom, or 2) the geometric distance, gio for each atom i and center of the molecule, o In this case, the elements gij of the matrix 𝔾𝔾! are defined as follows: gij1 = Dij if i ≠ j ∧ i, j are atoms of the molecule = Lij if i = j ∧ lone - pairs are considered (or Dio ) (9) = otherwise where, Dij is the (dis)-similarity between atoms i and j (see Table 1), and Lij could be: 1) the lone-pair (electron) number on the atom i, or 2) the (dis)-similarity between atom i and center of molecule, gio (Dio) The matrices 𝔾𝔾! (k ≥ 2) are calculated multiplying elements of the matrix 𝔾𝔾!!! by elements of the matrix 𝔾𝔾! , in Novel 3D Algebraic Molecular Descriptors such a way that the elements of the matrix 𝔾𝔾! will be equal to (g ) for all atom-pairs i, j of a molecule When no k ij normalizing procedure is employed for the elements of 𝔾𝔾! (see below section 2.3), this matrix is designated as the kth non-stochastic two-tuple spatial-(dis)similarity matrix (NSSDSM, !"𝔾𝔾! ) This matrix !"𝔾𝔾! can be classified as a generalized matrix due to the fact that is determined through the Hadamard product, that is, raising to different real powers the elements of the matrix [20] However, generalized reciprocal matrices where k takes negative values (k ≤ -1) are also employed as matrix forms That is to say, the matrices employed in this report are calculated by raising the matrix coefficients to both positive and negative exponents In this case, when the matrix exponent is negative and if the number of lone pairs for each atom i in the molecule is selected as diagonal element then the reciprocal is not applied Nonetheless, the reciprocal is computed if the (dis)-similarity between each atom i and the center of the molecule is chosen as diagonal coefficient The use of the elements of the matrix !"𝔾𝔾! and its corresponding reciprocal for computing the bilinear, quadratic and linear indices is based on the physicochemical nature of distinct non-covalent interactions, such as Van der Waals terms, gravitational interactions, Coulomb potential and so on Indeed, the kth power of the matrix !"𝔾𝔾! is related with the powers of their coefficients, where k = 0, ±1, ±2, ±3…±12 These exponents take into account the different interactions between atoms in a molecule, for example, for k = ±1 and k = ±2, the !"𝔾𝔾! matrix reflects interactions like Coulombic and/or Gravitational, respectively The maximum k value, ±12, is related with non-bonded (mainly steric) interactions associated with the functional form of the Lennard-Jones 6-12 potential, like in most CoMFA-like studies Current Bioinformatics, 2015, Vol 10, No Formally, stochastic matrices are square matrices where each column sum, left stochastic matrices, or each row sum, right stochastic matrices, is equal to 1, that is, the coefficients of each column or row consist of non-negative real numbers that can be interpreted as probabilities [62] On the contrary, MDs defined up to date not use the double stochastic matrix, which is a stochastic matrix where the elements of each column and row sum With the aim of normalizing the “extended” kth nonstochastic two-tuple spatial-(dis)similarity matrix, !"𝔾𝔾! (NS-SDSM), three probability schemes are applied These schemes are associated with inter-atomic interactions in the chemical structure For the TOMOCOMD-CARDD 2D and 2.5D indices (QuBiLS-MAS program), the stochastic graph– theoretical electronic-density matrix for a molecule, describes changes in the electron distribution over time throughout the molecular backbone In this scheme, a hypothetical case in which a set of atoms are initially free in space is considered (discrete object in the space) Later, outer shell electrons of atoms are distributed around atomic cores in discrete time intervals In this sense, the electrons in an arbitrary atom can move to other atoms at different discrete time periods throughout the chemical-bonding framework In the geometrical approach, this matrix can be interpreted as the change in the probability of atoms in a molecule to interact with each other Consequently, this probability as a measure of the spreading of the atoms (taken as discrete objects) in space can be considered On this basis, the kth simple-stochastic two-tuple spatial(dis)similarity matrix, !!𝔾𝔾! (SS-SDSM) has been defined, which is obtained from !"𝔾𝔾! as follows: k ss gij = gijk ∑g j k ij where, gij are the coefficients of the kth power of 2.3 Normalization Formalisms based on SimpleStochastic, Double-Stochastic and Mutual Probability Schemes Matrices constitute the most common mathematical representation to codify structural information of molecules [20] Of particular interest are the matrices related to molecular geometry, such as the geometry matrix, molecular influence matrix, and others, which serve as a starting point for the calculation of many 3D-MDs However, it is unusual to use probabilistic transformations in matrices in general As each rule has an exception, stochastic matrices are defined in the framework of the MARCH-INSIDE descriptors [58, 59], TOMOCOMD-CARDD 2D descriptors (now condensed in QuBiLS-MAS module in TOMOCOMDCARDD software) [33, 60], and in walk counts (random walk Markov matrix) In addition, Carbo-Dorca [61] also employed a stochastic scaling by means of a simple stochastic transformation This transformation was applied to Quantum Similarity Matrixes (QSM) providing a stochastic QSM In these methods a simple stochastic scaling has been employed, where the summation of the coefficients of each row is utilized as a scale factor In this way, unsymmetrical matrices whose columns can be interpreted as discrete probability distributions are created (10) !"𝔾𝔾 ! matrix and the SUM of the elements of the ith row of !"𝔾𝔾! is called the k-order spatial-(dis)-similarity vertex degree of atom i (see Schemes and 2) However, this matrix is not necessarily symmetrical in that the probability for atom i to interact with an atom j is different from the probability for the atom j to interact with the atom i With the purpose of equalizing the probabilities in both senses, a double-stochastic matrix is used, defined as a matrix with real non-negatives entries whose column and row sums are Therefore, these matrices are referred to kth double-stochastic two-tuple spatial-(dis)similarity matrix, ! !"𝔾𝔾 (DS-SDSM) The procedure to compute the doublestochastic matrix associated to a non-stochastic matrix is not trivial Sinkhorn postulates that a strictly positive matrix A can be scaled to a double stochastic matrix B by [63]: B = Dg × A × Dg (11) where, Dg is a diagonal matrix Later, Sinkhorn and Knopp extended this theorem to consider non-negative matrices and proposed a well-known iteration algorithm for matrix balancing, named as the authors [64] In this sense, a DSSDSM ( !"𝔾𝔾! ) may be calculated from the !"𝔾𝔾! using the equation 11 and the Sinkhorn-Knopp algorithm Current Bioinformatics, 2015, Vol 10, No Marrero-Ponce et al Finally, the kth mutual probability two-tuple spatial(dis)similarity matrix, !"𝔾𝔾! (MP-SDSM) is introduced The elements !"𝔾𝔾! are obtained as follows: k mp gij = gijk = S n gijk n ∑∑ g i=1 j=1 k ij (12) where, mp gij denotes the mutual probability between vertices i k and j, and S the sample space The sample space is computed by summing all elements of !"𝔾𝔾! It should be pointed out that while the simple-stochastic probability scheme has been previously used in the TOMOCOMD-CARDD approach [33, 60], the double-stochastic probability and mutual probability schemes are presented for the first time as alternative normalization strategies Scheme demonstrates the steps followed in the computation of the NS-, SS-, DSand MP-SDSMs In order to illustrate the calculation process of these matrix approaches, the molecular structure of (E)-3-(4,5dihydrooxazol-4-yl)-2-fluoro-3-(methylthio)acrylonitrile is considered Table depicts the zero (k = 0), first (k = 1), second (k = 2) and third (k = 3) powers of the NS-, SS-, DSand MP-SDSMs for this molecular structure An example of the computation of the atom-level SDSM matrix is shown in Table using the NS-SDSM, NSa,k for k = 2.4 Local-Fragment (Group, Atom-Type) Bilinear, Quadratic and Linear Algebraic Indices In addition, the proposed matrices ! ! ! ! ( !"𝔾𝔾 , !!𝔾𝔾 , !"𝔾𝔾 and !"𝔾𝔾 ) could be used to codify information on a specific molecular fragment (F) of the molecule Therefore, a SDSM for the molecular fragment F, 𝔾𝔾!! , is obtained from the total matrix 𝔾𝔾! The elements k gijF of the local-fragment matrix are defined as follows: k gijF = gijk if i, j ∈F k gij if i ∈F ∨ j ∈F =0 otherwise = (13) Similar to the total atom-level indices (see Eqs 1-3), the local-fragment two-tuple atom-level indices are computed as a vector of LOVIs !𝐿𝐿 , where each entry !𝐿𝐿! corresponds to a value of a local-fragment index according to the atom considered “a” The definition of these indices is as follows: !,! ! ! = 𝑏𝑏!!,! (𝑥𝑥, 𝑦𝑦) = !!!! !!!! 𝑔𝑔!"# 𝑥𝑥 𝑦𝑦 = [𝑋𝑋] 𝔾𝔾!,! 𝑌𝑌 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 ! !! 𝐿𝐿! ! (14) Scheme Steps followed in the computation of the non-stochastic, simple-stochastic, double-stochastic and mutual probability-based matrices Novel 3D Algebraic Molecular Descriptors Table Current Bioinformatics, 2015, Vol 10, No A) Chemical structure of (E)-3-(4,5-dihydrooxazol-4-yl)-2-fluoro-3-(methylthio)acrylonitrile and its labeled molecular scaffold B), C), D) and E) The zero (k = 0), first (k = 1), second (k = 2) and third (k = 3) powers of the non-stochastic (NS), simple-stochastic (SS), double-stochastic (DS) and mutual probability (MP) spatial-(dis)similarity matrices (SDSM) of the molecule, respectively A) 2D (left) and 3D (right) molecular structure H H F O H N H N H B) NS0 NS-, SS-, DS- and MP-SDSM, 𝔾𝔾! for k = H S H SS0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 DS0 MP0 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 C) NS NS-, SS-, DS- and MP-SDSM, 𝔾𝔾! for k = SS1 0.000 1.111 1.957 2.958 2.283 3.797 1.977 1.083 1.672 1.630 1.064 2.173 0.000 0.051 0.090 0.136 0.105 0.175 0.091 0.050 0.077 0.075 0.049 0.100 1.111 0.000 1.074 1.925 1.798 2.731 1.258 1.847 2.498 2.400 1.739 2.145 0.054 0.000 0.052 0.094 0.088 0.133 0.061 0.090 0.122 0.117 0.085 0.105 1.957 1.074 0.000 1.065 0.963 1.928 2.021 2.257 2.807 2.829 2.411 3.154 0.087 0.048 0.000 0.047 0.043 0.086 0.090 0.100 0.125 0.126 0.107 0.140 2.958 1.925 1.065 0.000 1.640 0.863 2.376 3.310 3.853 3.823 3.362 3.694 0.102 0.067 0.037 0.000 0.057 0.030 0.082 0.115 0.133 0.132 0.116 0.128 2.283 1.798 0.963 1.640 3.000 2.400 2.928 2.137 2.506 2.726 2.626 3.937 0.079 0.062 0.033 0.057 0.104 0.083 0.101 0.074 0.087 0.094 0.091 0.136 3.797 2.731 1.928 0.863 2.400 1.000 2.937 4.169 4.705 4.650 4.174 4.286 0.101 0.073 0.051 0.023 0.064 0.027 0.078 0.111 0.125 0.124 0.111 0.114 Current Bioinformatics, 2015, Vol 10, No Marrero-Ponce et al (Table 2) contd… D) NS1 NS-, SS-, DS- and MP-SDSM, 𝔾𝔾! for k = SS1 1.977 1.258 2.021 2.376 2.928 2.937 2.000 2.919 3.599 3.358 2.513 1.379 0.068 0.043 0.069 0.081 0.100 0.100 0.068 0.100 0.123 0.115 0.086 0.047 1.083 1.847 2.257 3.310 2.137 4.169 2.919 0.000 1.022 1.618 1.663 3.209 0.043 0.073 0.089 0.131 0.085 0.165 0.116 0.000 0.040 0.064 0.066 0.127 1.672 2.498 2.807 3.853 2.506 4.705 3.599 1.022 2.000 0.963 1.567 3.743 0.054 0.081 0.091 0.125 0.081 0.152 0.116 0.033 0.065 0.031 0.051 0.121 1.630 2.400 2.829 3.823 2.726 4.650 3.358 1.618 0.963 0.000 0.915 3.362 0.058 0.085 0.100 0.135 0.096 0.164 0.119 0.057 0.034 0.000 0.032 0.119 1.064 1.739 2.411 3.362 2.626 4.174 2.513 1.663 1.567 0.915 1.000 2.480 0.042 0.068 0.094 0.132 0.103 0.164 0.099 0.065 0.061 0.036 0.039 0.097 2.173 2.145 3.154 3.694 3.937 4.286 1.379 3.209 3.743 3.362 2.480 0.000 0.065 0.064 0.094 0.110 0.117 0.128 0.041 0.096 0.112 0.100 0.074 0.000 DS1 MP1 0.000 0.075 0.117 0.132 0.105 0.129 0.089 0.059 0.073 0.078 0.057 0.085 0.000 0.003 0.006 0.009 0.007 0.011 0.006 0.003 0.005 0.005 0.003 0.007 0.075 0.000 0.067 0.090 0.087 0.097 0.060 0.105 0.115 0.120 0.097 0.087 0.003 0.000 0.003 0.006 0.005 0.008 0.004 0.006 0.008 0.007 0.005 0.006 0.117 0.067 0.000 0.044 0.041 0.060 0.085 0.114 0.114 0.125 0.119 0.114 0.006 0.003 0.000 0.003 0.003 0.006 0.006 0.007 0.008 0.008 0.007 0.009 0.132 0.090 0.044 0.000 0.052 0.020 0.074 0.124 0.116 0.126 0.124 0.099 0.009 0.006 0.003 0.000 0.005 0.003 0.007 0.010 0.012 0.011 0.010 0.011 0.105 0.087 0.041 0.052 0.099 0.058 0.094 0.083 0.078 0.093 0.100 0.109 0.007 0.005 0.003 0.005 0.009 0.007 0.009 0.006 0.008 0.008 0.008 0.012 0.129 0.097 0.060 0.020 0.058 0.018 0.070 0.119 0.108 0.117 0.117 0.088 0.011 0.008 0.006 0.003 0.007 0.003 0.009 0.013 0.014 0.014 0.013 0.013 0.089 0.060 0.085 0.074 0.094 0.070 0.063 0.111 0.110 0.112 0.094 0.038 0.006 0.004 0.006 0.007 0.009 0.009 0.006 0.009 0.011 0.010 0.008 0.004 0.059 0.105 0.114 0.124 0.083 0.119 0.111 0.000 0.038 0.065 0.075 0.105 0.003 0.006 0.007 0.010 0.006 0.013 0.009 0.000 0.003 0.005 0.005 0.010 0.073 0.115 0.114 0.116 0.078 0.108 0.110 0.038 0.060 0.031 0.057 0.099 0.005 0.008 0.008 0.012 0.008 0.014 0.011 0.003 0.006 0.003 0.005 0.011 0.078 0.120 0.125 0.126 0.093 0.117 0.112 0.065 0.031 0.000 0.036 0.097 0.005 0.007 0.008 0.011 0.008 0.014 0.010 0.005 0.003 0.000 0.003 0.010 0.057 0.097 0.119 0.124 0.100 0.117 0.094 0.075 0.057 0.036 0.044 0.080 0.003 0.005 0.007 0.010 0.008 0.013 0.008 0.005 0.005 0.003 0.003 0.007 0.085 0.087 0.114 0.099 0.109 0.088 0.038 0.105 0.099 0.097 0.080 0.000 0.007 0.006 0.009 0.011 0.012 0.013 0.004 0.010 0.011 0.010 0.007 0.000 E) NS NS-, SS-, DS- and MP-SDSM, 𝔾𝔾! for k = SS2 0.000 1.235 3.829 8.749 5.212 14.419 3.909 1.172 2.796 2.655 1.132 4.723 0.000 0.025 0.077 0.176 0.105 0.289 0.078 0.024 0.056 0.053 0.023 0.095 1.235 0.000 1.153 3.706 3.231 7.460 1.583 3.411 6.242 5.758 3.025 4.601 0.030 0.000 0.028 0.090 0.078 0.180 0.038 0.082 0.151 0.139 0.073 0.111 3.829 1.153 0.000 1.134 0.927 3.719 4.083 5.095 7.880 8.004 5.812 9.947 0.074 0.022 0.000 0.022 0.018 0.072 0.079 0.099 0.153 0.155 0.113 0.193 8.749 3.706 1.134 0.000 2.688 0.745 5.646 10.959 14.843 14.619 11.305 13.648 0.099 0.042 0.013 0.000 0.031 0.008 0.064 0.124 0.169 0.166 0.128 0.155 5.212 3.231 0.927 2.688 9.000 5.760 8.572 4.568 6.279 7.431 6.895 15.499 0.069 0.042 0.012 0.035 0.118 0.076 0.113 0.060 0.083 0.098 0.091 0.204 14.419 7.460 3.719 0.745 5.760 1.000 8.624 17.382 22.133 21.622 17.419 18.373 0.104 0.054 0.027 0.005 0.042 0.007 0.062 0.125 0.160 0.156 0.126 0.133 3.909 1.583 4.083 5.646 8.572 8.624 4.000 8.518 12.952 11.279 6.317 1.900 0.051 0.020 0.053 0.073 0.111 0.111 0.052 0.110 0.167 0.146 0.082 0.025 1.172 3.411 5.095 10.959 4.568 17.382 8.518 0.000 1.044 2.617 2.767 10.296 0.017 0.050 0.075 0.162 0.067 0.256 0.126 0.000 0.015 0.039 0.041 0.152 2.796 6.242 7.880 14.843 6.279 22.133 12.952 1.044 4.000 0.927 2.456 14.013 0.029 0.065 0.082 0.155 0.066 0.232 0.136 0.011 0.042 0.010 0.026 0.147 2.655 5.758 8.004 14.619 7.431 21.622 11.279 2.617 0.927 0.000 0.837 11.301 0.031 0.066 0.092 0.168 0.085 0.248 0.130 0.030 0.011 0.000 0.010 0.130 1.132 3.025 5.812 11.305 6.895 17.419 6.317 2.767 2.456 0.837 1.000 6.150 0.017 0.046 0.089 0.174 0.106 0.268 0.097 0.042 0.038 0.013 0.015 0.094 4.723 4.601 9.947 13.648 15.499 18.373 1.900 10.296 14.013 11.301 6.150 0.000 0.043 0.042 0.090 0.124 0.140 0.166 0.017 0.093 0.127 0.102 0.056 0.000 DS2 MP2 0.000 0.060 0.134 0.163 0.123 0.166 0.090 0.036 0.057 0.060 0.035 0.076 0.000 0.001 0.004 0.009 0.005 0.015 0.004 0.001 0.003 0.003 0.001 0.005 0.060 0.000 0.045 0.077 0.086 0.096 0.041 0.116 0.144 0.145 0.105 0.083 0.001 0.000 0.001 0.004 0.003 0.008 0.002 0.004 0.007 0.006 0.003 0.005 0.134 0.045 0.000 0.017 0.018 0.035 0.076 0.125 0.131 0.145 0.145 0.129 0.004 0.001 0.000 0.001 0.001 0.004 0.004 0.005 0.008 0.008 0.006 0.010 0.163 0.077 0.017 0.000 0.027 0.004 0.055 0.142 0.130 0.140 0.150 0.094 0.009 0.004 0.001 0.000 0.003 0.001 0.006 0.012 0.016 0.015 0.012 0.014 0.123 0.086 0.018 0.027 0.116 0.036 0.107 0.075 0.070 0.091 0.116 0.136 0.005 0.003 0.001 0.003 0.009 0.006 0.009 0.005 0.007 0.008 0.007 0.016 0.166 0.096 0.035 0.004 0.036 0.003 0.052 0.139 0.120 0.128 0.143 0.078 0.015 0.008 0.004 0.001 0.006 0.001 0.009 0.018 0.023 0.023 0.018 0.019 0.090 0.041 0.076 0.055 0.107 0.052 0.049 0.136 0.141 0.134 0.103 0.016 0.004 0.002 0.004 0.006 0.009 0.009 0.004 0.009 0.014 0.012 0.007 0.002 0.036 0.116 0.125 0.142 0.075 0.139 0.136 0.000 0.015 0.041 0.060 0.116 0.001 0.004 0.005 0.012 0.005 0.018 0.009 0.000 0.001 0.003 0.003 0.011 0.057 0.144 0.131 0.130 0.070 0.120 0.141 0.015 0.039 0.010 0.036 0.107 0.003 0.007 0.008 0.016 0.007 0.023 0.014 0.001 0.004 0.001 0.003 0.015 Novel 3D Algebraic Molecular Descriptors Current Bioinformatics, 2015, Vol 10, No (Table 2) contd… F) NS-, SS-, DS- and MP-SDSM, 𝔾𝔾! for k = 0.060 0.145 0.145 0.140 0.091 0.128 0.134 0.041 0.010 0.000 0.013 0.094 0.003 0.006 0.008 0.015 0.008 0.023 0.012 0.003 0.001 0.000 0.001 0.012 0.035 0.105 0.145 0.150 0.116 0.143 0.103 0.060 0.036 0.013 0.022 0.071 0.001 0.003 0.006 0.012 0.007 0.018 0.007 0.003 0.003 0.001 0.001 0.006 0.076 0.083 0.129 0.094 0.136 0.078 0.016 0.116 0.107 0.094 0.071 0.000 0.005 0.005 0.010 0.014 0.016 0.019 0.002 0.011 0.015 0.012 0.006 0.000 G) NS3 NS-, SS-, DS- and MP-SDSM, 𝔾𝔾! for k = 4.67 4.33 SS3 0.00 1.37 7.49 25.88 11.90 54.75 7.73 1.27 1.21 10.26 0.000 0.010 0.057 0.198 0.091 0.418 0.059 0.010 0.036 0.033 0.009 0.078 1.37 0.00 1.24 7.14 5.81 20.37 1.99 6.30 15.60 13.82 5.26 7.49 1.24 0.00 1.21 0.89 7.17 25.88 7.14 1.21 0.00 4.41 0.64 13.42 36.28 57.19 55.89 38.01 50.42 0.089 0.025 0.004 0.000 0.015 0.002 0.046 0.125 0.197 0.192 0.131 0.174 9.87 0.015 0.000 0.014 0.080 0.065 0.230 0.022 0.071 0.176 0.156 0.059 0.111 8.25 11.50 22.12 22.64 14.01 31.37 0.059 0.010 0.000 0.009 0.007 0.056 0.065 0.090 0.173 0.177 0.110 0.245 11.90 5.81 0.89 4.41 27.00 13.82 25.10 9.76 15.73 20.26 18.11 61.02 0.056 0.027 0.004 0.021 0.126 0.065 0.117 0.046 0.074 0.095 0.085 0.285 54.75 20.37 7.17 0.64 13.82 1.00 25.33 72.47 104.13 100.54 72.70 78.76 0.099 0.037 0.013 0.001 0.025 0.002 0.046 0.131 0.189 0.182 0.132 0.143 7.73 1.99 8.25 13.42 25.10 25.33 8.00 24.86 46.61 37.88 15.88 2.62 0.036 0.009 0.038 0.062 0.115 0.116 0.037 0.114 0.214 0.174 0.073 0.012 1.27 6.30 11.50 36.28 9.76 72.47 24.86 0.00 1.07 4.23 4.60 33.04 0.006 0.031 0.056 0.177 0.048 0.353 0.121 0.000 0.005 0.021 0.022 0.161 4.67 15.60 22.12 57.19 15.73 104.13 46.61 1.07 8.00 0.89 3.85 52.46 0.014 0.047 0.067 0.172 0.047 0.313 0.140 0.003 0.024 0.003 0.012 0.158 4.33 13.82 22.64 55.89 20.26 100.54 37.88 4.23 0.89 0.00 0.77 37.99 0.014 0.046 0.076 0.187 0.068 0.336 0.127 0.014 0.003 0.000 0.003 0.127 1.21 5.26 14.01 38.01 18.11 72.70 15.88 4.60 3.85 0.77 1.00 15.25 0.006 0.028 0.073 0.199 0.095 0.381 0.083 0.024 0.020 0.004 0.005 0.080 10.26 9.87 31.37 50.42 61.02 78.76 2.62 33.04 52.46 37.99 15.25 0.00 0.027 0.026 0.082 0.132 0.159 0.206 0.007 0.086 0.137 0.099 0.040 0.000 DS3 MP3 0.000 0.047 0.146 0.189 0.137 0.200 0.087 0.020 0.042 0.043 0.020 0.067 0.000 0.000 0.002 0.009 0.004 0.018 0.003 0.000 0.002 0.001 0.000 0.003 0.047 0.000 0.030 0.064 0.082 0.091 0.028 0.125 0.173 0.170 0.110 0.080 0.000 0.000 0.000 0.002 0.002 0.007 0.001 0.002 0.005 0.005 0.002 0.003 0.146 0.030 0.000 0.006 0.007 0.018 0.064 0.128 0.138 0.156 0.165 0.143 0.002 0.000 0.000 0.000 0.000 0.002 0.003 0.004 0.007 0.007 0.005 0.010 0.189 0.064 0.006 0.000 0.013 0.001 0.039 0.152 0.135 0.146 0.168 0.086 0.009 0.002 0.000 0.000 0.001 0.000 0.004 0.012 0.019 0.018 0.013 0.017 0.137 0.082 0.007 0.013 0.128 0.021 0.116 0.064 0.058 0.083 0.126 0.165 0.004 0.002 0.000 0.001 0.009 0.005 0.008 0.003 0.005 0.007 0.006 0.020 0.200 0.091 0.018 0.001 0.021 0.000 0.037 0.151 0.122 0.130 0.161 0.067 0.018 0.007 0.002 0.000 0.005 0.000 0.008 0.024 0.034 0.033 0.024 0.026 0.087 0.028 0.064 0.039 0.116 0.037 0.036 0.160 0.168 0.151 0.108 0.007 0.003 0.001 0.003 0.004 0.008 0.008 0.003 0.008 0.015 0.012 0.005 0.001 0.020 0.125 0.128 0.152 0.064 0.151 0.160 0.000 0.006 0.024 0.045 0.124 0.000 0.002 0.004 0.012 0.003 0.024 0.008 0.000 0.000 0.001 0.002 0.011 0.042 0.173 0.138 0.135 0.058 0.122 0.168 0.006 0.023 0.003 0.021 0.111 0.002 0.005 0.007 0.019 0.005 0.034 0.015 0.000 0.003 0.000 0.001 0.017 0.043 0.170 0.156 0.146 0.083 0.130 0.151 0.024 0.003 0.000 0.005 0.089 0.001 0.005 0.007 0.018 0.007 0.033 0.012 0.001 0.000 0.000 0.000 0.013 0.020 0.110 0.165 0.168 0.126 0.161 0.108 0.045 0.021 0.005 0.010 0.061 0.000 0.002 0.005 0.013 0.006 0.024 0.005 0.002 0.001 0.000 0.000 0.005 0.067 0.080 0.143 0.086 0.165 0.067 0.007 0.124 0.111 0.089 0.061 0.000 0.003 0.003 0.010 0.017 0.020 0.026 0.001 0.011 0.017 0.013 0.005 0.000 !! 𝐿𝐿! ! [𝑋𝑋] = 𝑞𝑞!!,! (𝑥𝑥, 𝑥𝑥) = 𝔾𝔾!,! ! ! !!! !,! ! ! ! !!! 𝑔𝑔!"# 𝑥𝑥 𝑥𝑥 𝑋𝑋 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 !,! ! ! !,! ! ! !! 𝐿𝐿! = 𝑓𝑓! (𝑥𝑥) = !!! !!! 𝑔𝑔!"# 𝑢𝑢 𝑥𝑥 [𝑈𝑈]! 𝔾𝔾!,! 𝑋𝑋 ∀ 𝑎𝑎 = 1,2, … , 𝑛𝑛 ! = = (15) (16) where, gijF is the kth element corresponding to the row “i” a,k and column “j” of the local-fragment two-tuple atom-level matrix, 𝔾𝔾!,! ! , according to the atom “a” This matrix is computed for each atom of the molecule from the localfragment matrix 𝔾𝔾!! , which contains information regarding the distances between each atom-pair belonging to the molecular fragment ( F ) considered Summation over vector ! 𝐿𝐿 of LOVIs yields the kth local-fragment bilinear, quadratic and linear indices for atom-types or groups (see SCHEMES and 2) In this report, these local MDs can be calculated on seven chemical (or functional) groups in the molecule, these are: hydrogen bond acceptors (A), carbon atoms in aliphatic chains (C), hydrogen bond donors (D), halogens (G), terminal methyl groups (M), carbon atoms in aromatic portion (P) and heteroatoms (O, N and S in all valence states, denoted as X) Up to this section, we have used the summation of the total atom-level contributions and local-fragment atom-level contributions as exclusive operator to obtain the kth total (or local-fragment) NS-, SS-, DS-, MP-bilinear, quadratic and linear molecular indices In the subsection 2.6, we propose alternative strategies (invariants) of obtaining indices from LOVIs other than the summation 10 Current Bioinformatics, 2015, Vol 10, No Table Marrero-Ponce et al A) Non-stochastic matrix of order (NS-SDSM1) of the chemical structure (E)-3-(4,5-dihydrooxazol-4-yl)-2-fluoro-3(methylthio)acrylonitrile This matrix belongs to the total bilinear index, using the Euclidean distance and the properties mass and vdw volume B) The atom-level non-stochastic matrices, NS-SDSMa,k, derived from the total NS-SDSM1, for all atoms of the molecule A) NS-SDSM order NS1 0.000 0.900 0.511 0.338 0.438 0.263 0.506 0.924 0.598 0.614 0.940 0.460 0.900 0.000 0.931 0.519 0.556 0.366 0.795 0.541 0.400 0.417 0.575 0.466 0.511 0.931 0.000 0.939 1.039 0.519 0.495 0.443 0.356 0.353 0.415 0.317 0.338 0.519 0.939 0.000 0.610 1.158 0.421 0.302 0.260 0.262 0.297 0.271 0.438 0.556 1.039 0.610 3.000 0.417 0.342 0.468 0.399 0.367 0.381 0.254 0.263 0.366 0.519 1.158 0.417 1.000 0.341 0.240 0.213 0.215 0.240 0.233 0.506 0.795 0.495 0.421 0.342 0.341 2.000 0.343 0.278 0.298 0.398 0.725 0.924 0.541 0.443 0.302 0.468 0.240 0.343 0.000 0.979 0.618 0.601 0.312 0.598 0.400 0.356 0.260 0.399 0.213 0.278 0.979 2.000 1.039 0.638 0.267 0.614 0.417 0.353 0.262 0.367 0.215 0.298 0.618 1.039 0.000 1.093 0.297 0.940 0.575 0.415 0.297 0.381 0.240 0.398 0.601 0.638 1.093 1.000 0.403 0.460 0.466 0.317 0.271 0.254 0.233 0.725 0.312 0.267 0.297 0.403 0.000 B) Atom-level NS-SDSM order for all atoms of the molecule NS1,1 NS2,1 0.000 0.450 0.256 0.169 0.219 0.132 0.253 0.462 0.299 0.307 0.470 0.230 0.000 0.450 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.450 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.450 0.000 0.466 0.260 0.278 0.183 0.397 0.271 0.200 0.208 0.287 0.233 0.256 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.466 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.169 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.260 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.219 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.278 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.132 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.183 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.253 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.397 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.462 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.271 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.299 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.200 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.307 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.208 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.470 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.287 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.230 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.233 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NS3,1 NS4,1 0.000 0.000 0.256 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.169 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.466 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.260 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.256 0.466 0.000 0.469 0.519 0.259 0.247 0.222 0.178 0.177 0.207 0.159 0.000 0.000 0.000 0.469 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.469 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.169 0.260 0.469 0.000 0.305 0.579 0.210 0.151 0.130 0.131 0.149 0.135 0.000 0.000 0.519 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.305 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.259 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.579 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.247 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.210 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.222 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.151 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.178 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.130 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.177 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.131 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.207 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.149 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.159 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.135 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 C) Atom-level NS-SDSM order for all atoms of the molecule NS5,1 NS6,1 0.000 0.000 0.000 0.000 0.219 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.132 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.278 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.183 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.519 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.259 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.305 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.579 0.000 0.000 0.000 0.000 0.000 0.000 0.219 0.278 0.519 0.305 3.000 0.208 0.171 0.234 0.200 0.183 0.190 0.127 0.000 0.000 0.000 0.000 0.000 0.208 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.208 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.132 0.183 0.259 0.579 0.208 1.000 0.170 0.120 0.106 0.108 0.120 0.117 18 Current Bioinformatics, 2015, Vol 10, No Marrero-Ponce et al (A) (B) (C) Novel 3D Algebraic Molecular Descriptors Current Bioinformatics, 2015, Vol 10, No 19 (Fig 1) contd… (D) (E) (F) Fig (1) Shannon’s entropy distribution of the QuBiLS-MIDAS Duplex indices according to: A) the non-stochastic, simple stochastic, double-stochastic and mutual probability matrix formalisms B) the metrics for inter-atomic distance C) the norms, statistical operators of central tendency and operators for dispersion and form D) the “classical algorithms-Invariant” E) Pair-wise combinations of the three best atomic properties (c, logp, h) and worst properties (r, e) in bilinear indices, and F) Constraints (here lag k and lag r represent to lag p and lag l cut-offs in Section 2.5, respectively) 20 Current Bioinformatics, 2015, Vol 10, No 3.2 Variability of QuBiLS-MIDAS Duplex Indices in Terms of the Inter-Atomic Distance Metrics We evaluate the variability of the MDs computed according to the different metrics used to compute the interatomic distances To this end, 325 variables for each metric were compared Fig (1B) reveals that the best SE distribution is obtained for MDs computed according to Canberra, Lance-Williams and Clark metrics This is a logical result given that these metrics are derived from similar mathematical considerations (see Table 1) This group of metrics is followed by MDs based on Angular Separation and Bhattacharyya, respectively Posteriorly, with comparable behavior is a group of MDs based on Minkowski distances based on different p values (see Table 1) This result suggests that the choice of the p value does not alter “significantly” the performance, in variability terms, of the MDs Finally, with low variability are MDs based on Soergel and Wave-Edges metrics This study offers plausible relevance to the incorporation of other metrics (traditionally used as similarity measures) in the computation of the inter-atomic distance other than the classical inter-atomic distance metrics [Euclidean distanceMinkowski (p = 2), leverage, and topographic distance] [20], in sense that MDs with superior variability are obtained with these generalization metric schemes 3.3 Comparison of Variability of Mathematical Invariants in QuBiLS-MIDAS Duplex Indices The purpose of the present study is to analyze the variability of the 3D-indices according to the mathematical operator applied to the vector of LOVIs computed for each of the molecules In this study, we compared 325 variables calculated for each mathematical operator This study comprises of two parts In the first study, we compare the mathematical operators collectively denominated as norms, statistical operators of central tendency and operators for dispersion and form; and in the second study, we analyze the “classical algorithms” As can be observed in Fig (1C), highest variability is presented by the statistical operators skewness, kurtosis and variation coefficient, with approximately 96%, 95% and 60% of these variables, respectively, with SE above bits (60% of maximum entropy) Of low variability are the indices based on the variance The rest of the indices show comparable behavior In the second study (see Fig 1D), it is observed that high SE distribution is obtained with the operators mean information content, total information content and total sum The standardized information content yields variables with high SE values although on the overall, it presents low SE distribution The electrotopological state, gravitational and autocorrelation algorithms possess comparable SE distribution, with the last two presenting an identical SE distribution pattern justified by the similar mathematical definitions for these algorithms The lowest SE distribution is presented by Ivanciuc-Balaban algorithm The existence of divergent variability among the MDs when different mathematical operators are used justifies the use different mathematical operators for the global Marrero-Ponce et al characterization of molecular structures While there operators with superior SE distribution than others, this does not mean that only these operators should be used In practical applications these operators should be combined due to the fact that a collection of the best is not always the best 3.4 Comparative Analysis of Atomic Properties and Molecular Vectors Bearing in mind that the atom-based molecular vectors are a key component of the 3D- bilinear, quadratic and linear index formalisms, it would be of particular interest to have an overview of the performance for the atom- and fragmentbased weighting schemes, or pair-wise combinations of these, employed in the present report Fig (1E) shows the SE distribution for bilinear MDs computed for pair-wise combinations of the best three atomic properties in SE terms [i.e., atomic charge (c), atomic Ghose-Crippen Log P (a) and atomic hardness (h)] and the two properties with the lowest SE [i.e., atomic refractivity (r) and atomic electronegativity in Pauling scale (e)] (see SI2 for variability analysis with ten individual properties) It is interesting to note that pair-wise combinations of the lowest SE weights (r_e) or low SE weights with the best atomic properties yield schemes of similar performance with combinations of the best atomic properties, which justifies the theoretical contribution of the use of pair-wise combinations of atom weights, specifically in bilinear form-based indices On the other hand, generally, weighting pairs with the atomic charge (c) not yield high SE MDs 3.5 Comparison of Variability of Constraints In section 2.5, theoretical fundament to explain the importance of constraint schemes was given Here, we desire to evaluate the variability of MDs computed according to these constraints, following the synthesis that considering the inter-atomic interactions for all separations is not necessarily the most optimum approach, as some considerations may instead introduce “noise” in the variables Fig (1F) shows the SE distribution of the QuBiLS-MIDAS Duplex MDs computed according to single and pair-wise constraints, as well as for the case when no constraints are considered It is striking to note that the lowest variability is obtained when no constraints are considered (ka = keep all) Also it is noted that superior SE distribution is obtained when pair-wise separations are considered On the other hand, the constraints generally possess comparable SE distribution although with a slight edge for pair-wise combinations and angstrom-based separations (lag r) A similar scenario is encountered in COMFA-like methods where the consideration of interactions at all grid points generates numerous variables, including noisy ones, which need to be filtered out prior to the model building 3.6 Comparative Analysis of QuBiLS-MIDAS Duplex 3D-Indices Versus DRAGON Descriptor Families The objective of this study is to compare the entropy of the DRAGON descriptor families with respect to the 3Dlinear, bilinear and quadratic indices of the QuBiLS-MIDAS Novel 3D Algebraic Molecular Descriptors Current Bioinformatics, 2015, Vol 10, No 21 Fig (2) Shannon’s entropy distribution for DRAGON Descriptor Families and Linear, Bilinear and Quadratic QuBiLS-MIDAS Duplex 3DIndices program Some of the DRAGON MD families were joined in bigger families, that is, 0D_others (constitutional descriptors, charge descriptors, molecular properties), 1D-fragment (functional group counts, atom-centered fragments), 2Dconn_autocorr_inf (connectivity indices, information indices, 2D autocorrelations), 2D-edge_walk (walk and path counts, edge adjacency indices), 2D-eigenvalues (Burden eigenvalues, topological charge indices, eigenvalues-based indices) and 3D-Randic_geometrical (Randic molecular profiles, geometrical descriptors) The best 91 variables for each of these families were considered, with this cut-off being determined by the family with the least number of variables, i.e 0D_others As can be observed from Fig (2), the 3D bilinear, quadratic and linear QuBiLS-MIDAS indices present superior SE distribution than the DRAGON MD families, with all the linear and bilinear indices and 66% of the quadratic indices with SE greater 9.0 bits (90.63% of max SE); while the DRAGON MD families practically present no variable superior to this value Among the DRAGON’s MD families, the best SE distribution is offered by the 2D-edge_walk (walk and path counts, edge adjacency indices) and the worst SE distribution is provided by 1D-fragment MDs The present study offers a critical understanding on the performance of the QuBiLS-MIDAS Duplex indices, particularly on the contribution of the proposed generalization schemes in the improvement of the variability of the previously proposed quadratic, linear and bilinear indices It is could thus be inferred that the QuBiLS-MIDAS Duplex approach is a valuable tool for chemo-informatics tasks It is important to point out that when molecular parameters are proposed, it is desirable that these codify structural information orthogonal among themselves and to the existing ones However, the SE-based method does not give information about the existing redundancy (correlation) among the variables Bearing this in mind, the next section is dedicated to the exploration of the possible linear independence of the proposed parameters ANALYSIS OF ORTHOGONALITY OF QUBILSMIDAS DUPLEX 3D INDICES In this section, we wish to examine the possible orthogonality of the QuBiLS-MIDAS Duplex 3D-MDs using the principal component method This is a versatile technique majorly employed for dimensionality reduction of datasets through the computation of orthogonal projections (principal components) that capture the majority of the variance in the data It follows that variables with strong loading in different components represent (or describe) linearly independent information For details on this method, see refs [71, 72] For all these studies, the respective 3Dindices were computed for the curated spectrum database (1962 molecules) Here, we compare the QuBiLS-MIDAS Duplex indices in terms of: 1) the applied algebraic forms, 2) matrix-based approaches and 3) inter-atomic measures Later, a comparison with other approaches is performed 4.1 Linear Independence of QuBiLS-MIDAS Duplex Indices in Terms of the Applied Algebraic Form We evaluate the information captured by the 3D-indices with respect to the algebraic form used To this end, we calculated 260 variables SI3A shows the eigenvalues and the percentages of the explained variance by the principal components obtained in this analysis, which approximately explain 88.18% of the cumulative variance The bilinear 22 Current Bioinformatics, 2015, Vol 10, No form based indices for the orders 4-12 are strongly loaded in Factor (34.61%) As for Factor (20.244%), robust loadings for the bilinear indices for orders 0-2 and quadratic indices for the same orders, as well as linear indices for orders 0-3 which suggests that this factor is relevant for low order based indices (see factors loading in SI4A) Localfragment bilinear and quadratic indices for terminal methyl groups are strongly loaded in Factor (9.83%), while Factor (6.82%) is important for local linear indices for halogens The important conclusion from this study is that there exists correlation of indices for the three algebraic forms, and thus these may codify similar structural information 4.2 Orthogonality of Matrix-Based Approaches In this study, we evaluate the information captured by the proposed 3D indices according to the NS, SS, DS and MP approaches for 208 variables, calculated for each of the matrix approaches SI3B shows the eigenvalues and the percentages of the explained variance by the factors (see SI4B for factor loading), which collectively explain 84.26% of the cumulative variance Factor (30.655%) exhibits robust loadings for MDs based on the SS and DS matrix approaches, as well as the NS approach for orders 0-3 On the other hand, the mutual probability matrix-based MDs are exclusively loaded in Factor (6.26%) and thus capture orthogonal information to the rest of the formalisms Likewise, MDs computed according to NS matrix approach for orders superior to are exclusively loaded in Factor (9.591%) suggesting orthogonality for these MDs These results support the incorporation of normalization schemes to the spatial-(dis)similarity matrix, as linearly independent information is captured using these schemes 4.3 Linear Independence of Inter-Atomic Distance Metrics The aim of the present study is to evaluate the possible linear independence of the information codified by the QuBiLS-MIDAS Duplex approach in terms of the different inter-atomic distances used To this end, 600 variables were computed The SI3C shows the eigenvalues and percentages of the explained variance of the obtained factors (see SI4C for factors loading), which explain approximately 97.38% of the cumulative variance Factor (57.311%) possesses strong factor loadings for Minkowski distances for orders 03, Canberra (M10), Lance-Williams (M11) and Clark (M12) metrics, as well as Bhattacharyya (M14) and Angular Separation (M16) for orders and 0-2, respectively On the other hand, Minkowski distances M1 (p = 0.25) and M2 (p = 0.5) for orders equal or greater than are strongly loaded in Factor (3.822%), while the metrics M3 - M8 (p = 1, p = 1.5, p = 2, p = 2.5, p = 3, p = infinity, respectively) are strongly loaded in Factor (18.037%) The Bhattacharyya metrics for orders 7-9 are strongly loaded in Factor (1.849%) Finally, the MDs based on Soergel (M13) and Wave-Edges (M15) metrics are exclusively loaded in Factor (6.001%) and (5.546%), respectively It could thus be deduced that the introduction of generalization schemes for the spatial inter-atomic distance, using diverse (dis)similarity measures, permits the codification of orthogonal structural information as a practical contribution of this approach Marrero-Ponce et al 4.4 Orthogonality Among QuBiLS-MIDAS Duplex and DRAGON 3D-Indices For this study, 721 DRAGON 3D-MDs were computed The DRAGON families that were considered in this study are GETAWAY with 197 variables, 3D-MoRSE with 160 variables, RDF with 150 variables, WHIM with 99 and (Geometric and Randic Molecular Profiles) with 109 indices SI5 shows the eigenvalues and percentages of explained variance by the 15 factors (see SI6) of this analysis, which approximately explain 73.24% of the cumulative variance The QubilS-MIDAS Duplex indices computed according the three algebraic forms (linear, bilinear and quadratic) and based on the NS- and SS-matrix formalisms are strongly loaded in Factor (26.527%) as well as the following DRAGON MDs: 1) GETAWAY (ITH, H6u, H7u, H8u, HTu, H5m, H6m, H8m, H5v, H6v, H7v, H8v, HTv, H6e, H7e, H8e, H5p, H6p, H7p, H8p, and HTp), 2) 3D-MORSE (Mor01u, Mor05u, Mor01m, Mor05m, Mor01v, Mor05v, Mor01e, Mor05e, Mor01p, and Mor05p), 3) Geometrical (W3D, H3D, AGDD, DDI, ADDD, G1, G2, SEig, QXXm, QYYm, QZZ, QXXv, QYYv, QZZv, QXXe, QYYe, QZZe, QXXp, QYYp, QZZp and G(N O)), 4) all RDF MDs, 5) WHIM (L2u, L2m, L3m, L2v, L3v, L2e, L2p, L3p, L2s, L3s, Au, Am, Av, Ae, Ap, As, Vu, Vm, Vv, Ve, Vp and Vs) A similar trend is observed in Factor (13.232%) and Factor (4.505%) indicating co-linearity between the QuBiLSMIDAS Duplex MDs and the DRAGON MDs: 1) GETAWAY (R2u, R4u, R5u and R4e) and 2) 3D-Morse (Mor11u, Mor13u, Mor15u, Mor19u, Mor28u, Mor29u, Mor31u, Mor12m, Mor14m, Mor18m, Mor20m, Mor12v, Mor14v, Mor15v, Mor20v, Mor26v, Mor28v, Mor29v, Mor31v, Mor11e, Mor13e, Mor15e, Mor19e, Mor28e, Mor29e, Mor31e, Mor12p, Mor14p, Mor15p, Mor20p, Mor26p, Mor28p, Mor29p and Mor31p) (see factors loading in SI6) On the other hand, Factor (5.879%), (2.784%) and Factors to 15 (10.596%) are exclusive for the QuBiLSMIDAS Duplex MDs, while Factor (3.979%) is important for the GETAWAY MDs, Factor (3.323%) 3D-MoRSE, Geometrical and RDF MDs, and Factor (2.424%) for the Randic Molecular Profiles, Geometrical and WHIM MDs Two important conclusions could be made from this study: 1) The QuBiLS-MIDAS Duplex MDs seem to codify a degree of structural information captured by the DRAGON MDs (although with some exceptions) evidenced by the colinearity between the two groups, and 2) There is structural information codified by the QuBiLS-MIDAS Duplex MDs linearly independent to that of the DRAGON MDs These findings justify the theoretical and practical contribution of the QuBiLS-MIDAS Duplex 3D-MDs in the sense that orthogonal information is codified by these indices, and could thus ideally complement the 3D-MDs reported in the literature in the construction QSAR models with greater predictive power and a wider applicability domain in the chemical structure space QSAR MODELING FOR THE LOG K (CBG) WITH CRAMER’S STEROID DATABASE In the following set of studies, we evaluate the correlation ability of the QuBiLS-MIDAS Duplex approach, Novel 3D Algebraic Molecular Descriptors as an alignment-free 3D QSAR method, following the conjecture that the applied generalization schemes provide a wider and flexible span of the chemical space, which should yield good correlations with determined biological activities For this study, a search for QSAR models for the binding affinity to the corticosteroid-binding globulin (CBG) for popular Cramer’s steroid database (also known as “CoMFA or Tripos Steroid”) was performed (see SI7 for names and CBG values of compounds) This chemical dataset was first utilized by Cramer et al when proposed the COMFA methodology [2], and since then has been employed by several authors as “benchmark” for comparing the performance of different 3D-QSAR procedures Several approaches have been used in the comparative studies with the steroid dataset: some authors develop their predictive models using all 31 CBG data values, while others build the models only considering the first 21 compounds and keeping the last 10 to assess the external predictive ability The results obtained from both approaches are exposed here (see section 5.4 below) The models were built using the software MobyDigs with statistical technique Multiple Linear Regression (MLR) and the Genetic Algorithm (GA) as the variable subset selection strategy, using the statistical parameter Q2loo as the optimization function [73] For in house comparative studies, five variables were used as the model size The predictive ability of the QSAR models was assessed utilizing the validation technique bootstrapping (Q2boot) In this section, we perform an in house comparison of the diverse extensions in the QuBiLS-MIDAS Duplex 3D-MDs in terms of the matrix approach, atomic properties, algebraic forms, interatomic distances, aggregation operators, and finally, we compare the results obtained with those reported in the literature 5.1 Evaluation of the Non-Stochastic, Simple-Stochastic, Double-Stochastic and Mutual Probability Matrix Approaches In this section, we evaluate the performance of the NS, SS, DS and MP matrix-based approaches in QSAR modeling To this end, 260 variables for each matrix-based approach were used in the search of the best model As can be observed in Fig (3A), superior statistical parameters are obtained with the MP approach (Q2loo = 86.93%, Q2boot = 85.11%), followed by SS matrix based MDs (Q2loo = 76.54%, Q2boot = 73.09%) and non-stochastic matrix based MDs (Q2loo = 79.91%, Q2boot = 71.69%), respectively, and finally by the DS matrix based indices (Q2loo = 71.34%, Q2boot = 67.63%) This result is consistent with that obtained for variability analysis, in which, the highest entropy MDs are obtained with the MP matrix formalism 5.2 Comparison of the Atom-based Chemical Properties as Weighting Schemes and Algebraic Forms The following properties are compared: charge (c), electronegativity (e), hardness (h), Log P (a), mass (m), polarizability (p), refractivity (r), softness (s) and Van der Waals volume (v) Fig (3B) shows the performance of models for 3D-indices computed according to the considered Current Bioinformatics, 2015, Vol 10, No 23 properties As can be observed from the Q2boot values for the obtained models, superior performance is obtained for the model based on 3D-indices computed with charge property, followed by the properties hardness and Log P, while refractivity yields a model with the lowest predictive ability The rest of the properties demonstrate comparable performance It is important to clarify that regardless of the performance of the different properties, in practical applications, combinations of these are used and in this way better correlations are obtained (see Fig 3C) Indeed it was observed in section 3.4 that pair-wise combinations of properties yielded high variability bilinear indices It is thus not surprising that the bilinear indices yield superior performance statistics to the linear and quadratic indices, respectively (see Fig 3D) A similar approach is observed in several 3D-QSAR methods where combinations of molecular interaction fields, for example, electronic and hydrophobic or hydrogen bonding fields, coulombic fields and shape, normally yield better results [74] 5.3 Examination of the Performance of the Inter-Atomic Metrics In this study, 260 variables are computed for each metric and a search for the best QSAR models according to each of the metrics is performed As it is evident in Fig (3E), all the metrics, as generalization of the classical Euclidean distance [Minkowski metric are 0.25 (M01), 0.5 (M02), (M03, Manhattan), 1.5 (M04), (M05, Euclidean), 2.5 (M06), (M07, Minkowski) and ∞ (M08, Chebyshev)], yield models with good predictive ability The best performance is obtained with the model computed with angular separation (M16) based indices, followed by models obtained for indices computed with M01, M02, M03, M04, LanceWilliams distance (M11), Clark (M12), M08 and Bhattacharyya (M14) distances, respectively Finally, with low predictive capacity is the model computed for WaveEdges (M15) distance Generally this trend corroborates, with a few exceptions, the results obtained with SE-based variability analysis The varied performance for the interatomic dissimilarity measures upholds the practical contribution of these generalization schemes, suggesting that these codify different aspects of the inter-atomic interactions It would thus be expected that models built with a mixture of these metrics exhibit an improved span of the chemical space, and thus possess greater predictive power Nonetheless, more studies need to be made to have a clear picture of the performance of these metrics, particularly, combinations of these 5.4 Analysis of the Performance of the Aggregation Operators This study has as objective to assess the performance of the different aggregation operators proposed in the generalization scheme for obtaining global or local indices from LOVIs Generally, the norms yield superior performance to the statistical invariants and the means, respectively As for the classical algorithms, the Gravitational (GV), Electrotopological State (ES), Kier-Hall (KH), Total Sum (TS) and Autocorrelation (AC) algorithms yield comparable performance; while Ivanciuc-Balaban (IB) 24 Current Bioinformatics, 2015, Vol 10, No Marrero-Ponce et al A B C D E F Fig (3) Comparison of the performance of some features of the QuBiLS-MIDAS software in QSAR modeling: A) non-stochastic (NS), simple-stochastic (SS), double-stochastic (DS) and mutual probability (MP) matrix-based approaches; B) Influence of the weighting properties Charge (c), Electronegativity (e), Hardness (h), Log P (a), Mass (m), Polarizability (p), Refractivity (r), Softness (s) and Van der Waals Volume (v) on the predictive power of QuBiLS-MIDAS Duplex QSAR models; C) Pair-wise combinations of the three best atomic properties (c, logp, h) and two worst properties (r, e) in bilinear indices.; D) Comparison of algebraic bilinear, quadratic and linear forms; E) Comparison of inter-atomic metrics in QSAR modeling Here, the Minkowski distance are 0.25 (M01), 0.5 (M02), (M03, Manhattan), 1.5 (M04), (M05, Euclidean), 2.5 (M06), (M07, Minkowski) and ∞ (M08, Chebyshev)] Other metric are Canberra (M10), Lance-Williams distance (M11), Clark (M12), Soergel (M13), Bhattacharyya (M14), Wave-Edges (M15) and angular separation (M16); and F) constraints considering lag k (lag p in Section 2.5) or lag r (lag l in Section 2.5) and combinations of these Novel 3D Algebraic Molecular Descriptors Current Bioinformatics, 2015, Vol 10, No and Information-theoretic operators, respectively, yield models with particularly low predictive power For details see SI8-9 An important finding from this evaluation is the fact that the generalization of the linear combination of LOVIs as procedure for the global definition of indices, permits obtaining MDs with varied performance in regression models, and thus combinations of these should yield greater performance 5.5 The QuBiLS-MIDAS Duplex 3D-MDs versus Other Approaches Reported in the Literature One of the methods of evaluating the true contribution and relevance of new molecular parameters or generalizations of these, is to assess their performance in correlation studies with determined molecular properties with respect to the existing methods, following the synthesis that the novelty of a method should be assessed in terms of better correlations with molecular structural properties or at least provide improvements when combined with existing methods In this sense, a search for models for the binding affinity to the CBG for a database of 31 steroid molecules was performed This study comprised of two parts: 1) regression models were obtained with the 31 structures as the training set (1-31 or also 1-30 with compound 31 as outlier), and 2) The database was divided in training (1-21) and test sets (22-31) Both sampling schemes are reported in the literature, and will permit us to make comparisons with our new approach The validation of the obtained models represents one of the most important steps in the QSAR model development workflow as this provides criteria about the true predictive capacity of the generated models In this sense, the dividing of the dataset in a training and test set permits to avoid the possibility of “overfitting” of the model to the training dataset which compromises the model quality Several approaches have been proposed in the literature for the generation of the training and test sets, of which the most popular consists in the training set the first 21 compounds and the remaining compounds constitute the test set Table R2 Q2loo Q2boot a(Q2) F (df) 0.939 0.917 0.916 -0.284 139.7 (3.27) 0.966 0.952 0.949 -0.312 189.6 (4.26) Table shows the equations and statistics for the best 3-6 variable QuBiLS-MIDAS Duplex regression models using the 31 steroids as training set, and the Table shows the best 2-4 variables QuBiLS-MIDAS Duplex regression models when the first 21 steroids are used as training set and the rest as test set Comparisons with other approaches are reported in Tables and As can be seen in Tables 6-9, the QuBiLSMIDAS Duplex descriptors yield good models with statistics comparable-to-superior to those of approaches reported in the literature Note that some of the studies performed in the literature with the 21 compounds as the training set did not report the SDEPext (external standard deviation of the errors of prediction) values using the 10 steroids usually considered as the external validation set, and thus the true predictive power of these methods cannot be compared Tables SI10 and SI11 show the experimental and calculated values for log K, as well as the corresponding residual values for the obtained models Tables and show the best models for log K with their corresponding statistics obtained with the QuBiLS-MIDAS Duplex descriptors and their comparison with the results reported in the literature As can be observed, the QuBiLSMIDAS Duplex models for all data (31 steroids) yield statistics superior to all the models reported in the literature so far, obtained with well-known 3D-MD approaches and/or more complex methods If we compare the statistical parameters q2 for the models reported in the literature for the 31 (or 30) compounds, the best six parameter model (q2 = 0.941) using the combined electrostatic and shape similarity matrix and the genetic neural networks (GNN) First, it should be noted that computation of the electrostatic and shape similarity matrix includes the molecular alignment procedure and as a grid-based method, it is computationally involved In addition it has been reported in literature that the use of the GNN leads to better optimized models compared to traditional methods [75, 76] Nonetheless, applying a simple statistical technique such as MLR to the QuBiLSMIDAS Duplex indices, better statistical parameters are obtained, even with a much lower degree of freedom [q2 (6 var) = 0.978, q2 (5 var) = 0.964, q2 (4 var) = 0.952 and q2 (3 var) = 0.917] The MEDV-13 (Molecular Electronegativity Statistical parameters for the best models for 2-6 variables for the physicochemical property log K, considering the 31 structures as the training set (test set is not taken into account) Size 25 0.977 0.986 0.964 0.978 0.960 0.974 -0.434 -0.423 216.2 (5.25) 286.3 (6.24) Models* 𝑺𝑺𝑺𝑺 𝑴𝑴𝑴𝑴 𝑩𝑩𝒉𝒉!𝒆𝒆 + 66.882 (±11.766) log K = –4.673 (±0.714) – 5.028 (±0.893) 𝑺𝑺𝑺𝑺𝑺𝑺 𝑨𝑨𝑨𝑨[𝟕𝟕]_𝑲𝑲 𝑴𝑴𝑴𝑴 𝑹𝑹𝑹𝑹 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴𝑴𝑴𝑩𝑩𝒉𝒉!𝒆𝒆 – 0.090 (±0.011) 𝑵𝑵𝑵𝑵𝑵𝑵𝑩𝑩𝒉𝒉!𝒄𝒄 𝑲𝑲 𝑴𝑴𝑴𝑴 𝑩𝑩𝒂𝒂!𝒗𝒗 – 0.051 (±0.012) log K = –15.840 (±1.082) + 0.305 (±0.055) 𝑺𝑺𝑺𝑺𝑺𝑺 𝑲𝑲𝑲𝑲[𝟑𝟑]_𝒊𝒊𝒊𝒊𝒊𝒊 𝑴𝑴𝑴𝑴 𝑮𝑮𝑮𝑮[𝟔𝟔]_𝑨𝑨𝑨𝑨 𝑴𝑴𝑴𝑴𝑴𝑴 + 23.339 (±2.666) 𝑺𝑺𝑺𝑺𝑺𝑺𝑭𝑭𝒉𝒉 + 13.845 (±1.357) 𝑺𝑺𝑺𝑺𝑺𝑺𝑭𝑭𝒉𝒉 𝑨𝑨𝑨𝑨[𝟕𝟕]_𝑲𝑲 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴𝑴𝑴𝑩𝑩𝒉𝒉!𝒄𝒄 𝑲𝑲 𝑴𝑴𝑴𝑴 log K = –29.015 (±3.970) + 0.433 (±0.059) 𝑺𝑺𝑺𝑺𝑺𝑺 𝑩𝑩𝒂𝒂!𝒗𝒗 – 0.047 (±0.010) 𝑲𝑲𝑲𝑲[𝟑𝟑]_𝒊𝒊𝒊𝒊𝒊𝒊 𝑴𝑴𝑴𝑴 𝑷𝑷𝑷𝑷 𝑴𝑴𝑴𝑴 + 1.816 (±0.533) 𝑺𝑺𝑺𝑺𝑺𝑺𝑸𝑸𝒆𝒆 + 20.788 (±2.366) 𝑺𝑺𝑺𝑺𝑺𝑺𝑸𝑸𝒉𝒉 + 𝑨𝑨𝑨𝑨[𝟔𝟔]_𝑷𝑷𝑷𝑷 𝑴𝑴𝑴𝑴𝑴𝑴 2.821 (±0.243) 𝑺𝑺𝑺𝑺𝑺𝑺𝑭𝑭𝒉𝒉 𝑮𝑮𝑮𝑮[𝟕𝟕]_𝑲𝑲 𝑴𝑴𝑴𝑴 𝑵𝑵𝑵𝑵𝑵𝑵𝑩𝑩𝒉𝒉!𝒄𝒄 log K = –16.228 (±0.729) + 11.664 (±1.531) 𝑺𝑺𝑺𝑺 𝑴𝑴𝑴𝑴 𝑲𝑲 𝑴𝑴𝑴𝑴 – 0.193 (±0.036) 𝑵𝑵𝑵𝑵𝑵𝑵 𝑩𝑩𝒂𝒂!𝒗𝒗 + 0.518 (±0.042) 𝑺𝑺𝑺𝑺𝑺𝑺 𝑩𝑩𝒂𝒂!𝒗𝒗 + 28.826 𝑲𝑲𝑲𝑲[𝟑𝟑]_𝒊𝒊𝒊𝒊𝒊𝒊 𝑴𝑴𝑴𝑴 𝑮𝑮𝑮𝑮[𝟐𝟐]_𝑲𝑲 𝑴𝑴𝑴𝑴𝑴𝑴 (±1.681) 𝑭𝑭 – 0.011 (±0.003) 𝑩𝑩 + 16.421 (±0.622) 𝑺𝑺𝑺𝑺𝑺𝑺 𝒉𝒉 𝑵𝑵𝑵𝑵𝑵𝑵 𝒄𝒄!𝒆𝒆 𝑮𝑮𝑮𝑮[𝟔𝟔]_𝑷𝑷𝑷𝑷 𝑴𝑴𝑴𝑴𝑴𝑴 𝑭𝑭 𝑺𝑺𝑺𝑺𝑺𝑺 𝒉𝒉 𝒊𝒊𝒊𝒊𝒊𝒊 𝑴𝑴𝑴𝑴 𝑵𝑵𝑵𝑵𝑵𝑵𝑩𝑩𝒂𝒂!𝒆𝒆 Eq (20) (21) (22) (23) *In the equations the molecular descriptors are expressed with the following nomenclature: !"#! 𝐴𝐴𝐹𝐹!! , where AF is the algebraic form (B: bilinear; Q: quadratic; F: Linear), M is the (dis-)similarity metrics, P is the atomic properties, I is the invariant, and MaO is the matrix approach (Ma) with the corresponding order (O) Current Bioinformatics, 2015, Vol 10, No 26 Table Marrero-Ponce et al Statistical parameters for the best models for 2-4 variables for the physicochemical property log K, considering the first 21 structures as the training set and the last 10 structures as test set Size R2 Q2loo Q2boot a(Q2) SDEPexta F (df) Models** 0.946 0.927 0.926 -0.387 0.582 (0.271) 159.3 (2.18) 0.968 0.950 0.948 -0.431 0.559 (0.243) 172.9 (3.17) logK = -4.707 (±0.107) – 0.109 (±0.019) 𝑨𝑨𝑨𝑨[𝟕𝟕]_𝑲𝑲 𝑴𝑴𝑴𝑴 𝑹𝑹𝑹𝑹 𝑴𝑴𝑴𝑴𝑴𝑴 𝑵𝑵𝑵𝑵𝑵𝑵𝑩𝑩𝒉𝒉!𝒄𝒄 – 0.003 (±0.001) 𝑵𝑵𝑵𝑵𝑵𝑵𝑩𝑩𝒄𝒄!𝒗𝒗 0.559 (0.295) 249.4 (4.16) 0.984 0.974 0.964 -0.550 Eq 𝑺𝑺𝑺𝑺 𝑴𝑴𝑴𝑴 logK = –3.004 (±0.506) – 2.618 (±0.767) 𝑺𝑺𝑺𝑺𝑺𝑺 𝑩𝑩𝒉𝒉!𝒆𝒆 – 0.105 𝑨𝑨𝑨𝑨[𝟕𝟕]_𝑲𝑲 𝑴𝑴𝑴𝑴 𝑹𝑹𝑹𝑹 𝑴𝑴𝑴𝑴𝑴𝑴 (±0.015) 𝑩𝑩 – 0.002 (±0.001) 𝑩𝑩𝒄𝒄!𝒗𝒗 𝑵𝑵𝑵𝑵𝑵𝑵 𝒉𝒉!𝒄𝒄 𝑵𝑵𝑵𝑵𝑵𝑵 𝑨𝑨𝑨𝑨[𝟕𝟕]_𝑲𝑲 𝑴𝑴𝑴𝑴 logK = –2.100 (±0.414) – 0.111 (±0.011) 𝑵𝑵𝑵𝑵𝑵𝑵𝑩𝑩𝒉𝒉!𝒄𝒄 – 0.928 𝑻𝑻𝑻𝑻[𝟐𝟐]_𝒊𝒊𝒊𝒊𝒊𝒊 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴 𝑩𝑩 – 0.009 (±0.002) 𝑭𝑭 (±0.208) 𝑬𝑬𝑬𝑬_𝑺𝑺𝑺𝑺 𝑺𝑺𝑺𝑺𝑺𝑺 𝒉𝒉!𝒆𝒆 𝑵𝑵𝑵𝑵𝑵𝑵 𝒂𝒂 – 17.786 𝒊𝒊𝒊𝒊𝒊𝒊 𝑴𝑴𝑴𝑴𝑴𝑴 (±7.143) 𝑴𝑴𝑴𝑴𝑴𝑴 𝑩𝑩𝒂𝒂!𝒆𝒆 (24) (25) (26) a In parenthesis the SDEPext value for the test set of steroids (compound 31 outside, taken as outlier) *In the equations the molecular descriptors are expressed with the following nomenclature: !"#! 𝐴𝐴𝐴𝐴!! , where AF is the algebraic form (B: bilinear; Q: quadratic; F: Linear), M is the (dis-)similarity metrics, P is the atomic properties, I is the invariant, and MaO is the matrix approach (Ma) with the corresponding order (O) Distance Vector based on 13 atomic types) method yields the best model with alignment free MDs and RLM, but has a q2 of 0.882; which is a much lower value than the obtained with the QuBiLS-MIDAS Duplex approach Likewise, for the 21 compound dataset, while the best model reported in the literature is obtained using kNN-MFA (k Nearest Neighbor – Molecular Field Analysis) and simulated annealing [q2 (6 var) = 0.950]; which is a combination of a more complex 3D-method and search strategy compared to the QuBiLS-MIDAS Duplex/GARLM Even then, much better statistical parameters [q2 (4 var) = 0.984] are obtained with the variable models, while comparable behavior is achieved with variable QuBiLSMIDAS Duplex models (see Table 8) The EEVA (Novel Electronic Eigenvalue) method and PLS (Partial Least Square) yield a variable model with a q2 value of 0.84 While EEVA is an alignment free method, it is based on more complex quantum-chemical computations Even then, the QuBiLS-MIDAS Duplex based models possess superior statistics In Table is the SDEPext of the test set compounds using well-known 3D-QSAR methods and the proposed QuBilSMIDAS Duplex approach It is interesting to note that the QuBiLS-MIDAS Duplex method possesses a good rank in both 10 and (steroid 31 out) test sets (first lowest SDEPext values with cases) Here, it is also important to highlight that just as in several studies, compound 31 demonstrates outlier behavior when included in the test set (see Tables and SI11) In fact there is a significant enhancement of the SDEPext values when steroids are considered in place of 10, achieving the better prediction accuracy compared to the remaining methodologies Other than yielding superior statistics, the QuBiLSMIDAS Duplex approach as an alignment free 3D-QSAR method offers a remarkable advantage over the COMFA-like techniques in avoiding all the limitations related with molecular superposition CONCLUSION The QuBiLS-MIDAS Duplex approach, in its diverse extensions and generalizations, seems to renew the prospect of obtaining 3D-QSAR models with greater correlative and predictive power Motivated by the famous “no free lunch” theorem [77], which postulates that there is no single best approach for tackling combinatorial optimization problems, the different extensions represent an innovative undertaking to adequately characterize the different phenomena that affect the molecular spatial configuration and intermolecular interactions, and thus affecting their biological activity As expected, the best results were achieved with combinations of parameters from the diverse generalization schemes The theoretical structure of the QuBiLS-MIDAS Duplex approach is based on sound mathematical and physicochemical concepts in algebra, similarity and topology While a claim of the end of all the difficulties encountered in 3D-QSAR modeling may not be made, it is accurate to infer that the QuBiLS-MIDAS Duplex approach codifies relevant and orthogonal information to the existing 3D MDs reported in the literature, demonstrated by SE-based variability analysis and principal components methods, respectively, and offer significant gains in the predictive power of the 3D-QSAR models It is thus expected that this methodology provide useful tools for the diversity analysis of compound datasets and high-throughput screening structure–activity data Program Availability The QuBiLS-MIDAS program is freely available in the ToMoCoMD Framework web site (http://tomocomd.com/) CONFLICT OF INTEREST The authors confirm that this article content has no conflict of interest ACKNOWLEDGEMENTS Marrero-Ponce, Y thanks the program ‘International Professor’ for a fellowship to work at “Universidad Tecnológica de Bolívar” in 2014-2015 Last, but not least, the authors want to express their acknowledgements to Prof Jorge Galvez (VU), Prof Ramón García-Domenech (VU), Facundo Pérez-Giménez (VU) and Francisco Torrens (VU) for their help and useful comments about these new MDs Novel 3D Algebraic Molecular Descriptors Current Bioinformatics, 2015, Vol 10, No Comparison of Q2loo statistics of nD-QSAR methods for the property log K (CBG) for 31 (or 30) and 21 Cramer’s steroids Table nD-QSAR Method PCs/Var Statistical Method 31/30 Steroids (All Dataset) 𝐐𝐐𝟐𝟐 loo Eq./Ref QuBiLS-MIDAS GA and MLR 0.978 Eq 23 QuBiLS-MIDAS GA and MLR 0.964 Eq 22 QuBiLS-MIDAS GA and MLR 0.952 Eq 21 Combined electrostatic and shape similarity matrix Genetic NN 0.941 [75] QuBiLS-MIDAS GA and MLR 0.917 Eq 20 Hodking SM Genetic NN 0.903 [75] Fragment QS-SM PLS 0.886 [78] MEDV-13 GA-RLM 0.882 [79] “compounds” - 0.88 [80] SOMa - R2 0.85 [81] Tuned-QSAR MLR and PCA 0.842 [74] Autocorrelation vector 30 - - 0.84 [13] CoMMA PLS 0.828 [82] SOMFA/esp + ALPHA - SOR 0.82 [83] Combined electrostatic and shape similarity matrix GA and RLM 0.819 [75] EEVA PLS 0.81 [84] SOM-4D-QSAR SOM Neural Network 0.80 [85] Charges and Properties from MEPS-AM1 MLR 0.80 [86] HE State/E-Statea,d - 0.80 [87] E-Statea,d - 0.79 [87] CoSA “Bins” PLS 0.78 [88] QSAR/E-State “atoms” - 0.78 [89] TQSI MLR 0.775 [74] EVA PLS 0.77 [22] CoMSA PLS 0.76 [90] MQSM MLR and PCA 0.759 [74] EVA + ALPHA - SOR 0.75 [83] GRIND - PLS 0.75 [91] SEAL PLS 0.748 [92] PLS 0.74 [83] - 0.74 [93] PLS 0.820 [13] MiDSASA – “template” SOMFA/esp CoSCoSA 27 a Similarity Indices (ESP MC matrix 30) CoSASA “atoms” PLS 0.73 [88] E-State and kappa shape index MLR 0.72 [94] TARIS - 0.71 [95] MQSM MLR 0.705 [74] Combined electrostatic and shape similarity matrix PLS 0.70 [75] SAMFA-RF - RF 0.69 [96] SAMFA-PLS 4-5 PLS 0.69 [96] 4D-QSAR PLS 0.69 [85] CoMMA (ab initio) PLS 0.689 [28] 28 Current Bioinformatics, 2015, Vol 10, No Marrero-Ponce et al (Table 8) contrd… nD-QSAR Method PCs/Var Statistical Method Eq./Ref - 𝐐𝐐𝟐𝟐 loo QSARa 0.68 [97] SOM-4D-QSAR SOM Neural Network 0.68 [85] Wagener’s (AMSP Method) - k-NN and FNN 0.630 [6] SAMFA-SVM - SVM 0.60 [96] ALPHA PLS 0.57 [83] 21 Steroids (Training Set) QuBiLS-MIDAS GA and MLR 0.984 Eq 26 QuBiLS-MIDAS GA and RLM 0.968 Eq 25 kNN-MFA/Simulated Annealing (SA) kNN-SA 0.95 [98] QuBiLS-MIDAS GA and RLM 0.946 Eq 24 kNN-MFA/Genetic Algorithm (GA) kNN-GA 0.93 [98] MEP PLS 0.919 [99] CoMFA: variable select PLS 0.903 [100] Apex-3D - - 0.897 [101] COMPASS - neural network 0.89 [102] kNN-MFA/ Stepwise (SW) Variable Selection kNN-SW 0.89 [98] CoMSA PLS 0.88 [90] QOVS 0.875 [103] CoMFA-QOVS b Fragment QS-SM PLS 0.865 [78] SOM-4D-QSAR SOM Neural Network 0.86 [85] EEVA PLS 0.84 [84] 4D-QSAR PLS 0.84 [85] TARIS - 0.84 [95] Tuned-QSAR MLR after PCA 0.832 [74] CoMMA PLS 0.828 [28]) PLS 0.822 [104] QOVS 0.807 [103] PARM - - 0.806 [105] CoMFA: E-state fields PLS 0.803 [87] EVA PLS 0.80 [22] b PWPLS 0.792 [103] CoMFA: g2-GRS PLS 0.79 [106] c PWPLS 0.773 [103] SEAL PLS 0.768 [92] Tominaga’s model (van der Waals volume excluded) - 0.766 [29] CoMFA - shape PLS 0.761 [12] PLS 0.716 [12] HQSAR - PLS 0.71 [107] Minimal Steric Difference (MTD) - - 0.70 [108] PLS 0.689 [12] TDQ-Surface PLS 0.68 [109] WHIM PLS 0.667 [110] CoMSIA - 0.665 [111] CoMASA CoMFA-QOVS c CoMFA-PWPLS CoMFA-PWPLS CoMFA – full field CoMFA – full field b a Novel 3D Algebraic Molecular Descriptors Current Bioinformatics, 2015, Vol 10, No 29 (Table 8) contrd… nD-QSAR Method CoMFA Receptor surface model (RSM) CoMFA - electrostatics GRIND Shape Matrix MS-WHIM ALPHA 3D-TDB SHESP matrix TSAR ESP matrix Richards SM EGSITE a Statistical Method 2 2 1 1 - PLS GFA PLS PLS PLS PLS PLS PLS PLS PLS PLS - Eq./Ref 𝐐𝐐𝟐𝟐 loo 0.662 0.646 0.644 0.64 0.633 0.631 0.63 0.558 0.533 0.505 0.501 0.501 0.23 [2] [112] [12] [91] [12] [110] [83] [113] [12] [113] [12] [97] [114] When it is applicable, specifies the number of components (PCs), bsteric field, celectrostatic field, d1.0 A models Table Comparison of external SDEP values for models obtained with QuBiLs-MIDAS and other 3D-QSAR methods for 10 (9) test set Cramer’s steroids nD-QSAR Method CoMFA-QOVSd MFTA kNN-MFA/Genetic Algorithm (GA) Receptor surface model (RSM) CoMFA-QOVSc kNN-MFA/Simulated Annealing (SA) CoMFA-PWPLSc EVA Fragment QS-SM CoMFA-PWPLSd QuBiLs-MIDAS QuBiLs-MIDAS EEVA QuBiLs-MIDAS 3D-TDB SOMFA ESP matrix CoMFA - electrostatics kNN-MFA/ Stepwise (SW) Variable Selection CoMASA MEDV-13 MS-WHIM SHESP matrix COMSA COMPASS PARM ALPHA CoMFA – FFD SHAPE matrix CoMFA – full fielda SOM-4D-QSAR CoMFA - shape Tuned-QSAR CoMFA – full fieldb DAPPER TSAR WHIM a PCs/Var SDEPext* Eq./Ref nD-QSAR Method SDEPext** Eq./Ref 0.258 0.30 0.374 0.38 0.404 0.449 0.522 0.53 0.544 0.545 0.559 0.559 0.58 0.582 0.583 0.584 0.595 0.619 0.624 0.632 0.650 0.662 0.646 0.70 0.705 0.709 0.71 0.716 0.710 0.746 0.75 0.760 0.762 0.835 0.988 1.055 1.563 [103] [115] [98] [112] [103] [98] [103] [22] [78] [103] Eq 25 Eq 26 [84] Eq 24 [113] [116] [12] [12] [98] [104] [117] [110] [12] [90] [102] [105] [83] [116] [12] [12] [85] [12] [74] [12] [118] [113] [110] QuBiLS-MIDAS GRIND QuBiLS-MIDAS QuBiLS-MIDAS MFTA COMPASS CoMFA - electrostatics CoMFA – FFD kNN-MFA/Genetic Algorithm (GA) SOMFA CoMASA Receptor surface model (RSM) CoMFA – full fielda EEVA SHAPE matrix MS-WHIM ESP matrix CoMFA - shape COMSA kNN-MFA/Simulated Annealing (SA) Fragment QS-SM EVA SHESP matrix Tuned-QSAR CoMFA – full fieldb MEDV-13 kNN-MFA/ Stepwise (SW) Variable Selection PARM ALPHA DAPPER WHIM 0.243 0.26 0.271 0.295 0.31 0.339 0.352 0.356 0.361 0.367 0.386 0.39 0.396 0.40 0.404 0.411 0.411 0.421 0.44 0.447 0.493 0.51 0.514 0.555 0.567 0.589 0.636 0.74 0.75 1.007 1.502 Eq 25 [91] Eq 24 Eq 26 [115] [102] [12] [116] [98] [116] [104] [112] [12] [84] [12] [110] [12] [12] [90] [98] [78] [22] [12] [74] [12] [117] [98] [105] [83] [118] [110] 2.0 Å – grid density, b1.0 Å – grid density, csteric field, delectrostatic field *Test set of 10 steroids (compound 31 inside) **Test set of steroids (compound 31 outside, taken as outlier) 30 Current Bioinformatics, 2015, Vol 10, No Supplementary Information Available Several Shannon’s Entropy distribution diagrams, some results of the factor analysis by the principal component method, the names of structures for Cramer’s steroid database and their corresponding experimental and prediction values for the CBG as well as a list of acronyms (SI12) Marrero-Ponce et al [20] [21] [22] REFERENCES [23] [1] [24] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] Kubinyi H QSAR and 3D QSAR in Drug Design: Methodology Drug Disc Today 1997; 2: 457-67 Cramer RD, Patterson DE, Bunce JD Comparative Molecular Field Analysis (CoMFA) Effect of Shape on Binding of Steroids to Carrier Proteins J Am Chem Soc 1988; 110: 5959-67 Fichera M, Cruciani G, Bianchi A, Musumarra G A 3D-QSAR Study on the Structural Requirements for Binding to CB1 and CB2 Cannabinoid Receptors J Med Chem 2000 2000/06/01; 43(12): 2300-9 Martin YC 3D QSAR: Current State, Scope, and Limitations In: Kubinyi H, Folkers G, Y C Martin, editors Three-Dimensional Quantitative Structure Activity Relationships Netherlands: Springer; 2002; p p 3–23 Baroni M, Costantino G, Cruciani G, Riganelli D, Valigi R, Clementi S Generating optimal linear PLS estimations(GOLPE) An advanced chemometric tool for handling 3D-QSAR problems Quant Struct-Act Relat 1993; 12:9-20 Wagener M, Sadowski J, Gasteiger J Autocorrelation of Molecular Surface Properties for Modeling Corticosteroid Binding Globulin and Cytosolic Ah receptor J Am Chem Soc 1995; 117: 7769-75 Cruciani G, Watson KA Comparative molecular field analysis using GRID force-field and GOLPE variable selection methods in a study of inhibitors of glycogen phosphorylase b J Med Chem 1994; 37: 2589-601 Ortiz AR, Pastor M, Palomer A, Cruciani G, Gago F, Wade RC Reliability of comparative molecular field analysis models: effects of data scaling and variable selection using a set of human synovial fluid phospholipase A(2) inhibitors J Med Chem 1997; 40(6) :1136-48 Allen MS, La Loggia AJ, Dorn LJ, Martin MJ, Costatino G, Hagen TJ, et al Predictive binding of â-Carboline Inverse Agonists and Antagonists via the CoMFA/GOLPE Approach J Med Chem 1992; 35: 4001-10 Cruciani G, Clementi S, Baroni M Variable Selection in PLS Analysis Kubinyi H, editor Leiden: ESCOM; 1993 Pastor M, Cruciani G, Clementi S Smart region definition: A new way to improve the predictive ability and interpretability of threedimensional quantitative structure activity relationships J Med Chem 1997; 40: 1455-64 Good AC, So SS, Richards WG Structure-activity relationships from molecular similarity matrixes J Med Chem 1993; 36: 433-8 Parretti MF, Kroemer RT, Rothman JH, Richards WG Alignment of Molecules by the Monte Carlo Optimization of Molecular Similarity Indices J Comput Chem 1997; 18: 1344-53 Allen MS, Tan YC, Trudell ML, Narayanan K, Schindler LR, Martin MJ, et al J Med Chem 1990; 33: 2343 Doweyko AM Three-Dimensional Pharmacophores from Binding Data J Med Chem 1994; 37: 1769-78 Doweyko AM The Hypothetical Active Site Lattice An Approach to Modelling Active Sites From Data on Inhibitor Molecules J Med Chem 1988; 31: 1396-406 Guccione S, Doweyko AM, Chen H, Barretta GU, Balzano F 3DQSAR Using ‘Multiconformer’ Alignment: the Use of HASL in the Analysis of 5-HT1A Thienopyrimidinone Ligands J ComputAided Mol Des 2000; 14: 647-57 Consonni V, Todeschini R, Pavan M Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors Theory of the Novel 3D Molecular Descriptors J Chem Inf Comput Sci 2002; 42(3): 682-92 Gasteiger G, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V Chemical Information in 3D Space J Chem Inf Comput Sci 1996; 36: 1030-7 [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] Todeschini R, Consonni V Molecular Descriptors for Chemoinformatics 1st ed Mannhold R, Kubinyi H, Folkers G, editors Weinheim: WILEY-VCH; 2009 Bursi R, Dao D, van Wijk T, de Gooyer M, Kellenbach M, Verwer P Comparative Spectra Analysis (CoSA): Spectra as ThreeDimensional Molecular Descriptors for the Prediction of Biological Activities J Chem Inf Comput Sci 1999; 39: 861-7 Turner DB, Willett P, Ferguson AM, Heritage TW Evaluation of a novel molecular vibration-based descriptor (EVA) for QSAR studies: Model validation using a benchmark steroid dataset J Comput Aided Mol Des 1999; 13: 271-96 Balaban AT, editor From Chemical Topology to ThreeDimensional Geometry New York: Plenum Press; 1997 Bogdanov B, Nikolic S, Trinajstic N On the Three-Dimensional Wiener Number J Math Chem 1989; 3: 299-309 Mekenyan O, Peitchev D, Bonchev D, Trinajstic N, Bangov IP Modelling the Interaction of Small Organic Molecules with Biomacromolecules I Interaction of Substituted Pyridines with anti-3-azopyridine Antibody Arzneim Forsch 1986; 36: 176-83 Randić M Molecular Profiles Novel Geometry-Dependent Molecular Descriptors New J Chem 1995; 19: 781-91 Randić M, Razinger M On Characterization of Molecular Shapes J Chem Inf Comput Sci 1995; 35: 594-606 Silverman BD, Platt DE, Pitman M, Rigoutsos I, editors Comparative molecular moment analysis (COMMA) Dordrecht, The Netherlands: Kluwer Academic Publishers; 1998 Tominaga Y, Fujiwara I Novel 3D Descriptors Using Excluded Volume: Application to 3D Quantitative Structure-Activity Relationships J Chem Inf Comput Sci 1997; 37: 1158-61 Marrero-Ponce Y Total and local (atom and atom type) molecular quadratic indices: significance interpretation, comparison to other molecular descriptors, and QSPR/QSAR applications Bioorg Med Chem 2004; 12: 6351-69 Marrero-Ponce Y, Torrens F, Alvarado YJ, Rotondo R BondBased Global and Local (Bond and Bond-Type) Quadratic Indices and Their Applications to Computer-Aided Molecular Design QSPR Studies of Octane Isomers J Comput Aided Mol Des 2006; 20: 685-701 Marrero Ponce Y Total and Local Quadratic Indices of the Molecular Pseudograph’s Atom Adjacency Matrix: Applications to the Prediction of Physical Properties of Organic Compounds Molecules 2003; 8: 687-726 Marrero-Ponce Y, Huesca-Guillén A, Ibarra-Velarde F Quadratic indices of the [`] molecular pseudograph's atom adjacency matrix' and their stochastic forms: a novel approach for virtual screening and in silico discovery of new lead paramphistomicide drugs-like compounds Journal of Molecular Structure: THEOCHEM 2005; 717(1-3): 67-79 Marrero-Ponce Y, Castillo-Garit JA, Olazabal E, Serrano HS, Morales A, Castañedo N, et al Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic Bioorganic & Medicinal Chemistry 2005; 13(4): 1005-20 Castillo-Garit JA, Marrero-Ponce Y, Torrens F Atom-based 3Dchiral quadratic indices Part 2: Prediction of the corticosteroidbinding globulinbinding affinity of the 31 benchmark steroids data set Bioorg Med Chem 2006; 14: 2398-408 Castillo-Garit JA, Marrero-Ponce Y, Torrens F, Rotondo R Atombased stochastic and non-stochastic 3D-chiral bilinear indices and their applications to central chirality codification Journal of Molecular Graphics and Modelling 2007; 26(1): 32-47 Marrero-Ponce Y, Torrens F, García-Domenech R, Ortega-Broche SE, Romero Zaldivar V Novel 2D TOMOCOMD-CARDD molecular descriptors: atom-based stochastic and non-stochastic bilinear indices and their QSPR applications J Math Chem 2008;44:650-73 Marrero Ponce Y, Martinez-Albelo ER, Casanola-Martin GM, Castillo Garit JA, Echeveria Diaz Y Bond-based linear indices of the non-stochasitic and stochastic edge-adjacency matrix Theory and modeling of ChemPhys properties of organic molecules Mol Divers 2009;11030 Marrero-Ponce Y, Martínez-Albelo E, Casola-Martín G, Castillo-Garit JA, Echevería-Díaz Y, Zaldivar V, et al Bond-based linear indices of the non-stochastic and stochastic edge-adjacency Novel 3D Algebraic Molecular Descriptors [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] matrix Theory and modeling of ChemPhysical properties of organic molecules Mol Divers 2010; 14(4): 731-53 Castillo-Garit JA, Martinez-Santiago O, Marrero-Ponce Y, Casola-Martín GM, Torrens F Atom-based non-stochastic and stochastic bilinear indices: Application to QSPR/QSAR studies of organic compounds Chemical Physics Letters 2008; 464(1-3): 10712 Ghose AK, Viswanadhan VN, Wendoloski JJ Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods J Phys Chem A 1998; 102: 3762–72 Balaban AT Steric fit in quantitative structure-activity relations: Springer-Verlag; 1980 Gasteiger J, Marsili M Iterative partial equalization of orbital elektronegativity - a rapid access to atomic charges Tetrahedron 1980; 36: 3219-88 Ertl P, Rohde B, Selzer P Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties J Med Chem 2000; 43: 3714-7 García-Jacas CR, Marrero-Ponce Y, Acevedo-Martínez L, Barigye SJ, Valdés-Martiní JR, Contreras-Torres E QuBiLS-MIDAS: A Parallel Free-Software for Molecular Descriptors Computation Based on Multilinear Algebraic Maps J Comput Chem 2014; 35(18): 1395–409 García-Jacas CR, Aguilera-Mendoza L, González-Pérez R, Marrero-Ponce Y, Acevedo-Martínez L, Barigye SJ, et al MultiServer Approach for High-Throughput Molecular Descriptors Calculation based on Multi-Linear Algebraic Maps Mol Inf 2015; 34(1): 60–9 Steinbeck C, Han YQ, Kuhn S, Horlacher O, Luttmann E, Willighagen EL The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics J Chem Inf Comput Sci 2003; 43 (2): 493–500 Marrero-Ponce Y, Castillo-Garit J, Torrens F, Romero Zaldivar V, Castro E Atom, Atom-Type, and Total Linear Indices of the “Molecular Pseudograph’s Atom Adjacency Matrix”: Application to QSPR/QSAR Studies of Organic Compounds Molecules 2004; 9(12): 1100-23 Marrero-Ponce Y, Medina-Marrero R, Torrens F, Martinez Y, Romero-Zaldivar V, Castro EA Atom, atom-type, and total nonstochastic and stochastic quadratic fingerprints: a promising approach for modeling of antibacterial activity Bioorg Med Chem 2005; 13(8): 2881-99 Nikolic S, Trinajstic N, Mihalic Z, Carter S On the geometricdistance matrix and the corresponding structural invariants of molecular systems Chem Phys Lett 1991; 179(1-2): 21-8 Devillers J, Balaban AT Topological Indices and Related Descriptors in QSAR and QSPR Amsterdam, The Netherland: Gordon and Breach; 1999 Vargas-Quesada B, Anegón FM Visualizing the structure of science New York: Springer; 2007 Willett P Chemoinformatics – Similarity and Diversity in Chemical Libraries Curr Opinion Biotechnol 2000; 11: 85–8 Balaban AT, Bertelsen S, Basak SC New centric topological indexes for acyclic molecules (trees) and substituents (rooted trees), and coding of rooted trees MATCH Commun Math Comput Chem 1994; 30: 55–72 Balaban AT, Feroiu V Correlations between structure and critical data or vapor pressures of alkanes by means of topological indices Rep Mol Theor 1990; 1: 133-9 Holliday JD, Ranade SS, Willett P A Fast Algorithm for Selecting Sets of Dissimilar Molecules from Large Chemical Databases Quant Struct-Act Relat 1995; 14: 501-6 Jones PE, Curtice RM A Framework for Comparing Document Term Association Measures Am Doc 1967; 18: 153-61 González-Díaz H, Uriarte E Proteins QSAR with Markov average electrostatic potentials Bioorg Med Chem Lett 2005; 15(22): 508894 Ramos de Armas R, González Díaz H, Molina R, Uriarte E Markovian Backbone Negentropies: Molecular Descriptors for Protein Research I Predicting Protein Stability in Arc Repressor Mutants PROTEINS: Struc Funct Bioinform 2004; 56: 715–23 Marrero-Ponce Y, Castillo-Garit JA, Castro EA, Torrens F, Rotondo R 3D-chiral (2.5) atom-based TOMOCOMD-CARDD Current Bioinformatics, 2015, Vol 10, No [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] 31 descriptors: theory and QSAR applications to central chirality codification J Math Chem 2008; 44: 755-86 Carbo-Dorca R Stochastic Transformation of Quantum Similarity Matrixes and Their Use in Quantum QSAR (QQSAR) Models Int J Quantum Chem 2000; 79: 163-77 Edwards CH, Penney DE Elementary linear algebra Englewoods Cliffs: Prentice Hall; 1988 Sinkhorn R A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices Ann Math Statist 1964; 35: 876-9 Sinkhorn R, Knopp P Concerning nonnegative matrices and doubly stochastic matrices Pacific J Math 1967; 21(2): 343-8 Janežič D, Miličević A, Nikolić S, Trinajstić N Graph Theoretical Matrices in Chemistry Mathematical Chemistry Monographs 2007 Barigye SJ, Marrero-Ponce Y, Santiago OM, López YM, Torrens F Shannon’s, Mutual, Conditional and Joint Entropy-Based Information Indices Generalization of Global Indices Defined from Local Vertex Invariants Curr Comput-Aided Drug Des 2013; Barigye SJ, Marrero-Ponce Y, Martínez López Y, Artiles Martínez LM, Pino-Urias RW, Martínez Santiago O, et al Relations Frequency Hypermatrices in Mutual, Conditional and Joint Entropy-Based Information Indices J Comput Chem 2013; 34: 259-74 García-Jacas CR, Marrero-Ponce Y, Barigye SJ, Valdés-Martiní JR, Rivera-Borroto OM, Olivero-Verbel J N-Linear Algebraic Maps for Chemical Structure Codification: A Suitable Generalization for Atom-pair Approaches? Curr Drug Metab 2014; 15(4): 441–69 J W Godden, F L Stahura, Bajorath J Variability of Molecular Descriptors in Compound Databases Revealed by Shannon Entropy Calculations J Chem Inf Comput Sci 2000; 40( ): 796-800 Pino-Urias RW, Barigye SJ, Marrero-Ponce Y, García-Jacas CR, Pérez-Giménez F, Valdés-Martiní JR IMMAN: Free Software for Information Theory-based Chemometric Analysis Mol Diversity 2015; 19(2): 305-19 Massey WF Principal components regression in exploratory statistical research J Amer Stat Assoc 1965; 60: 234–56 Mardia KV, Kent JT, Bibby JM Multivariate Analysis London: Academic Press; 1979 Todeschini R, Consonni V, Mauri A, Pavan M, Leardi R MobyDigs: software for regression and classification models by genetic algorithms Data Handling in Science and Technology: Elsevier; 2003 p 141-67 Robert D, Amat L, Carbo-Dorca R Three-Dimensional Quantitative-Activity Relationships from Tuned Molecular Quantum Similarity Measures: Prediction of the CorticosteroidBinding Globulin Binding Affinity for a Steroid Family J Chem Inf Comput Sci 1999; 39: 333-44 So SS, Karplus M Three-dimensional quantitative structureactivity relationships from molecular similarity matrices and genetic neural networks Method and validations J Med Chem 1997 Dec 19; 40: 4347-59 Sung-Sau S, Karplus M Three-Dimensional Quantitative Structure-Activity Relationships from Molecular Similarity Matrices and Genetic Neural Networks Applications J Med Chem 1997; 40: 4360-71 Wolpert DH, Macready WG No free lunch theorems for optimization IEEE Trans Evolut Comput 1997; 1: 67-82 Amat L, Besalu E, Carbo-Dorca R Identification of Active Molecular Sites Using Quantum-Self-Similarity Measures J Chem Inf Comput Sci 2001; 41: 978-91 Shu-Shen L, Chun-Sheng, Lian-Sheng W Combined MEDV-GAMLR Method for QSAR of Three Panels of Steroids, Dipeptides, and COX-2 Inhibitors J Chem Inf Comput Sci 2002; 42: 749-56 Beger RD, Harris SH, Xie Q Models of Steroid Binding Based on the Minimum Deviation of Structurally Assigned 13C NMR Spectra Analysis (MiDSASA) J Chem Inf Comput Sci 2004; 44: 1489-96 Polanski J The receptor-like neural network for modeling corticosteroid and testosterone binding globulins J Chem Inf Model 1997: 553-61 Silverman BD, Platt DE Comparative molecular moment analysis (CoMMA): 3D-QSAR without molecular superposition J Med Chem 1996; 39: 2129-40 Tuppurainen K, Viisas M, Peräkylä M, Laatikainen R Ligand intramolecular motions in ligand-protein interaction: ALPHA, a 32 [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] Current Bioinformatics, 2015, Vol 10, No Marrero-Ponce et al novel dynamic descriptor and a QSAR study with extended steroid benchmark dataset J Comp-Aided Mol Design 2004; 18: 175-87 Tuppurainen K, Viisas M, Laatikainen R, Perakyla M Evaluation of a Novel Electronic Eigenvalue (EEVA) Molecular Descriptor for QSAR/QSPR Studies: Validation Using a Benchmark Steroid Data Set J Chem Inf Comput Sci 2002; 42: 607-13 Polanski J, Bak A Modeling Steric and Electronic Effects in 3Dand 4D-QSAR Schemes: Predicting Benzoic pKa Values and Steroid CBG Binding Affinities J Chem Inf Comput Sci 2003; 43: 2081-92 De K, Sengupta C, Roy K QSAR modeling of globulin binding affinity of corticosteroids using AM1 calculations Bioorg Med Chem 2004; 12: 3323-32 Kellogg GE, Kier LB, Gaillard P, Hall LH E-state fields: Applications to 3D QSAR J Comput-Aided Mol Design 1996; 10: 513-20 Beger RD, Wilkes JE Developing 13C NMR quantitative spectrometric data-activity relationship (QSDAR) models of steroid binding to the corticosteroid binding globulin J Comp-Aided Mol Design 2001; 15: 659-69 Carolina de Gregorio LBK, Lowell H Hall QSAR modeling with electrotopological state indices: Corticosteroids J Comp-Aided Mol Design 1998; 12: 557-61 Polanski J, Walczak B The comparative molecular surface analysis (COMSA): a novel tool for molecular design Comput Chem 2000; 24: 615–25 Pastor M, Cruciani G, McLay I, Pickett P, Clementi S GRidINdependent Descriptors (GRIND): A Novel Class of AlignmentIndependent Three-Dimensional Molecular Descriptors J Med Chem 2000; 43: 3233-43 Kubinyi H, Hamprecht FA, Mietzner T Three-Dimensional Quantitative Similarity-Activity Relationships (3D QSiAR) from SEAL Similarity Matrices J Med Chem 1998; 41: 2553-64 Beger RD, Buzatu D, Wilkes JG, Lay J, J O Developing comparative structural connectivity spectra analysis (CoSCSA) models of steroid binding to the corticosteroid binding globulin J Chem Inf Comput Sci 2002; 42: 1123-31 Maw HH, Hall LH E-State Modeling of Corticosteroids Binding AffinityValidation of Model for Small Data Set J Chem Inf Comput Sci 2001; 41: 1248-54 Marín RM, Aguirre NF, Daza EE Graph Theoretical Similarity Approach To Compare Molecular Electrostatic Potentials J Chem Inf Model 2008; 48: 109-18 Manchester J, Czerminski R SAMFA: Simplifying Molecular Description for 3D-QSAR J Chem Inf Model 2008; 48: 1167-73 Andrew C Good SSS, W Graham Richards Structure-activity relationships from molecular similarity matrices J Med Chem 1993: 433-8 Ajmani S, Jadhav K, Kulkarni SA Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation J Chem Inf Model 2005 Cosentino U, Moro G, Bonalumi D, Bonati L, Lasagni M, Todeschini R, et al A combined use of global and local approaches in 3D-QSAR Chemom Intell Laborat Sys 2000; 52: 183-94 Norinder U Single and domain mode variable selection in 3D QSAR applications J Chemom 1996; 10: 95-105 Received: February 12, 2014 [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] Revised: May 10, 2014 Vorpagel ER Analysis of steroid binding using apex-3D and 3D QSAR models 210th American Chemical Society Meeting, Chicago, COMP-0125 1995 Ajay NJ, Kimberle K, Chapman D Compass: Predicting Biological Activities from Molecular Surface Properties Performance Comparisons on a Steroid Benchmark J Med Chem 1994; 37: 2315-27 Tominaga Y, Fujiwara I Prediction-Weighted Partial LeastSquares Regression Method (PWPLS) 2: Application to CoMFA J Chem Inf Comput Sci 1997; 37: 1152-7 Kotani T, K H Comparative Molecular Active Site Analysis (CoMASA) An Approach to Rapid Evaluation of 3D QSAR J Med Chem 2004; 47: 2732-42 Chen H, Zhou J, Xie G PARM: A Genetic Evolved Algorithm To Predict Bioactivity J Chem Inf Comput Sci 1998; 38: 243-50 Cho SJ, Tropsha A Cross-validated R2-guided region selection for comparative molecular field analysis: A simple method to achieve consistent results J Med Chem 1995; 38: 1060-6 Lowis DR HQSAR: A New, Highly Predictive QSAR Technique Tripos Technique Notes; Tripos: St Louis, MO, USA 1997 Oprea TI, Ciubotariu D, Sulea TI, Simon Z Comparison of the minimal steric difference (MTD) and comparative molecular field analysis (CoMFA) methods for analysis of binding of steroids to carrier proteins Quant Struct-Act Relat 1993; 12: 21-6 Norinder U 3D-QSAR Investigation of the Tripos Benchmark Steroids and some Protein-Tyrosine Kinase Inhibitors of Styrene Type using the TDQ Approach J Chemom 1996; 10(5-6): 533-45 Bravi G, Gancia E, Mascagni P, Pegna M, Todeschini R, Zaliani AJ MSWHIM, New 3D Theoretical Descriptors Derived from Molecular Surface Properties: A Comparative 3D QSAR Study in a Series of Steroids J Comput Aided Mol Des 1997; 11: 79-92 Klebe G, Abraham U, Mietzner T Molecular Similarity Indices in a Comparative Analysis (CoMSIA) of Drug Molecules to Correlate and Predict Their Biological Activity J Med Chem 1994; 37: 413046 Hahn M, Rogers D Receptor Surface Models Application to Quantitative Structure-Activity Relationships Studies J Med Chem 1995; 38: 2091-102 Klein CT, Kaiser D, Ecker G Topological Distance Based 3D Descriptors for Use in QSAR and Diversity Analysis J Chem Inf Comput Sci 2004; 44: 200-9 Schnitker J, Gopalaswamy R, Crippen GM Objective models for steroid binding sites of human globulins J Comput-Aided Mol Design 1997; 11: 93-110 Palyulin VA, Radchenko EV, Zefirov NS Molecular Field Topology Analysis Method in QSAR Studies of Organic Compounds J Chem Inform Comput Sci 2000; 40: 659-67 Robinson DD, Winn PJ, Lyne PD, Richards WG Self-organizing molecular field analysis: A tool for structure-activity studies J Med Chem 1999; 42: 573–83 Shu-Shen L, Chun-Sheng Y, Zhi-Liang L, Shao-Xi C QSAR Study of Steroid Benchmark and Dipeptides Based on MEDV-13 J Chem Inf Comput Sci 2001; 41: 321-9 Wildman SA, Crippen GM Three-dimensional molecular descriptors and a novel QSAR method J Mol Graphics Modell 2002; 21: 161-70 Accepted: June 25, 2014 ... 325 variables calculated for each mathematical operator This study comprises of two parts In the first study, we compare the mathematical operators collectively denominated as norms, statistical... different mathematical operators are used justifies the use different mathematical operators for the global Marrero-Ponce et al characterization of molecular structures While there operators with... this way, it could be useful to construct the matrices from the information about the contact (interaction) among atoms separated at a determined distance (or distance range) in the 2D structure