Hội nghị Khoa học Công nghệ lần thứ 4 SEMREGG 2018 23 APPLICATION OF MLR, PCR AND ANN MODELS FOR THE PREDICTION OF STABILITY CONSTANTS OF DIVERSE METAL CATIONS WITH THIOSEMICARBAZONE DERIVATIVES IN EN[.]
Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 APPLICATION OF MLR, PCR AND ANN MODELS FOR THE PREDICTION OF STABILITY CONSTANTS OF DIVERSE METAL CATIONS WITH THIOSEMICARBAZONE DERIVATIVES IN ENVIRONMENTAL MONITORING Nguyen Minh Quang1,2, Huynh Nhat Lam2, Tran Xuan Mau1, Tran Thi Thanh Ngoc3, Pham Van Tat4* Faculty of Chemistry, University of Sciences, Hue University, 77 Nguyen Hue, Hue City Faculty of Chemical Engineering, Industrial University of Ho Chi Minh City, 12 Nguyen Van Bao, Go Vap district, Ho Chi Minh City Faculty of Geology and Mineralogy, Ho Chi Minh City University of Natural resources and Environment, 236B Le Van Sy, Tan Binh district, Ho Chi Minh City Faculty of Science and Technology, Hoa Sen University, Dinh Cong Trang, District 1, Ho Chi Minh City *Email: vantat@gmail.com ABSTRACT Multivariate linear regression (MLR), principle component regression (PCR) and artificial neural network (ANN) methods were used to construct the quantitative relationships between molecular structure and stability constants (logβ11) of metal-thiosemicarbazone complexes In this study, The QSPR models were built with knotp, SHBa, HOMO, xvpc4, N4, LUMO, ionization potential, dipole, molecular weight, Maxneg and Hf descriptors selected from the descriptor sets The quality of QSPR models was proved with the statistical parameters: R2train = 0.9296, Q2LOO = 0.8673, MSE = 0.5878 and Fstat = 45.5829 for QSPRMLR model, and R2train = 0.9236, R2CV = 0.9423, MSE = 0.4190 and Fstat = 30.7886 for QSPRPCR model, respectively The neural network model QSPRANN with architecture I(11)-HL(14)-O(1) was constructed by using descriptors in the 11variables QSPRMLR model The QSPRANN model pointed out the training coefficient R2train = 0.9912, Q2CV = 0.9938 and R2test = 0.9948 According to the QSPRMLR, QSPRPCR and QSPRANN models, the logβ11 values were validated externally for ten randomly selected complexes The results are used for orientation of new thiosemicarbazone to design in rapid analysis of metal ions in environmental monitoring Keywords: QSPR models, stability constants log neural network, thiosemicarbazone 11, multivariate linear regression, artificial INTRODUCTION Pollution is the introduction of contaminants into the natural environment that cause adverse change Pollution can take the form of chemical substances or energy, such as noise, heat or light Major forms of pollution include air pollution, light pollution, littering, noise pollution, plastic pollution, soil contamination, radioactive contamination, thermal pollution, visual pollution and water pollution [1] Here we are concerned about the environmental pollution that comes from chemical agents, especially the heavy metals 23 The fourth Scientific Conference - SEMREGG 2018 In water pollution, heavy metal ions appear from a variety of sources Most of them have been emitted from the metallurgical and electroplating industries [2-3] Trace amounts of metal ions are important in industry [3], as an environmental pollutant [3, 4], and an occupational hazard [3, 4] There are many ways to monitor them in the environment One of the methods is the spectrophotometric method with UV-VIS equypment It is widely used because it is cheap and easy to handle In the method, it use organic substances as complexing reagents [5] to determine various metals Besides, there has been a rapid growth in the popularity of thiosemicarbazones in environmental chemistry for determining the metal ions [6] A survey of literature reveals that a few thiosemicarbazones are employed for direct spectrophotometric determination of metals in aqueous solution In published researches, the authors proposed the new analytical reagents thiosemicarbazones for the spectrophotometric analysis of metals Regarding the complex formation of metal ions with thiosemicarbazone ligand, the stability constant represented to be a very important role The quantity is related to the metal ions and the structural properties of the ligands The relationships can be built based on a method called quantitative structure and property relationship (QSPR) which is a modeling approach that has been successfully applied in many fields [7] Further, QSPR modeling supplies an effective method for establishing and reclaiming the relationship between chemical structure descriptors of molecules and their properties toward designs of new compound [8] The present study focuses on the construction of QSPR models for surely predicting the stability constants of metal-thiosemicarbazone complexes The QSPR models were developed based on the experimental stability constants and chemical structures The structural descriptors are calculated by using the semi-empirical quantum chemistry method with new version PM7 and PM7/sparkle [9] and molecular geometry calculation The QSPR models consist of the QSPR MLR, QSPRPCR and QSPRANN models The QSPRMLR model is established by using multivariate linear regression, the QSPRPCR model is built by using principle component regression and the QSPRANN model is constructed by the error back-propagation method using multilayer perceptron algorithm with the input layer including variables of the best selected QSPRMLR model The stability constant log 11 of the complexes between the metal ions and thiosemicarbazone in the data set resulting from the QSPR models is also validated externally with experimental data in the literature Herein, the QSPR modeling investigate for the first time the stability constant of metal-thiosemicarbazone complexes in the world COMPUTATIONAL METHODS 2.1 Experimental Datasets The data sets of experimental stability constant (log 11) values for the (M:L) complexes of transition metal ions (Ni2+, Co2+, Mo6+, Cu2+, Mn2+, Zn2+, Ag+, Pb2+, Fe2+ and Zn2+) with different thiosemicarbazone ligands in water were selected from the published literature [10-21] at temperature ranges from 288K to 323K, pH of 2.4 to 10 and an ionic strength average I = 0.1M The 50 metal-thiosemicarbazone complexes of training set involve the metal ions containing (Cd 2+), (Co2+), (Cr3+), (Cr6+), (Cu2+), (Fe2+), (Mg2+), (Mn2+), 12 (Ni2+), (Pb2+) and (Zn2+), as presented in Table The test set includes 10 substances with the metal ions Cu 2+, Ho3+, Dy3+, Co2+, Zn2+, Mn2+, Ni2+ and Fe3+ [15,16,18,22,23], as showed at the end of Table The metalthiosemicarbazone complexes (ML) are generated by the reaction between a metal ion (M) and a thiosemicarbazone ligand (L) in an aqueous solution [25] and the general structure of thiosemicarbazone and their complex are showed in Fig [6] 24 Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 a) b) Figure The schematic of complex formation and general structure of thiosemicarbazone (a) and metal-thiosemicarbazone complex (b) Here the logβ11 values are log-transformed stability constants of metal-thiosemicarbazone complexes and the stability constant β11 is calculated by the following equation [25] 11 [ML] [M][L] (1) Table The experimental log 11,exp values for ML complexes and the cross-validated and external predicted log 11,exp values resulting from QSPRMLR, QSPRPCR and QSPRANN No Thiosemicarbazone ligand R1 R2 R3 R4 H H H H H H H H H H H H H H H H H H H H -CH3 H H H H H H H H H H H H H H H H H -CH3 H -CH3 -CH3 -CH3 H -CH3 -CH3 -CH3 H H H H H H H -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C5H4N -C6H3BrOH -CH=N-NHC6H5 -CH=N-NHC6H5 -CH=N-NHC6H5 -C6H3(OH)OCH3 -CH=N-NHC6H5 -CH=N-NHC6H5 -CH=N-NHC6H5 -C9H5NOH -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C6H3(OH)OCH3 Metal ions logβ11.exp logβ11.cal MLR PCR ANN ref Training set and internal test set 10 11 12 13 14 15 16 17 18 Co2+ Mn2+ Cu2+ Cu2+ Cu2+ Co2+ Ni2+ Cr6+ Mn2+ Mn2+ Mn2+ Zn2+ Mn2+ Fe2+ Fe2+ Fe2+ Fe2+ Ni2+ 11.970 10.550 6.114 5.633 11.950 10.220 10.890 4.842 9.870 9.720 9.600 6.680 4.120 8.150 7.990 7.840 7.690 8.650 12.598 11.042 6.648 6.075 10.353 11.463 10.914 5.125 9.324 9.324 9.324 6.761 5.432 7.454 7.454 7.454 7.454 8.229 12.784 11.371 6.608 6.061 10.388 11.320 11.058 5.202 9.439 9.439 9.439 6.671 5.492 7.619 7.619 7.619 7.619 8.228 11.926 10.555 6.077 5.778 11.662 10.206 10.873 4.871 9.716 9.716 9.716 6.701 3.436 7.952 7.952 7.952 7.952 8.402 [10] [10] [11] [12] [13] [13] [13] [14] [15] [15] [15] [16] [17] [17] [17] [17] [17] [17] 25 The fourth Scientific Conference - SEMREGG 2018 No 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Thiosemicarbazone ligand Metal ions R1 R2 R3 R4 H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H -CH3 -CH3 -CH3 H H H H H H -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C6H3(OH)OCH3 -C6H4OH -C6H4OH -C6H4OH -C10H6OH -C10H6OH -C10H6OH -C10H6OH -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C9H8NO -C6H4NO2 -C6H4NO2 Ni2+ Ni2+ Ni2+ Ni2+ Ni2+ Ni2+ Mg2+ Mn2+ Pb2+ Pb2+ Ni2+ Ni2+ Ni2+ Ni2+ Pb2+ Pb2+ Pb2+ Co2+ Co2+ Co2+ Zn2+ Zn2+ Zn2+ Zn2+ Cd2+ Cd2+ Cd2+ Mn2+ Mn2+ Mn2+ Cr3+ Cr3+ logβ11.exp 8.480 8.370 8.110 5.940 5.310 5.140 3.250 4.660 6.680 6.570 8.221 8.124 7.910 7.709 7.861 7.653 7.307 7.668 7.591 7.251 7.820 7.534 7.423 7.039 7.015 6.863 6.611 5.820 5.621 5.439 10.150 11.250 logβ11.cal MLR 8.229 8.229 8.229 5.492 5.492 5.492 3.916 3.709 7.061 7.061 7.964 7.964 7.964 7.964 7.357 7.357 7.357 7.388 7.388 7.388 7.269 7.269 7.269 7.269 6.924 6.924 6.924 5.860 5.860 5.860 11.007 11.007 PCR ANN ref 8.228 8.228 8.228 5.433 5.433 5.433 3.858 3.759 7.235 7.235 7.913 7.913 7.913 7.913 7.176 7.176 7.176 7.203 7.203 7.203 7.241 7.241 7.241 7.241 6.927 6.927 6.927 5.971 5.971 5.971 10.952 10.952 8.402 8.402 8.402 5.558 5.558 5.558 4.081 4.665 6.644 6.644 8.098 8.098 8.098 8.098 7.536 7.536 7.536 7.463 7.463 7.463 7.272 7.272 7.272 7.272 6.774 6.774 6.774 5.529 5.529 5.529 10.696 10.696 [17] [17] [17] [18] [18] [18] [19] [19] [19] [19] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [20] [21] [21] 7.080 7.1140 7.0279 8.640 8.9132 8.6831 8.240 8.6043 8.3225 11.700 10.3534 10.3882 6.560 9.198 8.765 11.662 [22] [23] [23] [15] External test set 26 -CH3 H H H -CH3 H H H -C5H4N -CH3 -CH3 -CH3 -C5H4N - C5H4N - C5H4N -CH=N-NHC6H5 Cu2+ Ho3+ Dy3+ Cu2+ Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 No 10 Thiosemicarbazone ligand Metal ions R1 R2 R3 R4 H H H H H H H -CH3 -C2H5 H H H -CH3 -CH3 H -CH3 H -C6H4OH -CH=N-NHC6H5 -CH=N-NHC6H5 -C9H5NOH -C6H4OH -C6H4NH2 -C6H4OH Co2+ Cu2+ Zn2+ Mn2+ Ni2+ Fe3+ logβ11.exp logβ11.cal MLR PCR 10.020 11.4634 11.3203 12.300 12.2869 12.4456 6.130 7.0925 7.1365 5.000 4.7429 4.8538 12.710 12.1015 12.2011 5.496 6.2633 6.3243 ANN 10.206 11.903 6.623 5.280 11.523 6.251 ref [15] [15] [16] [18] [22] [24] 2.2 Calculation and selection of descriptors According to experimental results, the 2D structures of metal-thiosemicarbazone complexes were sketched using ChemBioDrawUltra 2013 [26] Then the structures were optimized by quantum mechanics on the MoPac2016 system [27] The quantum descriptors also were computed on the MoPac2016 system by using the semi-empirical quantum method with new version PM7 and PM7/sparkle for lanthanides [9] The 2D and 3D topological descriptors were received by QSARIS system [28, 29] After the computation of all essential parameters, it is essential to filter out non-suitable variables for collecting a set of databases that includes observations with the values logβ11 and structural parameters Next, we used this database to develop models 2.3 QSPR modeling method 2.3.1 MLR method In the quantitative structure and property relationship, dependent variable (Y) as values logβ11 are correlated with the values of independent quantitative or qualitative variables as structural descriptors (X) If it is simulated that the relationship is well represented by a multivariate linear regression (MLR) model In this case, the regression model with k explanatory variables was expressed [28, 30, 31] k Y j Xj (2) j where β0, is the intercept of the model, βj is the regression coefficients (slope), is the random error with expectation and variance 2.3.2 PCR method In statistics, principal component regression (PCR) is a regression analysis technique based on principal component analysis (PCA) Commonly, it considers regressing the dependent variable on a set of explanatory variables or independent variables based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model PCR can be divided into three steps: firstly, it runs a PCA on the table of the explanatory variables, secondly it runs an MLR on the selected components, and then it computes the parameters of the model that correspond to the input variables [32] In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components (PCs) of the explanatory variables are used as regressors One normally uses 27 The fourth Scientific Conference - SEMREGG 2018 only a subset of all the PCs for regression, thus making PCR some kind of a regularized procedure Often the PCs with higher variances are selected as regressors However, for the purpose of predicting the dependent variable, the PCs with low variances may also be important, in some cases even more important [32] One major use of PCR lies in overcoming the multicollinearity problem which arises when two or more of the explanatory variables are close to being collinear [33] PCR can aptly deal with such situations by excluding some of the low-variance PCs in the regression step In addition, by usually regressing on only a subset of all the principal components, PCR can result in dimension reduction through substantially lowering the effective number of parameters characterizing the underlying model This can be particularly useful in settings with high-dimensional covariates Also, through appropriate selection of the PCs to be used for regression, PCR can lead to efficient prediction of the outcome based on the assumed model If there are linear relationships between the independent variables in MLR model, then there will be multiple alignments between the independent variables The main advantage of the PCR model is elimination of linearity in independent variables [32] 2.3.3 Artificial neural network For ANN examination, multilayer perceptron (MLP) with many learning algorithm is normally used The MLP could have more than one hidden layer However, some studied results pointed out that one hidden layer is good enough for an ANN to approximate any complex non-linear function [34] In the present work, numbers of suitable hidden layer and epoch were tested with trial-anderror technique Simultaneously, the hyperbolic tangent sigmoid function was used as transfer functions in inputs and output datasets It was written as equation (3) [35] a tan sig (n) e 2n (3) In this study, all calculations of ANN analysis were performed on the Matlab software version 2016a [36] with Neural Network tool (nntool) toolbox on a data set of compounds We used a typical feed-forward neural network with Levenberg-Marquardt learning algorithm [37] (trainlm) to train it The algorithm appears to be the fastest method for training moderate-sized feed-forward neural networks 2.3.4 Models Validation The methods of modeling correspond to minimizing the sum of square differences between the observed and predicted values This minimization leads to the following estimating of the parameters of the model The models were screened by using the values R2train for training set, Q2LOO or Q2CV for cross-validation set, R2test for independent test of only ANN model and Q2test for external validation set of all models [28, 30, 31] These were calculated by the same formula n R2 (Yi Yˆi ) i n (4) (Yi Y ) i where Yi, Ŷi, and Ȳ are the experimental, calculated and average value, respectively The mean squared error (MSE) of regression methods is defined by equation (5) [28, 30, 31] 28 Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 N ˆ )2 (Yi Y i i MSE (5) N k The training of ANN model is carried out till the mean square error (MSEANN) is minimized followed by a comparison of the network output with the actual values of the output obtained from experimental results [38] MSEANN is the average squared error between the networks outputs (o) and the target outputs (t) It is written as follows [38] MSEANN n n ti oi (6) In order to compare the quality of the models, we use the average absolute values of the relative errors MARE (%) where ARE (%) is the absolute value of the relative errors They are calculated by the following expression [28, 29] n AREi , % MARE , % i where ARE , % n log 11,exp log log 11,exp 11,cal 100 (7) n is the number of test substances; β11,exp and β11,cal are the experimental and calculated stability constants RESULTS AND DISCUSSION 3.1 Constructing models QSPRMLR In order to building of QSPRMLR model, the data set was divided into the training set and the test set, in which the portion of the test set is 20 % The construction of QSPRMLR models was performed using back-elimination and the forward regression technique on the Regress [39] and MS-EXCEL [28, 40, 41] The QSPR models was cross-validated by means of the leave-one-out method (LOO) using the statistic Q2LOO The quality of the models was evaluated through statistical values such as R2train, Q2LOO, Fstat (Fischer‟s value) and MSE A good calibrating model has high R2train, Q2LOO, and Fstat values, and low MSE value with the least number of descriptors The results of QSPRMLR models were shown in Table The number of descriptors k was selected in range to 12 Table The results of QSPRMLR models (k of to 12) with statistical values The best model is in bold No QSPRMLR models log 11 = 10.9658 + 2.0345knopt n = 50; k = 1; MSE = 1.7505; R²train = 0.2106; Q2LOO = 0.1526; Fstat = 12.8093 log 11 = 6.1372 + 2.0769knopt + 0.2107SHBa n = 50; k = 2; MSE = 1.6150; R²train = 0.3421; Q2LOO = 0.2696; Fstat = 12.2220 log 11 = 16.2732 + 2.8514knopt + 0.2374SHBa + 1.2022HOMO n = 50; k = 3; MSE = 1.4937; R²train = 0.4493; Q2LOO = 0.3635; Fstat = 12.5088 log 11 = 16.2307 + 3.6618knopt + 0.2864SHBa + 1.3207HOMO + 0.3637xvpc4 29 ... the data set was divided into the training set and the test set, in which the portion of the test set is 20 % The construction of QSPRMLR models was performed using back-elimination and the forward... observed and predicted values This minimization leads to the following estimating of the parameters of the model The models were screened by using the values R2train for training set, Q2LOO or Q2CV for. .. validated externally with experimental data in the literature Herein, the QSPR modeling investigate for the first time the stability constant of metal- thiosemicarbazone complexes in the world COMPUTATIONAL