Qspr modelling of stability constants of metal thiosemicarbazone using artificial neural network and multivariate linear regression in environmental analysis

7 1 0
Qspr modelling of stability constants of metal thiosemicarbazone using artificial neural network and multivariate linear regression in environmental analysis

Đang tải... (xem toàn văn)

Thông tin tài liệu

The fourth Scientific Conference SEMREGG 2018 10 QSPR MODELLING OF STABILITY CONSTANTS OF METAL THIOSEMICARBAZONE USING ARTIFICIAL NEURAL NETWORK AND MULTIVARIATE LINEAR REGRESSION IN ENVIRONMENTAL AN[.]

The fourth Scientific Conference - SEMREGG 2018 QSPR MODELLING OF STABILITY CONSTANTS OF METALTHIOSEMICARBAZONE USING ARTIFICIAL NEURAL NETWORK AND MULTIVARIATE LINEAR REGRESSION IN ENVIRONMENTAL ANALYSIS Nguyen Minh Quang1,2, Phạm Thi Thu Trang2, Tran Xuan Mau1, Tran Thi Thanh Ngoc3, Pham Van Tat4* Faculty of Chemistry, University of Sciences, Hue University, 77 Nguyen Hue, Hue City Faculty of Chemical Engineering, Industrial University of Ho Chi Minh City, 12 Nguyen Van Bao, Go Vap district, Ho Chi Minh City Faculty of Geology and Mineralogy, Ho Chi Minh City University of Natural resources and Environment, 236B Le Van Sy, Tan Binh district, Ho Chi Minh City Faculty of Science and Technology, Hoa Sen University, Dinh Cong Trang, District 1, Ho Chi Minh City *Email: vantat@gmail.com ABSTRACT The quantitative structure and property relationships (QSPRs) between molecular structure and stability constants (logβ11) of metal-thiosemicarbazone complexes are constructed by the multivariate linear regression (MLR) and artificial neural network (ANN) methods The structural descriptors of the complexes include the 2D, 3D molecular and physicochemical descriptors The stability constants with experimental parameters are collected from published literature The best model QSPRMLR (with k of 11) consists of the molecular descriptors xvp6, xvpc4, xvp7, xp5, xp4, N3, electric energy, cosmo area, dipole, knotp and volume The quality of this QSPR MLR was verified with the statistical values: R2train = 0.926, Q2LOO = 0.842, SE = 0.790 and Fstat = 58.992 The neural network model QSPRANN with architecture I(11)-HL(8)-O(1) was also presented by the statistical values: R2train = 0.994, Q2CV = 0.998 and R2test = 0.993 Use of the present QSPRMLR and QSPRANN model to external validation of logβ11 values turns out to be in good agreement with those from the experimental literature The outcomes are aligned in the design of new thiosemicarbazone derivatives using quantitative analysis of metal ions in environmental sample Keywords: QSPR models, stability constants log neural network, thiosemicarbazone 11, multivariate linear regression, artificial INTRODUCTION Nowadays, the more the industry is developing, the more the environment is polluted, in which heavy metal ions are released into the environment from factories, industrial parks and production facilities Therefore, the control and analysis of heavy metal ions have to be fast and cheap to meet the practical demands In the world, many methods of analysing heavy metal content have been used such as AAS, ICP-AES, ICP-MS, spectrophotometry and so on [1-3], in which the complexes of ligands with 10 Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 metal ions has been used extensively [4] In this study, we aim to characterize the thiosemicarbazone derivative because of its good complexity and many published studies on their use in a simple and inexpensive method photometric analysis that is The QSPR modelling studies provide many advantages It may be noted that QSPR helps in achieving efficient, effective, safe, and environmentally benign chemicals and processes thereof and thereby facilitates a sustainable chemical process [5] However, to use it as complexes, stability constants are important consideration It is a measure of the strength of the interaction between the ligand and the metal ions to form different complexes On the other hand, in order to orientate the empirical research quyckly, with the development of computer science, it is very common to carry out theoretical research on applying computational chemistry combined with the chemical software to solve complex mathematical problems and appropriate mathematical methods [6] In this work, we construct the quantitative structure and properties relationship (QSPR) using the structural descriptors and stability constant of complexes between the metal ions and thiosemicarbazone The structural descriptors are calculated by using the semi-empirical quantum chemistry method PM7 and PM7/sparkle [7], molecular mechanics, and connectivity calculation The multivariable linear model QSPRMLR is established by using the least square techniques and structural descriptors The artificial neural network model QSPRANN is constructed by the error back-propagation method using multilayer perceptron algorithm with the input layer that includes structural descriptors of the best selected QSPRMLR model The stability constant log 11 of the complexes between the metal ions and thiosemicarbazone in the test set resulting from the QSPR models is validated and compared with those from experimental data in the literature METHODOLOGY 2.1 Data set selection The selection of complexes that constitute the data set should be the first step of the study The values logβ11 of metal-thiosemicarbazone complexes were taken from the literature in Table Schematic representation of complexes is commonly described as in Fig [8] a) b) Figure Structure of metal-thiosemicarbazone complex: a) General complex structure; b) Complex between Co2+/Zn2+ and nicotinaldehyde thiosemicarbazone [9] The logβ11 values are log-transformed stability constants of metal-thiosemicarbazone complexes The stability constant β is calculated based on reaction between a metal ion (M) and a thiosemicarbazone ligand (L) in an aqueous solution The reaction is p M + q L ⇌ M p Lq (1) 11 The fourth Scientific Conference - SEMREGG 2018 In the case of one step with p = and q = 1, the stability constant for the formation of ML is given by [10] ML 11 M L (2) Table Complexes of metal ions and thiosemicarbazone and stability constants Ord R1 R2 R3 R4 Metal ions H H H -C6H4OH Cu2+ 5.280 [11] 2+ 7.418 [12] logβ11 Ref H -C6H5 -CH3 -CHNOH Cu H -CH3 -CH3 -NHC6H5 Mn2+ 10.050 [13] 2+ 14.560 [14] 2+ 6.130 [14] Cu 2+ 19.100 [15,16] 2+ 15.300 [15,16] H H H H -C2H5 H H H H -C9H5NOH -C9H5NOH -C5H3NCH3 Cu Zn H H H -C6H4N(CH3)2 Cu H H H -C6H3OHOCH3 Pb2+ 7.100 [17] -C6H3OHOCH3 2+ 7.470 [17] 2+ 5.000 [18] H H H Zn 10 H H H -C6H4OH Mn 11 H H H -C6H4OH Cu2+ 6.840 [18] -C9H7OH Mg 2+ 3.400 [19] Mn 2+ 5.670 [19] 2+ 6.560 [19] 2+ 7.640 [19] 12 13 14 H H H H H H H H H -C9H7OH -C9H7OH Cd 15 H H H -C9H7OH Pb 16 H H - -C9H9NO Cu2+ 8.289 [20] -C9H9NO 2+ 7.998 [20] 2+ 7.852 [20] 17 H H - Ni 18 H H - -C9H9NO Pb 19 H H - -C9H9NO Co2+ 7.806 [20] -C9H9NO 2+ 7.645 [20] 2+ 7.599 [20] 2+ 6.041 [20] 2+ 11.610 [21] 20 21 22 H H H H H H - -C9H9NO -C9H9NO Zn Cd Mn 23 H H H -C6H4NH2 Cu 24 H H H -C6H4NO2 Cd2+ 10.630 [22] -C6H4OH 2+ 5.496 [23] 2+ 5.491 [24] 25 26 12 Thiosemicarbazone H H H H -C6H4OH H -C5H4N Fe Cu Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 2.2 Multivariate linear regression The multivariate linear regression (MLR) is a popularly used statistical method for modelling the dependency between two or more variables by fitting a linear equation to the observed data The relationship between independent and dependent variables is described in the following equation [25, 26] Y X1 X k X k (3) where Y is the dependent variable, β0 is the intercept of the model, βi is a slope associated with Xi, where Xi is an independent variable, k is the number of variables in the equation, and ε is an error In this study, MLR is chosen to develop a relationship between the log-transformed stability constants (logβ11) and structural parameters affecting on it The log-transformed stability constant (logβ11) is the dependent variable while other parameters are independent The MLR method is used to build QSPRMLR model The method chooses variables of model by the principle of least squares That is minimizing the sum of square differences between the observed and predicted values This minimization leads to the following estimators of the parameters of the model The models were screened by using the values R2train and Q2LOO [25-27] These were calculated by the same formula (4) n R2 (Yi Yˆi ) i n (4) (Yi Y ) i where n is the number of observations; Yi, Ŷi, and Ȳ are the experimental, calculated and average value, respectively Another statistic to evaluate MLR model is the standard error (SE) of the estimate It is a measure of the accuracy of predictions Recall that the regression line is the line that minimizes the sum of squared deviations of prediction The standard error of the estimate is closely related to this quantity and is defined below [28] n SE Yi Yˆi n k (5) The QSPRMLR model was constructed from the database of complexes between metal ions and the ligands including the 2D and 3D molecular descriptors, the quantum parameters and the stability constant log β11 in Table The two-dimensional structures of metal-thiosemicarbazone complexes were drawn using BIOVIA Draw 2017 [29] and optimized by means of quantum mechanics on the MoPac2016 system [30] The quantum descriptors were calculated by using the semi-empirical quantum method with new version PM7 and PM7/sparkle for lanthanides [7] The resulted geometry was transferred into QSARIS system [27, 31] which calculated the 2D and 3D topological descriptors Firstly, the data set was divided into training and test sets, in which the test set contains about 20 % of the initial set, the training set was used for constructing the regression model The construction of QSPRMLR models was carried out using back-elimination and the forward regression 13 The fourth Scientific Conference - SEMREGG 2018 technique on the Regress system [25] and MS-EXCEL [26,27,32] The artificial neural network model QSPRANN was built using the multilayer training technique on the Matlab system [33] The predictability of QSPR models was cross-validated by means of the leave-one-out method (LOO) using the statistic Q2LOO To assess the degree of the influential variables on the QSPRMLR models, we introduced a quantity, namely the average contribution percentage, MPxk,i It is the percentage of each independent variable in the selected QSPR models (with i of to k), is determined according to formula (6) [27, 31] MPxk ,i ,% N N 100 bk ,i xm,i k m (6) bk , j xm, j j where N is number of observations; m is number of substances used to calculate Pxk,i value; bk,i are the parameters of model 2.3 Artificial neural network Artificial neural network (ANN) is a mathematical model that tries to simulate the functional aspects of biological neural networks It consists of an interconnected group of artificial neurons and it processes information using a connection to approach the computation Many research works have been applying the ANN method successfully in various fields of mathematics, robotic control, medicine, and chemistry and so on [34, 35] An ANN modelling includes an input layer, one or more hidden layer, and an output layer Neurons in each of the layer and weights that connect these to one another There are many kinds of ANN architectures for various applications, in which multi-layer perceptron (MLP) is the simplest and the most commonly used ANN architecture for prediction [36] So, ANN architecture used in this study is a multilayer feed-forward network with a single hidden layer The model composes of input layer, hidden layer and one output layer Besides, we used a typical feed-forward neural network with an error back-propagation learning algorithm to train it Mathematical statement of this neural network style propagates information in the feed-forward direction can be written as [37, 38] N oj f wij · xi qj i (7) where xi is the input factor, oj is the output factor, wij is the weight factor between two nodes, qj is the internal threshold, and is the transfer function In this work, we used hyperbolic sigmoid tangent transfer function to train ANN models It is described in the following equation [37,38] a tan sig (n) e 2n (8) The training of ANN model is carried out till the mean square error (MSEANN) is minimized followed by a comparison of the network output with the actual values of the output obtained from experimental results [38] MSEANN is the average squared error between the networks outputs (o) and the target outputs (t) It is written as follows [38] 14 Hội nghị Khoa học Công nghệ lần thứ - SEMREGG 2018 MSEANN n n ti oi (9) The QSPRANN model is also developed with the neural network technique using “nntool” tool on the Matlab system [33] The QSPRANN model is trained with Levenberg-Marquardt backpropagation algorithm and the hyperbolic sigmoid tangent transfer function The dataset is divided randomly into three subsets That includes 70 % of training set; 15 % of cross-validation set and the last 15 % of independent test set 2.4 External validation The external validation ensures the predictability and applicability of the developed QSPR models for the prediction of untested molecules [5] The models are external validated by the statistical paremeter as the Q2test It also calculated the same equation (4) for the test set In order to evaluate the discrepancies between the experimental and predictive logβ11 values from the models, we used the single factor ANOVA In addition, error analysis is an important part of QSPR studies In order to assess the predictive performance of the developed models, the average absolute values of the relative error MARE used to assess the overall error of the QSPR models are calculated according to formula (10) n AREi , % MARE , % i n where ARE , % log 11,exp log log 11,exp 11,cal 100 (10) n is the number of test substances; β11,exp and β11,cal are the experimental and calculated stability constants RESULTS AND DISCUSSION 3.1 QSPRMLR modelling In the surveyed models, statistical parameters were used to evaluate the models such as SE, R2train, Q2LOO and Fstat (Fischer‟s value) A good calibrating model has high R2, Q2, and F values, and low SE value with the least number of descriptors in which the R2 and the Q2 values are more important The QSPRMLR models and the statistical values are shown in Table The results in Table showed that when the k value increases the R2train and Q2LOO values increase and the SE value decreases Once the k value reach 11, the R2train and Q2LOO values satisfy statistical conditions [39] When k value increases to 12, the R2train and Q2LOO parameters continue to increase and the SE still decreases However, this variation is negligible When k is higher or equal to 13 are not necessary because the number of variables increase Thus, the QSPR MLR model with k = 11 is the best match in all the models The quality of the QSPR MLR model is shown in the R2train value of 0.926; the standard error SE of 0.790; the Fstat value of 58.992 and the Q2LOO value of 0.842 The linear regression equation of the QSPRMLR model is as follows logβ11 = 7.984 - 5.997x1 + 3.044x2 + 5.960x3 - 24.356x4 + 26.688x5 + 22.313x6 - (11) - 0.00127x7 - 0.227x8 + 1.148x9 + 13.437x10 + 0.089x11 15 The fourth Scientific Conference - SEMREGG 2018 Table Selected QSPRMLR model (k of to 13) and statistical values k Variables SE R²train Q2LOO Fstat x1/x2/x3/x4 1.998 0.462 0.312 12.657 x1/x2/x3/x4/x5 1.747 0.596 0.420 17.085 x1/x2/x3/x4/x5/x6 1.629 0.654 0.491 17.985 x1/x2/x3/x4/x5/x6/x7 1.535 0.699 0.537 18.538 x1/x2/x3/x4/x5/x6/x7/x8 1.546 0.700 0.528 16.027 x1/x2/x3/x4/x5/x6/x7/x8/x9 1.376 0.766 0.574 19.674 10 x1/x2/x3/x4/x5/x6/x7/x8/x9/x10 1.108 0.851 0.668 30.327 11 x1/x2/x3/x4/x5/x6/x7/x8/x9/x10/x11 0.790 0.926 0.842 58.992 12 x1/x2/x3/x4/x5/x6/x7/x8/x9/x10/x11/x12 0.699 0.943 0.879 70.313 13 x1/x2/x3/x4/x5/x6/x7/x8/x9/x10/x11/x12/x13 0.621 0.956 0.905 83.502 Notation of molecular descriptors xvp6 x1 N3 x6 knotp x10 xvpc4 x2 electric energy x7 volume x11 xvp7 x3 cosmo area x8 surface x12 xp5 x4 dipole x9 N4 x13 xp4 x5 The number of descriptors k was selected in range to 13 The change of the amount of structural parameter leads to the change of the values SE, R2train and Q2LOO (Figure 2a) 3.0 1.0 0.9 2.5 0.7 1.5 0.6 0.5 1.0 R2train and Q2LOO Standard error, SE 0.8 2.0 0.4 0.5 0.3 0.0 0.2 a) 10 12 14 b) Number of variable, k R2train Figure a) The change of SE, and Q LOO values according to k descriptors; b) Comparison of experimental vs predicted values logβ11 of the data set using the QSPRMLR model (with k = 11) According to Table 3, the important contribution of molecular descriptors in each complex is arranged in the order based on GMPxi values (GMPxi is the average value of MPxk,i, it is calculated from the results of three good models with k = 11 - 13): xp4 > xp5 > cosmo area > volume The xp4 parameter (x5) with the GMPx5 value of 31.2463 strongly influences the stability constant of complexes The xp4 parameter is called Chi path 4, the Simple 4th-order path Chi index Next, the xp5 parameter is called Chi path 5, the Simple 5th-order path Chi index (x4) The last two parameters that strongly affects the stability constant are cosmo area (x8) and volume (x11), these are the geometric parameters of the molecule 16 ... divided into training and test sets, in which the test set contains about 20 % of the initial set, the training set was used for constructing the regression model The construction of QSPRMLR... parameters of model 2.3 Artificial neural network Artificial neural network (ANN) is a mathematical model that tries to simulate the functional aspects of biological neural networks It consists of an interconnected... with p = and q = 1, the stability constant for the formation of ML is given by [10] ML 11 M L (2) Table Complexes of metal ions and thiosemicarbazone and stability constants Ord R1 R2 R3 R4 Metal

Ngày đăng: 03/03/2023, 08:35

Tài liệu cùng người dùng

Tài liệu liên quan