Journal of Chromatography A, 1333 (2014) 25–31

Contents lists available at ScienceDirect

Journal of Chromatography A journal homepage: www.elsevier.com/locate/chroma

Application of random forests method to predict the retention indices of some polycyclic aromatic hydrocarbons N. Goudarzi a,∗ , D. Shahsavani b , F. Emadi-Gandaghi a , M. Arab Chamjangali a a b

Faculty of Chemistry, Shahrood University of Technology, Shahrood, Iran Faculty of Mathematics, Shahrood University of Technology, Shahrood, Iran

a r t i c l e

i n f o

Article history: Received 24 September 2013 Received in revised form 15 January 2014 Accepted 17 January 2014 Available online 30 January 2014 Keywords: Random forest (RF) Artificial neural network (ANN) Quantitative structure–retention relationship (QSRR) Polycyclic aromatic hydrocarbons (PAHs)

a b s t r a c t In this work, a quantitative structure–retention relationship (QSRR) investigation was carried out based on the new method of random forests (RF) for prediction of the retention indices (RIs) of some polycyclic aromatic hydrocarbon (PAH) compounds. The RIs of these compounds were calculated using the theoretical descriptors generated from their molecular structures. Effects of the important parameters affecting the ability of the RF prediction power such as the number of trees (nt ) and the number of randomly selected variables to split each node (m) were investigated. Optimization of these parameters showed that in the point m = 70, nt = 460, the RF method can give the best results. Also, performance of the RF model was compared with that of the artificial neural network (ANN) and multiple linear regression (MLR) techniques. The results obtained show the relative superiority of the RF method over the MLR and ANN ones. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Polycyclic aromatic hydrocarbons (PAHs) are defined to be composed of two or more fused aromatic rings containing only carbon and hydrogen atoms. They are formed mainly by the incomplete combustion of fossil fuels and coal, petrochemical cracking processing and the degradation of lubricating oils and dyes [1]. Also, they are released from burning oil, gasoline, trash, tobacco, wood or other organic substances such as charcoal-broiled meat. They can occur naturally when they are released from forest fires and volcanoes, and can also be manufactured. Other activities that release PAHs include driving, agricultural burning, roofing or working with coal tar products, coating pipes, steel making, and paving with asphalt. Currently, more than 200 types of PAHs have been found, some of which are highly carcinogenic. PAH contamination of food results not only from cooking processes such as smoking, grilling, baking and frying but also from the air through deposits, soil through transformation and water through deposition and transfer [2,3]. When ingested, PAHs can be absorbed by the gastrointestinal tract and distributed throughout the body, resulting in acute or chronic injury [4,5]. Animal tests have shown that longterm intake of PAHs can cause cancer and immature egg cell death [6]. Exposure to large amounts of coal tar creosote may result in

∗ Corresponding author. Tel.: +98 273339544. E-mail addresses: [email protected], [email protected] (N. Goudarzi). 0021-9673/$ – see front matter © 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chroma.2014.01.048

convulsions, unconsciousness, and even death. Breathing vapors of coal tar, coal tar pitch, and creosote can irritate the respiratory tract. Eating large amounts of herbal supplements that contain creosote leaves may cause liver damage [7]. PAHs affect organisms through various toxic actions. The mechanism of their toxicity is considered to be interferenced by the function of cellular membranes as well as the enzyme systems that are associated with the membrane. Identification and determination of PAHs are very important for investigation of the environmental pollution level [8]. High-resolution gas chromatography is a commonly applied method for the analysis of PAHs in various matrices. Correlations of chromatographic retention with the physico-chemical and structural characteristics of substances are the basis for the choice of appropriate chromatographic systems, and are of great significance for solving problems involving identification of the components of complex mixtures. In some cases, the highly hazardous properties or unavailability of the PAHs may limit a chemist’s access to a pure standard for comparison to unknowns. Quantitative structure–retention relationship (QSRR) On the other hand, it is hard work to determine experimentally the retention indices (RIs) for all compounds of a specific class. Therefore, alternative methods have overcome during the last years to improve the performance of substitute approaches to obtain theoretical RIs. QSRR (quantitative structure–retention relationship) represents one of the most successful tools, so that the principal aim of this work is to predict the RI data from a set of molecular properties. This technique correlates the variation of a dependent variable (RI) to the variations of a set of descriptors, and the

26

N. Goudarzi et al. / J. Chromatogr. A 1333 (2014) 25–31

purpose is to achieve predictive and explanatory elements for a group of structurally analog compounds. It can be noted that RI is one of the parameters that is used as a criterion for the identification and possible separation of compounds. Thus we can predict the possibility of separation and identification by calculating the RI values for an unknown, unmeasured or even unsynthesized compound without performing any experiment. It is applicable for each compound in the presence of other compounds with the specific experimental conditions such as types of the GC column and detector, temperature, etc. Also, we can access this relation between the activity of a compound and the molecular structure properties. Quantitative structure–property/activity relationship (QSPR/QSAR) analysis is a powerful method for relating the physically measurable properties/activities to structurally related quantities of any compound. Recently, Drosos and co-workers have developed a QSRR study for some PAHs using quantum mechanics and other sources of descriptors estimated by different approaches. The B3LYP/6-31G* level of theory was used for geometrical optimization and quantum mechanics related variables. A good linear relationship was found between gas-chromatographic RI and electronic or topologic descriptors by stepwise linear regression analysis [9]. A molecular electronegativity–distance vector (MEDV) has been proposed to describe the structure of PAHs and relate their RIs for the programmed temperature SE-52 capillary-column gas chromatography by Liu and co-workers [10]. They applied multiple linear regression (MLR) in combination with the cross-validation (CV) technique, and a four-parameter QSRR of 209 PAHs was developed with the correlation coefficient (R) of 0.9812 and the root mean square error (RMS) of 15.533 between the estimated and experimental RIs. In 2001, Ferreira used partial least-squares (PLS) regression and electronic, geometric, and topological descriptors to predict some physiochemical properties of 48 PAHs such as boiling temperature, RI, n-octanol/water partition coefficient, and solubility in water [11]. Also, principal component regression (PCR) and PLS with leave-one-out cross-validation were used for building the regression models to predict the octanol–water partition coefficient and retention time indices of some PAH compounds [12]. In 2007, Héberger established QSPR models for prediction of GC RI on polar and non-polar stationary phases [13]. On the other hand, QSPR techniques were used for calculation of retention time of peptides [14], polychlorinated biphenyls (PCBs) [15], and polybrominated diphenylethers (PBDEs) [16]. The other properties such as electrophoretic mobility [17], soil sorption coefficient [18], flash point [19], octanol-water partition coefficient [20–22], impact sensitivities [23], normal boiling points [24], elution conductivity [25], retention factor [26], water solubility [27], and bioconcentration factor [28] were modeled using different QSPR techniques. QSPR analysis consists of mathematical equations relating the chemical structure to a wide variety of physical, chemical, biological, and technological properties. QSPR models, once established, can be used to predict the properties of compounds as yet unmeasured or even unknown. One of the novel and powerful tools used to calculate the rank of variables and predict the property–activity simultaneously is called random forests (RF). The current study is devoted to explore the application of the RF algorithm in order to predict the RIs of some PAHs. Also, to evaluate the modeling power of the RF model, the results obtained were compared with those obtained using the ANN and MLR techniques.

2. Materials and method 2.1. Data The experimental RIs of some PAH compounds were taken from Ref. 29. Analyses of PAHs were performed on the Agilent

6890GC-5973MSD. The GC separation was achieved using an HP5MS capillary column (30 m × 0.25 mm I.D., 0.25 ␮m film thickness) with a GC oven temperature program 60 ◦ C for 2 min, heated to 258 ◦ C at 6 ◦ C/min, then to 300 ◦ C at 2 ◦ C/min and hold for 4 min at 300 ◦ C. Samples were injected in splitless mode with the injector temperature at 280 ◦ C with helium as carrier gas at constant flow rate (1 mL/min). The mass spectrometer was operated in both the scan and secondary ion mass spectrometry (SIMS) mode for compound identification and quantitation, respectively. The QSPR model for estimation of the RIs of these compounds was established according to the following steps: the molecular structure input and the generated files containing the chemical structures were stored in a computer–readable format; the quantum mechanics geometry was optimized using the semi-empirical AM1 method; the structural descriptors were computed; these structural descriptors were selected; and the structure-RI model was generated by means of the MLR, ANN, and RF techniques. The names of the PAH compounds and their experimental and predicted RI values are shown in Table 1 . The dataset was split randomly into a training set (70%) and a test or prediction set (30%). A prediction set of 25 compounds was selected randomly from the original 83 PAH compounds, with the remaining ones constituting the training set. The molecules included in the training set with the RI values in the range of 200.00–559.90 were used to adjust the parameters of the model, and 25 compounds with the RI values in the range of 253.56–547.71 were used to evaluate the prediction ability.

2.2. Descriptor generation and screening A major step in constructing the QSPR model is to find a set of molecular descriptors that represents variation in the structural properties of the molecules. The RI values of solutes are related to some of their structural, electronic, and geometric properties. The values for these molecular features can be encoded quantitatively by numerical values named molecular descriptors. These molecular parameters can be used to search for the best QSPR model for the RIs. The process of calculation of the molecular descriptors carried out was as follows: the two-dimensional structures of the molecules were drawn using the Hyperchem 7.5 software [30]. The final geometries were obtained using the semi-empirical AM1 method. The molecular structures were optimized using the Polak-Ribiere algorithm until the root mean square gradient was 0.001 kcal mol−1 . The resulting geometry was transferred into the Dragon software package for calculation of 18 classes of descriptors, developed by the Milano Chemometrics and QSPR Group [31,32]. 361 descriptors were removed because they included zero or other constant/near constant values, and did not have enough structural information. Also, only one of any pair of variables with a correlation coefficient greater than 0.90 was considered in developing the model. Finally, 193 descriptors were retained for the variable selection and construction of the model.

2.3. Random forests (RF) model The fundamental of RF is based upon the method of classification and regression trees (CART), in which the feature space is split into disjoint cuboids (nodes), and then the response is approximated by a simple model like averaging the data in each region. The splitting algorithm is hierarchical and designed in a binary fashion, so that in each step it determines the splitting variable and the split-point along that variable. In each step of the algorithm (each node of a tree), the best splitting variable and the best input data

N. Goudarzi et al. / J. Chromatogr. A 1333 (2014) 25–31

27

Table 1 Dataset and corresponding observed and predicted values for the RIs. Name

Exp.

Cal. (RF)

Cal. (SR-ANN)

Cal. (RF-ANN)

Cal. (MLR)

Naphthalene 2-Methyl naphthalene 1-Methyl naphthalene Biphenyl Acenahthylene Acenaphthenea Di benzo furan Flourene 2-Methyl flourene 1-Methyl flourene 9-Methyl flourenea Di benzo thiophene Phenanthrene Anthracene Naphtha[2,3-b]thiophene Carbazole 1H-indene,1-phenyl 1H-Indene,2-phenyla 2-Methyl phenantherene 3-Methyl phenanthrenea 2-Methyl anthracene 9-Methyl phenantherene 1-Methyl phenanthrenea Cyclopentha[def]phenantheren 2-Methyl carazol Fluoranthenea Acephenanthrylenea Phenanthro[4-5bcd]thiophene Pyrene Benzo[b]naphtha[2,3-d]furan Phenyl methyl naphthalenea p-Trephenyl Benzo[b]naphtha[1,2-d]furana 2-Methyl fluoranthan Benzo[a]flourenea 4-Methyl pyrene 1-Methyl pyren Benzo[ghi]flouranthenea Benzo[c]phenanthrenea Cyclopenta[cd]pyrenea Benzo[b]naphtha[2,3-d]thiophe Methylbenzo[a]anthracene Benzo[b]naphtha[2,3-d]thiophe Benzo[a]Anthracenea Chrycene 1,1-Binaphthalene 1,2-Binaphthalene 9-Phenyl anthracene Methylbenzo[b]naphtha[2,3-d]thiophenea Methylbenzo[a]anthracenea 2,2Binaphthalenea 2-Phenylphenanthrene Benzo[b]fluorescence Benzo[c]acephenanthrylene Benzo[c]pyrenea Dibenzo[a,h]flourane Perylene Phenylpyrene Dinaphthothiophene Benzo[b]trihenylenea 1-Phenyl pyrene Indeno[1,2,3-cd]flpurancene Indeno[7,1,2,3-cdef]chrycene Indeno[1,2,3-cd]pyrene Dibenzo[ah]anthracenea Pentaphene Benzo[b]chrycene Picene Benzo[ghi]perylenea Benzo[def,mno]chrysene Dibenzo[be]fluoranthenea Naphtha[1,2-k]fluoranthenea Dibenzo[b,k]fluranthenea Dibenzo[a,e]flouranthene Naphtha[2,3,k]fluoranthene

200.00 221.10 224.12 236.56 248.27 253.56 259.63 270.39 288.28 289.24 292.28 295.95 300.00 301.53 304.28 309.80 311.12 314.15 318.83 319.67 321.12 322.74 323.54 325.73 326.94 344.96 348.14 349.42 352.41 353.47 356.04 357.65 357.86 362.82 366.28 368.75 373.55 390.18 390.57 391.10 391.73 392.70 394.76 398.36 400.00 402.77 404.28 406.47 409.54 410.27 421.55 424.19 443.93 448.07 452.93 456.60 457.98 459.01 481.96 483.27 486.34 488.38 491.53 494.91 495.89 496.67 498.86 500.00 501.64 505.97 533.22 536.92 539.59 540.71 543.06

216.18 233.15 232.29 244.20 249.28 351.74 266.61 280.16 295.73 293.67 301.14 298.68 298.68 304.00 286.79 300.19 315.08 313.60 322.40 324.35 324.71 321.55 322.80 328.97 315.99 352.95 353.07 347.68 352.62 363.47 363.41 366.54 366.02 343.17 373.69 367.93 368.00 386.39 387.43 389.89 388.65 404.31 387.11 392.95 395.22 408.30 408.15 411.20 406.81 404.31 411.84 416.95 451.02 450.69 455.30 468.43 450.46 470.78 471.76 485.49 482.95 483.89 492.08 487.95 487.54 501.63 490.72 495.36 491.74 491.41 531.99 533.79 541.43 535.93 535.75

202.82 217.16 224.13 238.31 244.29 253.38 258.01 272.73 288.80 292.64 299.61 300.56 300.56 29928 326.99 310.98 307.17 306.44 323.65 323.26 316.19 329.47 327.77 327.97 324.15 353.19 348.21 351.71 352.55 363.41 389.72 355.90 362.20 357.70 364.50 371.36 370.78 389.97 392.38 391.48 391.96 405.22 388.61 395.19 394.70 404.26 407.48 404.14 399.96 405.20 414.99 427.06 444.01 445.72 452.03 460.86 458.67 462.43 482.51 503.64 482.64 494.71 489.83 492.01 494.16 494.30 495.59 500.15 512.14 501.06 569.47 539.78 543.07 540.74 533.93

218.79 234.15 234.17 249.73 249.98 249.46 278.49 275.19 292.76 292.79 291.93 295.88 295.88 294.03 381.59 309.68 310.55 310.80 311.87 311.21 311.99 320.37 311.47 320.37 327.54 340.04 339.66 342.88 341.72 367.67 350.76 371.77 368.25 352.95 360.15 361.63 361.33 390.24 381.05 390.62 386.26 402.84 387.42 381.84 381.15 427.42 427.45 428.10 409.64 402.84 427.68 425.48 436.69 436.45 438.25 462.03 435.28 487.09 493.35 485.22 484.35 494.14 494.73 496.25 486.99 485.99 486.23 485.15 495.38 496.08 580.43 546.12 547.91 544.60 544.14

211.82 224.49 230.96 241.28 247.41 255.33 262.10 274.84 287.73 290.58 296.65 297.10 297.10 298.73 320.20 309.87 301.13 300.87 319.42 319.11 312.77 324.92 323.49 324.43 320.74 350.13 344.22 346.80 349.87 361.56 332.06 354.73 359.65 354.70 371.28 369.64 368.97 393.51 396.28 400.58 394.81 412.16 390.22 401.40 403.35 408.42 412.24 408.58 405.55 412.13 419.47 429.38 448.19 450.88 455.78 462.35 460.68 461.26 480.85 499.41 477.80 493.25 486.27 491.70 492.87 491.60 493.65 497.59 509.14 499.74 569.30 537.46 540.15 538.00 531.59

28

N. Goudarzi et al. / J. Chromatogr. A 1333 (2014) 25–31

Table 1 (Continued) Name a

Naphtha[2,3,e]pyrene Coronene Naphtha[2,3-a]pyrene Dibenzo[a,e]pyrene Dibenzo[a.l]pyrene Dibenzo[a,i]pyrene Dibenzo[a,h]pyrene a

Exp.

Cal. (RF)

547.71 550.43 551.10 551.65 552.82 556.32 559.90

548.67 543.73 546.08 548.60 547.96 550.91 553.44

Cal. (SR-ANN) 549.76 549.86 557.36 558.15 555.31 553.61 555.58

Cal. (RF-ANN)

Cal. (MLR)

549.74 553.66 446.49 551.02 546.68 550.55 550.55

546.44 548.43 554.37 557.00 554.43 551.11 554.25

Test set.

s (split point) is determined by solving the following minimization problem:



arg min

j ∈ {1,2,...,p},s





¯ js )2 + (yi − y1

xi ∈ R1js





¯ js )2 ⎦ (yi − y2

xi ∈ R2js

where R1js = {X|Xj < s} and R2js = {X|Xj > s} are the two cuboids defined by splitting the current space from the point s in the axis ¯ js , y2 ¯ js are the corresponding mean value of response (Y) Xj , and y1 in these regions. This process is recursively applied to the created sub-regions until a maximal tree is created. Since a very large tree might overfit the new data, a pruning algorithm is employed to select an optimal tree structure. Finally, the prediction of response at untried point xo is calculated as the mean value of response values in the region where xo is located on. The Random forests (RF) model, which was introduced by Breiman [33], is a collection of many unpruned regression trees. Each tree is grown on its own learning data set that is a bootstrap replicate of the training data. In each bootstrap sample, that is a random sample of the original data with replacement and with the same length, some of the data is repeated, and the left out samples are called “out-of-bag (OOB)”. To construct a tree, at each node, a small random subset of size m of p descriptors or explanatory variables (m < p) is chosen, and the best split of the feature space is done using only this subset instead of using all variables. The value of m is held constant during the forest growing. Each tree is grown to the largest extent possible. Then predictions are made by averaging all the tree outputs. Since the OOB data are not used in the construction of each tree, therefore, it is regarded as a test set to get a running unbiased estimate of the prediction error and variable importance as trees are added to the forest. The OOB error estimate is an unbiased estimate of the generalization error, and this is almost identical to that obtained by N-fold cross-validation [34]. In practice, the number of trees (ntree) and the size of the variable subset (m) should be optimized to reach the ideal forest by minimizing the OOB error. The RF method has also the potential for variable selection by quantifying the variable importance. The importance of each variable is estimated by looking at how much the prediction error increases when the OOB data for that variable is permuted randomly, while all the other variables are left unchanged. The number of trees (ntree) and the size of the variable subset (mtry) in the RF modeling are the two key user-defined parameters that should be optimized to reach the ideal forest. The RF modeling can be implemented by using an R package that is based upon the seminal contribution of Breiman and Cutler30, described in Liaw and Wiener [35].

parameter setting is visualized in Fig. 1. As it can be seen in this figure, a broad range of parameter setting, especially those whose values are located in the first quartile of the plane, provide the least RMSE. Although when ntree > 100 and m > [p/3] = 65 (suggested by Breiman), there is no significant improvement on the model fit by increasing ntree and m, we used ntree = 460 and m = 70 to construct the final model. In this condition, the performance of the RFs model on the test set was evaluated by the scatterplot of calculated versus experimental RI values (Fig. 2). The high values for R2 (0.9940 and 0.9953 for the training and test sets, respectively) imply that the RF model could perfectly predict the RIs for these compounds. The achievements are also shown in Fig. 3, where the predicted errors are plotted against the experimental values. The scatter of residuals at both sides of the zero line indicates that no systematic error exists in the development of the RF model. As mentioned earlier, the importance of each variable depends upon all the OOB samples generated during the construction of the

Fig. 1. The OOB errors calculated during the construction of the RFs model for different values of m and nt .

3. Results and discussion The RF method was used to predict the RIs of some PAHs. The values of the parameters ntree and m were optimized simultaneously by using the grid-search method ranging from 20 to 700 (with step size 10) and from 20 to 180 (with step size 5), respectively. The root mean squared error (RMSE) of the OOB data based on each

Fig. 2. Scatter charts of predicted values vs. experimental values for the training set and the test set when nt = 460 and m = 70.

N. Goudarzi et al. / J. Chromatogr. A 1333 (2014) 25–31

29

Fig. 3. Prediction errors obtained for the training and test sets when nt = 460 and m = 70.

RF model, and as rerunning of the RF algorithm gives different OOB sets, the score of importance of each variable is changed in each run. Thus the RF algorithm was run 100 times for the selected setting nt = 460, m = 70, and the mean of the importance of all 193 variables was recorded. The obtained results showed that 188 variables had very low scores of importance near zero, so they can be removed, and five variables had high scores, namely gravitational index G2 -bond-restricted (G2 ), conventional bond order ID number (PilD), superpendentic index (SPI), hydrophilic factor (Hy), and lowest Eigen value n.8 of Burden matrix weighted by the atomic Sanderson electronegativities (BELe8). The 5 important variables, with their relative importance and ranks, are listed in Table 2. For comparison of the results obtained by the RF model with those obtained by other methods, the MLR model and a backpropagation ANN model were used to create the models for the prediction of RIs. Firstly, the stepwise variable selection procedure was employed to select the most relevant descriptors for fitting the MLR and ANN. The selected variables are listed in Table 3. A linear relationship was obtained between the RIs and calculated descriptors through the MLR analysis. The MLR model coefficients were calculated using the training data, and then used to predict the RIs of validation samples. To build an adequate linear model, the predictability powers of the first 7 models obtained from the stepwise regression were investigated. To do so, the MLR model was fitted to the training data by using the first two descriptors and then the RIs of the validation set were predicted. This process was continued until all the 7 descriptors were entered to the MLR model one by one. The obtained results implied that a regression model with 5 descriptors had the least MSE value, and so, this model was

Table 2 Important variables for prediction of RIs according to the RF modeling over 100 replicates. Variable symbol

Importance (mean)

Relative importance

Rank

G2 piID SPI Hy BELe8

20.98 13.81 8.74 8.58 8.24

1.00 0.66 0.42 0.41 0.39

1 2 3 4 5

Fig. 4. Plot of the calculated RI values against the experimental ones for the test set by the MLR and ANN models.

selected as the best MLR model: RI = 38.274G2 − 5.827RWW + 60.89GATS1e − 34.115GATS3e − 21.356Mor15e − 2.174RDF045p − 27.4 To construct an ANN model, a 3-layer feed-forward network with one hidden layer and a back-propagation pattern was used. Two different ANN models were provided; in the first one (SRANN or ANN1), the inputs were the subset of descriptors selected by the MLR model, whereas the distinguished most important descriptors by the RF model were considered as the input of the second ANN model (RF-ANN or ANN2). To have a network with a high predictability power, five parameters including (i) number of descriptors (between 2 and 7); (ii) number of nodes (neurons) in the hidden layer; (iii) type of transfer function (including logsigmoid and tansigmoid); (iv) training function including Bayesian regulation (trainbr) and Levenberg-Marquardt (trainlm)); and finally (v) number of epochs were optimized. According to the results obtained, for the SR-ANN, a network with a Levenberg-Marquardt training function, logsigmoid transfer function with 6 descriptors, two nodes in the hidden layer and the number of 40 has the least MSE value. Also for the optimized RF-ANN, Levenberg-Marquardt as training function, tansigmoid as transfer function with 4 descriptors, two nodes in the hidden layer and the number of 36 were selected. It should be noted that the training of the network for the prediction of RIs was interrupted when the MSE of the validation set started to increase, to avoid overfitting. Fig. 4 shows a plot of predicted RIs against the experimental values by using the ANN and MLR models for the molecules included in the test sets. In order to evaluate the predictive ability of the presented models, and to compare them, we employed the determination coefficient (R2 ), standard error of

Table 3 The specification of selected variables by the MLR method. Number

Class

Symbol

Name

1 2 3 4 5 6 7

Geometrical Topological 2D autocorrelation 2D autocorrelation 3D MoRSE RDF BCUT

G2 RWW GATS1e GATS3e Mor15m RDF045p BELe8

Gravitational index G2 (bond-restricted) Reciprocal hyper-detour index Geary autocorrelation-lag2/weighted by atomic Sanderson electro negativities Geary autocorrelation-lag3/weighted by atomic Sanderson electro negativities 3D MoRSE-signal 15-weighted by atomic Sanderson electro negativities Radial distribution function-4.5/weighted by atomic polarization Lowest eigenvalue n.8 of burden matrix weighted by atomic Sanderson electro negativities

30

N. Goudarzi et al. / J. Chromatogr. A 1333 (2014) 25–31

Table 4 Statistical parameters for comparing models. MLR

RF-ANN

SR-ANN

RF

Parameters

0.9844 10.2929 6.2928 2.5279 105.9434 1.5412

0.9911 7.7961 6.4937 1.9147 60.7796 1.6314

0.9799 11.6790 7.0422 2.8683 136.3985 1.7190

0.9961 5.4941 4.5788 1.3493 30.1847 1.1824

Determination coefficient (R2 ) Standard error prediction (SEP) Mean absolute error (MAE) Relative error of prediction (REP) Mran square error (MSE) Mean relative error (MRE)

prediction (SEP), mean absolute error (MAE), relative error prediction (REP), mean squared error (MSE), and mean relative error (MRE). These statistical parameters for MLR and ANNs and RF models are shown in Table 4. The most important descriptor selected by the RF model is G2 . This descriptor is a gravitational index from the geometrical descriptors class. Geometrical descriptors reflecting the mass distribution in a molecule, defined as other popular molecular steric descriptors are listed below: G1 =

A−1 A   mi mj i=1 j=i+1

G2 =

(1)

rij

 B   mi mj b=1

(2)

rij b

where mi and mj are the atomic masses of the considered atoms, rij >is the corresponding interatomic distances, >and A and B are the numbers of atoms and bonds in the molecule, respectively. The G1 >index takes into account all atom pairs in the molecule, while the G2 >index is restricted to the pairs of bonded atoms. These indices are related to the bulk cohesiveness of the molecules, accounting, simultaneously, for both atomic masses (volumes) and their distribution within the molecular space. Both indices can be extended to any other atomic property different from the atomic mass such as atomic polarizability and atomic Van der Waals volume. Thus by increasing the number of atoms in a molecule, according to Eq. (2), G2 is enhanced, and by increasing G2 , RI is enhanced too. The second important descriptor is PiID, whose name is the superpendentic index, and is classified as a topological descriptor. It is noteworthy that this molecular descriptor was derived from the H-depleted molecular graph. Also, it was calculated from the pendent matrix, which is a submatrix of the distance matrix D with A rows and a number m of columns corresponding to the number of terminal vertices (atoms). The superpendentic index was calculated as the square root of the sum of the products of the non-zero row elements (the topological distance d) in the pendent matrix:

p=

 A 

m

1/2 dim

(3)

i=1

where m is the number of terminal vertices, i.e. columns of the pendent matrix. By increasing the number of atoms in molecules, PiID and RI increase. Hy (hydrophilic factor) is another important descriptor. It is an empirical descriptor. This class of the empirical descriptors is a fuzzy one, not well-defined. In principle, empirical descriptors are those that are not defined on the basis of a general theory such as quantum chemistry or graph theory. Rather, they are defined by practical rules derived from chemical experience, e.g. considering specific or local structural factors present in the molecules, often sets of congeneric compounds. As a consequence, in most cases, empirical descriptors represent limited subsets of compounds, and cannot be extended to classes of compounds different from those for which they were defined. Empirical descriptors should not be confused with experimentally

derived ones, although they are well-known, and several of them are empirically derived. Hy has an inverse relation with RI. By increasing the hydrophilicity, compounds have a low tendency to the non-polar stationary phase. Thus their RIs decrease. The last important descriptor is BELe8, from the BCUT descriptors. The BCUT (Burden–CAS–University of Texas Eigen values) descriptors are the Eigen values of a modified connectivity matrix known as the Burden matrix. The diagonal elements of the Burden matrix Mare the weights wi or atom Ai , where these weights may be some properties associated with the atoms such as m (relative atomic mass), p (polarizability), e (Sanderson electronegativity), and v (Van der Waals volume). The non-diagonal elements Mwk are 1, if k = dij and 0, otherwise, where k is the lag defined as the topological distance d between the atom pair i–j, and may have a value between 0 and 8. BELwk is the lowest negative Eigen value of the matrix Mwk , and zero if there are no negative Eigen values. Thus by increasing this descriptor value, RI increases [32]. 4. Conclusion Random forests (RF), ANN, and MLR models are used for prediction of the retention indices (RIs) of polycyclic aromatic hydrocarbon (PAH) compounds. The results obtained shows that the RF model can predict RIs with better accuracy over the MLR and ANN models. RF model is not needed to make a variable selection in advance, and does not suffer from the overfitting problem. These are advantages of the RF model over the ANN and MLR ones. These properties can highlight the RF method among the others when the bulk of variables have a high dimension. References [1] IARC, IARC Monographs on the Evaluation of the Carcinogenic Risk of Chemicals to Humans, International Agency for Research on Cancer, Lyon, 1983. [2] D.J. Fan, J. Peng, J. China Cuis. Res. 13 (2000) 24. [3] S.A.V. Tfouni, M.C.F. Toledo, Food Control 18 (2007) 948. [4] G.S. Zhang, Q. Zhao, X.X. Yue, J. China Metrol. 24 (2006) 222. [5] N. Aguinaga, N. Campillo, P. Vinas, h.C. Manuel, Anal. Chim. Acta 596 (2007) 285. [6] A.N. Neilson, PAHs and Related Compounds: Chemistry (The Handbook of Environmental Chemistry/Anthropogenic Compounds), Springer, 2010. [7] B. Skrbic , N. Miljevic , J. Environ. Sci. Health A 37 (2002) 1029. [8] H.Y. Xu, J.W. Zou, G.X. Jiang, Q.S. Yu, J. Chromatogr. A 1198/1199 (2008) 202. [9] J.C. Drosos, M. Viola-Rhenals, R. Vivas-Reyes, J. Chromatogr. A 1217 (2010) 4411. [10] S. Liu, C. Yin, S. Cai, Z. Li, Chemometr. Intell. Lab. 61 (2002) 3. [11] M.M.C. Ferreira, Chemosphere 44 (2001) 125. [12] F.A.L. Ribeiro, M.M.C. Ferreira, J. Mol. Struct.: Theochem. 663 (2003) 109. [13] K. Héberger, J. Chromatogr. A 1158 (2007) 273. [14] Z. Dashtbozorgi, H. Golmohammadi, E. Konoz, Microchem. J. 106 (2013) 51. [15] J.J. Marrugo, J.C. Drosos, C. Gueto-Tettay, J. Anaya-Gil, L. Rincón, R. Vivas-Reyes, Chromatographia 76 (2013) 837. [16] N. Goudarzi, D. Shahsavani, Anal. Methods 4 (2012) 3733. [17] C. Xue, H. Liu, X. Yao, M. Liu, Z. Hu, B. Fan, J. Chromatogr. A 1048 (2004) 233. [18] N. Goudarzi, M. Goodarzi, M.C.U. Araujo, R.K.H. Galvao, J. Agric. Food Chem. 57 (2009) 7153. [19] A. Khajeh, H. Modarress, J. Hazard. Mater. 179 (2010) 715. [20] N. Goudarzi, M. Goodarzi, Mol. Phys. 106 (2008) 2525. [21] J.M. Pallicer, C. Calvet, A. Port, M. Rosés, C. Ràfols, E. Bosch, J. Chromatogr. A 1240 (2012) 113. [22] N. Goudarzi, M. Goodarzi, Anal. Methods 2 (2010) 758. [23] V. Prana, G. Fayet, P. Rotureau, C. Adamo, J. Hazard. Mater. 235 (2012) 169. [24] Y.M. Dai, Z. Zhu, Z. Cao, Y. Zhang, J. Zeng, X. Li, J. Mol. Grapg. Model. 44 (2013) 113.

N. Goudarzi et al. / J. Chromatogr. A 1333 (2014) 25–31 [25] T. Yang, C.M. Breneman, S.M. Cramer, J. Chromatogr. A 1175 (2007) 96. [26] H. Golmohammadi, M.H. Fatemi, Electrophoresis 26 (2005) 3438. [27] A.A. Toropov, A.P. Toropova, E. Benfenati, G. Gini, D. Leszczynska, J. Leszczynski, Chemosphere 90 (2013) 877. [28] E.B. De Melo, Ecotox. Environ. Saf. 75 (2012) 213. [29] Z. Wang, K. Li, P. Lambert, Ch. Yan, J. Chromatgr. A 1139 (2007) 14. [30] HYPERCHEM 7.5 (Hypercube), http://www.hyper.com

31

[31] R. Todeschini, Milano Chemometrics and QSPR Group, http://michem.disat. unimib.it/chm/ [32] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley VCH, Weinheim, Germany, 2000. [33] L. Breiman, Mach. Learn. 45 (2001) 5. [34] T. Hastie, R. Tibshirani, J. Friedmanm, The Elements of Statistical Learning: Data Mining, Inferences and Prediction, Springer, New York, 2009. [35] A. Liaw, M. Wiener, R. News 2 (2002) 18.

Application of random forests method to predict the retention indices of some polycyclic aromatic hydrocarbons.

In this work, a quantitative structure-retention relationship (QSRR) investigation was carried out based on the new method of random forests (RF) for ...
844KB Sizes 1 Downloads 0 Views