Chemosphere 119 (2015) 438–444

Contents lists available at ScienceDirect

Chemosphere journal homepage: www.elsevier.com/locate/chemosphere

In silico model for predicting soil organic carbon normalized sorption coefficient (KOC) of organic chemicals Ya Wang, Jingwen Chen, Xianhai Yang, Felichesmi Lyakurwa, Xuehua Li ⇑, Xianliang Qiao Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Linggong Road 2, Dalian 116024, China

h i g h l i g h t s  A QSAR model was developed for predicting log KOC of organic compounds.  The descriptor MLOGP2 explained 66.5% variance of the log KOC.  The model covers 31 different chemical classes and has a broad applicability domain.  The model is more suitable to predict log KOC for PBDEs, PFCs and heterocyclic toxins.

a r t i c l e

i n f o

Article history: Received 14 March 2014 Received in revised form 4 July 2014 Accepted 6 July 2014 Available online 31 July 2014 Handling Editor: Klaus Kümmerer Keywords: Soil organic carbon normalized sorption coefficient (KOC) Quantitative structure–activity relationship (QSAR) Multiple linear regression (MLR) Predictive ability

a b s t r a c t As a kind of in silico method, the methodology of quantitative structure–activity relationship (QSAR) has been shown to be an efficient way to predict soil organic carbon normalized sorption coefficients (KOC) values. In the present study, a total of 824 log KOC values were used to develop and validate a QSAR model for predicting KOC values. The model statistics parameters, adjusted determination coefficient (R2adj) of 0.854, the root mean square error (RMSE) of 0.472, the leave-one-out cross-validation squared correlation coefficient (Q2LOO) of 0.850, the external validation coefficient Q2ext of 0.761 and the RMSEext of 0.558 were obtained, which indicate satisfactory goodness of fit, robustness and predictive ability. The squared Moriguchi octanol–water partition coefficient (MLOGP2) explained 66.5% of the log KOC variance. The applicability domain of the current model has been extended to emerging pollutants like polybrominated diphenyl ethers, perfluorochemicals and heterocyclic toxins. The developed model can be used to predict the compounds with various functional groups including ˜C@C—, AC„CA, AOH, AOA, ACHO, ˜C@O, AC@O(O), ACOOH, AC6H5, ANO2, ANH2, ANHA, ˜NA, ANANA, ANHAC(O)ANHA, AOAC(O)ANH2, AC(O)ANH2, AX(F, Cl, Br, I), ASA, ASH, AS(O)2A, AOS(O)2A, ANHAS(O)2A, (SR)2PH(OR)2 and ˜Si—. Ó 2014 Elsevier Ltd. All rights reserved.

1. Introduction To date, more than 140 000 synthetic chemicals that have been preregistered by European REACH (Registration, Evaluation, and Authorization of Chemicals) regulation are used commercially (EC, 2010; Rudén and Hansson, 2010). Ecological risk assessment and management are necessary measures to prevent and control the possible pollution of these chemicals. The sorption of organic chemicals to soils/sediments plays a substantial role in determining their environmental transport and fate. The soil organic carbon normalized sorption coefficient (KOC) is an important behavioral parameter for assessing the ecological risk of organic chemicals

⇑ Corresponding author. E-mail address: [email protected] (X. Li). http://dx.doi.org/10.1016/j.chemosphere.2014.07.007 0045-6535/Ó 2014 Elsevier Ltd. All rights reserved.

(Gramatica et al., 2007; Wang et al., 2009; Phillips et al., 2011; Wen et al., 2012; Shao et al., 2014). The KOC values for organic chemicals are normally measured by the standard testing guidelines, e.g. the Organization for Economic Cooperation and Development (OECD) guidelines for estimation of KOC using HPLC and the batch equilibrium method (OECD, 2000, 2001). However, the experimental determination of the KOC is usually laborious, time-consuming and expensive (Huuskonen, 2003; Doucette, 2003; Phillips et al., 2011). Considering the large and ever-increasing number of chemicals entering into the market each year, it becomes necessary to develop methods that can provide KOC values highly-efficiently. Previous studies have shown that the methodology of quantitative structure–activity relationship (QSAR), a kind of in silico method, can predict KOC values efficiently. There have been many QSAR models published for KOC (e.g. Sabljic, 1984, 1987; Meylan and Howard, 1992; Poole and Poole,

439

Y. Wang et al. / Chemosphere 119 (2015) 438–444

1999; Tao et al., 1999, 2001; Huuskonen, 2003; Schüürmann et al., 2006; Gramatica et al., 2007; Wang et al., 2009; Goudarzi et al., 2009; Wen et al., 2012; dos Reis et al., 2014; Shao et al., 2014). Sabljic et al. (1995), Gawlik et al. (1997) and Doucette (2003) reviewed the existing models for predicting KOC values. Among the QSAR models, some predict KOC values only from the octanol/ water partition coefficients (log KOW) or water solubilities (log SW) of organic compounds (Sabljic et al., 1995; Gawlik et al., 1997; Doucette, 2003). These models are usually only applicable to compounds with a specific log KOW range; let alone for many organics, their log KOW and/or log SW values are unavailable. In 2007, the OECD issued guidelines on the development and validation of QSAR models (OECD, 2007). According to the OECD guideline, a QSAR model should be associated with the following information: (i) a defined endpoint; (ii) an unambiguous algorithm; (iii) a defined domain of applicability; (iv) appropriate measures of goodness-of-fit, robustness, and predictive ability; and (v) a mechanistic interpretation, if possible. In retrospect to the previous QSAR models on KOC prediction (Sabljic et al., 1995; Gawlik et al., 1997; Baker et al., 2000; Doucette, 2003; Huuskonen, 2003; Nguyen et al., 2005; González et al., 2006), we can find that many developed models before 2007 fail to meet one or more above principles. Thereafter, several new QSAR models for predicting KOC values have been developed (Gramatica et al., 2007; Wang et al., 2009; Goudarzi et al., 2009; Wen et al., 2012; dos Reis et al., 2014; Shao et al., 2014). These established models all performed external validation according to the OECD guidelines, and they were constructed based on heterogeneous or specific classes of compounds. It is known that the prediction accuracy and applicability domain (AD) of QSAR models depend on the quality and quantity of experimental data that are employed to construct and validate the models. Since 2006 when Schüürmann et al. (2006) established a QSAR model on KOC with 571 compounds, many new experimental KOC data have been published especially for some emerging pollutants (Razzaque and Grathwohl, 2008; Sun et al., 2010; Pose-Juan et al., 2011; Schenzel et al., 2012; Gellrich and Knepper, 2012; Stenzel et al., 2013). Thus, in this study, the data set was updated to 824 compounds covering 31 different chemical classes, and the QSAR model was developed to predict the KOC values for heterogeneous chemicals with the functional groups including ˜C@C—, AC„CA, AOH, AOA, ACHO, ˜C@O, AC@O(O), ACOOH, AC6H5, ANO2, ANH2, ANHA, ˜NA, ANANA, ANHAC(O)ANHA, AOAC(O)ANH2, AC(O)ANH2, AX(F, Cl, Br, I), ASA, ASH, AS(O)2A, AOS(O)2A, ANHAS(O)2A, (SR)2PH(OR)2 and ˜Si—.

are shown in Fig. S1. It indicates that the data in the training set and validation set have the similar statistical distribution. The CAS numbers, SMILES and corresponding experimental and predicted log KOC values of the compounds are listed in the Supplementary material (Table S1). 2.2. Calculation of the molecular structure descriptors The molecular structures were pre-optimized with the MM2 method (Schnur et al., 1991) contained in the ChemBio3D Ultra (Version 12.0) (http://www.cambridgesoft.com/services/). Thereafter, in order to get a more appropriate geometric structure for calculating the molecular descriptors, they were further optimized with the B3LYP/6-31+G(d,p) method (Lee et al., 1988; Becke, 1993) by using the Gaussian 09 program suite (Frisch et al., 2009). Besides, the implicit solvent effects were taken into account by using the integral equation formalism of the polarizable continuum model (IEFPCM) (Cancès et al., 1997) to simulate the water environment. The intermolecular forces like Van der Waals forces, electrostatic forces, and hydrogen bonding, could influence the distribution of compounds between water and soils/sediments (Nguyen et al., 2005). Thus, 17 quantum chemical descriptors used for characterization of the intermolecular interactions were chosen for the model development (Table S2). In addition, based on the optimized geometric structures, values of 4885 molecular structural descriptors were calculated by using Dragon software (Version 6.0, Talete srl, 2012). Those descriptors with constant and near-constant values were excluded, and those with pair correlation larger than or equal to 0.95 were also discarded. Finally, 1890 Dragon descriptors were retained. 2.3. Model development Stepwise multiple linear regression (MLR) analysis in the SPSS software (SPSS 13.0) was employed to select variables and develop models with the logKOC values as the dependent variable and the molecular structural descriptors as the predictor variables. The best model was selected, which has the lowest number of molecular descriptors, the maximum adjusted determination coefficient (R2adj) and each predictor variable with the variable inflation factor (VIF) < 10 (Norusis, 1997). VIF was used to characterize the multi-collinearity of the predictor variables (Norusis, 1997). R2adj is defined as follows:

R2adj ¼ 1 

Pn

2

^ i¼1 ðyi  yi Þ =ðn  p  1Þ Pn  2 i¼1 ðyi  yi Þ =ðn  1Þ

ð1Þ

^i are the observed value and predicted value for the where yi and y  is the average response value, n is ith compound, respectively, y the number of compounds and p is the number of descriptors.

2. Materials and methods 2.1. Data set

2.4. Model characterization and evaluation of the ADs 1

The experimental KOC values (L kg ) of 824 organic compounds were collected from EPI Suite™ (Version 4.1) (http://www.epa.gov/ oppt/exposure) and the literatures (Yaws, 1999; Wang et al., 2005; Razzaque and Grathwohl, 2008; Sun et al., 2010; Pose-Juan et al., 2011; Schenzel et al., 2012; Gellrich and Knepper, 2012; Stenzel et al., 2013). The 824 organic compounds in the total data set were classified into 31 groups according to the differences in the substituted functional groups (Table 1). The total data set was randomly divided into a training set and a validation set with a ratio of 3:1. The training set covers 618 compounds with a wide range of log KOC values (from 0 to 6.96), and the validation set includes 206 different compounds with a range of log KOC values (from 0 to 6.48). The normal distribution of log KOC values for the total dataset with mean of 2.70 and standard deviation (SD) of 1.22 is shown in Fig. 1, and that for the training set and validation set

The developed QSAR model was evaluated according to the OECD guidelines (OECD, 2007). The R2adj, root mean square error (RMSE), leave-one-out cross-validated Q2 (Q2LOO) and bootstrap method (Q2BOOT) (1/5, 5000 iterations) were applied to evaluate the goodness of fit and robustness of the model. The RMSEext and external explained variance (Q2ext) were used to assess the predictive ability. The statistical parameters were defined as follows:

Pnext ^ i  y i Þ2 ðy Q 2ext ¼ 1  Pni¼1 2 ext i¼1 ðyi  yext Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn 2 ^ 2 i¼1 ðyi  yi Þ RMSE ¼ n

ð2Þ

ð3Þ

440

Y. Wang et al. / Chemosphere 119 (2015) 438–444

Table 1 Classification of 824 compounds based on functional groups.

a

Group no.

Category

na

Group no.

Category

na

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Polycyclic aromatic hydrocarbons (PAHs) and halogenated PAHs Alcohols Ethers Aldehydes Ketones Phenols Anilines Acetanilides Chlorohydrocarbons Bromohydrocarbons Nitrobenzenes Organic acids Esters Phthalates Halogenated benzenes Alkylbenzenes

25 25 18 1 6 31 32 15 30 8 9 29 37 9 18 14

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Biphenyls and chlorinated biphenyls Polybrominated diphenyl ethers (PBDEs) Benzamides Ureas Quinones Azobenzenes Organophosphorus compounds Sulfonic derivatives Organic sulfides Organofluorine compounds Organic iodides Organosilicones Organic arsenic compounds Aromatic heterocycles Others

23 9 9 17 4 3 67 55 47 43 2 1 1 63 173

n: the number of the compounds.

log K OC ¼ 0:039 MLOGP2 þ 0:010 a  0:342 O  058  0:069 ATSC8v  0:123 nN  0:368 nROH  0:473 P  117 þ 2:335 SpMaxA G=D þ 0:302 Mor16u  1:612

ð6Þ

ntra ¼ 618; R2adj ¼ 0:854; RMSEtra ¼ 0:472; F ¼ 402:193; p < 0:001; Q 2LOO ¼0:850; Q 2BOOT ¼0:797; next ¼206; Q 2ext ¼0:761; RMSEext ¼0:558

Fig. 1. Distribution of experimental log KOC values of the data set.

The AD of the model was characterized by the Williams plot of standardized residuals versus leverage values (hi) (Gramatica, 2007). Compounds with the standardized residual (d*) values that are outside the range (i.e., |d*| > 3) were treated as the outliers. hi was calculated by: 1

hi ¼ xTi ðX T XÞ xi

ð4Þ T

where xi is the descriptor vector of the ith compound; xi is the transpose of xi; X is the descriptor matrix and XT is the transpose of X. The warning leverage value (h*) was calculated by: 

h ¼ 3ðk þ 1Þ=n

ð5Þ

where k stands for the number of predictor variables included in the model. If a compound has hi > h*, it means the compound in the training set is very influential on the model. However, such a compound in the validation set indicates that it is structurally distant from the compounds used in the training set and the predicted KOC value for this compound is an extrapolation of the QSAR model. 3. Results and discussion 3.1. Development and validation of the QSAR model The optimum model is:

where ntra and next represents the number of compounds in the training and validation sets respectively. According to the criteria (Q2 > 0.50, R2 > 0.60) proposed by Golbraikh et al. (2003), this KOC model has high goodness-of-fit, robustness and predictive ability. Besides, we also evaluated the predictive ability of the current model using the new cut-off values developed by Kaneko and Funatsu (2013). When k (the number of the nearest neighbor data points) was equal to 10, the determination coefficient of the midpoints between the k-nearest-neighbor data points (MIDKNNs) (r2midknn) was 0.875 and the corrected RMSE of the MIDKNNs (RMSEmidknn) was 0.157, indicating the developed model has good predictive ability. The definition of the predictor variables is listed in Table 2. The VIF values for all the predictor variables are less than 10, indicating there is no serious multi-collinearity among the variables (Table 2) (Norusis, 1997; Robinson and Schumacker, 2009). As can be seen from Fig. 2, the predicted log KOC values agree well with the observed values. 3.2. Applicability domain As shown in Fig. 3, almost all of the compounds in the training set and validation set are within the domain. This implies that the compounds in the training set have satisfactory representatives. Moreover, 29 compounds in the training set and 7 in the validation set were found with h > h* (h* = 0.049) and |d*| < 3. The 29 compounds in the training set fit the model well, thus they can stabilize the QSAR model and make it more precise. The 7 compounds in the validation set indicate that the current model has an extrapolating ability. Considering h and d* (Fig. 3), there are ten outlier compounds (i.e., |d*| > 3) including cacodylic acid (No. 46), dalapon (No. 201), endothal (No. 216), benzo[ghi]perylene (No. 223), benzo[k]fluoranthene (No. 226), mevinphos (No. 462), amitraz (No. 478), 4-nonylphenyl diphenyl phosphate (No. 536), haloxyfop-methyl (No. 545) in the training set and 2,20 ,4,40 ,6,60 -hexebromobiphenyl (No. 1760 ) in the validation set. The model underestimates log KOC for cacodylic acid, and cacodylic acid is

441

Y. Wang et al. / Chemosphere 119 (2015) 438–444 Table 2 Descriptions on the predictor variables involved in the model and the corresponding t, p, VIF values. Descriptor MLOGP2

a O-058 ATSC8v nN nROH P-117 SpMaxA_G/D Mor16u *

Description a

Squared Moriguchi octanol–water partition coefficient Molecular polarizability Number of @O group in a molecule Centred Broto–Moreau autocorrelation of lag 8b weighted by van der Waals volume Number of Nitrogen atoms Number of hydroxyl groups X3-P = X group (phosphate)c Normalized leading eigenvalue from distance/distance matrix (folding degree index)d 3D-MoRSE – Signal 16/unweightede

t*

p*

VIF*

11.695 23.036 13.918 7.480 7.450 6.858 4.235 5.191 4.552

In silico model for predicting soil organic carbon normalized sorption coefficient (K(OC)) of organic chemicals.

As a kind of in silico method, the methodology of quantitative structure-activity relationship (QSAR) has been shown to be an efficient way to predict...
391KB Sizes 0 Downloads 6 Views