Environmental Toxicology and Chemistry, Vol. 33, No. 12, pp. 2688–2693, 2014 # 2014 SETAC Printed in the USA
COMPARATIVE STUDY OF BIODEGRADABILITY PREDICTION OF CHEMICALS USING DECISION TREES, FUNCTIONAL TREES, AND LOGISTIC REGRESSION GUANGCHAO CHEN,yz XUEHUA LI,*y JINGWEN CHEN,y YA-NAN ZHANG,y and WILLIE J.G.M. PEIJNENBURGzx yKey Laboratory of Industrial Ecology and Environmental Engineering, School of Environmental Science and Technology, Dalian University of Technology, Dalian, China zInstitute of Environmental Sciences, Leiden University, Leiden, The Netherlands xCenter for Safety of Products and Substances, National Institute of Public Health and the Environment, Bilthoven, The Netherlands (Submitted 19 April 2014; Returned for Revision 6 June 2014; Accepted 5 September 2014) Abstract: Biodegradation is the principal environmental dissipation process of chemicals. As such, it is a dominant factor determining the persistence and fate of organic chemicals in the environment, and is therefore of critical importance to chemical management and regulation. In the present study, the authors developed in silico methods assessing biodegradability based on a large heterogeneous set of 825 organic compounds, using the techniques of the C4.5 decision tree, the functional inner regression tree, and logistic regression. External validation was subsequently carried out by 2 independent test sets of 777 and 27 chemicals. As a result, the functional inner regression tree exhibited the best predictability with predictive accuracies of 81.5% and 81.0%, respectively, on the training set (825 chemicals) and test set I (777 chemicals). Performance of the developed models on the 2 test sets was subsequently compared with that of the Estimation Program Interface (EPI) Suite Biowin 5 and Biowin 6 models, which also showed a better predictability of the functional inner regression tree model. The model built in the present study exhibits a reasonable predictability compared with existing models while possessing a transparent algorithm. Interpretation of the mechanisms of biodegradation was also carried out based on the models developed. Environ Toxicol Chem 2014;33:2688–2693. # 2014 SETAC Keywords: Biodegradability
In silico models
resulting in overall predictive accuracies of 90% and 93% for linear and nonlinear models in the respective training sets. Utilizing the same approaches, Tunkel et al.  developed novel modes based on biodegradation data for 884 discrete chemicals that were obtained by means of the MITI test against counts of 42 substructures plus the molecular weight. Their corresponding linear and nonlinear models correctly classiﬁed 81% of the compounds in an independent test set. Moreover, Loonen et al.  built a predictive model based on a series of 127 preselected substructures by partial least squares discriminant analysis; an average accuracy of 83% was found between 4 external tests. These models, however, lack a large and diverse dataset for model construction, which consequently prevents their wide applicability. Cheng et al.  recently developed a support vector machine (SVM) model based on 1604 chemicals, which achieved a 100% predictive accuracy for 27 organic compounds tested by an experimental assay through the Japanese MITI test protocols. Mansouri et al.  also built k-nearest neighbor (kNN) and SVM models based on 1725 chemicals using 12 and 14 molecular descriptors, respectively. These models demonstrated good performances, with all accuracies higher than 82% in training, test, and external validation sets, and they had relatively better predictability. However, the models were developed by using less transparent modeling techniques (SVM and kNN). Because of the black-box approaches [12– 14], it is difﬁcult for the model user to observe and comprehend how the prediction of biodegradability is made. In responding to this need, we developed in silico methods to assess chemical biodegradability by the comprehensible approaches of logistic regression , functional inner regression tree [16,17], and C4.5 decision tree , building on a diverse and heterogeneous set of 825 organic compounds. Models were externally validated by 2 test sets, 1 test set
Once released into the environment, organic chemicals may undergo microbial, chemical, or photochemical degradation. Microbial degradation is in general the most important process governing the persistence and fate of organic chemicals in the environment [1,2]. Thus it is necessary to obtain information on biodegradability for priority setting and risk assessment of organic chemicals. Currently, chemical biodegradability is assessed mainly through several standardized methods developed by the relevant regulatory organizations, such as the Organisation for Economic Co-operation and Development; the Japanese Ministry of Economy, Trade, and Industry; and the US Environmental Protection Agency (USEPA). As the number of existing chemicals that have to be evaluated is well over 140 000 , it is obviously an impossible task to test every single compound to acquire the basic information needed. The development of in silico methods that can predict the biodegradability of organic chemicals from their molecular structures is therefore of the utmost importance. In recent years, in silico models have been developed for predicting biodegradability of organic compounds from their chemical structures [1,2,4]. Working from the BIODEG data, Howard et al. [5,6] developed linear and nonlinear models with 35 molecular fragments, and achieved overall predictive accuracies of 82% for linear and 89% for nonlinear models for the chemicals that constituted a given test set. A revised version was then built by Boethling et al. , which included 5 new or redeﬁned substructures along with the molecular weight, All Supplemental Data may be found in the online version of this article. * Address correspondence to [email protected] Published online 10 September 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/etc.2746 2688
In silico models for predicting chemical biodegradability
consisting of 777 organic compounds and 1 set composed of the 27 chemicals that were used originally for external validation in the study of Cheng et al. . In all, 487 molecular descriptors were calculated using DRAGON software (Ver 2.1). These descriptors were ﬁltered afterward by the functional tree method, and the 13 most informative descriptors were obtained as output. Specially, the effect of class distribution for readily biodegradable and nonbiodegradable chemicals on model statistical performance was primarily evaluated, showing that the balanced class distribution is optimum for model construction. The performance of the models on the 2 test sets was also compared with that of the Estimation Program Interface (EPI) Suite Biowin 5 and Biowin 6 models. Interpretation of the mechanisms underlying biodegradation was carried out by analyzing the model descriptors, after which structural information on either readily biodegradable or nonbiodegradable chemicals was extracted. In particular, the functional inner regression model performed very well compared with the other models.
Environ Toxicol Chem 33, 2014
As an extension of the earlier ID3 algorithm , C4.5 generates a decision tree from the set of training data and each inner node contains a test on the original attributes. Prediction was accomplished by traversing a tree from the root to a leaf that directly classiﬁed an organic compound as readily biodegradable or nonbiodegradable. Building on the logistic model tree algorithm , the functional tree expands its choice of descriptors to split at internal nodes through generating synthetic functions, and the node’s logistic regression predicts class probabilities, which can be described by Equation 1 PðxÞ ¼
e2FðxÞ 1 þ e2FðxÞ
where P(x) is the categorical possibility needed to be further compared with a threshold generated by the algorithm; F(x) is a combination of the output descriptors automatically generated by the logistic regression method, which has the form
The experimental dataset was assembled from 3 sources. The ﬁrst was the Japanese National Institute of Technology and Evaluation dataset comprising 846 organic compounds, which were judged as being readily biodegradable or nonbiodegradable mainly according to the MITI-I or MITI-II test protocols . The second source, consisting of 26 chemicals, was retrieved from the Environmental Fate Data Base (EFDB) . The remaining data, up to 757 chemicals, were primarily from the study of Cheng et al. . Duplicated compounds were deliberately removed from our database by manually checking the chemical CAS numbers. For the EFDB data, the highest reliability rating (consistent results with 3 or more tests) was chosen to obtain reliable biodegradability information. Our entire dataset was comprised of 1629 organic compounds, 638 of which were assessed as being readily biodegradable through the experimental assays performed. Chemicals were labeled as either R (readily biodegradable) or N (nonbiodegradable) for modeling purposes. The two-dimensional (2D) structures of the organic chemicals were conﬁrmed with the USEPA Aggregated Computational Toxicology Resource . Molecular structural descriptors
Molecular structural descriptors were calculated using the DRAGON software (Ver 2.1), including the constitutional descriptors, molecular walk counts, BCUT descriptors, 2D autocorrelations, charge descriptors, aromaticity indices, functional groups, and atom-centered fragments , accounting for a total of 487 different descriptors. These descriptors were subsequently ﬁltered by the functional tree algorithm within the open source software Weka (Ver 220.127.116.11) . This yielded the 13 most informative descriptors as output. Modeling algorithms
Predictive models were built using logistic regression, a functional inner regression tree, and the C4.5 decision tree. The logistic regression algorithm is a type of regression analysis for predicting dichotomous dependent variables by estimating the probability of the event’s occurrence. Ridge estimators were employed for parameter approximation. In our case, the possibility of a chemical being nonbiodegradable (or readily biodegradable) was identiﬁed with a threshold value of 0.5 within the logistic regression model.
FðxÞ ¼ a þ b1 X 1 þ b2 X 2 þ L þ bk X k
In Equation 2, Xi is the value of molecular descriptor, and bi represents the corresponding coefﬁcients to be learned in the function. The functional tree is grown in a standard top-down recursive partitioning strategy, wherein univariate splitting is considered at each internal node determining the path that a compound will follow. Different conceptual models of a functional tree can be constructed by simplifying the algorithm: functional leaves regression tree (FT-Leaves), functional inner regression tree (FT-Inner), and the full functional regression tree (FT). We employed the FT-Inner for modeling, wherein multivariate combinations of the original attributes are used exclusively at internal nodes and the leaf nodes only generate the chemical classiﬁcation. Descriptor selection is conducted by the FT method. All procedures were performed with the Weka software (Ver 18.104.22.168). Model applicability domain
The applicability domain of the model is a response and chemical structure space on which a model has been constructed and in which the model is suited to make predictions for unclassiﬁed chemicals with a given reliability . In the present study, the model applicability domain is deﬁned by the Euclidean distance method  using the open source software Ambit Discovery (Ver 0.04) . The distance is estimated from a point to the center of the training set, and the central point can be measured as follows X¼
n 1X Xj n j¼1
where Xj is the value of a descriptor, X is the average, and n represents the total number of chemicals in the training set. The average of all individual descriptors determines the position of the central point. The subsequent Euclidean distance (di) between compound i and the central point in M-dimensional (M represents the total number of descriptors used) space is calculated as follows rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ XM di ¼ ðX ik X k Þ2 k¼1
Environ Toxicol Chem 33, 2014
G. Chen et al.
Table 1. Description of the ﬁltered molecular structural descriptors used in the models Descriptiona
No. Descriptor 1 2 3 4 5 6 7 8 9 10 11 12 13
nCIC No. of rings nN No. of nitrogen atoms nS No. of sulfur atoms nX No. of halogen atoms SRW10 Self-returning walk count of order 10 ATS3p Broto–Moreau autocorrelation of a topological structure MATS3m Moran autocorrelation of a topological structure GATS3m Geary autocorrelation of a topological structure C-001 CH3R / CH4 C-007 CH2X2 C-040 R-C(¼X)-X / R-C X / X ¼ C ¼ X O-061 O--b Cl-089 Cl attached to C1 (sp2)
R represents any group linked through a carbon atom; X represents any electronegative atom (e.g., O, N, S, P, Se, halogens). b As in nitro, N-oxides.
Figure 1. Model applicability domain characterized by the Euclidean distance. ATS3p ¼ Broto-Moreau autocorrelation of a topological structure, SRW10 ¼ self-returning walk count of order 10. [Color ﬁgure can be viewed in the online issue which is available at wileyonlinelibrary.com]
where, X1, X2, … and XM are the averages of the individual descriptors of the center, and Xi1, Xi2, … and XiM are the descriptor values of chemical i.
chemicals in test set II were originally used for external validation in the study of Cheng et al. . To determine the optimal model for predicting biodegradability, we initially developed the 3 models by using logistic regression, FT-Inner regression tree, and C4.5 decision tree. The 13 descriptors present in the 3 models are given in Table 1. The performances of the 3 models on the training set and test set I are summarized in Table 2. The comparison results show that on the basis of 13 molecular descriptors, the FT-Inner model was the top performing method among the 3 models, with predictive accuracies in both training and test set exceeding 80% (81.5% and 81.0%, respectively). The FT-Inner model has the form
Model performance evaluation
The developed models were evaluated based on the concepts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), which were stored in a confusion matrix. Predictive accuracy was reﬂected by 3 indices: sensitivity (SE ¼ TP/[TP þ FN]), speciﬁcity (SP ¼ TN/[TN þ FP]), and overall predictive accuracy (Q ¼ [TP þ TN]/[TP þ FP þ TN þ FN]) [29,30]. The receiver operating characteristic was also considered in model evaluation, and its quantitative measurement of area under the curve  was used. The learning procedure on the training set was executed in 10-fold crossvalidation.
The effect of class distribution of readily biodegradable and nonbiodegradable chemicals on a model’s statistical performance was evaluated primarily, and balanced class distribution turned out to be optimal for building models (Supplemental Data, Figure S1). We therefore constructed a training set of balanced class distribution, which comprised 416 nonbiodegradable and 409 readily biodegradable chemicals. Two test sets consisting of 777 (test set I) and 27 (test set II) organic compounds were used for external validation (Supplemental Data, Table S3). The method of dividing training and test sets is described in the Supplemental Data. It is worth noting that the 27
where the probability of a chemical belonging to the nonbiodegradable class can be measured by Equation 1. (See Table 1 for deﬁnitions of the descriptors present in the model.) A threshold of 0.388115 was generated by the machine learning procedure of FI-Inner, and the P(x) in the FT-Inner model was compared with this threshold. A chemical with P(x) higher than 0.388115 is judged as nonbiodegradable; otherwise, the chemical is judged as readily biodegradable. Information on the logistic regression and decision tree models is given in the Supplemental Data. We also predicted biodegradability of the compounds in test set I by utilizing the widely used EPI Suite Biowin 5 and Biowin 6 models. The results showed that the predictive accuracies of Biowin 5 and Biowin 6 models on test
Table 2. Performance of the C4.5 decision tree, FT-Inner, and logistic regression models on the training set and test set I Model C4.5 decision tree FT-Inner Logistic regression
Data set Training set Test set I Training set Test set I Training set Test set I
a Data provided are the percentage, with the number/total provided in parentheses. FT-Inner ¼ functional inner regression tree; SE ¼ sensitivity (SE ¼ TP/[TP þ FN]); SP ¼ speciﬁcity (SP ¼ TN/[TN þ FP]); Q ¼ overall predictive accuracy (Q ¼ [TP þ TN]/[TP þ FP þ TN þ FN]); AUC ¼ area under the curve; TP ¼ true positives; FP ¼ false positives; TN ¼ true negatives; FN ¼ false negatives.
In silico models for predicting chemical biodegradability
set I were 78.2% and 80.6%, respectively (see Supplementary Data). Thus the FT-Inner model exhibits a better predictability on test set I. To further validate the developed models, predictive performance of the C4.5 decision tree, FT-Inner, and the logistic regression models on test set II were subsequently evaluated along with a comparison with the predictive performance of the EPI Suite Biowin 5 and Biowin 6 models and the 2 best black-box models reported in the study of Cheng et al. (the descriptor-based model CHAID-SVM and the ﬁngerprint-based model PubChemFP-SVM) . As shown in Supplemental Data, Table S4, the FT-Inner model yielded an excellent performance on external test set II.
Environ Toxicol Chem 33, 2014
Applicability domain of models
The Euclidean distance approach was applied to evaluate the applicability domain of the models. Characterized by the 13 descriptors in Table 1, the applicability domain of the models is a 13-dimensional space. For an intuitive understanding of the applicability domain, the continuous variables SRW10 and ATS3p were selected for visualizing the descriptor space instead of the integral ones (e.g., nCIC, nN, nS). The maximum Euclidean distance of 1.481 in the training set was used to set the boundaries of the chemical applicability domain of the model (Figure 1). Because of the large dataset used for model construction, the models possess a wide applicability domain.
Figure 2. Distribution analysis of constitutional descriptors (p < 0.001) between nonbiodegradable and readily biodegradable classes, including nCIC, nN, nS, nX, C-040, O-061, and Cl-089. (See Table 1 for deﬁnitions of the descriptors.) [Color ﬁgure can be viewed in the online issue which is available at wileyonlinelibrary.com]
Environ Toxicol Chem 33, 2014
Only 8 organic compounds out of the 777 compounds in test set I are out of the domain, and all 27 organic compounds in test set II are in the domain. These results implied that the training sets have satisfactory representativeness. Relevance of the descriptors to chemical biodegradability
Statistical signiﬁcances of the descriptor distributions between nonbiodegradable and readily biodegradable classes were assessed by the independent-samples t test using IBM SPSS Statistics 19. A p value of less than 0.001 was considered statistically highly signiﬁcant. The corresponding p values of the t test for most of the descriptors were less than 0.001 except for GATS3m, C-007, and C-001 (Supplemental Data, Table S5). The p values for GATS3m, C-007, and C-001 were 0.951, 0.117, and 0.046, respectively, which demonstrates lack of statistical signiﬁcance in all 3 cases. To extract structural information on either readily biodegradable or nonbiodegradable chemicals, we performed a distribution analysis on the constitutional descriptors (p < 0.001) in the models (Figure 2). The nCIC is distributed between 0 and 6.00 in the entire dataset. This descriptor was found to have a retarding effect on chemical biodegradability in the model. The mean values of nCIC were 1.33 and 0.54 (p < 0.001) for nonbiodegradable and readily biodegradable chemicals, respectively. For the monocyclic chemicals (nCIC ¼ 1), the number of nonbiodegradable chemicals (441) exceeded the number of readily biodegradable chemicals (225) by a factor of almost 2. Chemicals containing more than 2 rings (404 in total) are more likely to be catabolically recalcitrant, with only 56 chemicals belonging to the readily biodegradable group. Organohalogen compounds are characterized by their high stability in the environment because halogenation of organic compounds tends to hinder biodegradation [8,31–33]. The nX is distributed between 0 and 27 in the dataset, with mean values of 0.88 and 0.11 for the nonbiodegradable and readily biodegradable chemicals, respectively (p < 0.001), whereas those of Cl089 are 0.38 and 0.03, respectively (p < 0.001), showing a signiﬁcant difference of nX and Cl-089 between the 2 classes of chemicals. A signiﬁcant difference of nN between the 2 groups of chemicals was also observed (p < 0.001), with mean values of 0.94 and 0.33 for nonbiodegradable and readily biodegradable chemicals, respectively. This means that, in general, chemicals with high nN are more likely to be resistant to biodegradation. This observation cannot be generalized, however, because the biodegradability of N-containing chemicals also depends on the other substituents present in the molecule. Aromatic NO2 and NH2 groups contributed negatively in previous models [7,8,34] and have therefore been regarded as retarding factors for chemical biodegradability. In contrast, the presence of an aliphatic NH2-substituent is viewed as positive for biodegradability [7,8]. Unlike the substructures mentioned above, the chemical fragments (R-C(¼X)-X/R-C X/X ¼ C ¼ X) characterized by descriptor C-040 can be considered as a positive factor, and its mean value of readily biodegradable chemicals (0.58) exceeds that of the nonbiodegradable class (0.29). Moreover, the corresponding p value of less than 0.001 is indicative of a signiﬁcant distribution difference between the readily biodegradable and nonbiodegradable classes. Interestingly, only 154 chemicals in the entire dataset are of nS > 0, of which 126 are nonbiodegradable. Also, for the structural fragment (O--) characterized by descriptor O-061, barely 96 chemicals have O-061 > 0 (88 are nonbiodegradable).
G. Chen et al.
These data demonstrate that chemicals with high nS or O-061 in our dataset are more likely to be resistant to biodegradation in the environment. These observations cannot be generalized, however, because the biodegradability of O-containing or S-containing chemicals also depends on other substituents present in the molecules. Moreover, in the models, 4 topological descriptors were also involved—namely, SRW10, ATS3p, MATS3m, and GATS3m. These topological descriptors are derived from the hydrogensuppressed molecular graphs. A self-returning walk (SRW) is a walk starting and ending at the same vertex; it has been used extensively to measure the complexity of graphs and molecules . The length of a walk (e.g., 10 for SRW10) is the total number of edges traversed. The ATS3p, MATS3m, and GATS3m are, respectively, Broto-Moreau, Moran, and Geary autocorrelations of lag 3, which are weighed by atomic polarizability, atomic mass, and atomic mass, respectively. The presence of these 4 descriptors reveals the important role of molecular complexity, atomic polarizability, and atomic mass in chemical biodegradability. As the principal environmental dissipation process, biodegradation is the primary removal pathway for organic compounds following their release into the environment. The aforementioned analysis of constitutional descriptors in our models is able to provide molecular structural information for chemical biodegradability and therefore offers a reliable reference for the further evaluation and regulation of organic chemicals. CONCLUSIONS
In summary, we have developed and validated in silico methods for assessing chemical biodegradability by the comprehensible approaches of logistic regression, functional inner regression tree, and C4.5 decision tree, building on a diverse and heterogeneous set of organic compounds. The FTInner model was found to have the best ability to predict biodegradability accurately for a large variety of organic chemicals, with accuracies on both training and test sets exceeding 80.0%. In addition, we addressed the practical problem of class distribution selection prior to modeling by evaluating the effect of class distribution of readily biodegradable and nonbiodegradable chemicals on model performance, and selected a balanced class distribution for the training data. The model applicability domain is deﬁned by the Euclidean distance theory. Analysis of constitutional descriptors in the models was also performed to extract information on the prerequisite of either readily biodegradable or nonbiodegradable chemicals. The FT model developed in the present study is easily employed, has a concise form, and is able to supply critical information for chemical regulation and the task of designing biodegradable chemicals. SUPPLEMENTAL DATA
Tables S1–S6. Figure S1. Equation S1. (190 KB DOC). Data Sets. (92 KB XLS). Acknowledgment—We acknowledge the WEKA Machine Learning Project for the open-source software used in the present study. This study was supported by the Chinese National Basic Research Program (grant 2013CB430403), the Chinese High-Tech Research and Development Program (grant 2012AA06A301), and the Chinese National Natural Science Foundation (grants 21137001, 21325729, and 21477016). G. Chen greatly
In silico models for predicting chemical biodegradability acknowledges the China Scholarship Council for fellowship support at Leiden University, The Netherlands.
REFERENCES 1. Pavan M, Worth AP. 2006. Review of QSAR models for ready biodegradation. EUR 22355 EN. European Commission, Joint Research Centre, Ispra, Italy. 2. Rücker C, Kümmerer K. 2012. Modeling and predicting aquatic aerobic biodegradation—A review from a user’s perspective. Green Chem 14:875–887. 3. Daginnus K. 2010. Characterisation of the REACH pre-registered substances list by chemical structure and physicochemical properties. EUR 24138 EN. European Commission. Brussels, Belgium. 4. Howard PH. 2000. Biodegradation. In Boethling RS, Mackay D, eds, Handbook of Property Estimation Methods for Chemicals: Environmental and Health Sciences. CRC Press, Boca Raton, FL, USA, pp 281– 311. 5. Howard PH, Hueber AE, Boethling RS. 1987. Biodegradation data evaluation for structure/biodegradation relations. Environ Toxicol Chem 6:1–10. 6. Howard PH, Boethling RS, Stiteler WM, Meylan WM, Hueber AE, Beauman JA, Larosche ME. 1992. Predictive model for aerobic biodegradability developed from a ﬁle of evaluated biodegradation data. Environ Toxicol Chem 11:593–603. 7. Boethling RS, Howard PH, Meylan W, Stiteler W, Beaumann J, Tirado N. 1994. Group contribution method for predicting probability and rate of aerobic biodegradation. Environ Sci Technol 28:459–465. 8. Tunkel J, Howard PH, Boethling RS, Stiteler W, Loonen H. 2000. Predicting ready biodegradability in the Japanese Ministry of International Trade and Industry Test. Environ Toxicol Chem 19: 2478–2485. 9. Loonen H, Lindgren F, Hansen B, Karcher W, Niemelä J, Hiromatsu K, Takatsuki M, Peijnenburg W, Rorije E, Struijś J. 1999. Prediction of biodegradability from chemical structure: Modeling of ready biodegradation test data. Environ Toxicol Chem 18:1763–1768. 10. Cheng F, Ikenaga Y, Zhou Y, Yu Y, Li W, Shen J, Du Z, Chen L, Xu C, Liu G, Lee PW, Tang Y. 2012. In silico assessment of chemical biodegradability. J Chem Inform Model 52:655–669. 11. Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni V. 2013. Quantitative structure-activity relationship models for ready biodegradability of chemicals. J Chem Inform Model 53:867–878. 12. Yang Z, Tang WH, Shintemirov A, Wu QH. 2009. Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. Systems Man Cybernetics C Applic Rev 39:597–610. 13. Tan SB. 2006. An effective reﬁnement strategy for KNN text classiﬁer. Expert Systems Applic 30:290–298. 14. Martens D, Baesens B, Van Gestel T, Vanthienen J. 2007. Comprehensible credit scoring models using rule extraction from support vector machines. Eur J Operat Res 183:1466–1476. 15. Kleinbaum DG, Klein M. 2010. Introduction to logistic regression. In Logistic Regression (Statistics for Biology and Health). SpringerVerlag, New York, NY, USA.
Environ Toxicol Chem 33, 2014
16. Gama J. 2004. Functional trees. Mach Learn 55:219–250. 17. Witten IH, Frank E, Hall MA. 2011. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. Morgan Kaufmann, San Francisco, CA, USA. 18. Quinlan JR. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, USA. 19. National Institute of Technology and Evaluation. 2006. Biodegradation and Bioconcentration Database of the Existing Chemical Substances. [cited 2011 October 31]. Available from: http://www.safe.nite.go.jp/ english/db.html. 20. Rosenberg SA, Hueber AE, Aronson D, Gouchie S, Howard PH, Meylan W, Tunkel JL. 2004. Syracuse Research Corporation’s chemical information databases: Extraction and compilation of data related to environmental fate and exposure. Sci Tech Libr 23:73–87. 21. Judson R, Richard A, Dix D, Houck K, Elloumi F, Martin M, Cathey T, Transue TR, Spencer R, Wolf M. 2008. ACToR—Aggregated Computational Toxicology Resource. Toxicol Appl Pharmacol 233: 7–13. 22. Todeschini R, Consonni V. 2000. Handbook of Molecular Descriptors: Methods and Principles in Medicinal Chemistry. Wiley - VCH, New York, NY, USA. 23. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. 2009. The WEKA data mining software: An update. SIGKDD Explorations 11:10–18. 24. Quinlan JR. 1986. Induction of decision trees. Mach Learn 1:81–106. 25. Landwehr N, Hall M, Frank E. 2005. Logistic model trees. Mach Learn 59:161–205. 26. Worth AP, Hartung H, van Leeuwen CJ. 2004. The role of the European Centre for the Validation of Alternative Methods (ECVAM) in the validation of (Q)SARs. SAR QSAR Environ Res 15:345–358. 27. Zhu H, Rusyn I, Richard A, Tropsha A. 2008. Use of cell viability assay data improves the prediction accuracy of conventional quantitative structure-activity relationship models of animal carcinogenicity. Environ Health Persp 116:506–513. 28. Jeliazkova N, Jaworska J. 2006. Ambit Discovery, version 0.04 (freeware). Ideaconsult Ltd, Bulgaria. 29. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. 2000. Assessing the accuracy of prediction algorithms for classiﬁcation: An overview. Bioinformatics 16:412–424. 30. Osei-Bryson KM. 2004. Evaluation of decision trees: A multi-criteria approach. Comput Oper Res 31:1933–1945. 31. Gamberger D, Horvatić D, Sekušak S, Sabljić A. 1996. Applications of experts’ judgement to derive structure biodegradation relationships. Environ Sci Pollut Res 3:224–228. 32. Kompare B. 1998. Estimating environmental pollution by xenobiotic chemicals using QSAR(QSBR) models based on artiﬁcial intelligence. Water Sci Technol 37:9–18. 33. Parsons JR, Sáez M, Dolﬁng J, de Voogt P. 2008. Biodegradation of perﬂuorinated compounds. Rev Environ Contam Toxicol 196:53–71. 34. Rorije E, Loonen H, Müller M, Klopman G, Peijnenburg WJGM. 1999. Evaluation and application of models for the prediction of ready biodegradability in the MITI-I test. Chemosphere 38:1409–1417. 35. Klein DJ, Palacios JL, Randic M, Trinajstic N. 2004. Random walks and chemical graph theory. J Chem Inform Comput Sci 44:1521–1525.
Copyright of Environmental Toxicology & Chemistry is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.
The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. The major problem with GWP is high-dimensional data from many thousands of SNPs scored on
The purpose of this study was to explore data-driven models, based on decision trees, to develop practical and easy to use predictive models for early identification of firefighters who are likely to cross the threshold of hyperthermia during live-fi
Reliable analysis of electroencephalogram (EEG) signals is crucial that could lead the way to correct diagnostic and therapeutic methods for the treatment of patients with neurological abnormalities, especially epilepsy. This paper presents a novel a
The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency's safety performance. However, the data needs for deriving these local calibration factors are significant,
Due to the importance of medical studies, researchers of this field should be familiar with various types of statistical analyses to select the most appropriate method based on the characteristics of their data sets. Classification and regression tre
Bayesian additive regression trees (BART) provide a framework for flexible nonparametric modeling of relationships of covariates to outcomes. Recently, BART models have been shown to provide excellent predictive performance, for both continuous and b
Consensus-based approaches provide an alternative to evidence-based decision making, especially in situations where high-level evidence is limited. Our aim was to demonstrate a novel source of information, objective consensus based on recommendations
In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. H
Many classification problems, especially in the field of bioinformatics, are associated with more than one class, known as multi-label classification problems. In this study, we propose a new adaptation for the Binary Relevance algorithm taking into
In this paper, we give and prove the lower bounds of the Vapnik-Chervonenkis (VC)-dimension of the univariate decision tree hypothesis class. The VC-dimension of the univariate decision tree depends on the VC-dimension values of its subtrees and the