Industrial applications of in silico ADMET.

J Mol Model (2014) 20:2322 DOI 10.1007/s00894-014-2322-5

ORIGINAL PAPER

Industrial applications of in silico ADMET Bernd Beck & Tim Geppert

Received: 31 March 2014 / Accepted: 27 May 2014 # Springer-Verlag Berlin Heidelberg 2014

Abstract Quantitative structure activity relationship (QSAR) modeling has been in use for several decades now. One branch of it, in silico ADMET, became more and more important since the late 1990s as studies indicated that poor pharmacokinetics and toxicity were important causes of costly late-stage failures in drug development. In this paper we describe some of the available methods and best practice for the different stages of the in silico model building process. We also describe some more recent developments, like automated model building and the prediction probability. Finally we will discuss the use of in silico ADMET for “big data” and the importance and possible further development of interpretable models. Keywords Automated model building . Data curation . In silico ADMET . Prediction probability . QSAR

Introduction The origin of what we nowadays call in silico ADMET are the first QSAR studies which were published more than a century ago. The work of Charles Richet published in 1893 [1] and especially also from Emil Fischer [2] in 1894 can be seen as the starting point of QSAR analysis. Richet published a short notice on the relationship of toxicity and the physicochemical properties of a series of organic compounds. He concluded that the more the compounds are soluble the more toxic they are. Fischer investigated the influence of α- and βThis paper belongs to a Topical Collection on the occasion of Prof. Tim Clark’s 65th birthday B. Beck (*) : T. Geppert Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorferstrasse 65, 88397 Biberach an der Riss, Germany e-mail: [email protected]

configurations of different glucosides on their enzymatic cleavage. He concluded from the observations that enzyme and glycoside must fit together like lock and key to show a chemical effect on each other. At around 1900 Meyer and Overton formulated the lipoid theory of narcosis [3, 4]. Another important milestone in the history of QSAR was the work published in the early 1960s by Hansch and Fujita on the quantitative relationship between physicochemical properties and biological activities [5]. Hansch analysis or related approaches [6, 7] have had a significant impact in the area of QSAR in the following decades. One of the most famous QSAR programs developed in that time was the ClogP program [8], it is a fragment additive approach. It still can be called one of the standards in LogP prediction. In the following decades many new molecular descriptors as well as new statistical methods were published and used in the field. In 1988 a first 3D QSAR approach, CoMFA [9] was published. In 1992 one of the authors starts his PhD work in the research team of Prof. Tim Clark. At that time data sets for many endpoints were very often quite small or not at all available in the public domain [10, 11]. Also the computer power was limited compared to nowadays. This also limited the kind of descriptors used for QSAR/QSPR analysis, e.g., quantum-mechanical derived descriptors are still often too expensive for that purpose. Only a few statistical methods like linear regression, simple decision trees, and neural networks were available within the first available SAR software packages like for example TSAR [12]. In the late 1990s studies indicated that poor pharmacokinetics and toxicity were important causes of costly late-stage failures in drug development [13]. The overall picture of these studies is shown in Fig. 1. It became widely appreciated that these areas should be considered as early as possible in the drug discovery process. At that time combinatorial chemistry and high-throughput screening had significantly increased the number of

2322, Page 2 of 15

Fig. 1 Main reasons for attrition in drug discovery in the 1990s [14]

compounds for which early ADMET data were needed. On the other hand there were not many medium and highthroughput in vitro ADMET assays available. One branch of QSAR, in silico ADMET, became one important solution for filling this gap. Nowadays very large virtual libraries are generated in the pharmaceutical industry [15, 16] and can be searched with high performance virtual screening methods [17, 18]. In order to prioritize virtual screening hits and to support the design process of virtual libraries in silico ADMET predictions are very important. In Fig. 2 a Google books Ngram Viewer diagram is shown [19]. It is the analysis of published literature between 1975 and 2008 using the following keyword, QSAR model, ADMET property, and QSPR model. It shows the concurrent growth of QSAR/QSPR publications. It also can be seen that starting in 1995 the number of ADMET property publication is also growing. In the following paper we want to describe the current status of in silico ADMET with respect to available data sets, descriptors and modeling methods. We will describe best practice for the different steps in ADMET modeling and some recent developments within the pharmaceutical industry using examples from Boehringer Ingelheim. At the end we will give a short outlook on what we think will become more important in the in silico ADMET research area.

Data sets and data preparation In the early 1990s ADMET data sets in the public domain were rare, with only a few small published data sets. Databases with druglike compounds and assay measurements like the WDI [20], CMC [21], or MDDR [22] were quite small and only a few ADMET endpoints were covered. Even within pharmaceutical companies only small ADMET data sets were available as the corresponding assays were expensive and had a low throughput. Nowadays this situation has changed quite dramatically. Poor pharmacokinetics and toxicity had been one major reason for the relatively high number of late stage failures in pharmaceutical industry during the late 1990s [13]. Based on this observation the pharmaceutical companies have

J Mol Model (2014) 20:2322

changed their ADMET related design strategy. Drug metabolism, pharmacokinetics, toxicity and related properties are now tested much earlier and with much higher throughput. The increase of testing capacity results in much larger data sets which can be used for ADMET model building. In the public domain there is also a number of ADMET related data sets available. These data sets are available through different databases. Examples are PubChem which started in 2004 [23], ChEMBL which started at the end of 2009 [24], and the newest example UNICHEM [25]. In addition there are several commercially available databases like Thomson Integrity [26] or the already mentioned WDI [20]. One point that has to be mentioned here is that many of the novel high throughput assays measure surrogate parameters, e.g., CaCo2 permeability or hERG binding assays, which of course have an influence on the information content of such measurements. Data preparation After the data for a selected endpoint is extracted from one of the available databases the next step for ADMET modeling is the data preparation. A data set typically consists of structures and associated measurements for the ADMET endpoint of interest. The data curation step is of outmost importance for the generation of a good ADMET model. In a few recent publications [27–30] the errors in bioactivity databases and MedChem publications with respect to errors in the chemical depiction or the reported measurements are discussed. In consequence the data curation step consists of inspection and standardization of the chemical structures as well as an analysis of the reported assay data. For chemical structures the procedures consists of a number of steps. If possible structural errors should be corrected, e.g., by comparing the structures incorporated from different sources or usage of the compound name to check the structure. If there are doubts left it is advisable to leave a questionable compound out of the model building process. In a next step one should standardize the structures. Standardization includes the removal of counterions, the normalization of specific chemotypes (e.g., nitro group drawing in 2D), aromatization, and a check for undefined stereo centers. Depending on the endpoint compounds with undefined stereochemistry should be sorted out. Compounds with several undefined stereocenters we would recommend to always leave out. Standardization of tautomeric forms is also often discussed. In order to be consistent within a dataset we think it is helpful. On the other hand it is often not possible to decide which tautomeric form is the correct one for a given endpoint. Therefore it is not possible to define a generally applicable rule here. The final step should consist of a brief manual inspection of the structures or at least a random subset. This manual inspection can highlight possible methodological errors.

J Mol Model (2014) 20:2322

Page 3 of 15, 2322

Fig. 2 Google books Ngram Viewer [19] using QSAR models, QSPR models, and ADMET Properties as search strings

The second important preparation step is to check the experimental values itself. In the public domain and also in the pharmaceutical industry experimental ADMET endpoints are often collected from different assays. If that is the case it is necessary to carefully check the reported units and the comparability between the assays. To do so it is necessary to have a set of compounds that is tested in all assays that should be aggregated. If this is not possible we suggest taking only values from one assay to avoid data inconsistencies. The next very important point is a careful analysis of the remaining data set for duplicates. If there are duplicates available they can give an impression about the experimental error within the data set. This can also help to understand the expected performance of a model. It is also necessary to check how the data is distributed over the given result range. Distribution peaks at multiples of three log units above or below the mean might indicate unit errors (e.g., nM - μM). Usually the data is not balanced (e.g., much more active than inactive compounds). It has been shown that it is advisable to obtain a balanced training set to improve the performance of an ADMET model [31]. Having done careful analysis and cleaning of the data it is advisable to convert the result unit of the experimental data to a useful unit as has been shown elsewhere [32]. For a classification model the experimental data needs to be binned. The binning strategy depends on the planned application of the final model. Most popular are binary classifications for filtering or three to five class models for prioritization. In a last step the data set is split into training, a test, and a validation set. At least 10 % of the data should be used for external validation. The remaining data should be used in at least a five fold cross validation protocol with 20 % of the data as validation within each fold. The whole process is summarized in Fig. 3. Having done all the necessary cleaning and analysis steps it is always important to keep in mind that many of the reported results are measured for so called research compounds. Purity of the compounds varies somewhere between 85 % and 99 %. This might have a significant impact on the final performance of the model and is often not accounted for in the analysis of a data set and the obtained results. The impurities in these samples impact the assay concentration, might be active themselves, or interact with the measurement method.

Available descriptors – descriptor/feature selection Molecular descriptors There are two main categories of molecular descriptors, experimental measurements like logP, dipole moment, and polarizability and calculated descriptors which will be described in the following section. Todeschini and Consonni [33] introduced a popular definition of calculated molecular descriptors: "The molecular descriptor is the final result of a logic and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment." According to this definition the information content of a molecular descriptor mainly depends on the chemical representation of a compound and the algorithm which was used to calculate the descriptor. Currently several thousand different molecular descriptors are known, e.g., in the current version of the software Dragon [34] it is possible to calculate more than 4800 different descriptors. Molecular descriptors can be classified by the data type of the descriptor or the kind of the molecular representation of a compound. In Table 1 the classification according to the resulting data type is shown. More common is the categorization into 0D, 1D, 2D, 3D, and 4D descriptors. Examples are shown in Table 2. More details can be found for example in the Handbook of molecular descriptors [33]. Empirically derived descriptors have a long tradition in the QSAR/QSPR field. The calculation of these descriptors is quite fast and therefore they became very popular in the later 1970s and 1980s. Examples are the Kier and Hall indices [40]. Still many of the regularly used descriptors are fragment or empirically derived descriptors. In the last two decades the use of quantum mechanical (QM) descriptors increased as the computational power has increased. While in 1998 [41] the calculation of semiempirical descriptors for 54,000 compounds took around

2322, Page 4 of 15

J Mol Model (2014) 20:2322

Fig. 3 Workflow for the preparation of structural and experimental data for model generation. Further details can also be found within reference [30]

one day on 128 processors this can now be done in much less than an hour on the same number of modern cores. QM derived descriptors describe the electronic structure of molecules and therefore give a better description of inter- and intramolecular interactions. Examples for the use of QM descriptors can be found in references [11, 42–47]. There have been many discussions in the past decades about the dependence of 3D descriptors on the 3D geometry of the input structure which is used for the calculation and is believed to be one main disadvantage of QM and other 3D descriptors. So called 4D methods have been developed as a consequence of this problem. Within these methods an ensemble of structures represents the fourth dimension [36–39]. Other approaches try

Table 1 Classification of molecular descriptors according to the resulting data type DATA type

Descriptor example

Integer Real Vector Tensor Boolean Scalar field Vector field

Counts of atoms Molecular weight Dipole moment, fingerprint Polarizability Compound with at least one aromatic ring Electrostatic potential Gradient of the electrostatic field

to reduce the dependency on the 3D geometry by using standardized structural processing workflows. Feature selection Often but not always the final step before the model generation is the so called feature selection or descriptor selection. The aim is to reduce the number of input variables and keep only the relevant ones. There are two general strategies for feature selection, the filter and the wrapper methods. Filter techniques work directly on the properties of a given descriptor set. Low variance or correlated descriptors are removed. Wrapper approaches always consists of an objective function, a regression or classification model, and an optimization (selection) method, e.g., a genetic algorithm or simulated annealing. The performance of the objective function guides the selection of descriptors during the optimization. In Fig. 4 an overview of the current descriptor selection methods is shown. More details can be found in the recent review from Shahlaei [48] and references therein.

Methods for QSAR model generation The QSAR methodology relies on the inference of a relationship between the explanatory variables (namely descriptors)

J Mol Model (2014) 20:2322

Page 5 of 15, 2322

Table 2 Molecular descriptors classified by dimensionality Molecular representation

Descriptor

Example(s)

0D 1D

Atom counts, bound counts, molecular weight, sum of atomic properties Fragments counts, properties

2D

Topological descriptors

3D

Geometrical descriptors

Molecular weight, number of: atoms, heteroatoms, heavy atoms, specific atom types, bonds, sum of the atomic van der Waals volumes Number of primary, secondary or tertiary C atoms, primary, secondary, tertiary amines, ring counts, H Donor and H acceptor counts, unsaturation index, hydrophilic factor, molar refractivity, fragment based polar surface area Zagreb Index, Wiener Index, Balaban index, connectivity indices, BCUT descriptors, TPSA 2D fingerprints Molecular surface, molecular volume, globularity, WHIM descriptors, GETAWAY descriptors, 3D MoRSE descriptors, 3D pharmacophoric fingerprints Politzer-Murray descriptors, local properties like MEP, local polarizability, local ionization potential and local electron affinity [35] Comparative molecular field analysis (CoMFA) [9] 3D coordinates, conformation-sampling [36–39]

3D surface properties 3D grid properties 4D

and the responses (explained variables). During the generation of QSAR models in the past a variety of methods, namely from the community of machine learning and artificial intelligence, have been utilized. Initially basic algorithms like PLS or MLR have been used for QSAR model building while with the maturation of the field more advanced methods have been introduced from other data mining areas. The best known and often used methods today are random forests (RF) [49], support vector machines (SVM) [50], artificial neuronal networks (ANN) [51], and multiple linear regression (MLR) [52–54]. For most of the model building tasks it is possible to use one of these methods in a classification or regression mode as a multitude of research papers and reviews prove [32, 55]. We now will describe a selection of methods which frequently appear in the QSAR literature and have been proven to be useful for model building [55]. Fig. 4 Different strategies for descriptor selection. See also Shahlaei [48]

Classification The classification problem is defined as the search for an optimal class separating hyper plane in the mostly high dimensional feature space of the data set. The different methods and their approach to achieve error minimization are presented below (Fig. 5). The random forest method is a well-known and broadly applicable method for classification. It was introduced by Breiman in 2001 [49]. The method creates a forest of simple tree predictors. A tree predictor contains a collection of nodes which implement a splitting rule. Each splitting rule partitions the data set within the feature space. To do so a feature has to be selected and a rule has to be generated. The depth of the tree and the leave size has to be defined as parameters within this method. Also the number of parameters for each split has to be

2322, Page 6 of 15

Fig. 5 Different classification methods. a: Tree classification separates the input space in corresponding subspaces (as indicated by shaded background) based on selected features. b: The SVM classification. The separating hyperplane and the maximized margin are indicated. The blue samples are corresponding to the support vectors and the red sample is misclassified. c: Feed forward neuronal network with an input layer (squares) one hidden layer (mid circles) and the corresponding output layer (right circle). The network weights are indicated as w and a

determined. Here a random subset is sampled and used for the split. As the method only uses a subset of the data for each tree and also only a subset of features at each node it is also possible to generate reliably output using the out of the bag error estimate for the performance and generalization of the trained model. The support vector machine (SVM) is another widely used method for classification. The SVM generates the optimal separating hyperplane by maximizing the margin between the data points of all classes. To reduce over-fitting the method allows for noise within the data by introducing so called slack variables. To separate the data points the SVM method uses kernel functions. The kernel function projects the inputs from the input space into a high dimensional feature space. Within the feature space a linear separation of the data is done. One of most common kernel functions is the radial basis function. Another kernel function based method is the so called Gaussian Process regression which is described in more detail in reference [56]. A third frequently used approach is the artificial neuronal network method which consists of connected nodes in a multilayer structure. The nodes of a neuronal network are called neurons which expresses its relation to biological neurons with an input and an output for signal transduction. A neuronal network consists of the input layer neurons where the parameters are fed in and one or several hidden layers

J Mol Model (2014) 20:2322

which optimize weight functions based on the inputs to generate an output that resembles the learning feature. The method can be seen as a simple linear regression method if the activity functions are linear, but in most cases a sigmoidal activity function is used. As the method can consist of a large number of neurons and connection weights the number of parameters can get very large. That is why most successfully used ANN are regularized on the number of weights respectively to the size of the complete weight vector. All the above mentioned methods for classification are so called supervised methods. For supervised methods it is necessary to have a classified training set which has to be used to determine the optimal parameters for the classification method. In comparison to the supervised methods there are also unsupervised methods. Unsupervised learning tries to define clusters of similar data points without taking class assignments into account. Examples of unsupervised methods are k-means clustering or self-organizing maps (SOM) [57]. The SOM method is an unsupervised neuronal network method that projects the high dimensional data onto a layer of neurons. As such it combines a clustering and projection method [58]. Regression For regression the strategy is to minimize the root mean square error (RMSE) between the training data and prediction. In most cases it is better to minimize the rank correlation between the sorted data and the sorted prediction as this is less prone to data fuzziness. The above mentioned methods (SVM, ANN, RF) also have been extended to be useful in the area of regression model building. The best known method for a regression model and in use for more than 30 years now is the so called multiple linear regression (MLR). It fits the parameters of a linear equation to a data set. A major benefit (and also a major drawback in some cases) is that MLR provides linear equations that are easy to understand and are interpretable. Details can be found elsewhere [52–54]. Performance measurements and over-fitting To reduce the over-fitting problem most algorithms allow for a set of misclassified data points during the learning phase. To check for over-fitting it is advisable to use a leave group out cross validation strategy as it generates a mean error estimate with standard deviation over all folds. One very early example to better estimate the prediction error was published in 2000. By using ten neuronal nets with different training/test set combinations one gets an average predicted value and a standard deviation of the prediction. For a given compound that eliminates good or bad predictions by chance because of a given training/test set split [42]. We also want to mention that the number of parameters in a model also have to be taken into account. The number of

J Mol Model (2014) 20:2322

features and parameters should be much smaller than the number of training samples to generate a useful model. Several strategies for the evaluation of the model performance have been introduced [59]. In our opinion the most informative analysis for classification models can be retrieved by generating a confusion matrix which simply presents the number of correct and incorrect classified samples within a matrix format. This gives not only an overview about the accuracy but also about the misclassification over multiple classes. These misclassifications in our experience have to be weighted stronger than one class misclassification as they result in more severe errors.

Model evaluation over time As the number of new data points is constantly increasing and the similarity of new data points to already known data points decreases the applicability of a model will decrease over time. It is thus important to update models on a regular basis. The criterion for updates is under scientific discussion. Two options can be followed on the one hand it can be tested if the model performance on new data points decreases significantly in comparison to the performance during model building. On the other hand it can be tested if a new build model outperforms the old model significantly. To compare two models the McNemar test [60] has shown to be very useful. To generate a data set for comparison of classifiers over time it is advisable to remove validation data at different time points from the training data. This data should never be used for model building but only for the comparison of old and new models.

Current status Having described all the necessary steps and tools which are currently understood to be the standard in the development of a predictive ADMET model we now want to describe a few newer projects/developments within Boehringer-Ingelheim. First a summary of our automated model generation framework is given. In the second part we will discuss the use of the so called applicability domain versus the use of prediction probabilities.

Page 7 of 15, 2322

into an automated model building framework. Results from colleagues at AstraZeneca [61] show that in general automated model building is achievable. Most companies now shift away from manual model building to a more automated approach. Here we describe the technical details and features for an automated model building framework we created inhouse. The aim of this study was to create a QSAR model building toolbox which can be used for the daily project work with a minimum of manual intervention. From the company perspective one has to take time constraints for model building and quality of the model as competing factors into account. We believe an automated toolbox can optimize both criteria. On the one hand the time needed to create a model can be reduced as labor intensive repetitive steps are automated and on the other hand best practice guidelines as well as state of the art methods can be incorporated into the model building process. It also ensures that the whole process can be updated centrally as novel scientific results regarding best practices for model building are introduced. This guidance and minimization of time and effort lowers the entry barrier to QSAR/QSPR which results in a broader application of local ADMET models, e.g., already in the synthesis idea generation step of a research project. These local models optimally supplement global models that are not project specific, as for example global ADMET related models. It is worth mentioning that also for the generation of global models the automated process offers opportunities as the iterative model building process which consists of regular updates can be automated. As we have outlined above there are four main steps in QSAR/ADMET model building. The procedure consists of the generation of a consistent high quality data set, the calculation and selection of relevant descriptors, the application of statistical methods for classification or regression, and the model evaluation. In addition there are possible update scenarios. We now want to highlight the opportunities for automation of each of these steps. First of all we have to raise the point that there has to be an upfront investment to generate an automated framework. In our opinion this will be compensated very quickly as it results in a more efficient process. In the following paragraph we will name a variety of software packages that can be used for automated model building. We want to note that we are not affiliated with any of these companies and that we do not provide a complete list of tools but just an incomplete selection which is not meant as a recommendation but just as examples which could be replaced by other software of the same kind.

Case study automated model building Data set preparation The field of QSAR has come to maturity within the last decades as we have outlined above. Several recent reviews have been published describing good practices for QSAR modeling. This also raised the question if the manual labor intensive process of QSAR model building can be streamlined

The first step of model building and also one of the most critical is the data set creation. We described above the various pitfalls that can happen during data set preparation. If these pitfalls are adequately addressed the whole process of data set

2322, Page 8 of 15

generation poses an optimal scenario for automation. Common structural standardization tasks can be handled by commercial toolkits as for example the ChemAxon standardizer [62]. During automated pooling of data sets it is important to check for consistency of multiple measurements. For example large standard deviations can point out problematic substances for which the responsible scientist should incorporate additional knowledge or exclude the data where appropriate. If the data shall be used for a class based model it can be classified accordingly and tested for consistent class membership and operator annotations. Operator values should only occur in the lowest (lower) and highest (higher) class. As the final model has to be evaluated on its quality it is necessary to create a test strategy which needs some additional preparation steps. First a random fraction of 20 % is split from the whole data set and held out during the complete model building process. This will be our final validation data set. The remaining data is split up in five folds and the whole process of model building is done in five runs in parallel, each consisting of four folds as training set and one fold as validation set. Based on this strategy we can measure the prediction performance for each fold. Inhouse data sets allow additional strategies to create external validation data. As each measurement contains a timestamp it is possible to create training data sets consisting only of compounds up to a certain time point and all newer data points are used as an external evaluation set. We think that the time split based evaluation represent the prediction problem better as it simulates the drug discovery process of the company over time. Novel compounds will be created in unexplored regions often based on known chemistry and chemical substructures, e.g., the activity anchors. In addition the time split evaluation can give an estimate on how fast the model will degrade over time. Descriptor generation The descriptor generation follows the data aggregation and annotation. As described above a variety of descriptors can be calculated for the standardized molecules. A multitude of software is available to do so and common examples are the Talete Dragon [34] or MOE software packages [63]. For 3D descriptors it is necessary to generate a useful 3D structure of the usually 2D input molecules. We use the Corina software package [64]. The reoccurring steps of descriptor calculation can be easily abstracted into an automated workflow and also handled in a parallel fashion to speed up the modeling process. Feature selection As mentioned earlier the feature selection is important to determine the most relevant descriptors for the model building process. Within the automated model building we first remove

J Mol Model (2014) 20:2322

descriptors with low variance. We also apply rescaling of the descriptors to remove possible influences of the descriptor scale. Missing values within the descriptor set should be handled with care. They can result from non-supported atom types or other calculation failures. If some descriptors are not reproducible it is advisable to remove them from the analysis. The final descriptor set will still consist of a multitude of descriptors which can be highly correlated. Therefore the number of descriptors should be reduced to optimal subset as this has been shown to increase predictive performance. For an automated model building process it is of importance to do the selection with a method that does not require extensive parameter optimization. For classification we use the guided regularized random forest (GRRF) method to determine feature importance [64]. The method performs reasonable well in our hands and does not need extensive parameter selection as could be shown by Deng and Runger in 2013 [65]. In brief the method consists of two phases. First a random forest model is trained within the full descriptor space. Based on this model the random forest based descriptor importance can be determined. Consecutively a second but regularized random forest is trained on the whole descriptor set. For the regularization the descriptor importance of the first round of training is used. This descriptor selection is done for each of the five splits applying internal cross validation as described in the previous paragraph. All descriptors that are selected more than once during the internal cross validation are used for a final GRRF based on the complete fold. The resulting important descriptors of this model are used during the model building process. To do descriptor selection for a regression model we apply the LASSO method using the same internal cross validation approach as described for classification. Details about the LASSO can be found in [66]. It is important to mention here that if one is interested in interpreting the most important features of the input molecules one should select descriptors that can be mapped to individual substructure features as for example substructure fingerprints or other atom based descriptors. The CATS descriptor and other substructure descriptors have been used for the determination of important substructure features and for training of interpretable machine learning models [67]. Machine learning (statistical methods) In this step the model training will be done. A variety of machine learning techniques have been used in the past to generate useful QSAR models as described above. In most settings the use of meta-predictors were shown to be useful and of better performance than single predictors. Therefore we focus on a methodology to combine different machine learning methods into a meta-predictor. A good performance can be expected if the learner is combined over orthogonal single learners. To generate a variable and general applicable learner

J Mol Model (2014) 20:2322

we incorporate three different widely known machine learning methods, SVMs, random forests and Bayesian regularized neuronal networks. In our hands this combination results in predictive meta-classifiers within several different modeling domains. For each individual method we follow a brute force grid based parameter optimization methodology using internal cross validation. Each training fold is the bases for the internal five fold cross validation. Based on this internal cross validation the optimal parameters are determined and used to train a final model for each complete fold. After each single classifier is trained and evaluated within the different folds the voting meta-classifier is calculated based on the results of the single classifiers and evaluated on the validation set which consists of 20 % hold out data. Based on this external validation set the performance of the obtained model is assessed and a retraining using all data can be triggered. Workflow creation In our opinion the natural choice for bundling all of the above mentioned processing steps is a pipelining tool like KNIME [68] or Pipeline Pilot [69]. A pipelining tool allows for the flexibility to incorporate different tools and also create specialized workflows that are easy to use for automation. In this example we used the KNIME environment and for the learning and statistical model evaluation the R integration into KNIME [70, 68]. A schematic representation of the resulting workflow is given in Fig. 6. Performance assessment The above framework can be applied to a variety of possible scenarios. As an illustrative example we now want to outline the performance based on public measurements of Cytochrome P450 isoform activity changes. The Cytochrome P450 isoforms are important antitargets which are regularly monitored during drug discovery programs and as such are an important endpoint for QSAR modeling [71]. The data set of Veith that is used in this example is classified into two classes namely active and inactive against CYP [72]. We trained four models with data for CYP2C19, CYP2C9, CYP2D6, CYP3A4 using the described automated process. Based on the input data we selected the number of positive and negative Fig. 6 Schematic model generation process

Page 9 of 15, 2322

samples in equal size following an under sampling strategy. Based on this strategy the expected random accuracy is 0.5. Table 3 summarizes the statistics for the different models. To evaluate the obtained model from the automated model building framework we compare the performance measures with two other published results using the same data set and a manual model building procedure. These models are optimized for this specific target set. Cheng et al. [73] published a meta-classifier for the classification of the inhibitor and noninhibitor class of Cytochrome P450. They built a set of different meta-classifiers based on C4.5 decision trees, k-nearest neighbor, naïve Bayes and fused them using different fusion rules [73]. The accuracy on the different data sets is indicated in Table 3. It can be seen that the accuracy is comparable with what can be achieved using the automated model building. Only the results for CYP 2D6 seem to differ largely. One reason might be that this data set is highly imbalanced with 26 % active and 74 % inactive molecules. In our study we used an equal number of active and inactive data points which changes our background probability for each class to 50 % while for the results of Cheng the background probability for inactive molecules is 74 % which in this case seems to impact the calculation of the overall accuracy. As Cheng also published the Matthews correlation coefficient which is better suited for imbalanced data sets we compared the Matthews correlation coefficient as well. For the Cheng method it is between 0.408-0.461 depending on the method for fusing the single learners and it is 0.49 for our model. Based on our observation we can conclude that our automated model building framework results in models that exhibit at least the same performance as manually generated models. In a second publication by Sun the ROC AUC values based on different descriptor sets have been presented for all four endpoints. The comparison to this result is also presented in Table 3. It can be seen that the ROC values are in the same area although the Sun data seems to be slightly better in comparison to the performance that is achieved by the model building framework. The descriptor set that performed best consists of 264 descriptors of which 12 where added specifically for this task. In comparison the number of descriptors used during automated model building is not higher than 35 for all models. Taking into account that the model building framework uses less than 15 % of the number of descriptors used by Sun we

2322, Page 10 of 15

J Mol Model (2014) 20:2322

Table 3 Training time (time), number of descriptors used (descriptors), accuracy and Matthews correlation (Matthews) for four different endpoints using the automated model building framework (AutoMeta). The results from the publication by Cheng and Sun are shown as comparison [73, 74] CYP

Time [h]

#Descriptors

AutoMeta Accuracy

Cheng Accuracy

AutoMeta Matthews

Cheng Matthews

AutoMeta ROC

Sun ROC

2C9 2C19 2D6 3A4

18 29 16 17

31 34 21 35

0.79 0.82 0.75 0.77

74.2-77.3 71.1-78 81.7-83.7 73.2-76.7

0.59 0.64 0.49 0.54

0.438-0.498 0.466-0.554 0.408-0.461 0.422-0.509

0.86 0.87 0.8 0.83

0.62-0.89 0.61-0.89 0.59-0.85 0.68-0.87

are confident that the performance differences are not significant. As mentioned above it is also of interest which descriptors have been selected for the models and by this getting insight into importance of molecular features for the endpoint. The work by Sun also ranked the descriptors by relevance and we want to highlight similarities for both methods. The automated model picked as important features the aqueous solubility, the octanol water partition coefficient, and the net charge of the molecules. All three descriptors are related to solubility. The work of Sun highlighted the importance of charge in the molecules as well as the number of aromatic rings and molecular weight which also relates to the solubility [74]. This observation underlines that an automated framework is able to compete in feature selection step with a manual approach.

Applicability domain versus prediction probability The Organization for Economic Cooperation and Development (OECD) defined in 2004 five principles [75] for QSAR models proposed for regulatory use. The models should have a defined endpoint, an unambiguous algorithm, a defined domain of applicability, appropriate measures for goodnessof-fit, robustness and predictivity, and if possible a mechanistic interpretation. The term applicability appeared around 2000 and was described by the OECD as follows: “The applicability domain of a (Q)SAR is the physicochemical, structural, or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds. The applicability domain of a (Q)SAR should be described in terms of the most relevant parameters, i.e., usually those that are descriptors of the model. Ideally, the (Q)SAR should only be used to make predictions within that domain by interpolation not extrapolation.” In other words the aim is to find an approach which is able to distinguish between reliable and not reliable predictions.

To our knowledge one of the first approaches published was the so called OPS method (optimum prediction space) which was published in 2000 [76]. The applicability domain of a model is always defined by the training data of the model. There are descriptor based and structure based (structural fingerprint) techniques to calculate the applicability domain. For the structure based approach the similarity between the input structure and the nearest neighbors in the training set is assessed using a fingerprint based method. If the similarity is above a user defined threshold or a given compound has a certain number of nearest neighbors in the training set the prediction is defined as reliable. This method very much depends on the definition of the chemical similarity respectively the fingerprint used [77, 78]. There are several descriptor based approaches used to define the applicability domain. In general the aim is to define a distance within a given descriptor space for which the prediction of a new compound should be reliable. Bounding box approaches are based on the minimum and maximum of each descriptor [79, 80] or the principal components of a PCA [81]. Distance based models rely in most cases on the Mhalanobis, Euclidean or City block distance [80, 82]. Other approaches are based on the estimated probability density of a descriptor set, identifying highest density regions that consist of a known fraction of the whole probability [76]. Please find details for these approaches in [83] and the references therein. Having several years of experience with QSAR model building we found that the applicability domain methods we have tested so far are not as reliable as expected. Thus we use another approach to determine the reliability of ADMET model predictions. As we often use metaclassifiers we use the so called “prediction probability” to flag the reliability of a prediction. The prediction probability is determined by counting the number of models within the metaclassifier which predict the majority class and divide this number by the number of all models. This very simple measure turns out to be a very solid indicator for the reliability of the prediction for all models we have investigated so far. The accuracy for various different prediction endpoints increases from somewhere around 0.7 to up to 0.9 by increasing the prediction probability cut off. This is also true for cases where the compounds are outside the applicability domain based on

J Mol Model (2014) 20:2322

Page 11 of 15, 2322

structure based similarity. On the other hand by increasing the prediction probability threshold the number of compounds which get a prediction can decrease quite significantly. This is shown in the example in Fig. 7. Using a prediction probability threshold of 0.8 reduces the number of compounds with predicted results from 477 down to only 129. The performance of this reliability measure was also underlined by the fact that we have remeasured compounds which were still misclassified, even if they got a high predition probability value. Around 60 % of these compounds confirm the prediction after being remeasured or are closer to the predicted class than before. From the remaining 40 % some compounds turn out to be impure or not soluble already in the DMSO stock solution used as starting point for the measurement. In the case of regression models one can use for example the standard deviation of the predictions from different models to estimate the uncertainty of the prediction for a compound. As mentioned earlier a first approach was published by B. Beck and T. Clark in 2000 [42]. There are also several other approaches, one of the most recent ones was presented by R. Clark [84]. Currently we would recommend wherever possible to use a combination of the described prediction probability in combination with an applicability domain method to have enough additional information to judge the reliability of a prediction. One of our ongoing projects is to use the described prediction probability to rank the predictions over all classes. Results for such a strategy will be published soon.

of QSAR the need for large scale predictive methods is growing. As a consequence the term big data has also been introduced into this community. Several approaches are under development within the machine learning community that target predictive modeling using big data. The major challenges within this area of research are the adaption of available algorithms to scale with large data sets or to introduce novel methods. Possible options so far are parallelization or the introduction of online and batch learning. In an online learning strategy the learning samples are processed consecutively and thus the amount of memory for data storage can be reduced. The batch learning methodology extends the online learning strategy by reducing the number of updates to a specified batch size in comparison to updates for each new sample which reduces the learning and processing time. Methods like random forests can be for example parallelized by training each tree of the forest at a single node. These methods will become more and more important to generate predictive QSAR and ADMET models to cover the need to characterize and filter large public domain data sets like PubChem [23] and Chemspider [85] with over 30 million compounds and/or virtual libraries based on inhouse chemistry like BICLAIM [15] or LiRCS [16] for example during a VS campaign. Fast models can then also be used during library design. They open up opportunities for real time prediction during the design phase. Using predicted activities or ADMET properties as early as possible will help to keep more relevant molecules for a given problem in comparison to only using empirical rules, e.g., Lipinski rule [86] or chemical similarity filters.

Future directions

Interpretable models

Here we describe two topics which will become or are already of more importance within our organization and will in our opinion also become a very important topic with the in silico ADMET field.

Interpretability here refers to the possibility of a user to understand and rationalize the relationship between the underlying descriptors of an ADMET model and the predicted endpoint. Over the past few decades the main focus of ADMET models has been shifted away from simple linear models toward more complex nonlinear and multi-parameter approaches. This opened up a gap between predictive power and the interpretability of models. As a consequence ADMET

Big data As the number of samples is constantly growing and the number of possible descriptors can be rather large in the area

Fig. 7 Using the prediction probability to increase the accuracy of the predictions. This is shown for the prediction of nephelometric solubility. Only results from the validation set are shown. Rows contain the predicted classes, columns the experimental measured ones

2322, Page 12 of 15

models became to a certain extend “black box” approaches. There are several factors which influence the interpretability. First of all the used molecular descriptors should enable interpretation. For example fragment descriptors have a direct connection to substructures. Also physicochemical characteristics like (H-acceptors, H-donors, or the surface) as well as descriptors describing the electronic configuration (dipole moment, local properties) are suitable for interpretation. In contrast descriptors like topological indices or autocorrelation descriptors are not that easy to rationalize. The second very important point is the modeling technique used. Linear regression or decision tree approaches are much easier to interpret than for example neural networks or support vector machines. For the analysis of the importance of fragments it is necessary to keep in mind that fragment combinations and the fragment environment also might have a significant influence [87]. A careful selection of descriptors and modeling techniques is therefore necessary to be able to obtain interpretable models. In addition a powerful visualization of the predicted results and the factors influencing the prediction (in a positive or negative way) is essential. It would be best if the influence can be mapped onto a 2D or maybe 3D structure of the compound. One example for such an approach is called “glowing molecules”. It is used within StarDrop from Optibrium [88] to visualize which parts of a molecule influence the predicted endpoint in a positive or negative way. This enables the scientist to easily identify the areas in a given compound which need to be modified to change the predicted property in the desired direction. In an optimal scenario there is also the possibility to get a list of possible modifications (substituents) to change the endpoint into the wanted direction. In Fig. 8 it is illustrated how such a tool could look like. We think that we have reached a stage of in silico ADMET modeling which allows us to invest more into the interpretability of models to increase the usability and impact of models on the design strategy. Fig. 8 Substructure influence on compound property prediction

J Mol Model (2014) 20:2322

Conclusions After several decades of development and basic research in the QSAR/QSPR field the basis for the modeling of an endpoint has changed quite dramatically. Several thousands of different 0D to 4D descriptor sets are available and the increased computer power and parallelization methods allow the processing of thousands of molecules in minutes or even seconds. A huge number of methods for feature selection and model building are currently available. Most importantly the size and the quality of the necessary data sets have increased quite significantly. The use of modern workflow systems enables the implementation of powerful automated model building tools for the generation of high quality local and global prediction models. This enables the users to easily develop or update models for the use in actual projects. QSAR/QSPR and especially in silico ADMET has become an integral part in academic and industrial NCE research.

Quo vadis in silico ADMET There are several important points that we want to mention. One of the most important points for us is that it is necessary to promote best practice for ADMET model generation, model performance validation, and the application of models. Best practices include data set curation, descriptor calculation, feature selection, as well as a sound model validation. During the application it is always important to have the possibly to estimate the reliability of a prediction. Information about the reliability should always be an integral part of the model output. This can be achieved by distributing workflows within the organization or making them available by easy usable web interfaces. Another very important direction is the so called “big data” field. ADMET models will become an integral early filter step as the screening and filtering of huge virtual libraries for various endpoints will become daily business in

J Mol Model (2014) 20:2322

pharmaceutical industry. The computational power and the storage capabilities are already available. What will also become important in the near future are interpretable models. New technologies for visualization, new algorithms, and the use of workflow tools enable us to generate a toolbox with much more capabilities than was possible 20 years ago. Not only the visualization of critical areas within the molecule but also the suggestion of alternatives to modify those areas will increase the use and the impact of ADMET models in the future. Especially in early lead optimization phases they can have a major impact. Another field where in silico ADMET or QSAR has not reached its full potential is the peptide or in general new biological entity research area. There should be enough data available today to also generate and apply in silico models in this area of research.

Acknowledgments We want to thank our Colleagues from Computational Chemistry, Drug Discovery Support and Medicinal Chemistry. Several of the mentioned approaches have been established together with our colleagues from these departments. Without their continuous commitment, assistance and support most of the described research could have not been done. B. Beck wants to thank Prof. T. Clark for the very good and productive time starting with the Diploma thesis followed by a PhD thesis and with a short intersection a 18 month PostDoc time. T. Geppert wants to thank Prof. T. Clark for the productive collaborations since his time as a PhD candidate within Prof. G. Schneider’s lab at the ETH Zurich.

References 1. Richet MC (1893) Note sur le rapport entre la toxicité et les propriétés physiques des corps. C R Soc Biol 45:775–776 2. Fischer E (1894) Einfluss der Configuration auf die Wirkung der Enzyme. Ber Dtsch Chem Ges 273:2985–2993 3. Overton E (1901) Studien über die Narkose. Gustav Fischer, Jena 4. Meyer H (1899) Zur Theorie der Alkoholnarkose. Arch Exp Pathol Pharmakol 42:109–118 5. Hansch C, Maloney P, Fujita T, Muir R (1962) Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients. Nature 194:178–180 6. Hansch C, Sammes PG, Taylor JB (1990) Quantitative Drug Design. In: Ramsden CA (ed) Comprehensive Medicinal Chemistry. Pergamon Press, Oxford 7. Kubiniyi H (1993) QSAR. Hansch analysis and related approaches. 1. In: Mangold R, Krosgaard Larsen P, Timmermann H (eds) Methods and principles in medicinal chemistry. VCH, Weinheim 8. Leo A, Jow PY, Silipo C, Hansch C (1975) Calculation of hydrophobic constant (log P) from pi and f constants. J Med Chem 18:865–868 9. Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110:5959–5967 10. Beck B, Glen RC, Clark T (1996) The inhibition of alphachymotrypsin predicted using theoretically derived molecular properties. J Mol Graph 14(130–5):142 11. Breindl A, Beck B, Clark T, Glen RC (1997) Prediction of the noctanol/water partition coefficient, logP, using a combination of

Page 13 of 15, 2322

12. 13. 14. 15. 16.

17. 18. 19.

20. 21. 22. 23.

24.

25.

26. 27.

28.

29. 30. 31. 32.

33. 34. 35.

36.

semiempirical mo-calculations and a neural network. J Mol Model 3:142–155 TSAR, Oxford Molecular Limited, The Magdalen Centre, Oxford Science Park, Sandford on Thames, Oxford OX4 4GA, UK van de Waterbeemd H, Gifford E (2003) ADMET in silico modelling: towards prediction paradise? Nat Rev Drug Discov 2:192–204 Kennedy T (1997) Managing the drug discovery/development interface. Drug Discov Today 210:436–444 Lessel U, Wellenzohn B, Lilienthal M, Claussen H (2009) Searching Fragment Spaces with Feature Trees. J Chem Inf Model 49:270–279 Nicolaou CA, Watson I, Wang J (2013) The Lilly Reachable Chemical Space System: bridging chemical synthesis potential with discovery chemistry, sixth jointSheffield Conference on Chemoinformatics FastROCS v1.4, OpenEye Scientific Software, Inc.: Santa Fe, NM, 2012 Rarey M, Dixon JS (1998) Feature trees: a new molecular similarity measure based on tree matching. J Comput Aided Mol Des 12:471–490 Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, Aiden EL (2011) Quantitative analysis of culture using millions of digitized books. Science 331:176–182 World Drug Index. Reuters, New York Comprehensive Medicinal Chemistry, Accelrys Software Inc., San Diego MDDR, Accelrys Software Inc. San Diego Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12 PubChem: integrated platform of small molecules and biological activities. In: Ralph AWaD, (ed) Annual reports in computational chemistry, Elsevier, pp. 217-241. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107 Chambers J, Davies M, Gaulton A, Hersey A, Velankar S, Petryszak R, Hastings J, Bellis L, McGlinchey S, Overington J (2013) UniChem: a unified chemical structure cross-referencing and identifier tracking system. J Chem Inf 5:3 Integrity, Thomson Reuters, New York Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27: 1337–1345 Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50:1189–1204 Olah MM, Bologa CG, Oprea TI (2004) Strategies for compound selection. Curr Drug Discov Technol 1:211–220 Tiikkainen P, Franke L (2011) Analysis of commercial and public bioactivity databases. J Chem Inf Model 52:319–326 Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inf 29:476–488 Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, Consonni V, Kuzmin VE, Cramer R, Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J, Richard A, Tropsha A (2013) QSAR modeling: where have you been? Where are you going to? J Med Chem. doi:10.1021/ jm4004285 Todeschini R and Consonni V (2000) Frontmatter. In: Handbook of molecular descriptors. Wiley-VCH, Weinheim, pp i-xxi Dragon 6, TALETE srl, Via V. Pisani, 13 - 20124 Milano – Italy, 2013 Kramer C & Clark T (2012) New types of descriptors and models in QSAR/QSPR. In: Statistical modelling of molecular descriptors in QSAR/QSPR. Wiley-VCH, Weinheim, pp 293-305 Hopfinger AJ (1980) A QSAR investigation of dihydrofolate reductase inhibition by Baker triazines based upon molecular shape analysis. J Am Chem Soc 102:7196–7206

2322, Page 14 of 15 37. Hopfinger AJ (1981) Inhibition of dihydrofolate reductase: structureactivity correlations of 2,4-diamino-5-benzylpyrimidines based upon molecular shape analysis. J Med Chem 24:818–822 38. Hopfinger AJ, Wang S, Tokarski JS, Jin B, Albuquerque M, Madhav PJ, Duraiswami C (1997) Construction of 3D-QSAR models using the 4D-QSAR analysis formalism. J Am Chem Soc 119:10509– 10524 39. Hopfinger AJ, Reaka A, Venkatarangan P, Duca JS, Wang S (1999) Construction of a virtual high throughput screen by 4D-QSAR analysis: application to a combinatorial library of glucose inhibitors of glycogen phosphorylase b. J Chem Inf Comput Sci 39:1151–1160 40. Kier LB, Hall LH (1976) Molecular connectivity in chemistry and drug research. Academic, New York 41. Beck B, Horn A, Carpenter JE, Clark T (1998) Enhanced 3Ddatabases: a fully electrostatic database of AM1-optimized structures. J Chem Inf Comput Sci 38:1214–1217 42. Beck B, Breindl A, Clark T (2000) QM/NN QSPR models with error estimation: vapor pressure and logP. J Chem Inf Comput Sci 40: 1046–1051 43. Brüstle M, Beck B, Schindler T, King W, Mitchell T, Clark T (2002) Descriptors, physical properties, and drug-likeness. J Med Chem 45: 3345–3355 44. Ehresmann B, de Groot MJ, Clark T (2005) Surface-integral QSPR models: local energy properties. J Chem Inf Model 45:1053–1060 45. Kramer C, Beck B, Kriegl J, Clark T (2008) A composite model for hERG blockade. Chem Med Chem 3:254–265 46. Hennemann M, Friedl A, Lobell M, Keldenich J, Hillisch A, Clark T, Göller A (2009) CypScore: quantitative prediction of reactivity toward cytochromes P450 based on semiempirical molecular orbital theory. Chem Med Chem 4:657–669 47. Kramer C, Beck B, Clark T (2010) A Surface-Integral Model for Log POW. J Chem Inf Model 50:429–436 48. Shahlaei M (2013) Descriptor selection methods in quantitative structure activity relationship studies: a review study. Chem Rev 113:8093–8103 49. Breiman L (2001) Random Forests. Mach Learn 5–32 50. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297 51. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65:386– 408 52. Legendre AM (1806) Nouvelles méthodes pour la détermination des orbites des comètes: avec un supplément contenant divers perfectionnements perfectionnements. de ces méthodes et leur application aux deux comètes de 1805. Courcier, Paris 53. Gauss KF (1857) Theory of the motion of the heavenly bodies moving about the sun in conic sections. Dover, Phoenix 54. Livingstone DJ (2000) The characterization of chemical structures using molecular properties. A survey. J Chem Inf Comput Sci 40: 195–209 55. Tarca AL, Carey VJ, Xw C, Romero R, Drâghici S (2007) Machine learning and its applications to biology. PLoS Comput Biol 3:e116 56. Rasmussen CE (2004) Gaussian processes in machine learning. Lect Notes Comput Sci 3176:63–71 57. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69 58. Schneider P, Stutz K, Kasper L, Haller S, Reutlinger M, Reisen F, Geppert T, Schneider G (2011) Target profile prediction and practical evaluation of a Biginelli-type dihydropyrimidine compound library. Pharmaceuticals 4:1236–1247 59. Tropsha A, Gramatica P, Gombar V (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22:69–77 60. McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12: 153–157

J Mol Model (2014) 20:2322 61. Cumming JG, Davis AM, Muresan S, Haeberlein M, Chen H (2013) Chemical predictive modelling to improve compound quality. Nat Rev Drug Discov 12:948–962 62. ChemAxon (2013) J Chem 5.11.5 http://www.chemaxon.com 63. Molecular Operating Environment (MOE), 2013.08; Chemical Computing Group Inc., 1010 Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7, 2013 64. CORINA—Generation of 3D coordinates, version 3.0, Molecular Networks GmbH, Erlangen, Germany 65. Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recogn 46:3483–3489 66. Friedman JH, Hastie T, Tibshirani R (2008) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1– 22 67. Schneider G, Neidhart W, Giller T, Schmid G (1999) Scaffoldhopping by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38:2894–2896 68. Berthold MR, Cebron M, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2007) KNIME: the konstanz information miner. In:Studies in classification, data analysis, and knowledge organization. Springer, Heidelberg 69. Pipeline Pilot, Version 8.0. Accelrys Software Inc. San Diego: s.n.; 2011 70. R Core Team (2013) R: A Language and Environment for statistical computing. R Foundation for Statistical Computing. (http://www.Rproject.org) 71. Kriegl JM, Arnhold T, Beck B, Fox T (2005) A support vector machine approach to classify human cytochrome P450 3A4 inhibitors. J Comput Aided Mol Des 19:189–201 72. Veith H, Southall N, Huang R, James T, Fayne D, Artemenko N, Shen M, Inglese J, Austin CP, Lloyd DG, Auld DS (2009) Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries. Nat Biotechnol 27:1050–1055 73. Cheng F, Yu Y, Shen J, Yang L, Li W, Liu G, Lee PW, Tang Y (2011) Classification of cytochrome P450 inhibitors and noninhibitors using combined classifiers. J Chem Inf Model 51:996–1011 74. Sun H, Veith H, Xia M, Austin CP, Huang R (2011) Predictive models for cytochrome P450 isozymes based on quantitative high throughput screening data. J Chem Inf Model 51:2474–2481 75. OECD (2005) Principles for the validation, for regulatory purposes, of (quantitative) structure-activity relationship models 76. OPS (2000) TOPKAT OPS. US patent no. 6 036 349 77. Sheridan R, Feuston RP, Maiorov VN, Kearsley S (2004) Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J Chem Inf Comput Sci 44:1912–1928 78. Sahigara F, Ballabio D, Todeschini R, Consonni V (2013) Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J Chem Inf 5:27 79. Netzeva TI, Worth A, Aldenberg T, Benigni R, Cronin MT, Gramatica P, Jaworska JS, Kahn S, Klopman G, Marchant CA, Myatt G, Nikolova-Jeliazkova N, Patlewicz GY, Perkins R, Roberts D, Schultz T, Stanton DW, van de Sandt JJ, Tong W, Veith G, Yang C (2005) Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. Altern Lab Anim 33: 155–173 80. Jaworska J, Nikolova-Jeliazkova N, Aldenberg T (2005) QSAR applicabilty domain estimation by projection of the training set descriptor space: a review. Altern Lab Anim 33:445–459 81. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab 2:37–52 82. Worth AP, Bassan A, Gallegos A, Netzeva TI, Patlewicz G, Pavan M, Tsakovska I, Vra-ìko M (2005) The Characterisation of (quantitative) structure-activity relationships: preliminary guidance. Institute for Health and Consumer Protection, Toxicology and Chemical Substances Unit, European Chemical Bureau

J Mol Model (2014) 20:2322 83. Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17:4791–4810 84. Clark RD, Liang W, Waldman M, Fraczkiewicz R (2013) Estimating classification confidence for ensemble models. Sixth Joint Sheffield Conference on Chemoinformatics 85. Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87:1123–1124

Page 15 of 15, 2322 86. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2001) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 46:3–26 87. Duewer DL (1990) The free–Wilson paradigm redux: Significance of the free–Wilson coefficients, insignificance of coefficient ‘uncertainities’ and statistical sins. J Chemometr 4:299–321 88. Stardrop, Optibrium Ltd., Cambridge Research, Cambridge, UK

Editorial. In silico ADMET predictions in pharmaceutical research.

Industrial applications of nanoparticles.

ADMET evaluation in drug discovery. 13. Development of in silico prediction models for P-glycoprotein substrates.

Industrial applications of marine carbohydrates.

The Microcalorimeter for Industrial Applications.

Archaeal Enzymes and Applications in Industrial Biocatalysts.

Synthetic biology applications in industrial microbiology.

Non-food industrial applications of poultry feathers.

Potential industrial applications of inhomogeneous broadening imaging.

Applications of capillary electrophoresis for industrial analysis.

On translating technology research into industrial applications.

Phenotypic evaluation and in silico ADMET properties of novel arylimidamides in acute mouse models of Trypanosoma cruzi infection.

Molecular Docking and In Silico ADMET Study Reveals Acylguanidine 7a as a Potential Inhibitor of β-Secretase.

Radiotracer technology in mixing processes for industrial applications.

Recent advances and versatility of MAGE towards industrial applications.

Biology and Industrial Applications of Chlorella: Advances and Prospects.

Citric acid: emerging applications of key biotechnology industrial product.

Review of high-power ultrasound-industrial applications and measurement methods.

Catalytic properties, functional attributes and industrial applications of β-glucosidases.

Development of microwave ion source for industrial applications.

In silico prediction of toxicity of non-congeneric industrial chemicals using ensemble learning based modeling approaches.

Recent Developments of Magnetoresistive Sensors for Industrial Applications.

Distribution, industrial applications, and enzymatic synthesis of D-amino acids.

Discovery, Molecular Mechanisms, and Industrial Applications of Cold-Active Enzymes.