Mol Divers DOI 10.1007/s11030-015-9592-4

FULL-LENGTH PAPER

Fragment virtual screening based on Bayesian categorization for discovering novel VEGFR-2 scaffolds Yanmin Zhang1 · Yu Jiao2 · Xiao Xiong1 · Haichun Liu1 · Ting Ran1 · Jinxing Xu1 · Shuai Lu1 · Anyang Xu1 · Jing Pan1 · Xin Qiao1 · Zhihao Shi2 · Tao Lu1,2 · Yadong Chen1

Received: 24 November 2014 / Accepted: 25 March 2015 © Springer International Publishing Switzerland 2015

Abstract The discovery of novel scaffolds against a specific target has long been one of the most significant but challengeable goals in discovering lead compounds. A scaffold that binds in important regions of the active pocket is more favorable as a starting point because scaffolds generally possess greater optimization possibilities. However, due to the lack of sufficient chemical space diversity of the databases and the ineffectiveness of the screening methods, it still remains a great challenge to discover novel active scaffolds. Since the strengths and weaknesses of both fragment-based drug design and traditional virtual screening (VS), we proposed a fragment VS concept based on Bayesian categorization for the discovery of novel scaffolds. This work investigated the proposal through an application on VEGFR-2 target. Firstly, scaffold and structural diversity of chemical space for 10 compound databases were explicitly evaluated. Simultaneously, a robust Bayesian classification model was constructed for screening not only compound databases but also their corresponding fragment databases. Although analysis of the scaffold diversity demonstrated a very unevenly distribution of scaffolds over molecules, results showed that our Bayesian model behaved better Electronic supplementary material The online version of this article (doi:10.1007/s11030-015-9592-4) contains supplementary material, which is available to authorized users.

B B

Tao Lu [email protected] Yadong Chen [email protected]

1

Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China

2

School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China

in screening fragments than molecules. Through a literature retrospective research, several generated fragments with relatively high Bayesian scores indeed exhibit VEGFR-2 biological activity, which strongly proved the effectiveness of fragment VS based on Bayesian categorization models. This investigation of Bayesian-based fragment VS can further emphasize the necessity for enrichment of compound databases employed in lead discovery by amplifying the diversity of databases with novel structures. Keywords Fragment · Virtual screening · Bayesian categorization · Scaffold diversity · VEGFR-2

Introduction To efficiently develop a small molecule that can moderately regulate physiological state has long been one of the significant but challengeable goals in drug discovery [1]. Fragment-based drug design (FBDD) and virtual screening (VS) have emerged as two important alternative methodologies to experimental high-throughput screening (HTS) for the discovery of potential leads. Both computational and experimental technologies share the same target to enrichment of relevant chemical space for hits [1–4]. FBDD is a promising methodology for lead discovery especially in finding novel scaffolds and structural optimization. It has also attracted many spotlights due to its effectiveness and efficiency along with the development of experimental FBDD [5]. Many computational FBDD strategies, such as BREED [6], Murcko frameworks [7], retrosynthetic combinatorial analysis procedure (Recap) analysis [8] and scaffold tree [9] have been developed by targeting the main scope of application for FBDD, including library design, fragment hit identification and subsequent druglike properties opti-

123

Mol Divers

mization [10–12]. However, several defects of FBDD have reduced its application in drug development. Firstly, cocrystal structure is necessary to determine the binding site and its interaction for the fragment optimization. Secondly, evaluating a fragment into a leadlike or a druglike compound is rather time-consuming and resource-intensive in practice. Thirdly, it is difficult to imagine the chemotype of a compound obtained by FBDD, such as a natural product with multiple rings. All these suggest that FBDD largely depends on the chemical space it covers [1]. Traditional VS methods like molecular docking, pharmacophore screening and similarity search, have been proved to be reliable, inexpensive and time-saving methods that widely applied in the discovery of small-molecule inhibitors in recent years [1,4,13,14]. Nevertheless, hits found in these ways are always hindered by the difficulty of improving hit rate and enrichment factor and lowering the false positive rate, and it is very difficult to find active hits with novel scaffolds [15]. However, Bayesian approach, on account of the frequency of various descriptors occurrence, is a robust classification methodology which can differentiate actives from inactives [14,16]. Many examples have demonstrated its application in structure-activity analysis and drug discovery [14,17–21]. Besides, several studies have shown that Bayesian categorization demonstrates higher EF and accuracy than molecular docking in identification of actives from decoys [22]. Due to the lack of sufficient chemical space diversity of the screened databases and the ineffectiveness of the screening methods, the discovery of novel scaffolds against a certain target has long been one of the biggest difficulties in finding lead compounds. For example, it is generally very difficult to capture a novel active scaffold through traditional VS methods (i.e., molecular docking) because of its limited binding energy. Thus, it still remains a great challenge to discover active novel scaffolds and construct a robust method to screen favorable fragments. Moreover, a novel scaffold or fragment which binds in important binding regions of the active pocket as a starting point is more favorable than a larger molecule because fragments generally possess more optimization possibilities. For example, to discover scaffolds that can bind in the hinge region or other important regions (e.g., DFG region) is one of the most frequently adopted methodologies in searching novel kinase inhibitors. Although our group has formerly constructed a novel strategy using a user-defined fragmentation framework for three-dimensional (3D) fragment-based lead discovery [23], only the ligands extracted from the Protein Data Bank (PDB) were employed to generate fragments and construct novel hits. Therefore, to expand the scaffold chemical space, more molecules are required to sample fragments. As no one can jump to the conclusions of the chemical space diversity for any compound database, a detailed analysis of the chemical space

123

diversity by various qualitative and quantitative evaluation criteria should be taken into consideration. Since the advantages and disadvantages of both FBDD and traditional VS, we proposed a fragment VS concept based on Bayesian categorization for the discovery of novel scaffolds. In this study, vascular endothelial growth factor receptor-2 (VEGFR-2, or kinase insert domain receptor [KDR]) kinase was selected for implementing the above proposal. VEGFR2 kinase has been demonstrated to be a key target for the discovery of small-molecule inhibitors against tumorassociated angiogenesis [24,25]. Most VEGFR-2 smallmolecule inhibitors exert biological effect by competitively binding to the highly conserved ATP pocket [26,27]. Many types of VEGFR-2 kinase inhibitors have been already approved by the Food and Drug Administration (FDA) for clinical use and they consist of diaryl ureas (e.g., sorafenib, regorafenib), pyrimidines (e.g., pazopanib), quinazolines (e.g., vatalanib), indolinones (e.g., sunitinib) and some other types (e.g., axitinib) [28,29]. Despite the huge advances in developing VEGFR-2 inhibitors, there are undoubtedly many issues that are not yet fully explored, from both the clinical and biochemical perspectives. Since the drug resistance, selectivity and toxicity remain important concerns, discovering novel inhibitors with satisfactory outcomes is still in great demand [30]. The overall workflow of this work is shown in Fig. 1. Firstly, 10 compound databases were fragmented by Murcko, Recap and user-defined (UserDefined) frameworks. The chemical space diversity of all the databases was evaluated to ensure an adequate novelty not only from the 2D physiochemical properties aspect but also from the scaffold and structural diversity perspectives. The latter were explicated through the count of scaffolds, cumulative scaffold frequency plot (CSFP), Pn values as well as tree maps visualization [31,32]. Simultaneously, Bayesian models were constructed using VEGFR-2 actives and decoys from the directory of useful decoys enhanced database (DUD-E) [33] database and their reliability was validated using internal

Fig. 1 Workflow of combination of FBDD and VS

Mol Divers Table 1 Compound databases used in the scaffold diversity analysis

Database

No. of Cpds

Description

Source website

ChemBridge

100,032

Compounds from ChemBridge database

http://chembridge.com

ChemDiv

700,000

Compounds extracted from ChemDiv database

http://www.chemdiv.com/

DrugBank

6324

Compounds taken from the latest version of DrugBank (release 4.0)

http://www.drugbank.ca

HitFinder

14,400

Compounds from the Maybridge screening collection

http://www.maybridge.com/

Compounds from National Cancer Institute (NCI) Open database

http://cactus.nci.nih.gov/

NCI

252,684

ZINC

1190,088

Compound freely taken from ZINC database

www.zinc.docking.org

ChEMBL

1286,475

Compounds from the EBI’s ChEMBL database (version 10)

https://www.ebi.ac.uk/chembldb

Compounds that meet the Specs selection criteria

http://www.specs.net/

Specs

219,059

KLIFS

1100

Compounds from kinase-ligand interaction fingerprints and structure (KLIFS) database, derived from PDB database

http://www.vu-compmedchem.nl

KDRActive

2542

Compounds from DUD-E database and a recent review

http://dude.docking.org/

No. of Cpds refers to the total number of compounds in each database

and external test sets and comparison with other’s work [16,19,22]. Then the compound databases and their corresponding fragments databases were screened by the best VEGFR-2 Bayesian model. Furthermore, to validate the effectiveness of this hybrid protocol, a retrospective literature research was conducted to ensure that some of the fragments screened by the best Bayesian model indeed exhibit biological activity toward VEGFR-2. Finally, 20 hit scaffolds were selected on the basis of Bayesian score and molecular docking as it can reflect the binding mode of the compounds and the protein as well as our former SAR analysis [28].

Pilot, version 7.5 [34]. Duplicates within each database and those that are the same when compared with the VEGFR-2 active database were removed to avoid bias. DUD-E All 409 VEGFR-2 actives and 24,680 decoys were collected and duplicated from the DUD-E [33] database for constructing the Bayesian druglike categorization models as well as 2053 actives for external validation. ChemBridge

Materials and methods Databases Compounds that used for Bayesian categorization model generation were collected from the DUD-E database [33]. Another 10 databases containing both publicly available and proprietary compounds were adopted for fragmentation and VS. They are listed in Table 1 and explicitly described as follows. All compounds were uniformly added with hydrogens and optimized using the Prepare Ligands protocol in Pipeline

ChemBridge provides as many as 1 million target-focused compounds and 14,000 diverse chemical building scaffolds. 100,032 diverse and target-focused screening compounds were downloaded from the ChemBridge database (http:// chembridge.com). ChemDiv Chemical Diversity (ChemDiv) offers drug development solutions and integrated drug discovery strategies for phar-

123

Mol Divers

maceutical and biotech companies (http://en.wikipedia.org/ wiki/ChemDiv). It provides numerous services concerning drugs and treatments for infectious, oncology, metabolic, central nervous system, inflammation, and some other diseases throughout the drug development process. Here, 700,000 compounds were extracted from ChemDiv (http:// www.chemdiv.com/).

DrugBank DrugBank database contains cheminformatics and bioinformatics data of comprehensive drug targets and drugs. It has also been broadly employed to promote in silico drug (docking or screening or de novo design), and drug target discovery, drug interaction prediction and metabolism prediction as well as basic pharmaceutical education [35]. DrugBank release 4.0 as the latest version [36] consists of 7736 drug items (1584 small-molecule drugs and 158 biotechdrugs approved by FDA, 89 nutraceuticals, as well as more than 6000 drugs under experiment. In addition, 4281 non-duplicate protein sequences related to their drug items are also provided. DrugBank is freely available at http://www.drugbank.ca.

HitFinder The HitFinderTM Collection is a subset of Maybridge screening collection which comprises 14,400 druglike diverse compounds, making the identification of potential drug leads convenient and cost-effective. Standard daylight fingerprints were employed as the clustering algorithm for selecting compounds with the Tanimoto similarity index set at 0.71 [37,38]. All screening compounds conform to Lipinski’s rule of five [39] for “druglikeness”, and all have purity greater than 90 %. Compounds have been selected to be non-reactive to ensure fewer false positives and higher quality results. All 14,400 compounds were downloaded from http://www.maybridge. com/.

NCI The National Cancer Institute (NCI) Open database is composed of more than 250,000 items on the basis of the chemistry information package CACTVS. The NCI’s Developmental Therapeutics Program contains almost all antiHIV and anticancer structures and screening data. A large quantity of additional, mostly calculated information (i.e., systematic names, predicted biological activities, etc.) have been included to enrich the database. 252,684 compounds were extracted from the release-4 (http://cactus.nci.nih.gov/) [40].

123

ZINC ZINC is as a useful and pragmatic compound database for structure-based VS [41]. Its large chemical space and availability for acquiring representative compounds render it a desirable resource for discovering leads for a specific biological target [42]. It currently has about 35 million commercially compounds and is freely available at www.zinc.docking.org. ChEMBL The functional mechanism, binding site and ADMET data for millions of druglike bioactive compounds are included in ChEMBL. These data are firstly manually collected from the original literature, curated and optimized to promote their application and quality by the EBI-ChEMBL team. The database hitherto consists of 5.4 million bioactivity data for as many as 5200 protein targets and 1 million compounds [43]. Version 10 (ChEMBL_10) from website (https://www.ebi. ac.uk/chembldb) was employed in this work. Specs Specs has been offering chemical compounds and chemistry related services to pharmaceutical and biotechnology companies since it was founded in 1987. Over 1.5 million structures released worldwide that confirm to the required criteria are appended to the database annually. 219,059 compounds were downloaded from http://www.specs.net/. KLIFS The ligand binding site residues of 85 kinases are aligned and included in the kinase-ligand interaction fingerprints and structure (KLIFS) database [44]. KLIFS can conveniently identify family specific interaction features and classify ligands based on their binding modes. 1252 Aligned human kinase-ligand cocrystal complexes from PDB database have been adopted to construct KLIFS [45]. Furthermore, different kinase-ligand interactions were featured by molecular interaction fingerprints [44]. 1100 Cocrystal ligands extracted from KLIFS were used (http://www.vu-compmedchem.nl). KDRActive This user-defined KDR active database contains the 409 actives from the DUD-E database, 65 actives collected from a recent review [27] and 2053 actives with known inhibitory activity values below 1 µM were also downloaded from DUD-E database (http://dude.docking.org/). 2542 Active compounds were obtained after removing duplicates.

Mol Divers

Chemical space analysis

Several physicochemical properties like molecular weight, AlogP, number of hydrogen donors and acceptors, number of rings and aromatic rings, number of rotatable bonds, minimized energy as well as molecular fractional polar surface area (Molecular_FPSA) were calculated by the Calculate Molecular Properties module in Discovery Studio 2.5.

ber of connections at the bonded atoms (two, three, four), and types of the bonded atoms (lists of specific elements or query atoms). The component has a rich parameter interface to specify the bonds to break and control the behavior so as to generate more various fragments. By default, this component enumerates all possible fragment combinations by breaking the identified bonds. However, only the single bond type and chain bond topology were selected as the breaking rules. The minimum branching was set to two connections. However, as the maximum molecular weight of several databases (DrugBank: 6179 Da, NCI: 3046 Da and ChEMBL: 10889 Da) were too large to enumerate the possible bonds to break; thus, these three databases were first screened by a molecular weight filter of 500 Da.

Scaffold representations

Scaffold diversity

Three scaffold representations (Murcko frameworks [7], Recap [8] and a user-defined (UserDefined) fragmentation methods) were employed to exemplify the scaffold diversity of the 10 databases. All the three methods were produced by different protocols in Pipeline Pilot 7.5. A simple fragmentlike rule (Molecular_Weight ≤ 350, Molecular_Weight ≥ 150) was initially used as filter to avoid inappropriate fragments. Duplicates were removed for the resulted fragments. All the retrieved fragments were denoted for identification. The fragmentation frameworks are shown in Fig. S1 (Supplementary Material) and described in detail below.

Scaffold diversity analysis was carried out on the 10 databases using the three scaffold representations. The structural diversity for the unique scaffolds [31] was investigated and characterized by the counts of scaffold, cumulative scaffold frequency plots (CSFP) and Tree Maps [47].

In these analysis, we investigated the chemical space of the databases from two perspectives: physicochemical properties and scaffold diversity. Physicochemical property

Murcko The generate fragments component was chosen to exert Murcko frameworks. Only MurckoAssemblies was selected in the FragmentsToGenerate parameter [7] as it combined a contiguous ring systems with chains that may link two or more rings, all other parameters remained default [31]. Recap [46] The recap algorithm excavates building-blocks from structures and privileged motifs that have been biologically determined [8]. Recap fragments a molecule by cleaving specific bonds which can be resynthesized according to general chemistry rules. In this technique, those molecules were not cleaved when their atoms exceeded the maximum number allowed to break. Above all, ringconnecting bonds and hydrogen bonds were not allowed to be fragmented [23]. Pipeline Pilot 7.5 was also adopted for fragmentation of molecules through the Generate Recap Fragments protocol. UserDefined A user-defined fragmentation method was applied using the Enumerate Fragments component. This component identifies bonds to break in the input molecules and outputs fragments as individual data records. The bonds to break are selected based on the specified bond type (single, double, triple), bond topology (ring, chain), minimum num-

Scaffold counts The numbers of non-duplicated and duplicated Murcko, Recap and UserDefined scaffolds for each database as well as scaffold frequency were counted. The number of singleton scaffolds presented in one molecule, was also recorded. Scaffolds filtered by a fragment-like rule (Molecular_Weight: 150–350 Da) were kept for further VS. Moreover, the ratio of singleton scaffolds to compounds (N/M) and singleton scaffolds to all scaffolds (Ns /N ) were mainly adopted to evaluate the chemical space diversity for scaffolds [31]. Cumulative scaffold frequency plots (CSFP) Scaffold frequency was characterized by a percentage of molecules which incorporate a same scaffold in total molecules. In order to produce CSFPs, scaffolds were firstly sorted according to their scaffold frequency and then the scaffold cumulative percentage was plotted. All the three scaffold representations were used to generate CSFPs and Pn values which could be further used to determine the percentage of scaffolds [31,48]. Tree maps Tree Maps, introduced by Shneiderman in 1992 [47], utilize a 2D space-filling method for visualizations of hierarchical data structures. With a clear intuitive advantage over traditional approachs [47], Tree Maps have been widely adopted to demonstrate hierarchical clustering based on the similarity of chemical structures [31] or biological values [49,50]. The biggest advantage of Tree maps is that it can highlight both the distribution of compounds over scaffolds and scaffold structural diversity. Herein, Tree maps calculated by TreeMap software from MacroFocus (http://www.

123

Mol Divers

macrofocus.com/) were used to analyze the structural diversity. Scaffold architectures were first clustered using ECFP_4 fingerprints by the cluster molecules component in Pipeline Pilot with the average number of compounds per cluster was set to 50 [32]. Then, the scaffold cluster number as well as its frequency were utilized to categorize scaffolds by circles of different area and colors [31]. Bayesian categorization models Model generation On the basis of VEGFR-2 actives and decoys, Bayesian classification models were derived using the Build Drug Like Model protocol in Pipeline Pilot (version 7.5). In this approach, a structural descriptor ECFP_4 which yields better performance among 13 extended-connectivity fingerprints [29], and 9 physiochemical descriptors including ALogP, number of hydrogen donors and acceptors, rings, aromatic rings, rotatable bonds, molecular solubility and Molecular_FPSA were employed to construct the models as these molecular descriptors are widely used in ADME predictions [16,20]. Compounds of 409 actives and 24,680 decoys within similar physicochemical space (Fig. S3a) were randomly divided into a training set and an internal test set by a ratio of 1:1. To investigate whether different ratios of actives and decoys would create a bias, we sampled four different ratios which include 1:1; 1:5; 1:20 and 1:60. Fig. S2 shows the major steps for the Bayesian model generation. Model validation Models were confirmed by a Leave-One-Out cross-validation and primitive test sets. In this method, after each compound was left out consecutively, the model based on the rest compounds were used to predict the omitted ones. After every compound was predicted, a Receiver Operator Characteristic (ROC) plot was created and the corresponding area under the curve (AUC) was accordingly computed. In addition to predicting the training set and internal test set, further performance evaluation was implemented by external test sets [22]. Herein, 2053 VEGFR-2 active compounds (IC50 < 1000 nM) and 218,821 compounds in the compound Specs database assumed as decoys constituted the external test sets (Fig. S2). Duplication was executed to avoid any bias. Likewise, five ratios (i.e, 1:1; 1:5; 1:20; 1:60 and 1:107) of actives and assumed decoys in the external test were also investigated. Several statistical parameters (i.e., enrichment factor (EF), accuracy, kappa value) were calculated to evaluate the quality of the models (Table S1 in Supplementary Material). As a measure of agreement corrected for chance, kappa value is considered as true accuracy for classification models and

123

outperforms accuracy to evaluate the prediction capability. A model with a kappa value over 0.4 is regarded rational [21,22]. Molecular docking The self- and cross-docking analysis of VEGFR-2 crystalized complexes in our previous study [29] indicates that the combination of Glide_SP and PDB 3EWH performs best and is considered as the most desirable solution for dockingbased VS of VEGFR-2 inhibitors. Thus, Glide [51] in the Schrodinger 2009 software was adopted for molecular docking analysis [52]. The Protein Preparation Wizard workflow was used to optimize the crystallized ligand of VEGFR-2 (PDB ID: 3EWH) and other selected molecules. The cocrystallized ligand of PDB 3EWH was centered by an enclosing box by the “Receptor Grid Generation” module to construct the docking grid files. Favorable compounds were identified mainly by taking Glide score and binding interactions into consideration. Other parameters remained with default settings. Finally, standard precision (SP) docking mode was selected for molecular docking.

Results and discussion Chemical space analysis Physicochemical property Nine 2D properties of the 10 databases are calculated to avoid bias. As shown in Fig. S3b and Table S2 (Supplementary Material), the 2D properties of DrugBank, ChEMBL, KLIFS and KDRActive were relatively closer and higher than other databases. Specifically, their minimized energy was all above 40 kJ/mol and the number of acceptors and donors were at least more than 5 and 2, respectively. However, there was an exception for DrugBank in its AlogP (1.34), number of rings (2) and aromatic rings (1) which ranked the lowest of all the databases. Overall, all the databases were of similar properties space as the standard error of all properties values were below 1. This was understandable as molecules in these four databases are almost actives and DrugBank even contains a large number of marketed drugs or drugs in clinical trials. The fact that such a simple comparison of 2D properties could easily differentiate the active databases rendered the usefulness of building a 2D property space. Scaffold diversity analysis Scaffold counts Table 2 lists the number of molecules (M) in the 10 databases as well as the fragments [duplicated (N) and non-duplicated (Nraw )] generated by the three fragmentation

Mol Divers

methods. The number of singleton scaffolds (Ns ) which represented the scaffolds that only exist in one molecule and molecules (Nfiltered ) filtered by a molecular weight restriction (150–350 D) to ensure a fragment-like size are also calculated in Table 2. The ratio of scaffolds and singleton scaffolds to compounds (N/M and Ns /M) and the ratio of singleton scaffolds to all scaffolds (Ns /N ) were defined to evaluate scaffold diversity. All the three fragmentation methods generated duplicated fragments, 1–6 times for Murcko, 2–31 times for Recap and 2–13 times for UserDefined. The most obvious was ChemDiv with the highest repetition rate (6, 31 and 13 times, respectively) while DrugBank, HitFinder and KLIFS were up to 1–3 times, indicating a high proportion of singleton scaffolds. The N/M values of Murcko fragments for ChemBridge, DrugBank, HitFinder, KLIFS and KDRActive databases were 0.51, 0.48, 0.67, 0.81 and 0.45, respectively, indicating that about one scaffold corresponding to 2 molecules and indicating that they were the highest scaffold diverse databases. Conversely, ChemDiv had the lowest N/M value of 0.17, almost one scaffold corresponding to six molecules, indicating that it contained heavily represented scaffolds. For the other four databases (NCI, ZINC, ChEMBL and Specs), the N/M values were all around 0.25, nearly one scaffold representing four molecules. To better describe the distribution of scaffolds, the ratio of singleton scaffold (Ns /N ) should be also considered in combination with the ratio of scaffolds to molecules (N/M). For instance, ChemBridge had an N/M of 0.51, but its Ns /N value was 0.84, demonstrating that 84 % scaffolds only represented one molecule (42,515 molecules) and the rest 16 % scaffolds represented the other 57,513 molecules (about 1 molecule corresponding to 7 molecules). The same trend also occurred in databases of DrugBank, KDRActive and HitFinder with N/M values over 0.4, but a high value of Ns /N about 0.8. By contrast, KLIFS was an exception which had an N/M value of 0.81 and an Ns /N of 0.88, suggesting an even distribution of molecules over scaffolds. This could be due to the fact that the molecules in KLIFS were mainly from the ligands of kinase protein drug bank (PDB) and many of them were evenly represented. From Table 2, all the cases (except for the ChemDiv and Spces databases) obtained a high percent of singletons (Ns /N > 0.6), indicating an uneven distribution of molecules over scaffolds. After molecular weight filtration, different proportions of scaffolds conformed to the molecular weight range of 150– 350 D; most of them were over 50 % and ZINC even reached 99 % of scaffolds except KDRActive of only 42 % scaffolds. This illustrated that most fragments generated by Murcko framework were large enough to meet the standard fragment size. Recap fragmentation obtained a similar overall trend with Murcko framework as the N/M values for DrugBank, HitFinder, KLIFS and KDRActive were 0.74, 0.45, 1.09 and

0.63, respectively and their Ns /N values were all over 0.6, demonstrating good scaffold diversity but uneven distribution of molecules over scaffolds. An obvious difference was the ChemBridge database which had a lower N/M value of 0.2 in the Recap analysis than in Murcko framework (N /M ≈ 0.51), indicating that this database could be represented by fewer scaffolds generated by Recap framework. Databases with the lowest N/M values were still ChemDiv and ZINC, but their N/M values were less than the counterparts generated by Murcko (N/M ≈ 0.08 and 0.12, respectively). This could be easily understood as Recap principle breaks a molecule on certain bonds that may be reformed by common reliable chemistry, which may result in limited fragments with many common features. The range of Ns /N (0.41–0.77) in Recap analysis was lower than Murcko framework (0.51– 0.89) and there were four databases with Ns /N values over 0.8, indicating a more uneven distribution of molecules over scaffolds. As an exception, The ChemDiv database had a lower Ns /N value of 0.45 but the lowest N/M value of 0.08, which meant that although molecules in ChemDiv could be represented by a few scaffolds, the distribution of molecules over scaffolds was more even. Additionally, N/M of KDRActive was 1.09, demonstrating that the number of fragments generated was more than the corresponding molecules. However, the high Ns /N value of 0.76 indicated an unequal representation of the molecules and the proportions of scaffolds that met the molecular weight limitation were lower than the Murcko framework on the whole, which mainly in the range of about 40–70 %. Fragments created by UserDefined methods were considerably different from Murcko and Recap analysis with N/M values all over 1, from the least of ZINC (1.17) to the most KLIFS (7.83), and the Ns /N values were also all over 1 except ZINC (0.84). Nevertheless, the trend of all the cases was consistent with the former two frameworks as DrugBank, HitFinder, KLIFS and KDRActive still obtained the higher N/M values, displaying high scaffold diversity. The Ns /N values for the 10 databases were all above 0.7 except ChemDiv (0.67), indicating a more uneven distribution. The proportion of scaffolds through molecular weight filtration varied from 39 % of ChemDiv to 76 % of ZINC, which was between Murcko and Recap analysis. In summary, no matter what fragmentation methods utilized, DrugBank, HitFinder, KLIFS and KDRActive databases possessed high scaffold diversity but an uneven distribution of molecules over scaffolds. The ChemDiv and ZINC databases then contained heavily represented scaffolds. Only the KDRActive database obtained an even distribution of molecules in all three fragmentation frameworks. The same phenomenon also occurred to DrugBank and KLIFS in the Recap analysis and ChemDiv, Specs and KLIFS in the UserDefined framework. Moreover, Murcko framework had the highest proportion of scaffolds filtered by molecular

123

Mol Divers Table 2 Results of the scaffold diversity analysis for the 10 databases M

Nraw

N

Ns

Nfiltered

Nraw /M

N/M

Ns /M

Nfilter /M

P25

P50

P75

ChemBridge

100,032

100,028

50,688

42,515

43,841

1.00

0.51

0.43

0.86

0.007

0.055

0.265

ChemDiv

700,000

700,000

116,341

58,843

69,939

1.00

0.17

0.08

0.60

0.064

0.202

0.470

DrugBank

6324

5456

3006

2421

2096

0.86

0.48

0.38

0.70

0.010

0.143

0.546

HitFinder

14,400

14,270

9599

8547

8320

0.99

0.67

0.59

0.87

0.008

0.122

0.516

NCI

252,684

232,294

60,793

41,948

43,096

0.92

0.24

0.17

0.71

0.001

0.019

0.174

ZINC

11,90,088

11,83,877

319,166

216,330

314,864

0.99

0.27

0.18

0.99

0.006

0.056

0.273

ChEMBL

1286,475

1270,471

3,56,859

219,184

200,630

0.99

0.28

0.17

0.56

0.011

0.102

0.393

Specs

219,059

218,932

49,715

28,661

37,469

1.00

0.23

0.13

0.75

0.014

0.083

0.323

KLIFS

1100

1100

896

792

592

1.00

0.81

0.72

0.66

0.097

0.386

0.693

KDRActive

2542

2542

1136

876

481

1.00

0.45

0.34

0.42

0.026

0.094

0.440

ChemBridge

100,032

240,144

20,303

9660

13,872

2.40

0.20

0.10

0.68

0.005

0.040

0.212

ChemDiv

700,000

1832,899

58,581

24,115

42,917

2.62

0.08

0.03

0.73

0.003

0.025

0.135

DrugBank

6324

10,526

4700

3621

2228

1.66

0.74

0.57

0.47

0.005

0.075

0.440

HitFinder

14,400

22,262

6,498

4,758

4059

1.55

0.45

0.33

0.62

0.002

0.026

0.173

NCI

252,684

325,413

59,742

43,066

33,183

1.29

0.24

0.17

0.56

0.001

0.008

0.082

ZINC

1190,088

2473,329

143,248

81,252

1,08,349

2.08

0.12

0.07

0.76

0.004

0.036

0.185

ChEMBL

1286,475

3310,356

341,908

221,573

209,419

2.57

0.27

0.17

0.61

0.001

0.020

0.124

Specs

219,059

414,945

47,119

27,420

30,551

1.89

0.22

0.13

0.65

0.002

0.022

0.105

Database Murcko

Recap

KLIFS

1100

2299

1197

908

530

2.09

1.09

0.83

0.44

0.020

0.140

0.520

KDRActive

2542

6834

1609

1004

782

2.69

0.63

0.39

0.49

0.005

0.042

0.187

ChemBridge

100,032

950,737

202,083

147,532

148,864

9.50

2.02

1.47

0.74

0.003

0.034

0.220

ChemDiv

700,000

9219,740

1088,581

724,130

422,591

13.17

1.56

1.03

0.39

0.002

0.023

0.166

UserDefined

DrugBank

6324

36,474

19,815

16,632

12,692

5.77

3.13

2.63

0.64

0.004

0.050

0.285

HitFinder

14,400

139,383

58,247

47,784

42,055

9.68

4.04

3.32

0.72

0.003

0.029

0.210

NCI

252684

1693,458

595,423

494,435

385,978

6.70

2.36

1.96

0.65

0.002

0.014

0.133

ZINC

1190,088

10587,370

1395,701

996,897

1066,162

8.90

1.17

0.84

0.76

0.002

0.023

0.177

ChEMBL

1286,475

11502,561

2727,747

2020,809

1621,114

8.94

2.12

1.57

0.59

0.002

0.017

0.142

Specs

219,059

2326,072

561,169

414,894

256,544

10.62

2.56

1.89

0.46

0.001

0.021

0.169

KLIFS

1100

14,106

8613

7228

4817

12.82

7.83

6.57

0.56

0.015

0.121

0.475

KDRActive

2542

48,344

15,551

11,383

7185

19.02

6.12

4.48

0.46

0.007

0.065

0.238

M number of total compounds, Nraw number of all scaffolds, N number of scaffolds after duplication, Ns number of singleton scaffolds, Pn the proportion of scaffolds that indicate n percent of compounds

weight limitation of 150–350 D followed by UserDefined and Recap analysis mainly from about 40 to 70 % for different databases. In total, the distribution of molecules over scaffolds for most cases was uneven with only a very small proportion of highly frequent scaffolds and a great number of singletons. CSFP and Pn values Moreover, we obtained the Pn values from the CSFP curves. Pn values (P25 , P50 and P75 ) are the percentage of scaffolds required to indicate n present of molecules. For example, P50 indicates the percentage of scaffolds that represents 50 % of all the molecules. However, as some

123

of the databases were too large and would generate large amount of fragments, thus only the top 5000 most frequent fragments were adopted for depicting the CSFP curves and calculating the Pn values. To validate the reliability of this processing method, we individually selected the top 2000 and top 5000 most frequent fragments to draw the CSFP curves and calculate Pn values, it did not demonstrate any significant difference. The CSFP curves for Murcko, Recap and UserDefined are illustrated in Fig. 2. In extreme cases, when the scaffolds corresponds to equal number of compounds, the plot would be diagonal from (0, 0 %) to (100, 100 %).

Mol Divers

analysis. For the Murcko framework, KLIFS, HitFinder and ChemDiv was closer to the diagonal, indication of a more even distribution; while NCI was furthest from the diagonal and was the least even distribution. ChemBridge and ZINC were also far from the diagonal and were almost overlapped, demonstrating a very similar distribution of molecules over the top 5000 most frequent scaffolds. All the curves were very steep gradient at the beginning of the CSFPs and a shallow trend at the end of the CSFPs. This indicated a small proportion of scaffolds represented a large proportion of the databases at first and then the curve was represented by the high proportion of singleton scaffolds. ChemDiv was exceptional with a very low proportion of singletons (Ns /N ≈ 0.51) but the lowest proportion of scaffolds to all databases (N/M ≈ 0.17). This phenomenon was explained through a lower gradient early and leveling off later in the plot compared with other databases (NCI, ZINC and ChemBridge, Fig. 4a, b) on account of a lower percent of singletons. CSFPs of Recap and UserDefined frameworks presented similar overall trends with Murcko framework as DrugBank and KLIFS were still the closest to diagonal and demonstrated the most even distribution while NCI demonstrated the opposite trend. However, some of the databases showed significant difference in the three fragmentation methods. For example, molecules of ChemDiv distributed evenly in the Murcko framework but unevenly in Recap and UserDefined analysis. We could conclude that Murcko was the most discriminatory in differentiating databases followed by Recap and UserDefined analysis. Furthermore, Pn values calculated from CSFPs also reflected the results mentioned above. For instance, P25 , P50 and P75 values for NCI were the lowest among all the databases: 0.1, 1.94, 17.38 in Murcko framework; 0.06, 0.84 and 8.16 in Recap analysis; 0.16, 1.4, 13.28 in UserDefined analysis. P50 indicated that 50 % of the database was represented by 1.94 or 0.84 or 1.4 % of Murcko or Recap or UserDefined unique scaffolds, respectively. These confirmed that NCI had an unevenly distribution with most of its compounds were represented by the least scaffolds. In contrast, ChemDiv had the lowest N/M value in the Murcko framework, but its P25 , P50 and P75 values were 6.38, 20.24 and 47.04, indicating a more evenly distribution of molecules over scaffolds. KLIFS and DrugBank also demonstrated the similar trend with a relatively high Pn values and thus a more evenly distribution. Fig. 2 Cumulative scaffold frequency plot (CSFP) analysis for the distribution of the 10 compound databases over. a Murcko scaffolds; b Recap scaffolds; c UserDefined scaffolds

Thus, if the curve is closer to the diagonal, the database will represent a more evenly distribution. Overall, the CSFPs of all the three fragmentation frameworks delivered similar conclusions resulted from the above

Tree maps Although the CSFPs and Pn values provided us with information about scaffold diversity for the compound databases; the diversity of the whole molecules should also be considered when evaluating the whole chemical space diversity of the databases. Therefore, Tree Maps [31,47,50] was used to explicate the structural similarity of the databases. The scaffolds were firstly clustered using the ECFP_4 fingerprints and then visualized using Tree Maps. Only the

123

Mol Divers

top 5000 most frequent scaffolds as well as scaffold screened by the molecular weight filtration (MW: 150–350 Da) were respectively used to generate the Tree Maps. For example, as shown in Figs. 3 and 4, each circle represented a scaffold after grouped into gray circles according to the clusters they belong to and their color and size were both proportional to the scaffold frequency. Tree Maps representations of the three types of scaffolds for the databases are illustrated in Figs. 5–6 and S2–S10. Fig. 3 shows the top 5000 most frequent scaffolds and Table 2 depicts the corresponding scaffolds screened by molecular weight filtration (MW: 150–350 Da) for the KDRActive database. The comparison of the scaffold Tree Maps did not show any significant changes for the whole distribution of scaffolds. Nevertheless, the granular fragments such as benzene ring and some chain structure were filtered as they are usually not suitable as a molecular skeleton. Therefore, the other nine databases were depicted by Tree Maps only used their scaffolds screened by molecular weight. From Figs. S4–S12 (Supplementary Material), we could see the unevenly distribution of scaffolds in different clusters no matter what fragmentation method used. That indicated a few clusters included a large number of scaffolds while the other majority of clusters contained a very small proportion of fragments. The highly populated scaffolds were distributed differently in different clusters. The majority of them (ChEMBL, ChemDiv, NCI, ZINC) were clustered together. Conversely, the highly populated scaffolds in databases like KLIFS, KDRActive and DrugBank were more evenly represented in different clusters, indicating a structural diversity. These results were consistent with our previous CSFP and Pn Values analysis. We also found an interesting phenomenon as the most populated scaffolds were the same or similar regardless of databases from the same fragmentation framework. For example, N-benzylaniline scaffolds emerged as the most popular scaffolds in ChEMBL (3709 molecules), ChemBridge (1711 molecules), HitFinder (66 molecules), Specs (3187 molecules) and ZINC (12,288 molecules), and as the second most popular in NCI (1080 molecules) databases. This means that scaffolds originated from single methods could be very limited. However, different frameworks often produced quite different scaffolds. The Murcko scaffold were generally larger while the Recap scaffold always more granular but available for synthesis. The UserDefined framework then produced scaffolds by incorporating the advantages of the other two methods, however, their availability of being synthesized yet to be verified. Nevertheless, one point was clear that different fragmentation frameworks could guarantee the diversity of our scaffolds, no matter from the scaffold aspect or the structural aspect. In addition, both the distribution and structural diversity of chemical scaffolds were effectively illustrated by Tree Maps.

123

Bayesian model generation and validation We used a Bayesian classification technique to investigate the distinction in structural as well as physiochemical properties between VEGFR-2 actives and decoys. To investigate whether the Bayesian models were dependent on the size of the sample, 4 different ratios of actives versus decoys (i.e, 1:1; 1:5; 1:20 and 1:60) were randomly generated from the VEGFR-2 DUD-E database. Each group was divided into a 1:1 ratio to generate a training set and an internal test set. The physicochemical and structural property parameters of the training set were used to construct the Bayesian models and validation of the models were performed on the test set using a leave-one-out cross-validation method. Statistics results such as leave-one out cross-validation, enrichment, percentile, category statistics, model construction information and ROC AUC value were displayed in the Table S3 (Supplementary Material). Enrichment results revealed the percentage of accurately categorized members on a given percentage cutoff. For instance, column “1 %” would be the percentage of true category members in the top 1 % of the list according to the model score. Table S3 shows that on top 5 % level, 100 % of all actives have been distinguished by the KDRDrugLike1vs60 model while for the other three were all below 100 %. At the top 1 % level, it could retrieve 61.4 % of all true actives, about 3, 6 and 24 times more than models KDRDrugLike1vs20, KDRDrugLike1vs5 and KDRDrugLike1vs1, respectively. This indicated that the KDRDrugLike1vs60 model was the most robust. The percentile results then indicated the cutoff needed to attain a certain percentile of the good samples. These results could also benefit us to choose the cutoff value which could capture good samples as many as possible. The percentile results from Table S3 indicated that the more good samples to capture, the higher the Bayesian score should be. For example, to obtain 50 % of the true positives, a cutoff of 8.8, 22.4, 31.9 and 35.7 was accordingly needed for the four models respectively. We also noticed that for each percentile level the higher ratio of actives versus decoys used to build models, the higher cutoff was required. Model KDRDrugLike1vs60 required a 27.4 cutoff value for achieving 90 % true actives whereas KDRDrugLike1vs20, KDRDrugLike1vs5, KDRDrugLike1vs1 required 24.4, 16.7 and 5.965 individually. The ROC score were described by the AUC values calculated using the internal test set. It can be seen from Table S3 that there were almost no significant difference between the four models as the error was in the range of 0.001–0.003. Thus, our Bayesian models were stable as it showed little dependence on the sample size. Moreover, the minimum AUC value for these four models was 0.997 and the maximum was 1, which could greatly support their reliability and robustness. As KDRDrugLike1vs60 performed well in both enrichment results and ROC score, and a large

Mol Divers

Fig. 3 Tree Maps of the KDRActive database before molecular weight filtration. a Murcko scaffolds; b Recap scaffolds; c UserDefined scaffolds. Scaffolds were depicted as colored circles whose color and area of the circles were proportional to the corresponding scaffold frequency.

Clusters of scaffolds were represented by gray circles. Large green circles in Tree Maps indicated the presence of highly populated scaffolds while small white circles represented the large proportion of singleton scaffolds in the databases (many small white circles)

sample size would give more reliable and significant statistic results, it has been selected for further validation and VS of the compound databases. Statistical parameters including true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), ROC score (i.e, AUC value), enrichment factor (EF), accuracy, sensitivity (Se), specificity (Sp), precision, and kappa value are calculated in Table 3. In this study, only the top 0.1, 0.2, 0.5, 1, 2 and 5 % ranked hit-lists were saved for further evaluation because of the large size of validation set. For the training set and internal test set at top 1 % to top 0.1 % and external test set at top 0.2 % to top 0.1 %, the number of TP were the same as the number on the hit-list and the FP were all 0, indicating that all the compounds on the hit-list were actives. In terms of FN and TN, with the increasing of

compounds on the hit-list, FN were decreasing to 0 and 1 for the training set and internal test set respectively at top 2 and 5 % level, and TN were decreasing gradually from top 0.1 % to top 5 %. FP and FN were mainly because that the wrongly predicted compounds containing some features that were similar to those in the actives or decoys of the training set, respectively. For example, there were 19 FP in the top 0.5 % hit-list for the external test set, 10 of them contain N-phenylacetamide substitution and 6 of them contain thieno[2,3-d]pyrimidine part which are the same structure features with the actives in the training set. From top 5 % to top 0.1 %, compounds on the hit-lists were decreasing while the EF values were increasing. The Bayesian model showed improvement in EF about 3 times for the training set and the internal test set and about 5 times for the external test set on

123

Mol Divers

Fig. 4 Tree Maps of the KDRActive database after molecular weight filtration. a Murcko scaffolds; b Recap scaffolds; c UserDefined scaffolds. Other details are the same as Fig. 3

50-fold decrease of compounds. At top 0.1 %, the EF values were as high as about 59, 63 and 108 for the three data sets, individually. However, from top 0.5 to 0.1 %, the EF values did not show any significant increase which may imply that the increased proportion of actives was almost the same with the increased proportion of compounds on the hit-lists. In terms of AUC values, all the statistics were striking and amazing as for the training and internal test sets it reached 1* [29] from top 1 % to top 0.1 %, demonstrating that the hit-lists only contained actives which is desirable for VS investigation which have been illustrated in our study [29]. For the external test set it also reached this point from top 0.2 % to top 0.1 %. For the other levels, excepting for an AUC of 0.843 (good level) [29] at top 0.5 % for the external test set, it achieved AUC values above 0.9 (excellent level) [29] for all the remaining databases. This strongly supported the fact that

123

compounds on the targeting hit-lists were most actives as the high AUC values provide an objective measure of the model performance due to its independence of the actual number of actives and inactives. However, the selectivity was decreasing with the decrease of hit-list compounds. Nevertheless, at top 5 % and top 2 %, the selectivity for this model were over 0.9 for all three databases, indicating that 90 % of the actives in the screening pool were captured. Additionally, specificity exerted satisfactory performance as it has reached almost 1 for all the three databases at all six stages of hit-lists, signifying that the Bayesian model has distinguished almost all the TP (i.e., decoys). Moreover, an accuracy of above 0.95 was obtained for this model which indicated that the proportion of true prediction (correctly identified TP and TN) in the entire population was above 95 %. From top 1 % to top 0.1 %, the model

12

AUC

TP

Top 2 % Training set

12,183

59.41

0.06

1.00

0.98

1.00

0.11

Top 1 %

Training set

Se

Sp

Accuracy

Precision

Kappa

Statistics

0.76

0.76

1.00

0.99

1.00

0.61

61.35

12,497

80

0

218,821

0.74

0.72

1.00

1.00

0.77

77.30

218,199

466

622

1587

0.904

218,665

2209

External test set

0.19

1.00

0.99

1.00

0.11

107.65

1* represents that no decoys were screened at the given stage

1.00

Se

Kappa

0.61

EF

Precision

61.39

TN

1.00

12,183

FN

0.99

78

FP

Accuracy

0

TP

Sp

124

AUC 127

1∗

1∗

Not-selected

127 12,577

124

12,261

Hit-list

Internal test set

0.12

1.00

0.98

1.00

0.06

62.80

12,497

0

0.90

0.81

1.00

1.00

1.00

50.00

12,137

0

46

202

0.994

12,137

248

0.22

1.00

0.99

1.00

0.12

61.88

12,183

177

EF

0 1832

TN

0 194

0

190

25

1∗

12,360

25

FN

221

1∗

220,653

221

FP

13

1∗

1∗

13 12,691

12

12,373

Training set

Not-selected

Top 0.2 % External test set

Training set

Internal test set

Top 0.1 %

Hit-list

Statistics

Table 3 Virtual screening results of Bayesian model KDRDrugLike1vs60

0.89

0.81

1.00

1.00

1.00

49.76

12,449

1

48

206

0.971

12,450

254

Internal test set

0.21

1.00

0.99

1.00

0.12

60.39

12,497

182

0

25

1∗

12,679

25

Internal test set

0.56

0.42

0.99

0.99

0.90

44.93

216,249

208

2572

1845

0.906

216,457

4417

External test set

0.35

1.00

0.99

1.00

0.22

107.65

218,821

1611

0

442

1∗

220,432

442

External test set

Top 0.5 %

0.48

0.33

0.97

0.97

1.00

20.00

11,766

0

417

202

0.999

11,766

619

Training set

Top 5 %

0.47

1.00

0.99

1.00

0.31

61.39

12,183

140

0

62

1∗

12,323

62

Training set

0.48

0.32

0.97

0.97

1.00

19.90

12,068

1

429

206

0.997

12,069

635

Internal test set

0.47

1.00

0.99

1.00

0.31

61.84

12,497

143

0

64

1∗

12,640

64

Internal test set

0.29

0.18

0.96

0.96

0.96

19.13

209,741

89

9080

1964

0.943

209,830

11044

External test set

0.69

0.98

1.00

1.00

0.53

105.70

218,802

968

19

1085

0.843

219,770

1104

External test set

Mol Divers

123

Mol Divers

achieved approximate 100 % precision for all the three databases except for 98 % accuracy at top 0.5 % for the external test set. As precision reveals the power to correctly predict positive results for the model, with the sharply increase of compounds on the hit-lists of top 5 %, the precision was accordingly reduced to 33, 32 and 18 % for the three databases respectively. The Bayesian model also performed well according to the kappa value (over 0.4), the most important statistic parameter to comprehensively evaluate a model’s predictive ability and robustness. The kappa value presented a trend among high-low on both sides from top 5 % to top 0.1 %. Excepting for a 0.29 of kappa value for the external test set at top 5 % stage, our Bayesian model obtained kappa values over 0.4 for all the three databases from top 5 % to top 0.5 %. At top 1 %, kappa values was up to above 0.7 for the three databases and even got to as high as 0.90 and 0.89 for the training set and the internal test set at top 2 %. All in all, the models performed rather well at top 1 % for all databases, and 1 % could be used as a cutoff when setting the ratio of hit-lists. Moreover, we reviewed the recent published papers and selected three of them [16,19,22] also conducted Bayesian models in the VS study for comparison. As shown in Table S4 (Supplementary Material), statistics criteria including enrichment factor (EF), selectivity (Se), specificity (Sp), accuracy, and precision have been collected to make an evaluation. Through a detailed comparison and analysis, we could conclude that only the selectivity values of their models were comparable to our validated model, the remaining statistics were inferior to our methods in the VEGFR-2 target system, especially the EF value. In summary, the model behaved excellent and considerably well not only for the training set and the internal test set, but also significant for the larger independent external test set. Bayesian model-based VS

Fig. 5 VS results of Bayesian model KDRDrugLike1vs60. a VS of compound DBs. b VS of corresponding fragment databases. c Comparison of EF values between the Bayesian screened compounds and fragments

The validated Bayesian model KDRDrugLike1vs60 was applied to VS in two aspects, of which one was from the whole compound perspective and another was from the fragment perspective. That means the 10 databases (≈ 3.77 million compounds) and their corresponding fragment databases were virtually screened by this model. The selection of results was mainly determined by Bayesian score. According to the percentile results of KDRDrugLike1vs60, when the FP and TN were estimated less than 1 % and over than 99 % respectively, the minimum Bayesian score was at least 19.766 to obtained a ratio of 95 % true positives. Thus, we set the cut-off value of 20 as the minimum starting score and 35 as the maximum starting score for further analysis. Five was defined as the intercept and therefore six levels of results are calculated in Table S5 (Supplementary Material) and the percentile results are depicted in Fig. 5.

For the VS of compound databases (Fig. 5a), ChEMBL has dominated the most of the results and has been increasing from 62.80 % (Score > 20 level) to 78.16 % (Score > 35 level). It was not difficult to understand as compounds in the ChEMBL were manually collected from the primary published literatures and most of them are known active compounds for different targets. Then, KDRActive database has the similar upward trend with ChEMBL with its percentage of compounds has elevated from 2.94 % (Score > 20 level) to 11.08 % (Score > 35 level). This was a larger validation process when only a few VEGFR-2 actives were sampled in a large database pool with millions compounds but could also be found by the robust Bayesian model. Two other databases that also occupied a relatively larger proportion were ZINC and ChemDiv. These two databases obtained

123

Mol Divers Fig. 6 Examples of final hit scaffolds selected that targeting VEGFR-2

13.9 and 12.25 % on the level of score over 20, but only 3.03 and 2.41 % on the level of score over 35. Although ChEMBL, KDRActive, ZINC and ChemDiv occupied 9 out of 10 of the screened results and the trends were similar, there were some differences in their proportion when filtering their corresponding fragment databases (Fig. 5b). The outstanding KDRActive fragment database then had a striking improvement with a proportion of 6.67 % (score > 20 level) and 21.58 % (score > 35 level), about 2 times more than results obtained by screening the KDRActive database. Additionally, as shown in Fig. 5b, EF values were calculated between the screened KDRActive compounds and KDRActive fragments, and results demonstrated that EF values for KDRActive fragments were twice more than its corresponding compound database. Due to the fact that these fragments were generated from compounds in the KDRActive database, we assumed that those fragments bind in the hinge region of the targets were the main contributors to promise their inhibitory activity. This hypothesis has been confirmed by a retrospective investigation of the published literatures [53–57]. Five scaffolds originated from VEGFR-2 actives also exhibited favorable VEGFR-2 activity. Their structures and parent molecules as well as their VEGFR-2 IC50 values are displayed in Table 4. This strongly proved that screening a fragment database may gain much more effective and

satisfactory achievements. As the finally results of the proposed methodology were a series of fragments, they can be preferably employed as the novel scaffolds to be further optimized into leads or drugs by computational methods like fragment evolution, fragment linking, fragment optimization and fragment self-assembly, and experimental methods like X-ray crystallography and cellular assays. Generally, fragments cover greater chemical space diversity than their counterpart compounds and are more favorable as starting points since they generally possess greater optimization possibilities. Therefore, it will be of great use of our screened fragments in developing novel leads, even drugs. Overall, the reliability of our Bayesian model could be further reaffirmed by the inspirational outcomes of the KDRActive compound database and fragment database and behaved better in screening fragments. Since the fragments of active compounds could be obtained, we were safely to believe that we could obtain potential active fragments from large fragment databases by this Bayesian model. Moreover, whenever screening a whole compound database or a corresponding generated fragment database, ChEMBL, ZINC and ChemDiv performed well. ChEMBL outperformed the other databases in screening whole compounds and fragments followed by ZINC. These three databases can be firstly considered for discovering inhibitors of other targets.

123

Mol Divers Table 4 Scaffolds that exhibit VEGFR-2 biological activity through retrospective investigation of published literature

Selection of hit scaffolds Molecular docking was deployed to ultimately predict the binding mode of the 33,557 molecules and 20,927 fragments with Bayesian score over 25. As the generate tautomers in the LigPrep module in schrodinger software was selected, 33,305 molecules and 25,789 fragments were finally docked. The top 20 potential VEGFR-2 scaffolds (Fig. 6 and Fig. S13 in Supplementary Material) were finally selected according to the key hydrogen interactions with VEGFR-2 hinge region (Cys 919 and Glu917) and DFG (Glu885 and Asp1046) region they maintained. These novel scaffolds could be optimized

123

to novel VEGFR-2 inhibitors as they can not only maintain the original binding mode but also permit optimization in other pocket cavities.

Conclusions In this work, we proposed a fragment VS concept based on Bayesian categorization to discover novel scaffolds and developed a hybrid workflow to validate it on the VEGFR-2 target. Firstly, we conducted an explicit chemical diversity analysis based on three fragmentation frameworks including Murcko, Recap and UserDefined for a total of 10 databases.

Mol Divers

The scaffold and structural diversity analysis demonstrated that regardless of fragmentation methods, the DrugBank, HitFinder, KLIFS and KDRActive databases possessed high scaffold diversity but an uneven distribution of molecules over scaffolds. The ChemDiv and ZINC databases then contained heavily represented scaffolds. In total, an uneven distribution of molecules over scaffolds was very common for most cases where only a very small proportion of heavily populated scaffolds and a great number of singletons. For the investigation of Bayesian-based VS, KDRDrugLike1vs60, as the most robust Bayesian model, behaved considerably excellent for the training set as well as internal and external test sets. Particularly, at the top 1 % results, the EF, AUC and kappa values for the training set, internal and external test set were all above 60, 0.9 and 0.74, respectively. The overall accuracy and precision of KDRDrugLike1vs60 even reached 99 and 90 % individually, which strongly supported the robustness of the Bayesian model. Then, comparison of screening compound databases and their corresponding fragment databases indicated that the Bayesian model behaved better in screening fragments. Furthermore, the generated fragments screened with relatively high Bayesian scores indeed exhibited VEGFR-2 biological activity through a retrospective literature research. This strongly proved our proposal for the effectiveness of fragment VS through Bayesian models. Databases including ChEMBL, ZINC and ChemDiv performed quite well. ChEMBL followed by ZINC outperformed the other databases in screening whole compounds and fragments. These three databases can be firstly considered for discovering inhibitors of other targets. Eventually, 20 novel potential VEGFR-2 scaffolds were selected on the basis of molecular docking and our former knowledge of this target. Supporting information Detailed information about the three different scaffold representations on VEGFR-2 inhibitor sorafenib (Fig. S1); workflow for Bayesian classification model building, validation, and VS as applied to internal and external databases (Fig. S2); statistical values comparison of 2D properties (Fig. S3); Tree Maps of the 10 compound databases after molecular weight filtration (Fig. S4–S12); finally selected 20 hit scaffolds based on Bayesian score and molecular docking analysis (Fig. S13); definition of classification model performance measures (Table S1); statistical values of 2D properties for the 10 databases (Table S2); statistics results of Bayesian categorization models (Table S3); comparison of the Bayesian model with those in other published papers (TableS4) and percentile results for Bayesian modelbased VS (Table S5). Acknowledgments This work was financially supported by National Natural Science Foundation of China (81302634, 21302225 and NSFC81473077), the Fundamental Research Funds for the Central Universities (PT2014LX0072); the Natural Science Foundation of Jiangsu

Province (BK20130662), the Postgraduate Innovative Foundation supported by Jiangsu Province (KYLX_0637), and the Jiangsu Qinglan Project. Conflict of interest The authors declare no competing financial interest.

References 1. Wassermann AM, Kutchukian PS, Lounkine E, Luethi T, Hamon J, Bocker MT, Malik HA, Cowan-Jacob SW, Glick M (2013) Efficient search of chemical space: navigating from fragments to structurally diverse chemotypes. J Med Chem 56:8879–8891. doi:10. 1021/jm401309q 2. Rees DC, Congreve M, Murray CW, Carr R (2004) Fragmentbased lead discovery. Nat Rev Drug Discov 3:660–672. doi:10. 1038/nrd1467 3. Congreve M, Chessari G, Tisi D, Woodhead AJ (2008) Recent developments in fragment-based drug discovery. J Med Chem 51:3661–3680. doi:10.1021/jm8000373 4. Walters WP, Stahl MT, Murcko MA (1998) Virtual screeningan overview. Drug Discov Today 3:160–178. doi:10.1016/S13596446(97)01163-X 5. Kumar A, Voet A, Zhang K (2012) Fragment based drug design: from experimental to computational approaches. Curr Med Chem 19:5128–5147. doi:10.2174/092986712803530467 6. Pierce AC, Rao G, Bemis GW (2004) BREED: generating novel inhibitors through hybridization of known ligands. Application to CDK2, p38, and HIV protease. J Med Chem 47:2768–2775. doi:10. 1021/jm030543u 7. Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39:2887–2893. doi:10.1021/ jm9602928 8. Lewell XQ, Judd DB, Watson SP, Hann MM (1998) RECAP retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J Chem Inf Comput Sci 38:511–522. doi:10.1021/ci970429i 9. Schuffenhauer A, Ertl P, Roggo S, Wetzel S, Koch MA, Waldmann H (2007) The scaffold tree-visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model 47:47–58. doi:10.1021/ci600338x 10. Baker M (2012) Fragment-based lead discovery grows up. Nat Rev Drug Discov 12:5–7. doi:10.1038/nrd3926 11. Law R, Barker O, Barker JJ, Hesterkamp T, Godemann R, Andersen O, Fryatt T, Courtney S, Hallett D, Whittaker M (2009) The multiple roles of computational chemistry in fragment-based drug design. J Comput Aided Mol Des 23:459–473. doi:10.1007/ s10822-009-9284-1 12. Yuan H, Tai W, Hu S, Liu H, Zhang Y, Yao S, Ran T, Lu S, Ke Z, Xiong X (2013) Fragment-based strategy for structural optimization in combination with 3D-QSAR. J Comput Aided Mol Des 27:897–915. doi:10.1007/s10822-013-9687-x 13. Hou T, Xu X (2004) Recent development and application of virtual screening in drug discovery: an overview. Curr Pharm Des 10:1011–1033. doi:10.2174/1381612043452721 14. Xia X, Maliski EG, Gallant P, Rogers D (2004) Classification of kinase inhibitors using a Bayesian model. J Med Chem 47:4463– 4470. doi:10.1021/jm0303195 15. Xie Q-Q, Zhong L, Pan Y-L, Wang X-Y, Zhou J-P, Di-wu L, Huang Q, Wang Y-L, Yang L-L, Xie H-Z (2011) Combined SVM-based and docking-based virtual screening for retrieving novel inhibitors of c-Met. Eur J Med Chem 46:3675–3680. doi:10.1016/j.ejmech. 2011.05.031

123

Mol Divers 16. Singh N, Chaudhury S, Liu R, AbdulHameed MDM, Tawa G, Wallqvist A (2012) QSAR classification model for antibacterial compounds and its use in virtual screening. J Chem Inf Model 52:2559–2569. doi:10.1021/ci300336v 17. Vijayan R, Bera I, Prabu M, Saha S, Ghoshal N (2009) Combinatorial library enumeration and lead hopping using comparative interaction fingerprint analysis and classical 2D QSAR methods for seeking novel GABAA α3 modulators. J Chem Inf Model 49:2498– 2511. doi:10.1021/ci900309s 18. Lee JH, Lee S, Choi S (2010) In silico classification of adenosine receptor antagonists using Laplacian-modified naive Bayesian, support vector machine, and recursive partitioning. J Mol Graph Model 28:883–890. doi:10.1016/j.jmgm.2010.03.008 19. Prathipati P, Ma NL, Keller TH (2008) Global Bayesian models for the prioritization of antitubercular agents. J Chem Inf Model 48:2362–2370. doi:10.1021/ci800143n 20. Wang S, Li Y, Wang J, Chen L, Zhang L, Yu H, Hou T (2012) ADMET evaluation in drug discovery. 12. Development of binary classification models for prediction of hERG potassium channel blockage. Mol Pharm 9:996–1010. doi:10.1021/mp300023x 21. Hu Y, Unwalla R, Denny RA, Bikker J, Di L, Humblet C (2010) Development of QSAR models for microsomal stability: identification of good and bad structural features for rat, human and mouse microsomal stability. J Comput Aided Mol Des 24:23–35. doi:10. 1007/s10822-009-9309-9 22. Kombo DC, Bencherif M (2013) Comparative study on the use of Docking and Bayesian categorization to predict ligand binding to nicotinic acetylcholine receptors (nAChRs) subtypes. J Chem Inf Model 53:3212–3222. doi:10.1021/ci400493a 23. Yuan H, Lu T, Ran T, Liu H, Lu S, Tai W, Leng Y, Zhang W, Wang J, Chen Y (2011) Novel strategy for three-dimensional fragmentbased lead discovery. J Chem Inf Model 51:959–974. doi:10.1021/ ci200003c 24. Kiselyov A, Balakin KV, Tkachenko SE (2007) VEGF/VEGFR signalling as a target for inhibiting angiogenesis. Expert Opin Investig Drugs 16:83–107. doi:10.1517/13543784.16.1.83 25. Huang L, Huang Z, Bai Z, Xie R, Sun L, Lin K (2012) Development and strategies of VEGFR-2/KDR inhibitors. Future Med Chem 4:1839–1852. doi:10.4155/fmc.12.121 26. Boyer SJ (2002) Small molecule inhibitors of KDR (VEGFR-2) kinase: an overview of structure activity relationships. Curr Top Med Chem 2:973–1000. doi:10.2174/1568026023393273 27. Musumeci F, Radi M, Brullo C, Schenone S (2012) Vascular endothelial growth factor (VEGF) receptors: drugs and new inhibitors. J Med Chem 55:10797–10822. doi:10.1021/ jm301085w 28. Zhang Y, Liu H, Jiao Y, Yuan H, Wang F, Lu S, Yao S, Ke Z, Tai W, Jiang Y (2012) De novo design of N-(pyridin-4-ylmethyl) aniline derivatives as KDR inhibitors: 3D-QSAR, molecular fragment replacement, protein-ligand interaction fingerprint, and ADMET prediction. Mol Divers 16:787–802. doi:10.1007/s11030-0129405-y 29. Zhang Y, Yang S, Jiao Y, Liu H, Yuan H, Lu S, Ran T, Yao S, Ke Z, Xu J (2013) An Integrated Virtual Screening Approach for VEGFR-2 Inhibitors. J Chem Inf Model 53:3163–3177. doi:10. 1021/ci400429g 30. Socinski MA (2011) Multitargeted receptor tyrosine kinase inhibition: an antiangiogenic strategy in non-small cell lung cancer. Cancer Treat Rev 37:611–617. doi:10.1016/j.ctrv.2011.04. 003 31. Langdon SR, Brown N, Blagg J (2011) Scaffold diversity of exemplified medicinal chemistry space. J Chem Inf Model 51:2174– 2185. doi:10.1021/ci2001428 32. Shen M, Tian S, Li Y, Li Q, Xu X, Wang J, Hou T (2012) Druglikeness analysis of traditional Chinese medicines: 1. property distributions of drug-like compounds, non-drug-like compounds

123

33.

34. 35.

36.

37.

38.

39. 40.

41.

42.

43.

44.

45.

46. 47.

48.

49.

50. 51.

and natural compounds from traditional Chinese medicines. J Cheminformatics 4:1–13. doi:10.1186/1758-2946-4-31 Mysinger MM, Carchia M, Irwin JJ, Shoichet BK (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55:6582–6594. doi:10.1021/ jm300687e Inc. AS (2008) Pilot Pipeline version 7.5. Accelrys Software Inc: San Diego Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36:D901– D906. doi:10.1093/nar/gkm958 Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V (2014) DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 42:D1091–D1097. doi:10.1093/nar/gkt1068 McGregor MJ, Pallai PV (1997) Clustering of large databases of compounds: using the MDL “keys” as structural descriptors. J Chem Inf Comput Sci 37:443–448. doi:10.1021/ci960151e Butina D (1999) Unsupervised data base clustering based on daylight’s fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inf Comput Sci 39:747–750. doi:10.1021/ci9803381 Leeson P (2012) Drug discovery: chemical beauty contest. Nature 481:455–456. doi:10.1038/481455a Ihlenfeldt W-D, Voigt JH, Bienfait B, Oellien F, Nicklaus MC (2002) Enhanced CACTVS browser of the open NCI database. J Chem Inf Comput Sci 42:46–57. doi:10.1021/ci010056s Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG (2012) ZINC: a free tool to discover chemistry for biology. J Chem Inf Model 52:1757–1768. doi:10.1021/ci3001277 Zauhar RJ, Gianti E, Welsh WJ (2013) Fragment-based shape signatures: a new tool for virtual screening and drug discovery. J Comput Aided Mol Des 27:1009–1036. doi:10.1007/s10822-0139698-7 Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107. doi:10.1093/nar/gkr777 van Linden OP, Kooistra AJ, Leurs R, de Esch IJ, de Graaf C (2013) KLIFS: a knowledge-based structural database to navigate kinaseligand interaction space. J Med Chem 57:249–277. doi:10.1021/ jm400378w Kirchmair J, Markt P, Distinto S, Schuster D, Spitzer GM, Liedl KR, Langer T, Wolber G (2008) The Protein Data Bank (PDB), its related services and software tools as key components for in silico guided drug discovery. J Med Chem 51:7021–7040. doi:10.1021/ jm8005977 Mauser H, Stahl M (2007) Chemical fragment spaces for de novo design. J Chem Inf Model 47:318–324. doi:10.1021/ci6003652 Shneiderman B (1992) Tree visualization with tree-maps: 2-d space-filling approach. ACM Trans Graph 11:92–99. doi:10.1145/ 102377.115768 Krier M, Bret G, Rognan D (2006) Assessing the scaffold diversity of screening libraries. J Chem Inf Model 46:512–524. doi:10.1021/ ci050352v Kibbey C, Calvet A (2005) Molecular Property eXplorer: a novel approach to visualizing SAR using tree-maps and heatmaps. J Chem Inf Model 45:523–532. doi:10.1021/ci0496954 Clark AM (2009) 2D depiction of fragment hierarchies. J Chem Inf Model 50:37–46. doi:10.1021/ci900350h Cross JB, Thompson DC, Rai BK, Baber JC, Fan KY, Hu Y, Humblet C (2009) Comparison of several molecular docking programs: pose prediction and virtual screening accuracy. J Chem Inf Model 49:1455–1474. doi:10.1021/ci900056c

Mol Divers 52. Kontoyianni M, McClellan LM, Sokol GS (2004) Evaluation of docking performance: comparative data on docking algorithms. J Med Chem 47:558–565. doi:10.1021/jm0302997 53. Fraley ME, Arrington KL, Buser CA, Ciecko PA, Coll KE, Fernandes C, Hartman GD, Hoffman WF, Lynch JJ, McFall RC (2004) Optimization of the indolyl quinolinone class of KDR (VEGFR-2) kinase inhibitors: effects of 5-amido-and 5-sulphonamido-indolyl groups on pharmacokinetics and hERG binding. Bioorg Med Chem Lett 14:351–355. doi:10.1016/j.bmcl.2003.11.007 54. Dinges J, Akritopoulou-Zanze I, Arnold LD, Barlozzari T, Bousquet PF, Cunha GA, Ericsson AM, Iwasaki N, Michaelides MR, Ogawa N (2006) Hit-to-lead optimization of 1, 4-dihydroindeno [1, 2-c] pyrazoles as a novel class of KDR kinase inhibitors. Bioorg Med Chem Lett 16:4371–4375. doi:10.1016/j.bmcl.2006.05.052 55. Akritopoulou-Zanze I, Albert DH, Bousquet PF, Cunha GA, Harris CM, Moskey M, Dinges J, Stewart KD, Sowin TJ (2007) Synthesis and biological evaluation of 5-substituted 1, 4-dihydroindeno [1,

2-c] pyrazoles as multitargeted receptor tyrosine kinase inhibitors. Bioorg Med Chem Lett 17:3136–3140. doi:10.1016/j.bmcl.2007. 03.031 56. Dinges J, Ashworth KL, Akritopolou-Zanze I, Arnold LD, Baumeister SA, Bousquet PF, Cunha GA, Davidsen SK, Djuric SW, Gracias VJ (2006) 1, 4-Dihydroindeno [1, 2-c] pyrazoles as novel multitargeted receptor tyrosine kinase inhibitors. Bioorg Med Chem Lett 16:4266–4271. doi:10.1016/j.bmcl.2006.05.066 57. Dinges J, Albert DH, Arnold LD, Ashworth KL, AkritopoulouZanze I, Bousquet PF, Bouska JJ, Cunha GA, Davidsen SK, Diaz GJ (2007) 1, 4-Dihydroindeno [1, 2-c] pyrazoles with acetylenic side chains as novel and potent multitargeted receptor tyrosine kinase inhibitors with low affinity for the hERG ion channel. J Med Chem 50:2011–2029. doi:10.1021/jm061223o

123

Fragment virtual screening based on Bayesian categorization for discovering novel VEGFR-2 scaffolds.

The discovery of novel scaffolds against a specific target has long been one of the most significant but challengeable goals in discovering lead compo...
3MB Sizes 1 Downloads 18 Views