Improving data mining strategies for drug design.

Editorial Special Focus: Computational Chemistry For reprint orders, please contact [email protected]

Improving data mining strategies for drug design “Compound data mining strategies have been increasingly refined in recent years by focusing on emerging concepts in medicinal chemistry and drug discovery, such as activity cliffs, matched molecular pairs or molecular promiscuity.” Keywords: activity cliffs n advanced strategies n compound-activity data n compound promiscuity n computer-aided design n data mining n matched molecular pairs n scaffolds n structure–activity relationships

In recent years, the looming ‘big data’ chal lenge has become an intensely discussed topic in computational biology [1]. In drug discovery, compound numbers and activity annotations have not yet been approaching the volumes of genomic sequence or gene expression data (and this might not happen in the foreseeable future). However, ‘big data’ issues are also beginning to be considered in pharmaceutical R&D [101]. In fact, we currently witness an unprecedented growth of compound activity data, not only in proprietary pharmaceutical environments, but also in the public domain. Here, databases such as PubChem [2] or ChEMBL [3] have become major repositories for compound activity data from biological screening to medicinal chemis try sources and have grown rapidly in size. For example, ChEMBL release 17 contains more than 1.3 million unique compounds with more than 12 million activity annotations for approx imately 9300 targets [102]; an almost astronomical number of activity/target annotations compared with the situation just a few years ago. These data provide a highly attractive source of knowl edge for pharmaceutical research. Therefore, it is not surprising that compound data mining plays an increasingly important role in drug discovery for knowledge extraction. Different data mining strategies have been introduced that are expanding the knowledge base for drug design and development. These also include text mining efforts to extract compound structures and associated information from the wealth of (heterogeneous) scientific and patent literature. Text mining methods represent a growth area in the data mining field, but will not be further reviewed in this concise editorial. In the follow ing, exemplary compound data mining strategies will be discussed that aid in drug design. An almost ‘classical’ form of database min ing is the search for compounds that contain

different core structures but have similar bio activity. This refers to the well-known ‘scaffold hopping’ exercise in virtual compound screen ing [4], which generally aims to provide alterna tive starting points for hit-to-lead projects and chemical optimization efforts. In this context, global compound data mining has further refined our views of scaffold hopping potential and associated challenges. When scaffolds were systematically extracted from public domain compounds active against the current spectrum of pharmaceutical target proteins, it was found that for many targets across different families an abundance of scaffold hops was already present in known active compounds [5]. These findings indicated that many small-molecule targets are permissive to different compound classes and that detecting or predicting scaffold hops might in many instances be less challenging than often thought. Going beyond virtual screening, various compound data mining strategies have been introduced that directly support structure– activity relationship (SAR) analysis and molecular design. For example, ‘activity cliffs’ are defined as pairs of structurally similar or analogous compounds sharing the same spe cific activity but having a large difference in potency [6]. Hence, from activity cliffs SAR determinants can often be deduced, which is an important aspect in lead optimization. The activity cliff concept is also amenable to data mining. For a global exploration of activ ity cliffs, compound similarity and potency difference criteria must be clearly defined [6]. Then, it is possible to systematically search for activity cliffs. Applying alternative cliff crite ria, all activity cliffs formed by public domain compounds have recently been identified on a per-target basis [7]. It has been shown that activity cliffs are generally rare among pairs of

10.4155/FMC.13.208 © 2014 Future Science Ltd

Future Med. Chem. (2014) 6(3), 255–257

Jürgen Bajorath Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology & Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany Tel.: +49 228 2699 306 Fax: +49 228 2699 341 [email protected]

ISSN 1756-8919

255

Editorial | Bajorath structurally similar compounds, as expected, but that bioactive compounds frequently par ticipate in the formation of cliffs. On average, 10–20% of active compounds in currently available data sets form activity cliffs. The SAR information associated with this pool of activity cliffs provides a significant knowledge base for compound design.

“Through compound data mining, matched molecular pair-based activity cliffs have been systematically identified in bioactive compounds, further supporting the chemical interpretability of cliff-associated structure–activity relationship information.” Matched molecular pairs (MMPs) represent another emerging concept in medicinal chemis try [8] that is also highly relevant for data mining. An MMP is defined as a pair of compounds that only differ by a structural change at a single site, that is, the exchange of two substructures [8], a so-called ‘chemical transformation’ [9]. Applying computationally efficient algorithms for MMP generation [9], MMPs can be systematically extracted from database compounds and changes in activity or other compound properties as a consequence of chemical transformations can be analyzed. For example, following this approach, bioisosteric replacements have been systemati cally extracted from known active compounds and many previously unknown bioisosters have been identified [10], which can be utilized in compound design and optimization. In addition, MMP-based compound data mining has enabled the identification of all currently available ‘SAR transfer’ events [11]. SAR transfer involves series of corresponding analogs with similar potency progression that contain different core struc tures. The replacement of one chemical series with another having a comparable SAR is an attractive, but often difficult task in drug design and lead optimization. However, data mining has revealed the presence of many SAR transfer events in current bioactive compounds [11]. For given targets, transfer events can be studied to aid in the prioritization of scaffolds for series replace ment. Moreover, on the basis of MMP analysis, physicochemical compound properties related to absorption, distribution, metabolism and excre tion (ADME) have been evaluated and ADME effects predicted [12]. The MMP formalism can also be utilized for the definition of activity cliffs, 256

Future Med. Chem. (2014) 6(3)

so-called MMP-cliffs, and for mining of cliffs [13]. MMP-cliffs are formed by pairs of com pounds that are interconverted by small chemical transformations but have large potency differ ences. This structure-based activity cliff defini tion is chemically more conservative and usually more intuitive than cliffs defined on the basis of calculated similarity values. Through compound data mining, MMP-cliffs have been systemati cally identified in bioactive compounds [7,13], further supporting the chemical interpretability of cliff-associated SAR information. Another emerging theme in drug discovery is ‘compound promiscuity’ [14], defined as the pres ence of specific interactions of a small molecule with multiple targets (as opposed to nonspecific binding events). As such, compound promiscu ity is at the roots of polypharmacology [15], which is often responsible for the efficacy of drug candidates and drugs including, for example, many protein kinase inhibitors applied in oncol ogy. Compound promiscuity can be effectively assessed through mining of compound activity annotations [14]. Here, care should be taken to focus the analysis on high-confidence activity data. Otherwise, the degree of compound pro miscuity might be overestimated, due to assaydependent effects and/or ambiguous target assignments. For example, through data mining, promiscuous scaffolds have been detected and multitarget activity profiles generated for com pounds represented by these scaffolds [16]. Such scaffolds and associated activity profiles can serve as templates for the design of compounds with desired multitarget activities. Systematic mining of high-confidence activity annotations has also revealed that the degree of promiscu ity among bioactive compounds is overall lower than for drugs [14]. These findings suggest interesting opportunities for future research. For example, does this mean that promiscuous drug candidates are often preferentially selected during clinical evaluation? Or are target pro files of compounds that ultimately become drugs simply better characterized than those of ‘aver age’ bioactive compounds? In this context, it is worth noting that compound data mining can only reveal ‘what current data tell us’ – but not what might principally be possible. Accordingly, conclusions drawn from systematic data analysis concerning the activity or property profiles of compounds and their distributions are subject to revision when large amounts of new data become available and/or data from previously unexplored sources. However, given the large future science group

Improving data mining strategies for drug design amounts of compounds and activity data that are already available at present, and the results of systematic analyses monitoring data growth, a number of conclusions drawn from data mining can be confidently considered. In summary, as discussed herein, compound data mining strategies have been increasingly refined in recent years by focusing on emerging concepts in medicinal chemistry and drug dis covery, such as activity cliffs, MMPs or molecu lar promiscuity. A number of findings obtained through such data mining efforts have significant potential to impact SAR exploration and com pound design. As compound activity data will further grow, data mining will continue to play

| Editorial

an important role going forward. A key challenge will be to seamlessly translate insights obtained from large-scale data analysis into further advanced drug-design strategies. The foundation for making such advances is being laid. Financial & competing interests disclosure The author has no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties. No writing assistance was utilized in the production of this manuscript.

References 1

2

3

4

5

6

7

Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 11(9), 647–657 (2010).

activity cliffs in bioactive compounds. J. Chem. Inf. Model. 52(9), 2348–2353 (2012). 8

Wang Y, Xiao J, Suzek TO et al. PubChem’s BioAssay database. Nucleic Acids Res. 40(database issue), D400−D412 (2012).

Griffen E, Leach AG, Robb GR, Warner DJ. Matched molecular pairs as a medicinal chemistry tool. J. Med. Chem. 54(22), 7739–7750 (2011).

Gaulton A, Bellis LJ, Bento AP et al. ChEMBL: a large-scale bioactivity database for drug discovery Nucleic Acids Res. 40(database issue), D1100–D1107 (2012).

9

Schneider G, Neidhart W, Giller T, Schmid G. ‘Scaffold-hopping’ by topological pharmacophore search: a contribution to virtual screening. Angew. Chem. Int. Ed. 38(19), 2894–2896 (1999).

10 Wassermann AM, Bajorath J. Large-scale

Hu Y, Bajorath J. Global assessment of scaffold hopping potential for current pharmaceutical targets. Med. Chem. Commun. 1(5), 339–344 (2010). Stumpfe D, Bajorath J. Exploring activity cliffs in medicinal chemistry. J. Med. Chem. 55(7), 2932–2942 (2012). Stumpfe D, Bajorath J. Frequency of occurrence and potency range distribution of

future science group

Hussain J, Rea C. Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J. Chem. Inf. Model. 50(3), 339–348 (2010). exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med. Chem. 3(4), 425–436 (2011).

11 Zhang B, Wassermann AM, Bajorath

J. Systematic assessment of compound series with SAR transfer potential. J. Chem. Inf. Model. 52(12), 3138–3143 (2012). 12 Papadatos G, Alkarouri M, Gillet VJ et al.

Lead optimization using matched molecular pairs: inclusion of contextual information for enhanced prediction of HERG inhibition, solubility, and lipophilicity. J. Chem. Inf. Model. 52(10), 1872–1876 (2012).

www.future-science.com

13 Hu X, Hu Y, Vogt M, Stumpfe D, Bajorath,

J. MMP-cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J. Chem. Inf. Model. 52(5), 1138–1145 (2012). 14 Hu Y, Bajorath, J. Compound promiscuity:

what can we learn from current data? Drug Discov. Today 18(13–14), 644–650 (2013). 15 Jalencas X, Mestres J. On the origins of drug

polypharmacology. Med. Chem. Commun. 4(1), 80–87 (2013). 16 Hu Y, Bajorath J. Polypharmacology directed

data mining: identification of promiscuous chemotypes with different activity profiles and comparison to approved drugs. J. Chem. Inf. Model. 50(12), 2112–2118 (2010). Websites 101 Big Data in Pharma.

http://acswebinars.org/big-data 102 ChEMBL.

www.ebi.ac.uk/chembl

257

Strategies for improving mucosal drug delivery.

Data mining for potential adverse drug-drug interactions.

Drug repositioning for diabetes based on 'omics' data mining.

Data-directed drug design.

Structure-based strategies for drug design and discovery.

Bioerodable PLGA-Based Microparticles for Producing Sustained-Release Drug Formulations and Strategies for Improving Drug Loading.

Text mining for drug-drug interaction.

Systematic drug repositioning through mining adverse event data in ClinicalTrials.gov.

Network-based modeling and intelligent data mining of social media for improving care.

Data mining for materials design: a computational study of single molecule magnet.

Data mining with molecular design rules identifies new class of dyes for dye-sensitised solar cells.

1. Strategies for Improving Care.

(1) Strategies for improving care.

Antitumor sulfonylhydrazines: design, structure-activity relationships, resistance mechanisms, and strategies for improving therapeutic utility.

Mining for answers from big data.

Digital family histories for data mining.

Marine Antibody-Drug Conjugates: Design Strategies and Research Progress.

MEDICI: Mining Essentiality Data to Identify Critical Interactions for Cancer Drug Target Discovery and Development.

Improved candidate drug mining for Alzheimer's disease.

Design Mining Interacting Wind Turbines.

Mining clinical text for signals of adverse drug-drug interactions.

Data mining in radiology.

Correction: Improving Fishing Pattern Detection from Satellite AIS Using Data Mining and Machine Learning.

Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study.