Some solved and unsolved problems of chemoinformatics.

This article was downloaded by: [University of Western Ontario] On: 28 September 2014, At: 15:10 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

SAR and QSAR in Environmental Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/gsar20

Some solved and unsolved problems of chemoinformatics J. Gasteiger

a

a

Computer-Chemie-Centrum, University of Erlangen-Nuremberg, Erlangen, Germany Published online: 09 Apr 2014.

To cite this article: J. Gasteiger (2014) Some solved and unsolved problems of chemoinformatics, SAR and QSAR in Environmental Research, 25:6, 443-455, DOI: 10.1080/1062936X.2014.898688 To link to this article: http://dx.doi.org/10.1080/1062936X.2014.898688

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/termsand-conditions

SAR and QSAR in Environmental Research, 2014 Vol. 25, No. 6, 443–455, http://dx.doi.org/10.1080/1062936X.2014.898688

Some solved and unsolved problems of chemoinformatics J. Gasteiger* Computer-Chemie-Centrum, University of Erlangen-Nuremberg, Erlangen, Germany

Downloaded by [University of Western Ontario] at 15:10 28 September 2014

(Received 1 September 2013; in final form 13 January 2014) The field of chemoinformatics has developed from different roots, starting in the 1960s. These branches have now merged into a scientific discipline of its own, exchanging ideas and methods across different areas of chemistry. In the last 40 years chemoinformatics has achieved a lot. Without access to the databases in chemistry developed with chemoinformatics methods, modern chemical research would not be able to work at its present high level of competence. However, there are quite a few challenges, such as drug design and understanding the effect of chemicals on human health and on the environment, as well as furthering our knowledge of chemistry and of biological systems, that can benefit from a more intensive use of chemoinformatics methods. Approaches to meet these challenges will be briefly outlined. All this emphasizes that chemoinformatics has matured into a scientific discipline of its own that reaches out to many other chemical fields and will increase in attractiveness to students and researchers. Keywords: chemical structure descriptors; prediction of properties; drug design; organic synthesis design; toxicity and metabolism prediction; risk assessment of chemicals

1. Introduction The aim of this paper is to highlight many of the problems that chemoinformatics has solved in its more than 40 years of history [1, 2]. This should raise the spirit and pride of those working in chemoinformatics by recognizing that they are working in a field that has made decisive contributions to the development of chemistry and related scientific disciplines. On the other hand, it should also be realized that there are still many problems to be solved in the area of chemoinformatics. In fact, this simply emphasizes that chemoinformatics has matured into a scientific field of its own that will endure into the future. There are many open questions and interesting challenges in chemoinformatics waiting for novel ideas and powerful solutions to be developed. Furthermore, it emphasizes that chemoinformatics is attractive to students to join and meet these challenges. Although the name ‘chemoinformatics’ did not appear until 1998, its history goes back to the 1960s, when various approaches to using computers for solving chemical problems were initiated [3, 4]. Representation of chemical structures for database storage, structure searching, molecular modelling, quantitative structure–activity relationships (QSAR), computer-assisted organic synthesis design (CASD) and computer-assisted structure elucidation (CASE) were some of the more prominent areas tackled by software developments. In spite of all the progress in these areas, many of these problems are still waiting for further improvements and will keep us busy for many years to come. *Email: [email protected] © 2014 Taylor & Francis

444

J. Gasteiger


Clearly, this paper cannot cover all the solved and unsolved problems in chemoinformatics; only some highlights can be discussed, with the selection of topics being somewhat personal. Further information can be found in journals devoted to this field such as the Journal of Cheminformatics, Molecular Informatics, SAR and QSAR in Environmental Research, or the Journal of Chemical Information and Modeling. The latter journal, in particular, has been around for quite some time – in fact 53 years – changing its title from Journal of Chemical Documentation through Journal of Chemical Information and Computer Science to Journal of Chemical Information and Modeling, thus reflecting the development of the field. Textbooks [1, 5] and a handbook [2] on chemoinformatics have been written, and a website providing a cheminformatics portal has been installed [6]. 2. Achievements of chemoinformatics 2.1 Access to chemical information One of the most outstanding achievements of chemoinformatics is that access to chemical information is provided without having to resort to visiting libraries which, because of financial restrictions, have only a limited stock of printed information. Chemical information has been stored in the databases of publications, covering chemical compounds, with information on their properties, or chemical reactions. This paper is certainly not the place to list all these valuable databases. Suffice to say that nowadays it would be completely impossible to obtain an overview of chemical information without access to databases with chemical information. This need has become particularly aggravated by the vast increase of chemical information in the last few decades. Just one illustration: in 1970 about 2.5 million compounds were registered in the Chemical Abstracts Database (and by inference known at that time). Presently, in 2013, 73 million compounds and 64 million sequences (of proteins and nuclear acids) are registered in this database. Chemists and other scientists might search in the CAS REGISTRY database of chemical compounds or in the Beilstein Database of organic compounds and their properties without being conscious that this has only been made possible by developments in chemoinformatics. What is most important is that these databases understand the language chemists are generally using, the graphical language of a structure diagram. Thus, one is not forced to use the name of a compound that may be in most cases quite cumbersome – and error-prone. With the vast increase in inexpensive computer storage space it has become feasible to discard early compact fragment codes and store chemical structures with atomic resolution, providing access to each atom and each bond of a molecule. Thereby, it has become possible to search databases for chemical structures that contain any conceivable substructure, substructure searching in databases certainly being an extremely valuable method. Furthermore, the development of linear structure codes or identifiers such as SMILES or InChI has produced concise codes for chemical structures that allow the rapid exchange of information and searching on the internet. 2.2 Learning from chemical information Large as these databases on chemical information have become, they will nevertheless always be incomplete. In particular, many properties of compounds may not yet be known. In this situation, the presently available information can be used for making predictions on unknown data. This process is a case of inductive learning: learning from data. There are two ways of inductive learning: data-driven and model-driven approaches.


SAR and QSAR in Environmental Research

445

In a model-driven, rule-based approach the available data are used to intellectually build a model and then use this model to make predictions. We used such an approach when developing the 3D structure generator CORINA [7]. From known 3D structures of organic molecules, rules were derived about bond lengths [8], bond angles, torsion angles and ring structures. Putting these rules into effect on new structures allows predictions to be made. It was shown that CORINA can convert chemical structures expressed as lists of atoms and bonds into 3D models with more than 99% conversion rate, even for datasets comprising millions of structures. Thus, one can say that CORINA has learned the rules that govern the 3D structure of any organic molecule. In a data-based approach data are put into context with other data, and doing this for an entire collection of compounds one might be able to generalize and thus learn about the relationships between data. A widely used method is to establish quantitative structure–activity or property relationships (QSAR, QSPR). Such an approach is made when the property of interest cannot be directly calculated from the structure of a compound, when the relationship between the structure and the property is too complicated to be derived by theoretical methods. Then, an indirect approach has to be taken, by first deriving structure descriptors from the compound of interest (Figure 1). By doing this for a series of related structures a dataset is obtained that can then be analysed by any data analysis method, such as from statistics, chemometrics, pattern recognition, machine learning or artificial neural networks [9]. Many approaches have been developed for deriving molecular descriptors for chemical structures. The book by R. Todeschini and V. Consonni summarizes 4885 types of molecular descriptors [10]. Our research group, too, has developed a series of structure descriptors [11]. However, we were careful in deriving only such structure descriptors that can be interpreted by combining geometric and physicochemical information. Thus, a hierarchy of descriptors is available, from whole molecule descriptors, to descriptors reflecting the 2D structure, the 3D structure, or molecular surface properties [11]. A large set of data analysis software has been developed, too numerous to list them all here. Suffice to say that packages of free software for data analysis are available, such as Weka [13] and R [14]. The QSPR and QSAR methodology has been applied to the prediction of a large set of physical, chemical or biological properties of compounds. It is certainly impossible to even attempt to give an overview of QSPR and QSAR applications in chemistry. Suffice to say that this methodology has been well established in this field and is used on a regular basis by many scientists in their daily work to make predictions on the properties of compounds they are interested in.

Figure 1. Establishing quantitative structure–property/activity relationships (QSPR, QSAR) through data-based learning.

446

J. Gasteiger

2.3 Applications of chemoinformatics All fields of chemistry have the potential to benefit from applications of chemoinformatics. However, most areas of chemistry only use databases with chemical information, not exploiting the full potential that chemoinformatics can offer. Only analytical chemistry and drug design have seen a large use of chemoinformatics applications.


2.3.1 Drug design Drug design is certainly by far the most prominent field where chemoinformatics methods have been and are heavily used. Chemoinformatics has become an integral part of the drug design process, assisting at several steps of drug development (Figure 2). For the identification of the target protein, both bioinformatics and chemoinformatics methods are used. Several methods, such as virtual screening or pharmacophore searching, have been developed for lead selection. For lead optimization, too, a variety of methods are used, such as QSAR for the prediction of biological activity, superimposition of 3D molecular models, and molecular docking of candidate molecules into the target protein. Preclinical testing has to ensure that properties other than the desired biological activity are acceptable for a candidate to serve as a drug. These properties are usually summarized under the term ADME-Tox, comprising properties that govern adsorption, distribution, metabolism, excretion and toxicity. Many QSPR models have been developed for various aspects influencing these ADMET properties. With good models for the prediction of ADME-Tox properties, preclinical testing can be done in a more focussed way and thus more rapidly and with less expense. The application of these chemoinformatics methods in drug design are usually not strictly applied in a sequential order, as Figure 2 implies. Rather, these methods are applied simultaneously in order to more rapidly focus on the most promising drug candidate. Chemoinformatics methods have helped to reduce the time and costs of drug development, and many new drugs have been developed with a decisive contribution from chemoinformatics.

3. Challenges for chemoinformatics Chemoinformatics has the potential of assisting scientists working in any field of chemistry in their daily tasks. This potential has by far not yet being tapped; many fields of chemistry have not yet awakened to the area of chemoinformatics. Thus, it might be the task of chemoinformaticians to venture out into these untapped fields and develop methods of use in these domains of chemistry.

Figure 2.

The basic steps in drug design.


447

The major task of chemistry is the production of compounds with desired properties, be it a drug, a dye-stuff, a plastic, a liquid crystal, an LED, etc. In this endeavour, three major tasks have to be addressed:

• • •

what compound would have the desired property? how do I make this compound? what is the product of my reaction?


For these tasks, chemoinformatics has to develop methods for:

• • •

structure–property relationships (QSPR, QSAR) synthesis design reaction prediction and structure elucidation

As mentioned in the previous section, methods to address some of these tasks have already been developed, but much more has to be done to assist chemists in solving these fundamental challenges. In recent years, the assessment of the risk of chemicals has become a major focus. In Europe, the REACH initiative (Registration, Evaluation, Authorization and restriction of CHemicals) became law on 1 June 2007. REACH requires that those chemicals used in quantities of more than 1 ton/year are only allowed to be manufactured or imported into the European Union if they are registered. It is estimated that about 35,000 chemicals have to be registered. For compounds used in quantities of more than 10 tons/year, a Chemical Safety Report must be submitted that addresses questions on harmful effects on human health, harmful effects on the environment, on persistence, bioaccumulation and toxicity and evaluation of exposure. Tests to answer these questions are time-consuming, expensive and might need many animals. In this situation, alternatives to animal testing methods are being sought. These can either be in vitro biological tests or in silico computer simulations. Thus, the REACH initiative has opened an important domain for applying chemoinformatics methods. Access to data is of crucial importance in this endeavour. Clearly, computer predictions of properties such as toxicity are still, for most endpoints, in their infancy. However, chemoinformatics methods could already be valuable for ranking compounds according to their risk of toxicity potential, thus providing guidelines for deciding which compounds have the highest risk and therefore should be tested with priority. A similar initiative is the Cosmetics Directive, also law in the European Union, which prohibits chemicals being used in cosmetics that have been tested on animals. Thus, cosmetics offer another highly important area for alternative testing methods such as chemoinformatics methods. In the endeavour to understand biological activity, toxicity and the risk of chemicals, research has been initiated into modelling entire human organisms by bioinformatics and chemoinformatics methods, such as in the Virtual Liver Project (http://www.virtual-liver.de/ wordpress/en/). Clearly, the endeavour to model entire human organisms provides many challenges for chemoinformatics, foremost the modelling of biochemical reactions both of endogenous metabolism as well as the metabolism of xenobiotics. A similar view to what has been said here has been expressed in 2009 by David Wild as editor in the opening statements to the Journal of Cheminformatics (http://www.jcheminf.com/ content/1/1/1) by identifying four “grand challenges” for chemoinformatics:

• •

overcoming stalled drug discovery green chemistry and global warming

448

J. Gasteiger

• •

understanding life from a chemical perspective enabling the network of the world’s chemical and biological information to be accessible and interpretable


Drug design, REACH and the Cosmetics Directive emphasize that the general public is increasingly interested in the effect that chemicals have on human health and on the environment. Collecting more data on these topics, making these data generally accessible, and finally making best use of these data by exploiting them by chemoinformatics methods is of major interest to society.

4. Problems still to be solved In this section, we will present problems that still have to be solved. In some cases, we will indicate research being currently undertaken to tackle these problems. We are doing this in order to stimulate more work on these tasks, and to show inroads that could be pursued to solve these problems. 4.1 Access to chemical information As widespread and easy to use as structure editors have become in drawing chemical structures as entry points into databases, it should nevertheless be realized that the drawing of a structure diagram by a structure editor is quite slow, certainly slower than drawing it by hand. Therefore, we should strive for easier methods to be used for communication with databases, such as hand-drawing of chemical structures or using voice to name a compound. Furthermore, there are chemical structures that cannot be represented well by a conventional structure diagram. A structure diagram is essentially a valence bond representation of a compound. There are structures that cannot be well represented in a single valence bond notation, such as boranes, many organometallic compounds like ferrocene, or the difference between a singlet or a triplet carbene. For such structures, a representation that somehow reflects a molecular orbital representation is more appropriate. Such a representation has been developed [15], and it has been made the basis of the chemoinformatics platform of the Molecular Networks company [16]. There are other chemical structures that still await improvements in their handling, such as the Markush structures for representing collections of chemical structures in patents. Furthermore, compounds exist that defy representation by a structure diagram, such as polymers. Here, other forms of characterization of the chemical structure have to be found (e.g. by spectral data). The acquisition of chemical information, too, still has a long way to go. Much information is only available in printed form. Scanning these documents does not solve the problem, because the information has to be in a structured form. Methods for optical character recognition have to be developed or improved for converting printed structure diagrams into computer-readable connection tables. Text-mining methods have to parse the text to find chemical names (that have to be converted into connection tables), properties, reaction information, etc. A great deal of information is becoming increasingly available on the internet, and this information should go into databases. The internet offers new and advanced ways for publishing, such as directly giving access to 3D structure information or spectral data (an IR spectrum has much more inherent information than the short list of IR peaks in a publication!).



449

The information in databases not only has to grow, it also has to be improved. Many entries in databases are full of errors, regarding both the representation of chemical structures and the data contained in the databases. For compounds all known properties should be stored, including spectra. The information in reaction databases is notoriously incomplete. Information should be given on all compounds involved, that is, the full stoichiometry of a reaction has to be given, and the byproducts and their amounts should also be specified. Furthermore, all reaction conditions should be indicated, such as solvent, reaction temperature, reaction time, and concentration of starting materials. Only when we have this information can we really learn the rules that govern chemical reactions. Clearly, much of this information is not available in even printed form in publications. However, decisive progress can be made here by storing all information known to the experimenter in electronic form. This has become feasible by the introduction of electronic laboratory notebooks (eLNs). Methods then have to be developed to transfer the information from eLNs into databases of structures, of spectra, and of reactions. This offers a new dimension in chemical information access and processing. Figure 3 shows how chemical information generated in an experiment is currently handled until it ends up in a reaction database (a reaction database serving simply as an example for the general task of transferring information from the experiment into a database). Many breaks in information technology occur on the path from the experiment to a database. At many points, the information is again written onto paper or manually entered into documents or databases, with the danger that errors are introduced in the new document or that important information is omitted. eLNs would offer the possibility that the information deposited during the experiment could seamlessly migrate to a series of different information sources all the way to a reaction database without the need for laying down the information again and again by rewriting (Figure 4). In this way, a continuous flow from the information producer to the information consumer could be established without any breaks in information technology. This would ensure that

Figure 3. The many breaks in information processing from an experiment to a reaction database as presently done.


450

J. Gasteiger

Figure 4. A continuous flow of information in a future approach from an electronic laboratory notebook to a reaction database.

the person who knows an experiment best – the person performing the experiment – can control what goes into a database and therefore secure a maximum of information quality. eLNs have already been introduced to a large extent in the chemical industry, whereas academia is still lagging behind. Furthermore, software for transferring the information from eLNs into databases will, in most cases, still has to be developed. 4.2 Learning from chemical information In QSPR and QSAR applications the emphasis has increasingly to move from building models with good predictivity to models suitable for interpretation. The establishment of QSPR and QSAR models should allow us to learn about inherent relationships, to further our understanding of the laws governing the relationships between chemical structures and their properties or biological activity. In order to achieve this, structure descriptors must be selected that are suited to interpretation and convey chemical meaning. Clearly, chemical substructures, as expressed by functional groups, have a lot of meaning for a chemist. However, the chemical environment of a substructure might strongly influence the contribution of a substructure to the property of interest, such as chemical reactivity. We have therefore embarked on enriching substructures with physicochemical properties, such as partial charges on the atoms and bonds of a substructure, to use them for better modelling properties such as chemical reactivity or toxicity. The new representations are termed “chemotypes”. They have been used as QSAR descriptors in the FDA CERES (Chemical Evaluation and Risk Estimation System) project for various toxicity endpoints including Ames mutagenicity, rodent carcinogenicity, and reproductive developmental (pregnancy loss and cleft palate) endpoints [17]. A mechanistic skin irritation model has been also published using the chemotypes [18]. ToxPrint is a set of chemotypes developed by Yang and FDA collaborators [19]. Under an FDA contract, the software application ChemoTyper has also been developed to enable searching with ToxPrint [20]. The software ChemoTyper containing the ToxPrint data is available at www.chemotyper.org and www.toxprint.org.



451

In choosing a data analysis method, care should be taken to use a method that is not a black box but one whose results are open to interpretation. One should sacrifice accuracy of prediction in favour of interpretability of a model. In most cases, the data to be modelled are not sufficiently accurate anyway to achieve total accuracy. Furthermore, each dataset should be analysed by several different data analysis methods, in order to avoid the traps of the peculiarities of a specific method. Clearly, the quality of the data is of crucial importance in the endeavour of establishing a QSPR or QSAR model. Erroneous data might have a devastating influence on the quality of a model. Methods for certifying data and for finding errors in the data have to be established. It is estimated that databases contain between 15–40% errors. Presently, in many cases the only solution might be to go back and consult the original publications. Automatic data-mining methods that directly extract the data from the original publications might offer less timeconsuming approaches. Furthermore, it should not be forgotten that a model is just a model of reality, a model that is based on the available information. Any prediction based on a model should therefore be considered with caution and not be over-interpreted. With more – and better – data becoming available, it might be possible to develop much better models. 4.3 Applications Clearly, we do not have the space here to show all possible applications of chemoinformatics in chemistry and related scientific fields such as toxicology, biology, medical science, or material science. Even when considering only chemistry proper, any area of chemistry, be it analytical, inorganic, medicinal, organic, physical, or theoretical chemistry or chemical engineering, will face tasks whose solution can be fostered by utilization of chemoinformatics. Many new types of applications will be found using one’s own imagination. Here, we want to highlight a few areas, and within an area only a few topics where chemoinformatics methods could assist scientists to solve their problems. 4.3.1 Drug design Drug design will continue to be a very important field for the application of chemoinformatics methods. The problem of conformational flexibility of chemical structures is still open for further ideas and developments. The receptors for different biological activities might bind different conformations of the same molecule, which in most cases is not the conformation of lowest energy. Molecular docking is being used for finding the appropriate conformation. However, much is still to be done to make molecular docking an efficient method. The flexibility of the protein has also to be taken into account. The scoring functions for assessing the quality of docking have certainly to be improved. Using several scoring functions and then employing consensus scoring, that is, making a decision on the basis of the results obtained from the majority of functions, is a sign that we have not yet understood binding. We certainly cannot decide on a scientific problem by voting. Rather, the various steps involved in docking, desolvation of the drug candidate, conformational changes of drug and protein target, and binding through electrostatic, hydrophobic, and hydrogen bonding effects have to be modelled by their physicochemical processes. Clearly, this is not a simple and easy task, and it will occupy scientists for some time.

452

J. Gasteiger

4.3.2 Organic chemistry


The field of organic chemistry offers many opportunities for using chemoinformatics methods over and beyond just inquiring databases – which have become an integral part of the daily work of organic chemists. 4.3.2.1 Design of organic syntheses. It has already been said that CASD was one of roots of chemoinformatics, with a few groups initiating work in late 1960s and early 1970s. Although a lot of work has been devoted, by a few pivotal research groups at Harvard, Stony Brook, Munich, Erlangen, Brandeis, Toyohashi and Tokyo, to the development of systems for synthesis design, no system is widely used by chemists. The reasons for this are many; foremost are, first, that the complexity of the task may have been underestimated, and second, that organic chemists were hesitant to accept computers in doing synthesis design because they wanted to undertake this challenging task intellectually, by themselves. However, the statement made by Herbert Gelernter in 1973 (private communication) is still true: “The amount of information to be processed and the decisions between many alternatives suggests the use of computers in organic synthesis design.” Clearly, synthesis design for an organic compound is a challenging task that needs a lot of insight into chemical reactions and chemical reactivity; it has to develop efficient strategies for constructing the synthesis target from available starting materials, and it has to develop synthesis pathways that are short and need inexpensive starting materials, and should avoid toxic or dangerous intermediates. Modelling all these aspects by software is quite a challenge, but a challenge that should be taken up by chemoinformatics. For, as Gelernter observed, a chemist can hardly have any more an overview of all the available information, particularly on chemical reactions, and cannot see all the different strategies and pathways that could be followed in synthesizing a compound. Of eminent importance for CASD would be to get access to better reaction databases, as was outlined in section 4.1. In fact, I believe that the best approach to designing the synthesis of an organic compound would be in tackling this task as a team, with software for organic synthesis design giving an overview of all available information and presenting all alternatives, and the chemist-user making decisions as to which steps and pathways to follow drawing from her/ his potential of lateral thinking. 4.3.2.2 Structure elucidation. Work on CASE was initiated in the late 1960s, and thus can also be considered as one of the origins of chemoinformatics. The idea is to bring all available spectral data on an unknown chemical compound together and use this information – usually together with a structure generator – to make conclusions on the chemical structure of this compound. In this way, more or less the traditional procedure and strategy of a chemist should be followed. The advantage of using computers lies in their capacity for storing and processing large amounts of spectral information. Whereas in the beginnings of CASE spectra had to be manually digitized, nowadays all spectra have become available in computer-readable form, and with many more details, directly from the spectrometer. This offers the opportunity for working with a much higher level of information. Bringing the pieces of information from the various spectral data together at the appropriate time and situation is a challenging task to model, a task that should be met by chemoinformatician. CASE software could assist chemists in their daily work and free them for more innovative work to pursue.


453


4.3.3 Biochemistry Great efforts are being made at the European Bioinformatics Institute in building a huge repository of chemical structures, proteins, nucleic acids, reactions and pathways (http://www. ebi.ac.uk/services/). Understanding biochemical reactions is of paramount importance in drug design, in metabolic engineering, and in the risk assessment of chemicals. Biochemical reactions occur in a variety of species, from microbes all the way to humans, and they do so in different ways, being catalyzed by different enzymes. Increasingly, information on biochemical reactions is becoming available in databases. However, what has been said about reaction databases in section 4.1 applies here also: the quality of databases on biochemical reactions has to be improved by storing more details on the individual reactions. We have shown in several publications with the BioPath.Database [21] how we can increase our knowledge on biochemical reactions [22–25]. Real progress in this field can best be made by collaborations between chemoinformatics and bioinformatics. We have shown this by bringing together information on biochemical reactions with genome information [24, 25]. 4.3.4 Risk assessment of chemicals Society is becoming increasingly interested in the effects that chemicals have on living species and on the environment. Legislation introduced in the European Union, such as REACH and the Cosmetics Directive mentioned in section 3, is a vivid indication of this trend. This legislation asks for the development of alternatives to animal testing, with chemoinformatics methods playing a pivotal role in this endeavour. The chemical, pharmaceutical and cosmetics industries have responded to this challenge, and have initiated projects in collaboration with academic research groups and small and medium-sized industries (SMEs) to develop chemoinformatics and bioinformatics methods for the prediction of toxicities, of metabolism, and for assessing the risk of chemicals. Within the Innovative Medicine Initiative (IMI) funded by the European Union and the European Federation of Pharmaceutical Industries and Associations, a consortium of pharmaceutical companies, academic groups, and SMEs have joined forces to work on the eTOX project for modelling toxicity and metabolism of drug candidates (http://www.etoxproject.eu/). The challenge set by the Cosmetics Directive has led the association Cosmetics Europe and the European Union to provide funding for the SEURAT cluster of projects: Safety Evaluation Ultimately Replacing Animal Testing (SEURAT) (http://www.seurat-1.eu/). Within this cluster the COSMOS consortium aims at the development of “Integrated in silico models for the prediction of human repeated-dose toxicity of cosmetics to optimize safety” (http://www. seurat-1.eu/pages/cluster-projects/cosmos.php). The COSMOS project is developing a repeated-dose toxicity database for cosmetics-related chemicals, alternative methods such as Threshold of Toxicological Concerns (TTC), and innovative modelling approaches addressing oral-to-dermal exposures by consideration of biokinetics and metabolism. The methods and tools will be made public through both KNIME nodes and COSMOS DB [26]. Through the COSMOS KNIME node, a public community version of MOSES descriptors and ToxPrint chemotypes will also be made available. Several other projects have been initiated in Europe, the USA and Japan. All these projects have made much progress in furthering and modelling our understanding of toxicity, metabolism, and the risk of chemicals. However, they can only be considered as inroads into these domains; much further work and research waits to be done. Interesting and challenging

454

J. Gasteiger


topics are the establishment of Adverse Outcome Pathways, the development of models for the toxicity of nanomaterials, the analysis of in vitro information, and the question of how to deal with mixtures.

5. Summary In the last 50 years, chemoinformatics has developed as a field of its own at the interface of chemistry and computer science, branching out to biology, bioinformatics, toxicology and environmental sciences [1–4]. The achievements of chemoinformatics have been enormous; without databases on chemical information, emanating from the work of chemoinformaticians, modern chemical research at the present high level of competence would not be possible. However, somehow we have failed to emphasize enough the achievements of chemoinformatics, and therefore need to increase its level of recognition within the chemical community. Great as the achievements of chemoinformatics are, there is still a lot of work to do. The challenges are great in processing and understanding chemical information, and in assisting scientists in solving the problems they are working on and in answering the questions posed from society at large. We should not shy away from these challenges, but should see them as a sign that indeed chemoinformatics is a mature scientific discipline that has a domain, a focus, and tasks of its own. In this sense, it is also attractive for new students to delve into this scientific discipline. Therefore, our efforts in teaching have to increase, both in teaching specialists in chemoinformatics and in making sure that the major topics of chemoinformatics are integrated into all regular chemistry curricula. The scientific and political goals of chemoinformatics can be better achieved by greater congregation and exchange of ideas and achievements by all experts in this field over and beyond scientific meetings. There is a society that has included this on their agenda: the Chemoinformatics and QSAR Society (http://www.qsar.org/). It needs further input and more active involvement of many more persons. Acknowledgements This paper was presented at the 7th CMTPI conference in Seoul, 8–12 October 2013. I gratefully acknowledge the contributions of my co-workers, both from my research groups at the Technical University Munich and the University Erlangen-Nuremberg as well as the co-workers in the company Molecular Networks GmbH, who have accompanied me on my journey into chemoinformatics in the last 40 years and have helped me shaping this field. I am also grateful to the many projects funded by public institutions, most recently the eTOX project funded by the European Union and the Innovative Medicine Initiative and the COSMOS project funded by the European Union and Cosmetics Europe. Our recent work in Molecular Networks has greatly benefitted from our collaboration with Dr. Chihae Yang and Altamira LLC.

References [1] J. Gasteiger and T. Engel, Chemoinformatics – A Textbook, Wiley-VCH, Weinheim, 2003. [2] J. Gasteiger, Handbook of Chemoinformatics, 4 volumes, Wiley-VCH, Weinheim, 2003. [3] J. Gasteiger, Chemoinformatics – A New Field with a Long Tradition, Anal. Bioanal. Chem. 384, (2006), pp. 57–64. [4] W.L. Chen, Chemoinformatics – Past, Present, and Future, J. Chem. Inf. Model. 46, (2006), pp. 2230–2255.



455

[5] A.R. Leach and V. Gillet, An Introduction to Chemoinformatics, Kluwer Academic, Dordrecht, 2003. [6] ICEP – Indiana Cheminformatics Education Portal; http://icep.wikispaces.com/. [7] J. Gasteiger, C. Rudolph, and J. Sadowski, Automatic Generation of 3D-Atomic Coordinates for Organic Molecules, Tetrahedron Comp. Method. 3 (1992), pp. 537–547. [8] The software CORINA, Molecular Networks, Erlangen, Germany; http://www.molecular-networks. com/products/corina. [9] J. Zupan and J. Gasteiger, Neural Networks in Chemistry and Drug Design, 2nd Edition, WileyVCH, Weinheim, 1999. [10] R. Todeschini and V. Consonni, Molecular Descriptors for Chemoinformatics, 2 volumes, WileyVCH, Weinheim, 2009. [11] J. Gasteiger, Of molecules and humans, J. Med. Chem. 49 (2006), pp. 6429–6434. [12] ADRIANA.Code, Molecular Networks, Erlangen, Germany; software available at http://www. molecular-networks.com/products/adrianacode; a free version MOSES.Descriptors Community Edition for a subset of descriptors can be accessed at http://www.molecular-networks.com/services/ mosesdescriptors. [13] Weka software suite for data mining; available at http://www.cs.waikato.ac.nz/ml/weka/. [14] The R project for Statistical Computing; available at http://www.r-project.org/. [15] S. Bauerschmidt and J. Gasteiger, Overcoming the limitations of a connection table description: A universal representation of chemical species, J. Chem. Inf. Comput. Sci. 37 (1997), pp. 705–714. [16] MOSES extensive Chemoinformatics Platform, Molecular Networks GmbH; available at http:// www.molecular-networks.com/moses. [17] L. Ye, R. Brown, E. Busta, B. Mugabe, K. Arvidson, and C. Yang, Implementing computational methods in an institutional knowledge-base at FDA’s Center for Food Safety and Applied Nutrition: A mode-of-action-based approach to building QSAR models, 50th National Meeting (2011), Society of Toxicology, Washington DC. [18] M. Leist, B.A. Lidbury, C. Yang, P.J. Hayden, J.M. Kelm, S. Ringeissen, A. Detroyer, J.R. Meunier, J.F. Rathman, G.R. Jackson, Jr., G. Stolper, and N. Hasiwa, Novel technologies and an overall strategy to allow hazard assessment and risk prediction of chemicals, cosmetics, and drugs with animal-free method, Altex 29, 4/12 (2012), 373–388. [19] C. Yang, K. Arvidson, A. Detroyer, J. Gasteiger, J. Marusczyk, J. Rathman, A. Richard, S. Ringeissen, C. Schwab, A. Tarkhov, and A. Worth, Chemotypes: A new structure representation standard for incorporating atom/bond properties into structural alerts for toxicity effects and mechanism, 52nd National Meeting (2013), Society of Toxicology, San Antonio, TX. [20] A. Tarkhov, J. Marusczyk, C. Yang, L. Terfloth, B. Bienfait, O. Sacher, T. Kleinoeder, T. Magdziarz, C.H. Schwab, K. Arvidson, A. Richard, A. Worth, J.F. Rathman, and J. Gasteiger, manuscript in preparation. [21] BioPath.Database, Molecular Networks; software available from http://www.molecular-networks. com/databases/biopath; the databases can be freely explored: http://www.molecular-networks.com/ biopath3/. [22] M. Reitz, A. von Homeyer, and J. Gasteiger, Query generation to search for inhibitors of enzymatic reactions, J. Chem. Inf. Model. 46 (2006), pp. 2324–2332. [23] O. Sacher, N. Reitz, and J. Gasteiger, Investigations of enzyme-catalyzed reactions based on physicochenical descriptors applied to hydrolases, J. Chem. Inf. Model. 49 (2009), pp. 1525–1534. [24] G. Kastenmüller, J. Gasteiger, and H.W. Mewes, An environmental perspective on large-scale genome clustering based on metabolic capabilities, Bioinformatics 24 (2009), pp. i56–i62. [25] G. Kastenmüller, M.E. Schenk, J. Gasteiger, and H.W. Mewes, Uncovering metabolic pathways relevant to phenotypic traits of microbial genomes, Genome Biology 10 (2009), pp. R28. [26] S. Anzali, M.R. Berthold, E. Fioravanzo, D. Neagu, A.R. Péry, A.P. Worth, C. Yang, MTD Cronin, and A.N. Richard, Development of computational models for the risk assessment of cosmetic ingredients, IFSCC Magazine 15 (2012), pp. 249–255.

Solved and unsolved headache problems.

Cytochrome oxidase: some unsolved problems and controversial issues.

Mycosis fungoides--unsolved problems.

The unsolved problems of neuroscience.

Unsolved problems and concluding recommendations regarding cytomegalovirus.

Guillain-Barré syndrome: the unsolved cardiovascular problems.

Unsolved problems and future perspectives of hepatitis B virus vaccination.

Reactive oxygen species, nutrition, hypoxia and diseases: Problems solved?

PDT for Barrett's esophagus: Status and unsolved problems.

Assembly line polyketide synthases: mechanistic insights and unsolved problems.

Unsolved problems in biology--The state of current thinking.

[Inappropriate ICD therapies: All problems solved with MADIT-RIT?].

Pharmacokinetics of gestagens: some problems.

Machine learning methods in chemoinformatics.

Some Problems with Randomized Controlled Trials and Some Viable Alternatives.

Hepatitis B immune globulin: some progress and some problems.

New methods of estimating the number of motor units: the problems remain unsolved.

Imported mycoses: some diagnostic problems.

The gel test: some problems and solutions.

Chronic form of childhood spinal muscular atrophy. Are the problems of its genetics really solved?

The myxomycetes--some problems and unanswered questions.

Psychoanalytic views of aggression: some theoretical problems.

Diagnostic classification of learning problems: some data.

Some Wartime Problems of Mental Health.