Drug Discovery Today: Technologies
Vol. 1, No. 2 2004
Editors-in-Chief Kelvin Lam – Pfizer, Inc., USA Henk Timmerman – Vrije Universiteit, The Netherlands DRUG DISCOVERY
The role of bioinformatics in target validation Paul A. Whittaker Novartis Institutes for Biomedical Research, Respiratory Disease Area, Horsham, West Sussex, UK RH125AB
Bioinformatics is being increasingly used to support target validation by providing functionally predictive information mined from databases and experimental datasets using a variety of computational tools. The predictive power of these complementary approaches is strongest when information from several techniques is combined, including experimental confirmation of predictions. The aim of this review is to highlight and discuss the key approaches available in this rapidly developing area to facilitate selection of the appropriate tools and databases. Introduction Genomic, transcriptomic and proteomic technologies are currently driving the pharmaceutical industry’s search for novel targets that will result in innovative therapies . Building up the case that drug modulation of a target is likely to have a beneficial effect in a given disease (target validation) is a key step in this process and combines data from molecular biology, cell biology, bioinformatics and in vitro and in vivo experiments, with the amount of work needed for validation increasing dramatically for ‘‘novel’’ targets with no known biological function or link to disease. Although experimental work is the key driver in target validation, bioinformatics is playing an increasingly important role in supporting this process as biological knowledge is ‘‘mined’’ from the numerous databases containing data on DNA sequences, protein structures, pathways, organisms and disease that exist to uncover disease-links and provide clues to biological function
E-mail address: [email protected]
1740-6749/$ ß 2004 Elsevier Ltd. All rights reserved.
Section Editors: Luis Menandez-Arias – Universidad Auto´noma de Madrid, Cantoblanco, Madrid, Spain; Pierre Chatelain – UCB S.A., Braine-L’Allend, Belgium; Bernard Masereel – University of Namur, Namur, Belgium With the advent of genomics, transcriptomics, proteomics, etc. bioinformatics has achieved prominence because of its central role in data storage, management and analysis. The importance of bioinformatics in target validation is justified because a rational and efficient mining of the information that integrates knowledge about genes and proteins is necessary for linking targets to biological function. In addition, new developments in bioinformatics will be helpful to infer structural information from raw sequence data, guiding the identification or design of target-specific ligands.
(Fig. 1). The hypotheses developed as a result of these efforts can then be tested experimentally. This review focuses on the increasingly sophisticated in silico approaches that are being used to support target validation.
Predicting function from sequence and structure The most commonly used approach to assign function to proteins is by sequence similarity, but this approach has its limitations (Box 1), so attention has focussed on complementing and extending this approach by the development of complementary methods to function prediction using sequence and structural information.
Sequence-based approaches The identification of signatures of domains and functional sites in amino acid sequences (e.g. http://www.ebi.ac.uk/ interpro) has played an important and complementary role to similarity searching methods in the functional characterisation of proteins. In an extension of this approach, www.drugdiscoverytoday.com
Drug Discovery Today: Technologies | Target validation
Vol. 1, No. 2 2004
Box 1 Limitations of similarity methods for prediction of function
Figure 1. Target validation involves linking putative targets to biological function in healthy and diseased states. Because of the ‘‘fuzziness’’ of the term function, this process involves combining data from a range of experimental studies aimed at identifying: (1) what proteins the target interacts with; (2) which biochemcal pathway(s) or signalling network(s) the target participates in; and (3) the biological effect of removing the target by generating knockout mice, or reducing the level of the target in cultured cells by knocking down the RNA transcript for the target. In addition, information from structural, genetic and expression studies can also provide valuable information for target validation. By identifying small molecule inhibitors of the target, chemogenomics approaches can be used to gain insights into biological function and disease relevance. Bioinformatics is assuming an increasingly important role in the target validation process by analysing and integrating the datasets from these experimental studies and by making predictions about target function based on ‘‘mining’’ of genomic, transcriptomic and proteomic data. These predictions are then used to form hypotheses that are investigated experimentally. As a result, bioinformatics is being integrated increasingly with bench-based science as hypotheses generated in silico are tested in vitro and in vivo in ‘‘wet–dry’’ cycles.
the prediction of sequence motifs associated with post-translational modifications and sub cellular localisation of proteins has the ability to transfer functional information between sequences that are unrelated at the primary sequence or evolutionary level [2,3]. The key principle here is that functionally-related proteins will have similar posttranslational modifications and sorting signals even if they are unrelated at the sequence level. The ProtFun method (http://www.cbs.dtu.dk/services/ProtFun/) integrates 14 individual attributes (e.g. glycosylation, phosphorylation, signal peptides etc.) to predict functional categories, whereas Proteome Analyst (http://www.cs.ualberta.ca/ bioinfo/PA) predicts sub cellular location using database text annotations from homologs in addition to sequence information. The Eukaryotic Linear Motif (ELM) server (http://elm.eu.org/) is a resource for investigating short peptide linear motifs which are used for cell compartment targeting, protein–protein interaction, regulation by phosphorylation, acetylation, glycosylation and a range of other post-translational modifications. Scansite (http://scansite.mit.edu/) identifies short sequence motifs within query pro126
Sequence similarity methods have been a mainstay of bioinformatics approaches to assign biological functions to genes and proteins for the past 20 years or so. The redundancy of the genetic code means that this method is most sensitive at the protein level. Therefore, the nucleotide sequences of genes are usually translated into their predicted primary amino acid sequence and used to query protein sequence databases (e.g. UniProt: http://www.uniprot.org) for homologous sequences by automatic pair-wise comparison using Basic Local Alignment Search Tool (BLAST) and its various flavours (http://www.ncbi.nlm.nih.gov/BLAST/). This approach allows the detection of homologous sequences having greater than 50% identity. By building consensus sequences based on multiple alignments and then matching them against databases of consensus sequences (e.g. ProDom: http://prodes.toulouse.inra.fr/prodom/ current/html/home.php) using BLAST, sequence identities down to 30% can be detected. To allow the identification of homologous domains with low sequence identity (