Vol. 2, No. 3 2005

Drug Discovery Today: Technologies Editors-in-Chief Kelvin Lam – Pfizer, Inc., USA Henk Timmerman – Vrije Universiteit, The Netherlands DRUG DISCOVERY

TODAY

TECHNOLOGIES

Knowledge management

Managing genomic and proteomic knowledge Michael W. Lutz, Patrick V. Warren, Rob W. Gill, David B. Searls* Bioinformatics Division, GlaxoSmithKline Pharmaceuticals, 709 Swedeland Road, P.O. Box 1539, King of Prussia, PA 19406, USA

Genomic and proteomic platform data constitute a hugely important resource to current efforts in disease

Section Editor: Manuel Peitsch – Novartis, Switzerland

understanding, systems biology and drug discovery. We review prerequisites for the adequate management of ‘omic’ data, the means by which such data are analyzed and converted to knowledge relevant to drug discovery and issues crucial to the integration of such data, particularly with chemical, genetic and clinical data.

Introduction In combination with whole-genome sequences of human and several model organisms, the availability of technologies to modify and measure cellular responses at the level of individual transcripts and proteins presents opportunities to accelerate the process of drug discovery across the entire pipeline, from disease understanding and target identification through clinical trials, postmarketing surveillance and diagnostics. However, the availability of genomic and proteomic platform data, in addition to large-scale genetic association studies, high-throughput screening data and results from animal models, creates a sizeable challenge for data integration and knowledge management. This review focuses on knowledge management of genomic and proteomic data, the use of bioinformatics approaches to apply the knowledge to programs in drug discovery and the effective integration of this data with chemical, genetic and clinical data.

*Corresponding author: D.B. Searls ([email protected]) 1740-6749/$ ß 2005 Elsevier Ltd. All rights reserved.

DOI: 10.1016/j.ddtec.2005.08.001

Prerequisites for management of genomic and proteomic data In utilizing platform data, certain steps must be taken to facilitate the transition of data to useful information and ultimately knowledge. Effective integration of platform data requires planning at the outset to allow biologically meaningful questions to be asked of the data, both for an experiment that was designed to answer a specific question and for data mining across the entire collection of knowledge available. Table 1 gives a description of key genomic and proteomic technologies from the perspective of knowledge management. The companies and websites listed provide software and bioinformatics approaches to manage data for specific genomic and proteomics platforms and, in many cases, to integrate the data with other platforms. The diversity of the platforms is notable, ranging from transcriptomic analysis to animal models. New technologies are invented and developed at a rapid pace; a key aspect of the table is the power derived from using knowledge from several platforms in the context of a project, for example using microarrays to develop a list of potential biomarkers (see Glossary) based on differences between diseased and normal tissues and then using this gene list in combination with proteomic and metabonomic samples from blood and urine to develop diagnostic assays [1]. Data for genomic and proteomic studies is derived from samples obtained from human and/or animal model studies. To make effective use of data from samples, consistent and standard methods for indexing biological entities and capturwww.drugdiscoverytoday.com

197

Drug Discovery Today: Technologies | Knowledge management

Glossary Biomarker: A specific biological trait, such as the level of a hormone or molecule in a subject, that can be measured to indicate the progression of a disease, condition, or response to treatment. Chemogenomics: An integration of genomics with chemistry. Examples include measuring the genomic and/or proteomic response of a biological system to chemicals and the use of gene-family focused approaches for drug discovery. Normalization: A mathematical procedure that adjusts for systematic differences among data from varying sources to create a common basis for meaningful comparisons. Ontology: A formal, usually hierarchical, and often richly interconnected representation of objects, concepts and other entities that embodies knowledge about a field. Pleiotropy: The property of a gene or gene product by which it exhibits multiple phenotypic effects or possesses multiple functions.

ing experimental details are necessary (Fig. 1). Samples could be obtained from the same subject over a period of time, for different treatment conditions and for different purposes. To describe content, standard vocabularies are needed to allow meaningful aggregation and query of the data. SMOMED (see Links), for example, offers such standards for clinical practice and pathology. Well-characterized phenotypes are essential to genetic and genomic studies and a prerequisite for conducting analysis that is useful for drug discovery or disease understanding. Phenotypes are based on clinical measurements and findings, for example laboratory measurements, diagnoses, imaging studies, or clinical history notes. Platform data itself can be used to define phenotypes based on patterns of expression, which are then used in the context of drug treatment or disease studies and compared with genotypes. Definition of a phenotype requires consistency in terms of use and interpretation. To achieve consistency in gene-expression-based phenotypes, controlled vocabularies and a common gene index are essential. A valuable resource for microarray standards, which are among the most advanced, is the Microarray Gene Expression Data Society (see Links). To compare genotypes to phenotypes in different organisms, an important step is reconciling the different phenotype descriptions in public databases which are often derived for a single organism. To compare between species, orthology must first be established. A public resource of phenotypes and genotypes for multiple organisms, including human, mouse, fruit fly, Caenorhabditis elegans and other model organisms, is the PhenomicDB (see Links). The software provided with this database allows the comparison of phenotypes of a specific gene over many organisms simultaneously. The authors collected data from public data sources including OMIM, Wormbase, Flybase and MGI and semantically mapped the data to a single data model [2]. In the transition from data to information to knowledge, transformations of the original, primary data are often 198

www.drugdiscoverytoday.com

Vol. 2, No. 3 2005

Box 1. Normalization Normalization methods span a range of complexity from simple calculations based on scaling data from a collection of samples processed in the same batch (e.g. quantitative expression plate, microarray chip, or proteomics gel) to statistical error models based on the analysis of replicates. For microarrays, where there are typically thousands of genes arrayed on the same chip, normalization is often performed by comparison to a global average measure of intensity. Alternative approaches include using linear regression analysis, log centering, locally weighted linear regression and nonlinear methods [22]. For quantitative real-time PCR, where ten to several hundred genes might constitute an experimental batch, normalization calculations are often based on a set of reference or housekeeping genes; however, careful attention needs to be given to the choice of the reference genes and nonlinear response characteristics [23]. Often methods developed for one technology can be successfully applied to another, for example variance stabilizing transformations used for microarray data can be used to remove bias from differential protein expression measurements on 2D gels [24]. Literature references are available that describe and compare normalization methods for genomic and proteomic platforms [22–24]. Additional resources and software are available on academic or public websites such as BioConductor or the Stanford Genome Resources (see Links). Benchmarks and comparisons of normalization techniques are important; a useful resource for comparison of microarray algorithms is Affycomp (see Links).

required for analytical goals. NORMALIZATION (see Glossary and Box 1) is the most crucial transformation to utilize platform data for meaningful biological comparisons, biochemical pathway and network analysis and comparison across studies and multiple platforms. An effective normalization method allows biologically meaningful comparisons to be made, reduces variance between experimental runs and decreases bias. Normalization is necessary to account for different amounts of starting material (RNA, protein, analyte, among others), differences in labeling or detection efficiency and systematic biases introduced by the measurement technique. Different normalization methods can impact downstream analysis. For example, in high-density oligonucleotide arrays such as the Affymetrix microarray chip, advanced methods in signal extraction and normalization at the probe level are now competitive with and in many cases superior to the standard Affymetrix MAS 5.0 algorithm. The set of oligonucleotides that make up the probe set vary in length and position on the transcript of interest, and therefore, have different hybridization characteristics. The new methods utilize the information collected from the various probe hybridization characteristics and their respective response to varying amounts of target transcript to better evaluate the overall signal intensity [3]. However, unlike the MAS 5.0 global scaling, which makes possible comparison of intensity values from chip to chip, several of the new methods use normalization within a specific treatment group. Therefore, direct comparison across normalized groups is ill advised. It is necessary, to take advantage of the better signal-to-noise values derived from these new methods, to devise higher

High-density oligonucleotide microarrays

Interaction proteomics (Y2H, antibody methods, among others)

Two-dimensional gel electrophoresis (PAGE, DIGE)

Mass spectrometry (MALDI-MS, ESI-MS/MS, PMM, among others)

Functional genomics (RNAi, knockout, mutagenesis, transgenics, among others)

Other ‘omics’ (lipomics, metabonomics, among others)

Names of specific technologies with associated companies and company websites

Applied Biosystems www.appliedbiosystems.com Stratagene www.stratagene.com Statistical analysis tools for both RT-PCR and microarrays: SIMCA www.umetrics.com JMP www.jmp.com Spotfire www.spotfire.com R scripts www.r-project.org

DNA-Chip Analyzer www.biostat.harvard. edu/complab/dchip MicroArray Suite 5.0 www.affymetrix.com BioConductor www.bioconductor.org Rosetta Biosoft www.rosettabio.com Agilent Technologies www.home.agilent.com GeneSpring www.silicongenetics.com (See also Technology 1)

Dual Systems www.dualsystems.com Hybrigenics www.hybrigenics.com Prolexys www.prolexys.com

Non Linear Dynamics www.nonlinear.com ThermoLabsystems www.thermo.com GE Healthcare www.gehealthcare.com Spotfire www.spotfire.com Scripting in r www.r-project.org Rosetta Biosoft www.rosettabio.com

Waters-Micromass www.waters.com Matrix Sciences (MASCOT) www.matrixscience.com Agilent Technologies www.home.agilent.com Applied Biosystems www.appliedbiosystems.com Spotfire www.spotfire.com Scripting in r www.r-project.org Rosetta Biosoft www.rosettabio.com

Galapagos www.galapagosgenomics.com Cenix www.cenix-bioscience.com Atugen www.atugen.com Deltagen www.deltagen.com Genoway www.genoway.com Trans Genic Inc. www2.transgenic.co.jp

Metabometrix www.metabometrix.com Surromed www.surromed.com BG Medicine www.bg-medicine.com Icoria www.icoria.com Lipomics www.lipomics.com

Pros

Method is quantitative, greatly simplifying analysis and normalization One to hundreds of genes can be measured at once Instrument vendors often provide analysis packages Very rich set of tools exist for statistical analysis that are suitable for several platforms

Tens of thousands to millions of points from a single measurement Pre-processing and normalization software offered by platform vendors Large user community contributes substantial intellectual content and software (see, e.g. affycomp.biostat.jhsph.edu)

Y2H (yeast 2-hybrid) Proteins identifiable by direct sequencing Identifies interactions directly Antibody methods Can study complexes Can detect post-translational modification Can use native setting Suitable for membrane proteins

Ability to separate thousands of proteins from a single sample Differential expression analysis possible Isoelectric point and relative molecular mass estimation possible Post-translational modification can be detected directly

MS gives high-resolution data with high reproducibility MALDI–MS generates less raw data than other MS approaches Peptide Mass Mapping (PMM) allows not only qualitative but also quantitative detection for differential expression analysis

RNAi Shorter turnaround time to analysis Can be high-throughput Transgenics Effects seen in animals as opposed to cell lines Physiological phenotypes can be associated with a knockout Conditional knockouts allow effects of to be seen at any stage or treatment

Often minimally invasive, enabling more informative time-series analysis Can suggest sampling frequencies for other, costlier platforms Metabolic pathways relatively conserved Can aid interpretation of species changes in biochemical pathways

Cons

Subsets of genes chosen for study might miss biologically significant effects Smaller number of tests and highly flexible formats force postprocessing analysis into case-dependent methods Far fewer publications and methods for analysis than for microarrays

Expense and complexity of manufacture and quality control requirements tend to limit flexibility High dimensionality of data complicates analysis and visualization Statistical issues around multiple testing Sheer number and variety of pre-processing algorithms can be overwhelming Intimate knowledge of inferences made in functional annotation of genes is crucial

Y2H Promiscuous proteins result in noisy data Yeast environment might confer incorrect activity Certain protein classes not accessible Must score every interaction pair for reliability to allow cross-experiment analysis Antibody methods Cannot be sure if direct interaction or through secondary protein Extensive workflow architectures required to manage results through to peptide identification

Reproducibility issues Follow-up spot identification required ‘Missing data’ issues Can be hard to detect very acidic, basic, or hydrophobic proteins Loading can limit detection of low-abundance proteins Gel analysis tightly bound to vendor software leading to dependency Difficult to map proteins within an experiment and to do crossexperiment analyses

Very high data volume and storage/analysis cost Sampling rate can lead to ‘missed peptides’ Analysis bound to vendor software leads to dependency, integration issues Peptide-to-protein association might not be definitive Data transformation issues MALDI–MS must be converted to text files of peptide mass and intensity to search database ESI-MS/MS generates much more raw data, converted to files of peptide mass, intensity and charge followed by lists of fragment mass and intensity PMM ions not associated with peptides, so requires much additional analysis in differential expression studies

RNAi Not 100% knockdown, potentially complicating analysis Possible secondary effects Harder to phenotype Transgenics Long time scale from start to having a model delays analysis Subtle phenotypes can be difficult to capture and compare computationally, or might require broad range of conditions to elicit

Many endogenous metabolites change as a result of normal variability, diurnal cycles, diet, among others, complicating analysis Species differences in metabolism might be sufficient to limit the applicability of findings from animal model studies to humans

Refs

[23,26–28]

[29–32]

[33–35]

[36]

[37–38]

[39–42]

[43–46]

199

Drug Discovery Today: Technologies | Knowledge management

Real-time PCR or quantitative real-time PCR

www.drugdiscoverytoday.com

Name of specific type of technology

Vol. 2, No. 3 2005

Table 1. Summary table of genomic and proteomic technologies

Drug Discovery Today: Technologies | Knowledge management

Vol. 2, No. 3 2005

Figure 1. Proper indexing of biological entities such as genes and gene products is necessary to ensure that database query and data mining will retrieve all related information about those entities. Complete genome sequences now provide a ‘parts list’ and a solid foundation on which to anchor information about, for instance, variation at the level of populations and across evolution. However, phenomena such as alternative transcription, especially alternative splicing, establish a one-to-many relation from genetic loci to mRNA transcripts, often in a tissue-specific manner. Furthermore post-transcriptional modifications such as phosphorylation further expand the repertoire of the proteome, complicating the job of indexing all such variants.

order comparison strategies. For example, comparison at the fold-change level would be permissible, provided the compared groups shared proper controls. It is equally important to understand the inferences made in moving from analysis made at the probe set level to the gene level. Frequently in high-density oligonucleotide arrays, there are multiple probe sets per gene and there is also a 30 enhancement in signal intensity. Understanding the orientation of the probe set to gene structure is also crucial when considering the influence of splice variant populations in the mRNA isolates. The issues raised here are applicable to several genomic and proteomic platforms and the knowledge of the assumptions made are crucial to any analysis attempting the transition from measurement to biological inference. Data integration is a major challenge both for bioinformatics and for drug discovery [4]. Knowledge management of genomic and proteomic data are often a data-driven rather than a hypothesis-driven activity. Hypothesis-driven investigation is based on focused acquisition and analysis of data gathered from controls and treatment groups. For a datadriven investigation, acquisition and analysis of data are performed with the aim of making the data available for ad hoc sampling and interpretation. Different data types are often used in combination to provide support to a relationship identified by a single platform. Although cross-platform normalization is mathematically possible (e.g. by using methods such as centering and scaling of data, principle components analysis and partial least squares), qualitative interpretation of data across platforms is often used to build knowledge from multiple platforms. For example, information on changes in gene expression and associated protein level changes can be overlaid on pathway maps and examined for concordance. Software, such as the open-source Cytoscape (see Links), is available to visualize molecular interaction networks and associated platform data [5]. Gene sets from transcriptomic and proteomic experiments based on comparisons between disease and normal states can 200

www.drugdiscoverytoday.com

be compared for overlap and trends over time. Dose–response is an important comparison for drug treatment response and it is feasible to develop statistical models based on large numbers of dose–response comparisons with statistical adjustment for the number of comparisons. Cross-platform comparisons are complicated at a biological level by alternative splicing and post-translational modifications; for example, the primer chosen for a microarray chip might not correspond to the protein measured by a 2D-gel proteomics experiment. As a consequence, the two expression levels (mRNA and protein level) might not correlate with changes in treatment. Advanced statistical methods in supervised and unsupervised learning are used to develop complex models to generate biological fingerprints or profiles based on data obtained from multiple genomic and proteomic platforms [6]. Data integration can occur at different phases during the processes of data acquisition, analysis and interpretation. Primary data are stored in operational databases and is in the form of image files or output from scanners and is typically not normalized or combined with other data. At the next level, appropriate normalization is done to permit comparison with data from other experiments using the same platform, or, in some cases, with data from other platforms. Moving up yet one more level, data are aggregated into specific treatment and control groups; statistical and visual methods are employed at this stage to start transforming the data into results with a statistical measure of variation. Finally, at the highest level, the results are compared and contrasted with other experimental results and/or literature reports for conclusions and interpretations. Trends across dose, time, or other covariates are often the basis for higher level data integration. Maintaining a record or path to the primary data is crucial; as novel bioinformatics techniques evolve, the availability of the primary data allows new normalization and analytical methods to be employed. However, data integration across platforms for purposes of decision-

Vol. 2, No. 3 2005

making is usually done at a higher level, using data from different platforms to strengthen the evidence for a specific result or biological interaction.

From data to knowledge: ‘omics’ in drug discovery The application of genomic and proteomic data to problems in drug discovery is focused on understanding the molecular basis for disease and on the development of knowledge about efficacy and safety of potential new drugs in humans. Analysis and interpretation of data from several types of genomic and proteomic studies and interpretation of results from tests in animal models characterize these applications. Comparisons of results between species and between different genomic and proteomic platforms are essential to assess the strength of evidence of results to progress compounds through the drug discovery pipeline. Methods used in bioinformatics analysis must be effective at unifying results between platforms and species. Drug discovery utilizes comparative genomics, largely based on comparisons of drug response in model organisms ranging from yeast to primates, as a crucial step to understand drug efficacy and safety in humans. As described in the table, transgenic techniques including gene knockouts, knockins and conditionals are used to modify model organisms to characterize biochemical targets. However, drug discovery programs encounter difficulties when species differences in metabolism or toxicity are observed, either between different model organisms or between model organisms and humans. Establishment of orthology for the drug target or targets is a crucial step, which depends on the effective management and knowledgeable assessment of genome data. Although genes with high homology can often be identified between different pairs of genomes, complexity is introduced by characteristics of both the target and the drug–target complex, including alternative splicing, differences in signal transduc-

Drug Discovery Today: Technologies | Knowledge management

tion and binding of compound to multiple protein domains. Establishing a knowledge management program for comparative genomics requires data integration to identify models and organisms available for a specific compound, target or disease and to enable cross-species comparisons. Often, an iterative process is used in comparative genomics to provide support for a biological hypothesis, for example a finding from a yeast model might form the basis for undertaking gene knockouts in mice. Phylogenetic analysis is a powerful tool for disease understanding and drug development that provides a biological context for understanding species differences (Fig. 2). For diseases caused by viral and bacterial pathogens, phylogenetic analysis has been used to understand patterns of drug resistance and to understand the transmission of diseases between species. More recently, the completion of genome sequencing of many diverse species has allowed analysis in terms of prediction of protein function, regulatory region analysis, protein interactions, pathways and networks. By using a rigorous analysis of evolutionary change, drug targets can be more effectively classified and placed into a biological context [7]. Knowledge management of the data needed for phylogenetic analysis is based on the curation of wholegenome data for multiple species. Bioinformatics algorithms for phylogenetic mapping and reconstruction require highquality sequence data for BLAST searches of various types, clustering and a variety of phylogenetic software packages including PAUP and PHYLIP. Construction of complete orthology maps provides the basis to understand species differences during screening and lead identification or lead optimization. Combination of orthology maps with other types of genomic and proteomic data and with detailed annotations on pathways facilitates inference about species differences in compound efficacy and safety. Understanding drug target paralogy is also important for drug discovery,

Figure 2. Genes can be cross-indexed between species by establishing orthology relations, providing clues to function which in many cases is well conserved across phylogeny (as in the case of the genes shown from the PAX-6 gene family on the left, which share eye-related phenotypes [25]). Such mapping can also extend to pathways as shown on the right, allowing further inference of function by analogy between species, but in all cases it is necessary to track and understand species differences arising from gene loss, duplication and functional shifts.

www.drugdiscoverytoday.com

201

Drug Discovery Today: Technologies | Knowledge management

particularly for targets that are members of large superfamilies, including 7-transmembrane receptors, kinases and nuclear receptors. As an example, rigorous phylogenetic analysis provided a biological context for characterization of Aurora kinases and suggested a design for dual action anticancer drugs that would inhibit multiple kinases [8]. Increasingly, disease understanding and mechanism of drug action studies are viewed from a biological systems perspective [9]. The focus of systems biology is on interactions between the molecular components relevant to the biological question being addressed. Proteins interact with compounds and other proteins, forming networks that encompass, for example, kinases and phosphatases that have roles in signal transduction. Cis control elements and transcription factors regulate gene expression in different tissues and cellular environments. Systems biology presents a great challenge for knowledge management of genomic and proteomic data, integration of which offers the opportunity to correlate multiple lines of evidence, although potentially reducing uncorrelated noise. This combination can allow a more cohesive and meaningful biological interpretation to emerge from high-throughput data sources. For instance, protein interaction maps derived from yeast two-hybrid data indicate proteins that can bind to each other for a wide variety of reasons, including relatively nonspecific binding characteristic of certain proteins. At the same time, microarray data reflect expression levels of mRNA that might or might not correlate well with protein expression levels, and which is subject to disparate types of variation at the level of measurement, sample and subject replicates. Yet these two forms of data have been shown to be informationally synergistic when combined, allowing for the reconstruction of known pathways from first principles [10]. The combinations of these forms of data have been used to identify libraries of recurring motifs, where the mixed semantics of the patterns promise to be much more informative in establishing the building blocks of biological networks than any single data source taken in isolation [11]. Systems biology, phylogenetic analysis and many bioinformatics methods share a common goal of definition of gene function. Computational and logical frameworks to describe gene function are often based on ontologies, for example the GO (Gene Ontology) (see Links) project, that require subjective and labor-intensive curation. General ontological frameworks that include gene expression annotation and visualization have appeared, including GandrKB [12]. Gene function is complicated by biological phenomena including PLEIOTROPY (see Glossary), when a gene affects more than one phenotypic trait and redundancy, when a trait is affected by more than one gene. These phenomena make a top–down, strictly hierarchical of gene function problematic at best. An alternative involves development of a bottom–up, probabilistic view or framework of gene function [13]. In this frame202

www.drugdiscoverytoday.com

Vol. 2, No. 3 2005

work, the data are used to construct biochemical networks based on integration of multiple data sets. This approach is amenable to the use of diverse types of data as described in this review. In the probabilistic framework, pleiotropy is implicit and statistical methods are used to control the biological noise that is associated with any single measurement. As an example, Fraser and Marcotte [13] demonstrated how a biochemical network constructed from yeast was used to form a framework for understanding the function of C. elegans genes.

Integration with chemical, genetic and clinical data Drug discovery projects require integration of genomic and proteomic platform data with other experimental data to progress from potential drug targets to efficacious and safe drugs. Hundreds to thousands of genes and gene products change in response to a drug or development of a disease, or as a result of normal metabolism and regulation. The challenge is to distinguish the relevant biochemical changes that will assist in the determination of structure–activity relationships and establishment of efficacy and safety from other biological variations. Chemical, genetic and clinical data obtained in the course of discovery and clinical research combined with the genomic and proteomic knowledge is crucial to address this challenge. Ultimately, biomarkers developed from this knowledge can be used in the course of drug discovery and clinical research. CHEMOGENOMICS (see Glossary) is a brand of systems biology that is often based on the response of an organism or cell to different types of intervention, principally chemical intervention [14,15]. In the first phase, or learning phase, a set of treatments is applied to the system (organism or cell) that can include compounds, knockouts, RNAi, or genetically engineered overexpression of genes. The genomic response to the treatment is measured, typically using microarrays although any platform technology could be used. The response to the treatments creates a set of reference profiles that are used to generate a computational model of the biological network. After the network is constructed, a compound or treatment with an unknown mechanism of action or intracellular targets is applied to the system and the corresponding genomic profile is measured. Depending on the bioinformatics algorithms that are used, the genomic profile for the unknown compound is directly compared with the reference profiles or is used as an input to a more complex regulatory network. The network, it is hoped, will highlight putative molecular targets and pathways. Chemical genomic profiling is a powerful tool to elucidate intracellular signaling pathways, which in recent publications has been used to identify the targets of kinase inhibitors that have complex profiles in terms of interactions with multiple members of a signaling pathway [16]. Genetic factors confer susceptibility to disease, resistance to drugs and individual clinical outcomes. A key challenge for

Vol. 2, No. 3 2005

genetics is understanding of polygenic diseases, where susceptibility to the disease results from the interaction of multiple genes rather than the action of a single gene. Environmental factors (diet, lifestyle, toxins, among others) interact with genetic factors in a complex way for many common human diseases including cancer and cardiovascular disease. Linking genotypes to phenotypes is an important step in disease understanding and the availability of highquality phenotype databases is a crucial first step [17,18]. The combination of high-density, whole-genome SNP screening and statistical methods creates a paradigm for drug discovery and development where data from patients is used to select the appropriate biochemical targets for a disease [19]. Genetic associations from these studies support pathway analysis and disease understanding from the most relevant source: highquality collections of data from patients. Genomic and proteomic data augment genetic association studies in terms of additional and diverse evidence for biochemical networks associated with disease and drug response. A promising approach is to utilize genetic analysis (e.g. SNP-genotype data and genome-wide mapping techniques) for analysis of gene expression patterns. A recent study [20] used this approach to study natural variation in gene expression and identify transcriptional regulatory loci, without prior knowledge of the regulatory mechanism. Biomarkers include biological markers, surrogates, prognostics and diagnostics. The utility and uptake of a BIOMARKER depend on the statistical and biological validation of the marker and acceptance in the medical and regulatory communities. (Definitions of biomarkers can be found on a National Institutes of Health website [see Links].) A set of biomarkers rather than a single biomarker is often optimal for specific clinical requirements. Biomarker identification and validation creates several challenges for the management of genomic and proteomic knowledge. Often, one platform technology, for example microarrays, is used as the first filter for a set of target genes to be used as a biomarker for a specific purpose (surrogate, prognostic, or diagnostic). A second plat-

Drug Discovery Today: Technologies | Knowledge management

Related articles Cobb, J.P. et al. (2005) Application of genome-wide expression analysis to human health and disease. Proc. Natl. Acad. Sci. U.S.A. 102, 4801–4806 Joshi-Tope, G. et al. (2005) Reactome: a knowledgebase of biological pathways. Nucl. Acids Res. 33, D428–D432 Strange, K. (2005) The end of ‘‘naı¨ve reductionism’’: rise of systems biology or renaissance of physiology? Am. J. Physiol. Cell. Physiol. 288, C968–C974 Huang, J. et al. (2004) Finding new components of the target of rapamycin (TOR) signaling network through chemical genetics and proteome chips. Proc. Natl. Acad. Sci. U.S.A. 101, 16594–16599 Weston, A.D. and Hood, L. (2004) Systems biology, proteomics, and the future of health care: toward predictive, preventative, and personalized medicine. J. Proteome Res. 3, 179–196 Hegde, P.S. et al. (2003) Interplay of transcriptomics and proteomics. Curr. Opin. Biotechnol. 14, 647–651

form technology, such as proteomics or quantitative expression, is used to validate the potential biomarkers and better characterize the relationship between changes associated with the phenotype. Finally, it is often a requirement that a biochemical assay can be developed as the measurement that is used in the clinic. Bioinformatics assists in biomarker identification by providing integrated sources of genetic and genomic information associated with an array of potential biomarkers. By making connections between the sequences, regulatory regions, transcriptomic data, proteomic data, gene expression maps and literature, the strength of a biomarker or set of biomarkers can be assessed [21].

Conclusion Genomic and proteomic data hold great promise for contributing to drug discovery. Fulfilling that promise requires attention first at the level of basic data management, indexing, standardization of descriptors and data normalization that will support valid inferences especially across platforms. This in turn will better enable the well-founded integration of ‘omic’ data, throughout a total biological context but in particular with chemical, genetic and clinical data that will most directly support the needs of the drug discovery enterprise.

Links  SMOMED (Systematized Nomenclature of Medicine) (www.snomed.org)  Microarray Gene Expression Data Society (MGED) (www.mged.org).  PhenomicDB (www.phenomicdb.de)  National Institutes for Health biomarkers site (www.ospp.od.nih.gov/biomarkers)  BioConductor (www.bioconductor.org)  Stanford Genomic Resources (genome – www.stanford.edu)  Eisen lab (www.rana.lbl.gov)  Affycomp benchmark (www.affycomp.biostat.jhsph.edu)  Cytoscape (www.cytoscape.org)  Gene Ontology (www.geneontology.org)

Outstanding issues  What data integration architectures are best suited to assembling large-scale and highly diverse genomic and proteomic platform data?  Will robust methods emerge for standardized cross-platform normalization?  Will new methods for ‘omic’ data mining be able to mitigate the pitfalls of analyzing high-dimensional data?  Will expression profiles extending over thousands of genes, metabolites, among others yield biological insight or only ‘black box’ results?  What is the proper role of ontologies in representing and integrating ‘omic’ platform data?

www.drugdiscoverytoday.com

203

Drug Discovery Today: Technologies | Knowledge management

References 1 Bilello, J.A. (2005) The agony and ecstasy of ‘‘OMIC’’ technologies in drug development. Curr. Mol. Med. 5, 39–52 2 Kahraman, A. et al. (2005) PhenomicDB: a multi-species genotype/ phenotype database for comparative phenomics. Bioinformatics 21, 418–420 3 Wu, Z.J. and Irizarry, R.A. (2005) A Statistical Framework for the Analysis of Microarray Probe-Level Data. Johns Hopkins University, Dept. of Biostatistics Working Papers. 4 Searls, D.B. (2005) Data integration: challenges for drug discovery. Nat. Rev. Drug Discov. 4, 45–58 5 Shannon, P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 6 Hastie, T. (2001) The Elements of Statistical Learning. Springer 7 Searls, D.B. (2003) Pharmacophylogenomics: genes, evolution and drug targets. Nat. Rev. Drug Discov. 2, 613–623 8 Brown, J.R. et al. (2004) Evolutionary relationships of Aurora kinases: implications for model organism studies and the development of anticancer drugs. BMC Evol. Biol. 4, 39 9 Hood, L. et al. (2004) Systems biology and new technologies enable predictive and preventive medicine. Science 306, 640–643 10 Steffen, M. et al. (2002) Automated modelling of signal transduction networks. BMC Bioinformatics 3, 34 11 Yeger-Lotem, E. et al. (2004) Network motifs in integrated cellular networks of transcription-regulation and protein–protein interaction. Proc. Natl. Acad. Sci. U.S.A. 101, 5934–5939 12 Schober, D. et al. (2005) GandrKB – ontological microarray annotation and visualization. Bioinformatics 21, 2785–2786 13 Fraser, A.G. and Marcotte, E.M. (2004) A probabilistic view of gene function. Nat. Genet. 36, 559–564 14 di Bernardo, D. et al. (2005) Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat. Biotechnol. 23, 377–383 15 Bredel, M. and Jacoby, E. (2004) Chemogenomics: an emerging strategy for rapid target and drug discovery. Nat. Rev. Genet. 5, 262–275 16 Kung, C. et al. (2005) Chemical genomic profiling to identify intracellular targets of a multiplex kinase inhibitor. Proc. Natl. Acad. Sci. U.S.A. 102, 3587–3592 17 Roses, A.D. (2004) Pharmacogenetics and drug development: the path to safer and more effective drugs. Nat. Rev. Genet. 5, 645–656 18 Roses, A.D. (2002) Genome-based pharmacogenetics and the pharmaceutical industry. Nat. Rev. Drug Discov. 1, 541–549 19 Roses, A.D. et al. (2005) Disease-specific target selection: a critical first step down the right road. Drug Discov. Today 10, 177–189 20 Morley, M. et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743–747 21 Frank, R. and Hargreaves, R. (2003) Clinical biomarkers in drug discovery and development. Nat. Rev. Drug Discov. 2, 566–580 22 Quackenbush, J. (2002) Microarray data normalization and transformation. Nat. Genet. Suppl. 32, 496–501 23 Bond, B.C. et al. (2002) The quantification of gene expression in an animal model of brain ischaemia using TaqManTM real-time RT-PCR. Mol. Brain Res. 106, 101–116

204

www.drugdiscoverytoday.com

Vol. 2, No. 3 2005

24

25 26 27 28 29 30 31 32 33 34 35 36

37 38 39

40 41

42

43

44 45 46

Kreil, D.P. et al. (2004) DNA microarray normalization methods can remove bias from differential protein expression analysis of 2D gel electrophoresis results. Bioinformatics 20, 2026–2034 Chisholm, A.D. and Horvitz, H.R. (1995) Patterning of the Caenorhabditis elegans head region by the Pax-6 family member vab-3. Nature 377, 52–55 Heid, C.A. et al. (1996) Genome Res. 6, 986–994 Higuchi, R. et al. (1992) Simultaneous amplification and detection of specific DNA sequences. Biotechnology 10, 413–417 Higuchi, R. et al. (1993) Kinetic PCR: real time monitoring of DNA amplification reactions. Biotechnology 11, 1026–1030 Schena, M. et al. (1995) Quantitative monitoring of gene expression patterns with a complimentary DNA microarray. Science 270, 467–470 Fodor, S.P. et al. (1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251, 767–773 Fodor, S.P. et al. (1993) Multiplexed biochemical assays with biological chips. Nature 364, 555–556 Rajagopalan, D. (2003) A comparison of statistical methods for analysis of high density oligonucleotide array data. Bioinformatics 19, 1469–1476 Colland, F. and Daviet, L. (2004) Integrating a functional proteomic approach into the target discovery process. Biochimie 86, 625–632 Legrain, P. and Selig, L. (2000) Genome-wide protein interaction maps using two-hybrid systems. FEBS Lett. 480, 32–36 Cho, S. et al. (2004) Protein–protein interaction networks: from interactions to networks. J. Biochem. Mol. Biol. 37, 45–52 Righetti, P.G. et al. (2004) Critical survey of quantitative proteomics in two-dimensional electrophoretic approaches. J. Chromatogr. A 1051, 3–17 Shi, Y. et al. (2004) The role of liquid chromatography in proteomics. J. Chromatogr. A 1053, 27–36 Jacobs, J.M. et al. (2005) Ultra-sensitive, high-throughput and quantitative proteomics measurements. Int. J. Mass Spectrom. 240, 195–212 Chen, X. et al. (2005) Chemical modification of gene silencing oligonucleotides for drug discovery and development. Drug Discov. Today 10, 195–212 Hannon, G.J. and Rossi, J.J. (2004) Unlocking the potential of the human genome with RNA interference. Nature 431, 371–378 Harris, S. and Foord, S.M. (2000) Transgenic gene knock-outs: functional genomics and therapeutic target selection. Pharmacogenomics 1, 433–443 Tornell, J. and Snaith, M. (2002) Transgenic systems in drug discovery: from target identification to humanized mice. Drug Discov. Today 7, 461– 470 Griffin, J.L. and Bollard, M.E. (2004) Metabonomics: its potential as a tool in toxicology for safety assessment and data integration. Curr. Drug Metab. 5, 389–398 Watkins, S.M. et al. (2002) Lipid metabolome-wide effects of the PPARg agonist rosiglitazaone. J. Lipid Res. 43, 1809–1817 Nicholson, J.K. et al. (2002) Metabonomics: a platform for studying drug toxicity and gene function. Nat. Rev. Drug Discov. 1, 153–161 Watkins, S.M. (2004) Lipomic profiling in drug discovery, development and clinical trial evaluation. Curr. Opin. Drug Discov. Dev. 7, 112–117

Managing genomic and proteomic knowledge.

Genomic and proteomic platform data constitute a hugely important resource to current efforts in disease understanding, systems biology and drug disco...
209KB Sizes 3 Downloads 3 Views