A primer on precision medicine informatics.

Briefings in Bioinformatics Advance Access published June 5, 2015 Briefings in Bioinformatics, 2015, 1–9 doi: 10.1093/bib/bbv032 Paper

A primer on precision medicine informatics Andrea Sboner and Olivier Elemento Corresponding authors. Andrea Sboner, PhD. Tel.: 1-646-962-2241. E-mail: [email protected]; Olivier Elemento, PhD. Tel.: 1-646-962-5726. E-mail: [email protected].

In this review, we describe key components of a computational infrastructure for a precision medicine program that is based on clinical-grade genomic sequencing. Specific aspects covered in this review include software components and hardware infrastructure, reporting, integration into Electronic Health Records for routine clinical use and regulatory aspects. We emphasize informatics components related to reproducibility and reliability in genomic testing, regulatory compliance, traceability and documentation of processes, integration into clinical workflows, privacy requirements, prioritization and interpretation of results to report based on clinical needs, rapidly evolving knowledge base of genomic alterations and clinical treatments and return of results in a timely and predictable fashion. We also seek to differentiate between the use of precision medicine in germline and cancer. Key words: precision medicine; informatics; NGS testing; clinical genomics; computational biology

Introduction Precision medicine involves using detailed, patient-specific molecular information to diagnose and categorize disease, then guide treatment to improve clinical outcome [1]. In precision medicine, it is assumed that the underlying molecular causes of disease are at least partly specific to each patient—that is, each patient has a unique set of molecular alterations that are responsible for their disease condition. Identifying these molecular alterations helps identify the best treatment for each individual, thus effectively tailoring and customizing treatment for each individual. In some diseases, precision medicine relies on molecular biomarkers, that is, molecular events that are correlated with treatment response and clinical outcome but not necessarily causal for the disease [2]. The concept of precision medicine is appealing because there are a growing number of drugs whose efficacy is tied to the presence of one or more molecular alterations. This is especially true in oncology with Food and Drug Administration (FDA)-approved drugs targeting mutated genes such as crizotinib targeting anaplastic lymphoma kinase rearranged lung cancer [3], gefitinib/erlotinib for epidermal growth factor receptor (EGFR) exon 19 deletions [4], Herceptin for HER-2 amplified or over-expressing breast tumors [5] and many others. What

makes precision medicine even more compelling is the realization that targetable mutations are unexpectedly observed in certain diseases, albeit at low frequencies. For example, BRAF V600E mutations are frequently found in melanoma but also found at a lower frequency in lung adenocarcinoma [6]. When such actionable mutations unexpectedly occur, off-label prescription of the drug targeting the alteration is possible and clinical responses have been observed. For example, BRAF V600E-mutated lung adenocarcinomas have been observed to respond to vemurafenib [7]. In the non-oncology area, the use of approaches that broadly interrogate the genome is especially compelling for rare diseases with unknown molecular causes. For example, a recent study showed that 24% of patients with undiagnosed genetic disease could be diagnosed using genomic sequencing [8]. At present, most precision medicine programs rely on highthroughput sequencing (or more commonly next-generation sequencing—NGS) to examine DNA from patient samples. Such analyses range from highly focused to specific regions (a few genes are sequenced), to whole-exome (all genes are sequenced) and whole-genome (the entire genome is sequenced). The ability to interrogate a large number of genes is clinically relevant because predicting the efficacy of a growing number of drugs

Andrea Sboner, PhD, is Assistant Professor in the Department of Pathology and Laboratory Medicine at Weill Cornell Medical College. He is a member of the Institute for Computational Biomedicine and the Institute for Precision Medicine. Olivier Elemento, PhD, is Associate Professor in the Department of Physiology and Biophysics at Weill Cornell Medical College. He is a member of the Institute for Computational Biomedicine and the Institute for Precision Medicine. Submitted: 4 March 2015; Received (in revised form): 12 April 2015 C The Author 2015. Published by Oxford University Press. For Permissions, please email: [email protected] V

1

Downloaded from http://bib.oxfordjournals.org/ at University of Victoria on June 5, 2015

Abstract

2

|

Sboner and Elemento

Precision medicine assays The vast majority of assays used in precision medicine programs interrogate genomic DNA to identify alterations of single or a few nucleotides (indels) and/or larger events such as amplifications, deletions and other structural variations. DNA is either derived from tumor or normal samples, e.g. blood lymphocytes or buccal swabs. All such assays use highthroughput sequencing. Genomic DNA-based assay types include targeted panel resequencing, whole-exome sequencing (WES) and whole-genome sequencing (WGS) (Table 1). These assays differ when it comes to genomic coverage, sequencing depth, turnaround time, clinical application and sample/tissue type (Table 1). Targeted sequencing is popular in the oncology space because (1) tissue availability is a limiting factor, (2) short turnaround time can be a strict clinical requirement, e.g. for acute myeloid leukemia (AML), and (3) mutations may be present in a small fraction of tumor cells (subclonal mutations). This category of tests includes hotspot sequencing, which has a very fast turnaround time (a few days) and requires very low DNA amounts. A popular test is the Ion AmpliSeq Cancer

Hotspot Panel test, a polymerase chain reaction (PCR) ampliconbased approaches that covers 40 genes or mutation hotspots, requires down to 10–25 ng DNA, and uses the Ion Torrent Personal Gene Machine for sequencing [15]. Targeted panels such as those provided by Foundation Medicine [16] or MSKCCIMPACT [17] query 250–400 genes at the same time, work from FFPE or frozen tissue and have a turnaround time of a few weeks. Targeted panels are especially applicable in oncology, where tumor tissue is obtained from resections or biopsies but blood or adjacent normal samples are frequently unavailable. In such settings, de novo detection of somatic mutations is challenging and the analysis is usually restricted to wellcharacterized mutations, e.g. mutations found in the COSMIC database [18]. Targeted gene panel resequencing also applies to non-oncology clinical areas. For example, the PulmoGene panel covers 64 genes that are associated with several wellcharacterized lung hereditary syndromes [19]. Of note, different sequencing platforms, e.g. Illumina, Ion Torrent/Proton, have different error profiles and require specific analytic procedures [20–23]. In assays employing a WES approach, the coding regions of all known genes are sequenced (20 000 genes or 2.5–3% of the genome) at a sequencing depth that varies between 30 for germ line application and 100 or more for oncology. In the clinical space, WES has been used for germline testing, e.g. diagnosis of rare, undiagnosed diseases. Typically, a trio (mother, father and affected child—proband) is tested to identify the altered gene. WES is increasingly being used in oncology as well. The extra genomic coverage obtained with WES assays means that in addition to single nucleotide variants (SNV) and indels, these assays can be used to detect somatic copy number alterations (SCNAs). WGS-based testing provides the most comprehensive view of the genome. Widely hailed as the future of genomic testing, its clinical applicability is rapidly increasing. Companies such as Illumina provide clinical-grade (Clinical Laboratory Improvement Amendments—CLIA) WGS-based germline testing. Advantages include broader coverage of genic regions, long noncoding RNAs and regulatory regions, less artifacts owing to capture or PCR amplification biases compared with targeted approaches and the ability to uncover structural variations. WGS is the most expensive NGS strategy and the most complex and lengthy approach both in terms of turnaround time (several weeks to a few months) and analysis including data storage and compute power. For these reasons, applications of WGS testing to clinical areas where starting material is low, fast turnaround time is needed and high sequencing depth is required, such as oncology, are not yet mainstream. It is important to note that at the time of writing this review, almost none of the assay types mentioned above are available as FDA-approved tests. Hence these assays must be implemented and validated as laboratory-developed tests (LDT) in

Table 1. Genomic DNA assays. Clinical application and tissue type refer to clinical-grade applicability, not to research grade Assay type

Depth

Turn-around time

Genomic coverage

Clinical Application

Tissue type

Targeted panel sequencing Whole exome sequencing Whole genome sequencing

>500 60–150 30–100

Fast (2–4 days) 1–2 weeks Weeks

20–350 genes >20,000 genes Entire genome

Cancer, germline Cancer, germline germline

FFPE, fresh–frozen, FNA FFPE, fresh-frozen Fresh-frozen

Sequencing depth refers to the number of genomic DNA fragments covering each queried nucleotide. Turnaround time is the time it takes to go from extracted DNA to a clinical report. It includes time needed to extract DNA, prepare sequencing libraries, sequence, analyze the results and generate a report. Genomic coverage refers to the fraction of the genome queried or number of genes. Sample or tissue type can be frozen tissues, fine-needle aspirates (FNAs), blood or formalin-fixed paraffin embedded (FFPE) specimens.


requires knowledge of more than one molecular alteration. For example, in lung adenocarcinomas, the EGFR inhibitors gefitinib and erlotinib target mutated EGFR but do not work when V-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog (KRAS) mutations are present [9]. Precision medicine efforts are now in place at a number of academic centers [10–12] including ours at Weill Cornell Medical College—New York Presbyterian Hospital [13]. A frequently encountered hurdle in the implementation of a precision medicine program is the bioinformatics and informatics component required to support such a program [14]. Indeed, informatics plays a key role in nearly every aspect of a precision medicine program, ranging from physician-oriented clinical interfaces that enable them to order tests and visualize and interpret results for decision support and report generation, to systems for sample tracking and handling, data acquisition, and software for analyzing the genomic assays including identification and annotation of variants. The objective of this review is to highlight and explain key informatics challenges in the implementation of a Precision Medicine program and describe approaches to address these challenges. We begin with an overview of the different assays currently being used to support a precision medicine effort. We then describe the informatics infrastructure needed to support these assays. We choose to be conservative in this review and focus on assays that are in clinical use and whose clinical utility is demonstrated. NGS testing is a fast evolving field and many novel assays that are currently being investigated are mentioned in the ‘Discussion’ section.

A primer on precision medicine informatics

agreement with state-specific legal requirements as well as other recommendations, e.g. [24].

Data analytics for precision medicine: variation calling and copy number analysis for DNA sequencing

Preprocessing and QC With most next-generation sequencers, the initial ‘usable’ output (after base calling and demultiplexing) are files in FASTQ format, which contain raw short reads together with their quality scores. QC can be performed based on these files. For example, a minimum number of reads with average quality score above a required threshold must be obtained. Programs such as FastQC (http://www.bioinformatics.babraham.ac.uk/projects/ fastqc/) can be used to detect other problems, such as low-quality read extremities. If needed, short reads can be trimmed using tools such as Trimmomatic [27] or Sickle [28].

Read alignment The next step involves mapping the reads to the reference human genome to identify their provenance. For genomic DNA, tools such as BWA [29] and SOAP [30] are frequently used. All these tools have numerous parameters to control the alignment process. Alignment parameters are important as some might favor/disfavor discovery of specific alterations. For example, the -e flag of BWA controls gap (indel) length extension and therefore impacts the size of indels that can be discovered. Reads with low mapping quality, i.e. reads whose alignment location is unclear, should also be removed. Additional important steps meant to improve alignment and variant detection include quality score recalibration, and realignment around indels. Most frequently, the Genome Analysis ToolKit (GATK) is used to perform these procedures [31]. These tools use external databases, e.g. reference genome builds, dbSNP [32], whose

specific version and content impact the overall result. In a clinical environment, changing any of the parameters or tools requires the entire pipeline to be revalidated. If applicable, PCR duplicates are removed after alignment using tools such as Picard (http://broadinstitute.github.io/picard). At the outset of these steps, a clean BAM file suitable for further QC and variant calling is produced. A number of QC measures can now be determined, including number of mapped reads, fraction of uniquely mappable reads, fraction of reads on targets for capture or amplicon based assays, average coverage, fraction captured with coverage high enough to call variants (typically 10). If any of these metrics falls below a prespecified threshold, re-testing or confirmatory testing using newly extracted DNA may be required.

Variant and mutation calling: point mutations and short indels The strategy used for variant calling typically depends on the nature of the assay and the area of application. For germline applications, tools such as GATK [31] and FreeBayes [33] are frequently used. In oncology, tumor/normal pairs are sequenced together in most cases and analyzed using somatic variant detection tools such as VarScan [34], Strelka [35] and MuTect [36]. These tools are commonly used to detect somatic SNV and indels, including in samples with low tumor purity. Several groups have reported comparisons of variant discovery tools and pointed out to strengths and weaknesses in each method [37–40]. All variant calling tools come with parameters that need to be carefully explored to limit false positives and false negatives, often including a minimum allele frequency threshold. To calibrate detection thresholds and other parameters, sequencing of reference samples such as NA12878 [41] or commercially available samples (e.g. http://www.horizondx.com) as well as samples independently assessed by a CLIA lab for specific mutations can be used. Initiatives such as Genome in a Bottle Consortium (https://sites.stanford.edu/abms/giab) strive to provide standard references to help variant calling including parameter calibration. In the clinical context, clinically relevant mutations or variants cannot be missed if present in the sample, even at low allelic frequency. Insufficient coverage of these mutations may impede detection and therefore warnings should be reported. Retesting may be needed if this occurs. Ad hoc procedures that specifically investigate the tumor DNA for presence of clinically relevant mutations, even in a small number of reads and the control, can be used to complement de novo discovery, e.g. using tools as samtools/mpileup (http://sam tools.sourceforge.net). Databases of recurrent mutations such as COSMIC (cancer; http://cancer.sanger.ac.uk [18]) or ClinVar (germline; http://www.ncbi.nlm.nih.gov/clinvar/ [42]) can be used to identify clinically important variants to query. Alternatively, hybrid approaches uses prior information (in the form of known mutations) to reinforce observed sequence data and increase the posterior probability of mutations [16]. Historically, detecting indels has been more challenging than detecting point mutations. To overcome limitations of indel detection approaches, frequently two independent methods are considered together. Indel detection tools such as PINDEL [43] and SCALPEL [44] are improving detection rates, including for long indels, using approaches such as de novo read assembly. This is important because several long indels are clinically relevant, e.g. FLT3-ITD in AML [45]. Variant and indel discovery tools typically produce outputs in VCF format (germline) or MAF format (often used for tumors).


NGS testing produces a large number (tens of millions) of short reads that need to be analyzed using procedures that (1) preprocess the raw data so that it can be analyzed by subsequent steps, (2) perform comprehensive quality control (QC), (3) process the aligned data to remove potential bias and artifacts and (4) perform accurate variant detection. Importantly, the CLIA / Clinical Laboratory Evaluation Program (CLEP) requirements and College of American Pathologists (CAP) guidelines for LDTs do not dictate specific technical choices for these steps [25, 26]. Rather, the methods that are most applicable to each of these procedures depend on a number of parameters including sample origin, assay type, library preparation choice, sequencing parameters and clinical application. CLIA/CLEP considerations and CAP guidelines must be kept in mind because analyses must be reproducible, traceable, be well documented and occur in compliance with privacy requirements. Other requirements typically include sequencing: (1) samples with known mutations assessed by an independent CLIA/CLEP test and demonstrating ability of analytical procedures to recover these alterations; (2) samples with mutational abundance below, at and above stated detection thresholds; (3) samples on different days and by different operators; and many others. The ability of analytical procedures to accurately and reproducibly perform these tasks is critical to their selection. In what follows, we provide an overview of commonly used methods for NGS data analytics.

| 3

4

|

Sboner and Elemento

Structural variations discovery Structural variation discovery plays a key role in precision medicine, as a growing number of such variants have clinical relevance. Examples include HER2 amplification in breast cancer, which (if associated with HER2 expression) can be targeted using Herceptin. Indeed, copy number variations (germline; CNV) or copy number alterations (cancer; CNA or SCNA for somatic variants) represent an important structural variation class. In oncology testing, SCNA discovery is usually performed by comparing depth-adjusted read counts within covered regions between tumor and patient matched control samples. A read count log ratio is calculated, then adjacent log ratios are segmented using procedures such as DNAcopy [48] and VarScan [34] or more sophisticated approaches using Hidden Markov Models (HMM), e.g. Excavator [49]. Tumor purity estimation tools such as CLONET [47] can help obtain a precise estimate of how many copies of a given allele are present in the tumor. For germline application, segmentation and detection of copy number variants are complicated by biases such as GCTable 2. Common tools for variant discovery Tool

Application

GATK Germline SNPs and short indels FreeBayes Germline SNPs and short indels VarScan Germline and somatic SNP/SNVs, short indels Strelka Somatic SNVs and short somatic indels MuTect Somatic SNVs and short somatic indels Pindel Short and long indels SCALPEL Short and long indels Excavator Somatic Copy Number Alteration ExomeCNV Germline copy number variants XHMM Germline copy number variants

Reference [31] [33] [34] [35] [36] [43] [44] [49] [51] [50]

This is a non-exhaustive list. SNP stands for Single Nucleotide Polymorphism.

content and capture efficiency. Tools such as XHMM [50] use principal component analysis and HMM methods to overcome such biases. Table 2 summarizes the most frequently used variant discovery tools and their applications. All tools mentioned here are equally applicable to targeted panels, WES and WGS. As with point mutation and indels, detection of specific events is as important as discovery for CNAs and CNVs. Thus, read count log ratios can be investigated independently of the segmentation process to detect CNAs at specific loci. Such detection approaches are useful to detect complex variants involving breakpoints within long genes, e.g. in the PTEN gene [52].

Variant annotation Variant annotation is the analytic process that determines the likely functional impact of a genomic variant (germline or somatic). This is an important step in a precision medicine workflow, yet it is one of the most complex and open-ended tasks because it relies on evolving standards and biological knowledge [53]. Variant annotation depends on the knowledge of a ‘gene’, typically encoded in a gene annotation. Tools such as AnnoVar [54] and SnpEff [55] are widely used. Nonsense mutations are frequently considered deleterious while a subset of missense mutations, e.g. those that change the physicochemical properties of an amino acid, are considered important. Computational tools such as SIFT [56] or PolyPhen [57] can predict deleterious changes by integrating data such as protein domains and evolutionary constraints [58]. Variants predicted to disrupt splice sites potentially leading to aberrant splicing are also considered important. Copy number alterations also need to be annotated, frequently at the gene or subgenic level. This means that all genes within deleted or amplified segments need to be identified. A CNA or CNV can also be annotated as focal, i.e. involving a ‘small’ genomic area, for example including less than 50 genes, or large scale, i.e. regions with more than 50 genes or even entire chromosome arms. Focal alterations are usually more relevant from a clinical point of view than large-scale alterations because the former contain fewer genes and often directly point to the likely driver gene.

Variant interpretation Variant interpretation is the process that connects variant annotations with actionable clinical information—often in a disease-specific context—and classifies variants into clinically distinct categories. Providing clinicians, geneticists and non-geneticists with readily interpretable test reports is a challenging task, complicated by relative absence of standards and institutional/clinical specificities [59]. Traditionally, variants and mutations are categorized into three tiers of importance, depending on their level of actionability (Table 3). Tier 1 contains the most clinically relevant alterations, that is, alterations linked to treatment according to an FDA-sanctioned set of guidelines. For example, EGFR exon 19 is actionable in non-small cell lung cancer using EGFR inhibitors. A mutation can be ‘actionable’ when it leads to a more accurate diagnosis. JAK2 V617F mutation for example can help diagnose patients with myeloproliferative disorders [60]. Importantly, classifying a variant into Tier 1 only makes sense if that variant has previously been clinically linked to the patient’s diagnosis. For example, an unexpected BRAF mutation in a tumor not previously linked to response to anti-BRAF agents may not belong to Tier 1. In the same vein, incidental findings (IF) with clinical


Key information is the variant allele frequency, which conveys the abundance of the variant in the sample. Low abundance may signify subclonality (or mosaicism) and have clinical implications, that is, actionable subclonal events may be linked with partial response to drug treatment compared with mutations present in all cells of a given sample. Importantly, clinical grade sequencing requires that negative results of important mutations or variants be reported. This implies that a variant can have three statuses: detected, undetected and indeterminate when minimum quality metrics criteria are not met (for example, when coverage is below a certain threshold). These criteria should be determined according to well-defined standard operating procedures (SOP) and will depend on the specific test used (targeted panel, WES, etc.). As suggested by CAP guidelines [25], laboratory policies should also define criteria for confirmatory testing and retesting using an independent assay. For example, retesting may be required if the indeterminate mutation status is critical given the diagnosis, e.g. BRAF V600E mutation in metastatic melanoma. Additional steps need to be pursued for oncology testing where contamination of normal cells is frequently present in tumor specimens. Tools such as ABSOLUTE [46] and CLONET [47] can be used to estimate tumor purity. Tumor purity estimation is important to correctly interpret variant allele frequencies, which can be artificially reduced in low-purity tumors.


Precision medicine reports Reporting is a key step in a precision medicine workflow and is what most visibly distinguishes clinical from research NGS. Reports are vehicles to communicate precision medicine results to the referring physicians and to patients in some cases. While the optimal structure of precision medicine reports is still heavily debated, a few accepted themes are starting to emerge. For example, it is widely accepted that they need to be succinct and provide clear and concise clinical recommendations [61]. The multi-tier structure together with interpretive comments is also Table 3. Classification of genomic alterations in three tiers Class

Description of alterations

Tier 1

This Tier contains the most clinically relevant alterations: (1) linked to treatment according to an FDA-sanctioned set of guidelines; (2) leading to a more accurate diagnosis. Alterations in this Tier have some clinical potential but not yet proven clinical relevance, i.e. those in or near known and frequently mutated cancer genes and/or genes/pathways targeted by molecules in clinical trials. All remaining alterations are included here, typically called VUS.

Tier 2

Tier 3

widely accepted and frequently part of CLIA/CLEP guidelines. Other aspects, such as whether or how to report VUS, are debated. In such cases, it is possible that different clinicians will have distinct preferences and clinician-customized reports may offer an attractive solution. In addition to tier-ed and interpreted mutations and variants detected, precision medicine reports need to ascertain pertinent negative results regarding disease-specific clinically relevant mutations, provide information about the test (extraction methods for DNA, library preparation protocols, average coverage of the sample(s), analytic pipeline version, disclaimers, etc.). Reports should state if specimens do not meet laboratory criteria for the test. They also need to include information about the patient, e.g. patient name and identification number or unique identifier, lab information, specimen information such as diagnosis, date of collection, type, etc. The number and type of information to be reported is frequently dictated by CLIA/CLEP requirements. For example, the CAP [62], American College of Medical Genetics and Genomics (ACMG) and Centers for Disease Control have provided recommendations to supplement the CLIA general report format for molecular testing. In many US states, guidelines indicate that reporting of IF must be included in the precision medicine report. Identification of IFs indeed have the potential to impact patient’s health and their family (if germline). Guidelines for reporting IFs are evolving—ACMG recommends reporting IFs in 56 genes ([63], at the time of writing this report but this number is likely to increase in the near future [64]).

Information technology infrastructure for precision medicine There are several information technology (IT) infrastructure considerations for a comprehensive, effective and secure precision medicine program. Most importantly, a precision medicine program must be integrated into a medical institution’s clinical systems [10]. Such integration also facilitates billing and reimbursement. As many institutions run Electronic Health Record (EHR) systems, such as EpicCare, and EHRs are the main entry point for all clinician’s interactions, it is critical that the precision medicine workflow be able to (1) get orders from the EHR and (2) return results to the EHR. Moreover, as molecular testing (including NGS) is often an activity performed by pathologists, integration with pathology systems such as CoPath or Cerner Millennium Helix is critical. Figure 1 illustrates one way by which this integration can occur. In this workflow, a physician places an NGS order via EHR that triggers the appropriate events to coordinate the acquisition of samples (either from archival material or from surgery). Specimens are accessioned in a pathology system and diagnosis, site, histology are stored along with case images, e.g. H&E staining, and other parameters, e.g. tumor content. Pathologists select the tissue to submit for NGS testing (and the appropriate matched control sample if needed) and the specimens enter the NGS workflow. In this workflow, specimens are tracked via an NGS laboratory information management system (LIMS). The LIMS oversees the entire NGS testing process including the collection of QC metrics and reports. The LIMS must interact with other instruments such as the NGS R sequencer and other devices (Agilent Bioanalyzer, Picogreen V assays, etc.). The analysis of NGS raw read data can be performed on dedicated workstations (for smaller targeted panels), but most commonly requires high-performance computing facilities,


relevance can also be discovered, e.g. a germline BRCA1/2 carrier mutation, but would not be a Tier 1 mutation if the diagnosis is not cancer. Tier 2 mutations are typically variants with potential but not yet proven clinical relevance. For example, alterations for which there is a drug in clinical trials are frequently considered as Tier 2 mutations. Variants that are in or near known cancer hotspots or in known cancer genes can also be considered as Tier 2. Finally, Tier 3 mutations are all other variants or mutation—often called variants of unknown significance (VUS). This multi-tier categorization applies to both point mutations, indels and copy number alterations. Rapidly evolving scientific knowledge linking variants to specific phenotypes or treatments give rise to the challenging task to maintain the multi-tier structure of variants up-to-date. The categorization of variants into tiers is usually either done manually for small panels but increasingly done using a database of variants or disease genes as well as rules to apply to variants or groups of variants. Databases such as ClinVar (for R germline variants), HGMDV (http://www.hgmd.cf.ac.uk/ac/ index.php; germline) and MyCancerGenome (http://www. mycancergenome.org; cancer), Personalized Cancer Therapy knowledge base (https://pct.mdanderson.org; cancer) and DOCM (http://docm.genome.wustl.edu; cancer) are increasingly being used to automate the process. Custom databases are also being built to address specific, often local clinical needs. For example a cancer center that specializes in treating patients with sarcomas may develop a targeted assay for this disease, thus requiring collection and curation of disease-specific genomic variants, e.g. gene fusions involving EWSR1. Interpretative comments linked to mutation–disease pairs and validated by expert clinicians can also be stored in these databases. The development of standards for encoding variants, variants interpretation and treatments including drugs will help create and maintain up-to-date, community-driven knowledge bases. In turn, making these knowledge bases available as shared resources will further speed up the adoption of these standards and improve the quality, reliability and implementation of genomic testing for precision medicine [1].

| 5

6

|

Sboner and Elemento

viewable by the referring physician. A PDF-based strategy is employed by a number of centers and is meant to overcome limitations of existing EHRs [59]. Health Level 7 (HL7) is the de facto standard framework for clinical systems and enables exchange, integration, sharing and retrieval of health information. Hence, it is foreseeable that structured data messages employing HL7 format will increasingly be used in the precision medicine context. In that regard, an HL7 clinical genomics format is maturing to enable seamless communication of genomic results between clinical systems (http://www.hl7.org/Special/committees/ clingenomics/). In the United States, requirements regarding hardware and software infrastructure for precision medicine vary from state to state. However, all of them include lock-down and documentation. Lock-down means that the same exact pipeline, software (including version) and databases that were described as part of the validation procedure need to be used for all subsequent analyses. If the analytic component is modified after, analyses must be rerun on previously sequenced samples to demonstrate absence of substantial negative impact on alteration detection. Software versioning is fundamental—and analytical pipeline version must be included in the precision medicine report. Versioning can be performed using open-source software engineering tools such as git (https://github.com) and subversion (https://subversion.apache.org), not only for software code and scripts, but also for the analysis results and the knowledge

Figure 1. Components of an integrated precision medicine workflow. This figure illustrate a possible workflow that connects existing EHR and Accessioning systems to a novel Precision Medicine infrastructure. A LIMS provides the central connection point between these existing systems and the NGS testing workflow. In addition to sample information, the LIMS stores quality control metrics (QC) and final reports. The latter is transmitted back to the accessioning system for sign-off and transfer to the EHR.


especially for WES. Cluster infrastructure and fast (low latency) network-based data transfer are must-have for large-scale NGS testing workflows. In a cluster infrastructure, analyses are executed on multiple compute nodes. Moreover, reliable data storage must be in place, especially given that (1) State requirements frequently include storing NGS data for a number of years and (2) raw NGS data re-analysis might be required, as improved detection techniques and expanded knowledge bases become available [65]. Cloud-based data storage and computing solutions have been explored in the research context [66]. In the clinical setting, several cloud vendors such as Amazon and Microsoft provide HIPAA-compliant cloud-based data storage and offer business associate agreements, meaning that they can legally host Protected Health Information. However, institutions that use these cloud services need to fulfill substantial requirements such as secure data transfer and authorized access. Moreover, institution-specific factors such as data transfer speed capacity, proximity to a data center and privacy rules vary widely across institutions and may in some case delay adoption of cloud computing [67]. After variant calling and report generation, precision medicine reports are uploaded back into the sample accessioning system either as Portable Document Format (PDF), flat text file or as structured data. These systems frequently allow sign-off using electronic signature by a clinical laboratory director. The reports can then be transmitted back to the EHR and become


bases used for generating a report. If the analysis is rerun on previous samples and novel actionable alterations (typically Tier 1) are found, the precision medicine report may need to be amended and communicated to the physician who initially requested the test. In a precision medicine workflow, privacy and HIPAA compliance are important because genetic data are patient data and could be linked to patient identity, even if temporarily anonymized for certain steps, e.g. analytics. From an IT point of view, extra care must be taken to update software including operating systems, web servers, etc. for all machines with a network access. Strict SOPs for testing software when these updates occur must be put in place to avoid service continuity issues. Likewise, strict access control rules must ensure that only authorized personnel has access to the genomics and clinical data.

Precision medicine is still a young area and many aspects are in a state of flux. There is currently little standardization regarding best practices, especially when it comes to analytical tools and workflows. Knowledge bases such as MyCancerGenome, ClinVar and PCT are still nascent and incomplete. As a result, it is frequently the case that centers need to reinvent the wheel and create custom mutation databases to accommodate specific, often local, clinical needs and expertise. For example, local expertise in lung disease may necessitate the need to create a lung disease genetics variant database to accommodate targeted sequencing assays that solely interrogate lung disease genes. This will likely change in the future as community-driven knowledge bases improve, perhaps populated by crowdsourcing [68]. The development of standards for variant description and interpretation of variants will help institutions accommodate specific clinical needs and tap information from multiple independent databases within a standardized framework. Efforts such as the Global Alliance for Genomics and Health will help realize the vision of the 2011 Institute of Medicine precision medicine report [1] by connecting genomic variants to clinical phenotypes and improving actionability of genomes [69]. Additional informatics techniques will need to be developed to address unresolved challenges in precision medicine, for example, to better detect structural variants including germline variants or to detect active/suppressed pathways in n ¼ 1 studies. Noncoding variants, e.g. variants affecting promoters and enhancers and therefore gene expression, may eventually be added to mutation databases, once their clinical utility and actionability is established. Better modeling of mutations (and combinations) via a predictive, network-level understanding of drug-induced perturbations may eventually be included as part of these databases. In cancer, the clinical relevance of clonality is not well understood and will require extensive clinical trials to investigate [70]. Current efforts to implement precision medicine projects mostly focus on genomic DNA. Other assays, e.g. RNA-seq, proteomics (mass spectrometry, Reverse Phase Protein Arrays, mass-cytometry), metabolomics, microbiome, epigenomics, will likely complement genomic DNA in the near future. As a complementary assay, RNA-seq is probably the most mature technology. RNA-seq can provide orthogonal support and validation for some of the DNA alterations, e.g. sequence support for point mutations and indels, expression validation for genes affected by SCNA, splicing analysis for splice site mutations [71].

Perhaps most importantly, RNA-seq can provide gene fusion detection [72–74] as well as active/suppressed pathway detection using pathway-specific gene sets [75] and computational techniques such as GSEA [76]. Moreover, a comprehensive precision medicine approach for patient care will eventually integrate other data from patients’ health records and health history and from increasingly popular external devices tracking individuals’ activities, from fitness data to food consumed to environmental factors [77–80].

Conclusion The concept of precision medicine has been rapidly growing in the medical community, pushed by the tremendous expansion of NGS technology. Recently a major effort from the United States of America Administration through the Precision Medicine Initiative (http://www.nih.gov/precisionmedicine/) is consolidating the concept of using molecular information to provide better patient care. The establishment of Precision Medicine programs requires the development and, especially, integration of many different components. This review highlights some of the computational tools and challenges that are required to implement a Precision Medicine program.

Funding This work was partially supported by the NSF CAREER (O.E.), LLS SCOR (O.E.), Hirschl Trust (O.E.), Starr Cancer Consortium I6-A618 (O.E.) grants.

Acknowledgments We are grateful to all members of Weill Cornell Medical College’s Institute for Precision Medicine for discussion and insights. We are also grateful to Keith Robison for insightful comments on the initial version of the manuscript. We are also grateful to our three anonymous reviewers for important and constructive criticisms and suggestions, which substantially improved the manuscript.

References 1. National Research Council (US) Committee on a framework for developing a new taxonomy of disease. In: Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. Washington, DC, 2011. 2. Keutgen XM, Filicori F, Crowley MJ, et al. A panel of four miRNAs accurately differentiates malignant from benign indeterminate thyroid lesions on fine needle aspiration. Clin Cancer Res 2012;18(7):2032–8. 3. Kwak EL, Bang YJ, Camidge DR, et al. Anaplastic lymphoma kinase inhibition in non-small-cell lung cancer. N Engl J Med 2010;363(18):1693–703. 4. Pao W, Miller V, Zakowski M, et al. EGF receptor gene mutations are common in lung cancers from “never smokers” and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004;101(36):13306–11. 5. Romond EH, Perez EA, Bryant J, et al. Trastuzumab plus adjuvant chemotherapy for operable HER2-positive breast cancer. N Engl J Med 2005;353(16):1673–84.


Future directions

| 7

8

|

Sboner and Elemento

25. Aziz N, Zhao Q, Bry L, et al. College of American pathologists’ laboratory standards for next-generation sequencing clinical tests. Arch Pathol Lab Med 2015;139(4):481–93. 26. “Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection. http://www.wadsworth.org/labcert/TestApp roval/forms/NextGenSeq_ONCO_Guidelines.pdf. 27. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014;30(15): 2114–20. 28. Joshi NA, Fass JN. Sickle: A sliding-window, adaptive, qualitybased trimming tool for FastQ files 2011. https://github.com/ najoshi/sickle. 29. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25(14):1754–60. 30. Li R, Li Y, Kristiansen K, et al. SOAP: short oligonucleotide alignment program. Bioinformatics 2008;24(5):713–4. 31. DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43(5):491–8. 32. Sherry ST, Ward MH, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29(1):308–11. 33. Garrison E, Marth G, Haplotype-based variant detection from short-read sequencing. arXiv 1207.3907 [q-bio.GN] 34. Koboldt DC, Chen K, Wylie T, et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 2009;25(17):2283–5. 35. Saunders CT, Wong WS, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 2012;28(14):1811–7. 36. Cibulskis K, Lawrence MS, Carter SL, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013;31(3):213–9. 37. Wang Q, Jia P, Li F, et al. Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med 2013;5(10):91. 38. Liu X, Han S, Wang Z, et al. Variant callers for next-generation sequencing data: a comparison study. PLoS One 2013;8(9): e75619. 39. Yi M, Zhao Y, JiaL, et al. Performance comparison of SNP detection tools with illumina exome sequencing data–an assessment using both family pedigree information and sample– matched SNP array data. Nucleic Acids Res 2014;42(12):e101. 40. Roberts ND, Kortschak RD, Parker WT, et al. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 2013;29(18):2223–30. 41. Zook JM, Chapman B, Wang J, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 2014;32(3):246–51. 42. Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014;42:D980–5. 43. Ye K, Schulz MH, Long Q, et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009;25(21):2865–71. 44. Narzisi G, O’Rawe JA, Iossifov I, et al. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods 2014;11(10):1033–6. 45. Smith CC, Wang Q, Chin CS, et al. Validation of ITD mutations in FLT3 as a therapeutic target in human acute myeloid leukaemia. Nature 2012;485(7397):260–3. 46. Carter SL, Cibulskis K, Helman E, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 2012;30(5):413–21.


6. Cancer Genome Atlas Research, N. Comprehensive molecular profiling of lung adenocarcinoma. Nature 2014; 511( 7511): 543–50. 7. Peters S, Michielin O, Zimmermann S. Dramatic response induced by vemurafenib in a BRAF V600E-mutated lung adenocarcinoma. J Clin Oncol 2013;31(20):e341–4. 8. Zhu X, Petrovski S, Xie P, et al. Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios. Genet Med 2015, online January 15:1–8. 9. Pao W, Wang TY, Riely GJ, et al. KRAS mutations and primary resistance of lung adenocarcinomas to gefitinib or erlotinib. PLoS Med 2005;2(1):e17. 10. Lazaridis KN, McAllister TM, Babovic-Vuksanovic D, et al. Implementing individualized medicine into the medical practice. Am J Med Genet C Semin Med Genet 2014;166C(1):15–23. 11. Van Allen EM, Wagle N, Stojanov P, et al. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine. Nat Med 2014;20(6):682–8. 12. Roychowdhury S, Iyer MK, Robinson DR, et al. Personalized oncology through integrative high-throughput sequencing: a pilot study. Sci Transl Med 2011;3(111):111ra121. 13. Beltran H, Eng K, Mosquera JM, et al. Whole exome sequencing of metastatic cancer and biomarkers of treatment response. JAMA Oncology 2015, in press. 14. Servant N, Romejon J, Gestraud P, et al. Bioinformatics for precision medicine in oncology: principles and application to the SHIVA clinical trial. Front Genet 2014;5:152. 15. Chen K, Meric-Bernstam F, Zhao H, et al. Clinical actionability enhanced through deep targeted sequencing of solid tumors. Clin Chem 2015;61:544–53. 16. Frampton GM, Fichtenholtz A, Otto GA, et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol 2013;31(11):1023–31. 17. Brannon AR, Vakiani E, Sylvester BE, et al. Comparative sequencing analysis reveals high genomic concordance between matched primary and metastatic colorectal cancer lesions. Genome Biol 2014;15(8):454. 18. Forbes SA, Beare D, Gunasekaran P, et al. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res 2015;43:D805–11. 19. Diana MT, Mark JB, Elizabeth DH, et al. Integrating genetics into subspecialty care: The PulmoGene Test - comprehensive testing for hereditary causes of lung disease. In: A93. Immunologic and Genetic Biomarkers of Inflammatory Lung Disease. American Thoracic Society. International Conference, May 16–21, 2014, San Diego, CA, USA, A2175. 20. Nakamura K, Oshima T, Morimoto T, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 2011;39 (13):e90. 21. Bragg LM, Stone G, Butler MK, et al. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput Biol 2013;9(4):e1003031. 22. Salipante SJ, Kawashima T, Rosenthal C, et al. Performance comparison of Illumina and ion torrent next-generation sequencing platforms for 16S rRNA-based bacterial community profiling. Appl Environ Microbiol 2014;80(24):7583–91. 23. Schirmer M, Ijaz UZ, D’Amore R, et al. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res 2015;43(6):e37. 24. Gargis AS, Kalman L, Berry MW, et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30(11):1033–6.


genetic counseling. J Genet Couns. Published online 18 November 2014; doi: 10.1007/s10897-014-9794-4. 64. Amendola LM, Dorschner MO, Robertson PD, et al. Actionable exomic incidental findings in 6503 participants: challenges of variant classification. Genome Res 2015;25(3):305–15. 65. Deverka PA, Dreyfus JC. Clinical integration of next generation sequencing: coverage and reimbursement challenges. J Law Med Ethics 2014;42 (Suppl 1):22–41. 66. El-Kalioby M, Abouelhoda M, Kruger J, et al. Personalized cloud-based bioinformatics services for research and education: use cases and the elasticHPC package. BMC Bioinformatics 2012;13 (Suppl 17):S22. 67. Dove ES, Joly Y, Tasse AM, et al. Genomic cloud computing: legal and ethical points to consider. Eur J Hum Genet, 24 September 2014; doi: 10.1038/ejhg.2014.196. 68. Evanko D. The power of a crowd. Nat Methods 2014;11(1):31. 69. Hayden EC. Geneticists push for global data-sharing. Nature 2013;498(7452):16–7. 70. Laurent-Puig P, Pekin D, Normand C, et al. Clinical relevance of KRAS-mutated subclones detected with picodroplet digital PCR in advanced colorectal cancer treated with anti-EGFR therapy. Clin Cancer Res 2015;21:1087–97. 71. Wang K, Singh D, Zeng Z, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 2010;38(18):e178. 72. Sboner A, Habegger L, Pflueger D, et al. FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data. Genome Biol 2010;11(10):R104. 73. Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics 2011;27(20):2903–4. 74. McPherson A, Hormozdiari F, Zayed A, et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol 2011;7(5):e1001138. 75. Liberzon A, Subramanian A, Pinchback R, et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011;27(12): 1739–40. 76. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005; 102(43):15545–50. 77. Bonato P. Wearable sensors and systems. From enabling technology to clinical applications. IEEE Eng Med Biol Mag 2010;29(3): 25–36. 78. Quantified self. Fit, fit, hooray! 2013. http://www.economist. com/blogs/babbage/2013/05/quantified-self. 79. The Data-Driven Life. 2010. http://www.nytimes.com/2010/05/ 02/magazine/02self-measurement-t.html. 80. The Rise of the ‘Quantified Self’ in Health Care. 2013. http://blogs. wsj.com/venturecapital/2013/08/13/the-rise-of-the-quantifiedself-in-health-care/.


47. Prandi D, Baca SC, Romanel A, et al. Unraveling the clonal hierarchy of somatic genomic aberrations. Genome Biol 2014; 15(8):439. 48. Olshen AB, Venkatraman ES, Lucito R, et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004;5(4):557–72. 49. Magi A, Tattini L, Cifola I, et al. EXCAVATOR: detecting copy number variants from whole-exome sequencing data. Genome Biol 2013;14(10):R120. 50. Fromer M, Moran JL, Chambert K, et al. Discovery and statistical genotyping of copy-number variation from wholeexome sequencing depth. Am J Hum Genet 2012;91(4):597–607. 51. Sathirapongsasuti JF, Lee H, Horst BAJ, et al. Exome Sequencing-Based Copy-Number Variation and Loss of Heterozygosity Detection: ExomeCNV, Bioinformatics 2011; 27(19):2648–54. 52. Berger MF, Lawrence MS, Demichelis F, et al. The genomic complexity of primary human prostate cancer. Nature 2011; 470(7333):214–20. 53. McCarthy DJ, Humburg P, Kanapin A, et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med 2014;6(3):26. 54. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38(16):e164. 55. Cingolani P, Platts A, Wang le L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6(2):80–92. 56. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 2003;31(13):3812–4. 57. Adzhubei IA, Schmidt S, Peshkin L, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7(4):248–9. 58. Peterson TA, Doughty E, Kann MG. Towards precision medicine: advances in computational approaches for the analysis of human variants. J Mol Biol 2013;425(21):4047–63. 59. Dorschner MO, Amendola LM, Shirts BH, et al. Refining the structure and content of clinical genomic reports. Am J Med Genet C Semin Med Genet 2014;166C(1):85–92. 60. Kralovics R, Passamonti F, Buser AS, et al. A gain-of-function mutation of JAK2 in myeloproliferative disorders. N Engl J Med 2005;352(17):1779–90. 61. Zawati MH, Parry D, Thorogood A, et al. Reporting results from whole-genome and whole-exome sequencing in clinical practice: a proposal for Canada? J Med Genet 2014;51(1):68–70. 62. Gulley ML. Public policy recommendations for oversight of molecular laboratory tests. N C Med J 2007;68(2):109–11. 63. Smith LA, Douglas J, Braxton AA, et al. Reporting incidental findings in clinical whole exome sequencing: incorporation of the 2013 ACMG recommendations into current practices of

| 9

Precision medicine informatics.

Interdisciplinary training to build an informatics workforce for precision medicine.

A new initiative on precision medicine.

An informatics research agenda to support precision medicine: seven key areas.

Precision Chemistry for Precision Medicine.

Ethics, equity, and economics: a primer on women in medicine.

Intelligent Informatics in Translational Medicine.

Personomics and Precision Medicine.

Obama's Precision Medicine Initiative.

Precision Medicine for Endocrinology.

Systems medicine, stratified medicine, personalized medicine but not precision medicine.

Medical Informatics in Traditional Medicine and Integrative Medicine.

Epilepsy treatment: precision medicine at a crossroads.

Translational biomedical informatics and computational systems medicine.

The primer for sports medicine professionals on imaging: the shoulder.

A new hope for precision medicine.

Developing precision medicine in a global world.

Precision medicine for prostate cancer.

Precision medicine for managing diabetes.

Information technology and precision medicine.

Precision medicine: hype or hoax?

Capturing phenotypes for precision medicine.

Cushing's disease: towards precision medicine.

Cancer Precision Medicine in China.