Vanno: a visualization-aided variant annotation tool.

Humu-2014-0288 Informatics

Vanno: A Visualization-aided Variant Annotation Tool

Po-Jung Huang1, 2, ¥, Chi-Ching Lee1, ¥, Bertrand Chin-Ming Tan3, Yuan-Ming Yeh4, Kuo-Yang Huang5, Ruei-Chi Gan1, Ting-Wen Chen1, Cheng-Yang Lee1, Sheng-Ting Yang6, Chung-Shou Liao6, Hsuan Liu2, 7, * and Petrus Tang1, 2, 5, *#

1

Bioinformatics Core Laboratory, Chang Gung University, Taoyuan, Taiwan,

2

Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan,

3

Department of Biomedical Sciences, Chang Gung University, Taoyuan, Taiwan,

4

Bioinformatics Division, Tri-I Biotech, Inc. Taipei, Taiwan,

5

Molecular Regulation and Bioinformatics Laboratory, Chang Gung University, Taoyuan,

Taiwan, 6

Department of Industrial Engineering and Engineering Management, National Tsing Hua

University, Hsinchu, Taiwan, 7

Department of Molecular and Cellular Biology, Chang Gung University, Taoyuan, Taiwan.

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1002/humu.22684. This article is protected by copyright. All rights reserved.

1

¥

#

Joint first authors

*

Joint corresponding authors

To whom correspondence should be addressed

Petrus Tang Tel: +886 3 2118800(5136) Fax: +886 3 2118122 Email: [email protected]

Contract grant sponsors: Chang Gung Molecular Medicine Research Center (EMRPD1D0671 & EMRPD1D0841 & EMRPD1D0761); Ministry of Education, ROC to Chang Gung University; Ministry of Science and Technology, Taiwan (MOST 103-2632-B-182-001); Chang Gung Memorial Hospital (CMRPD190603).

2 This article is protected by copyright. All rights reserved.

ABSTRACT Next-generation sequencing technologies (NGS) have revolutionized the field of genetics and are trending toward clinical diagnostics. Exome and targeted sequencing in a disease context represent a major NGS clinical application, considering its utility and cost-effectiveness. With the ongoing discovery of disease-associated genes, various gene panels have been launched for both basic research and diagnostic tests. However, the fundamental inconsistencies among the diverse annotation sources, software packages, and data formats have complicated the subsequent analysis. To manage disease-associated NGS data, we developed Vanno, a web-based application for in-depth analysis and rapid evaluation of disease causative genome sequence alterations. Vanno integrates information from biomedical databases, functional predictions from available evaluation models, and mutation landscapes from TCGA cancer types. A highly integrated framework that incorporates filtering, sorting, clustering and visual analytic modules is provided to facilitate exploration of oncogenomics datasets at different levels, such as gene, variant, protein domain or 3D structure. Such design is crucial for the extraction of knowledge from sequence alterations and translating biological insights into clinical applications. Taken together, Vanno supports almost all disease-associated gene tests and exome sequencing panels designed for NGS, providing a complete solution for targeted and exome sequencing analysis. Vanno is freely available at http://cgts.cgu.edu.tw/vanno.

Key Words: NGS, Exomes, SNV annotation, TCGA 3 This article is protected by copyright. All rights reserved.

INTRODUCTION Over the past few years, we have witnessed a rapid revolution in sequencing technologies, which has already improved our understanding of cancer genomics and provided new opportunities for cancer diagnostics. As the technology becomes more accessible, whole-genome sequencing (WGS) and whole-exome sequencing (WES) are increasingly used in the field of clinical medicine and are expected to reveal genetic findings of potential clinical importance. While WGS provides insights into genomic determinants in 95% of the individual patient’s genome, the cost, time, and expertise required for WGS data acquisition and analysis largely constrain its adaptation in a clinical setting. Moreover, only a small subset of genomic variants identified by WGS has been demonstrated to cause health effects. In contrast to WGS, WES focuses on protein-coding regions of the genome, which comprise less than 2% of the entire genome but account for over 85% of all mutations associated with Mendelian disorders (Ng et al. 2010). Such a targeted enrichment strategy can substantially reduce costs on both sequencing and informatics analysis, achieve higher sequencing coverage over regions of interest, and simplify the subsequent bioinformatics workflow. Recently, cumulative evidence supports the clinical value of WES for the molecular subtyping of cancer and the diagnosis of Mendelian disease (Cancer Genome Atlas Network 2012; Yang et al. 2013). However, several technical issues also exist for exome sequencing; First, a skew in the uniformity of the entire human coding exome can be introduced by various factors, such as GC-richness, repetitive sequences, target capture specificity or a low mapping quality. Second, a substantial fraction of the exome (5-10%) is 4 This article is protected by copyright. All rights reserved.

not covered properly (Bamshad et al. 2011), which may reduce the effectiveness of biological or medical research that analyzes small sets of genes in a known disease. In contrast, disease-oriented tests that focus on specific disease-associated genes or hotspot mutation regions generally have a higher sequencing coverage compared to WES. Disease-targeted testing has been applied for disorders with causative genes using Sanger sequencing, which has been firmly established as a molecular diagnosis tool for many years. With the declining cost of NGS, diagnostic testing is moving from testing single genes to panels of genes, a transition that is currently becoming a major NGS application in clinical sequencing. Accumulating gene panels for diagnostic testing of hereditary disorders and cancers have been launched by academic/clinical laboratories and biotechnology companies, suggesting the fast evolution and importance of targeted sequencing applications in clinical contexts (Costa et al. 2013; Nikiforova et al. 2013; Beck et al. 2014). As diagnostics moves towards genetic testing, an added challenge is delivering information from the sequencing laboratory to clinicians and research scientists for more in-depth studies. An effective report is a crucial component of disease-targeted tests, which relies on detailed and accurate phenotyping and efficient data-filtering approaches on all identified variants. Although most commercial tests have analysis packages alongside the sequencer, information about variants is limited to identifiers in dbSNP (Sherry et al. 2001) and COSMIC (Shepherd et al. 2011); thus, sufficient information for functional interpretation of each identified variant is lacking. Several software packages were developed to tackle this issue (Wang et al. 2010; Asmann et al. 2012; San Lucas et al. 2012), but informatics expertise 5 This article is protected by copyright. All rights reserved.

is required to operate command-line driven software to obtain the highest yield of information. Furthermore, most existing packages are designed to interpret variants at the level of individual samples (Wang et al. 2010; Asmann et al. 2012; Douville et al. 2013) and leave cross-sample analysis to the end user, thus making it difficult to assess variants in a disease cohort study. Here, we present Vanno, which is based on our previous pipeline CPAP (Huang et al. 2013), designed to support the then two commercially available cancer-targeted NGS tests. In this new version, we support the analysis of ~30 commercially available disease-targeted gene panels and whole-exome panels, ranging from inherited disease genes to comprehensive cancer-related genes as well as whole human exomes. Although the TCGA has released the analytical results for somatic mutations and small insertions/deletions across 12 tumor types and laid the groundwork for subsequent cancer-related studies, the lack of a linking interface makes it difficult to compare in-house data with TCGA. Vanno has been extended to include a highly integrated framework for visualizing comparison and functional analysis of genomic variants across datasets, including the comparison of a mutation spectrum with TCGA at the protein domain level. This tool is aimed to support researchers in the identification and interrogation of disease-relevant variations.


MATERIALS AND METHODS Pipeline overview Vanno provides a unique interface to annotate and examine genetic variants through an intuitive and user-friendly interface, supporting a wide variety of different variant calling formats and almost all commercially available sequencing gene and exome panels. An integrated database is generated from multiple annotation sources, which is linked to our in-house analysis pipeline through the Sun Grid Engine queuing system to leverage this capacity for computationally intense procedures (Gentzsch 2001). Because annotation sources (e.g., dbSNP (Sherry et al. 2001), dbNSFP (Liu et al. 2013), COSMIC (Shepherd et al. 2011), OMIM (Hamosh et al. 2005), TCGA(Kandoth et al. 2013)) are constantly evolving, we created a batch script that executes monthly to keep data sources up to date with every release of new information. Visual analytics is inherited from the previous version of CPAP based on the architecture of Circos, with interactive filters to visualize the distribution of variants across multiple datasets. In Vanno, we introduce a new approach to display the impact of different filter settings on Circos in real time, alleviating the need for regenerating the whole plot. This is particularly useful in an exploratory stage to grasp the characteristic of the dataset, which can be applied to the subsequent discovery stage to prioritize candidate targets of interest. New features, such as an interactive heatmap and protein domain diagram, are included in Vanno. Thus, a mutation spectrum can be further classified based on gene symbol, cytogenetic band and pathway or mapped onto individual protein domains and the structural and functional units of proteins, which is critical to the accurate assessment of the 7 This article is protected by copyright. All rights reserved.

impact of the mutations. Shown in Figure 1 is the analysis workflow of Vanno. Data input Vanno accepts inputs in the standard variant call format (VCF) generated by GATK (McKenna et al. 2010) or VarScan (Koboldt et al. 2012) and in diverse formats generated from Torrent Variant Caller, MiSeq Reporter, QIAGEN DNAseq Sequence Variant Analysis package or Agilent SureCall, and various variant calling packages from targeted sequencing panel providers. The variant calling files can be delivered to the Vanno server easily through a drag-and-drop fashion, and the only requirement for the user is to select the correct targeted gene panel and the respective variant calling package from which the variant calling file is generated. After receiving the uploaded variant calling files, a custom script is used to convert variant files of different formats into a standard format containing information such as sample name, chromosome, position, reference allele, variant allele, variant frequency, and coverage, which are subsequently stored into an SQLite database. A job identifier based on a timestamp will be returned to the user for retrieving the finished job.

Variant Annotation To identify nucleic acid-based variants or perturbations that are associated with disease and to exclude ethnic germline mutations from the variant lists, identified variants are compared with data annotated in COSMIC, ClinVar (Landrum et al. 2013), dbSNP, 1000 Genomes Project, and the NHLBI GO Exome Sequencing Project. The molecular consequences of amino acid changes, INDELs or frame-shift mutations are examined by 8 This article is protected by copyright. All rights reserved.

ANNOVAR, and the functional impact prediction results from SIFT, PolyPhen, LRT, MutationTaster, MutationAssessor, and FATHMM are extracted from dbNSFP (Ng and Henikoff 2003; Schwarz et al. 2010; Liu et al. 2013; Shihab et al. 2013), which is a functional prediction database of human non-synonymous SNPs. Information on gene ontology terms, biological pathway, protein domain, protein structure, and interaction networks are also annotated by consulting all the available information from the Gene Ontology database (Camon et al. 2004), Pfam (Finn et al. 2014), PDB (2000), and ConsensusPathDB (Kamburov et al. 2013). The NCI Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov) identified somatic variants across thousands of tumors using the latest sequencing and analysis methods, which provides the most uniform and comprehensive catalog of genomic aberrations of cancer to date. The full set of annotated somatic mutation data from the TCGA project was retrieved from Memorial Sloan-Kettering Cancer Center through the Cancer Genomic Data Server (CGDS) web service interface (http://www.cbioportal.org) (Cerami et al. 2012). We provide a portal for comparing mutations identified in the user-uploaded data to the mutational spectrum of all available cancer types deposited in TCGA. Mutations can be inspected on not only individual genes but also protein domains, helping users to uncover potential disruptions of protein function that underlie cancer development. For subsequent confirmation of mutation targets at both the genetic and protein levels, mRNA sequences and protein sequences of human protein-coding genes can be retrieved from the NCBI Reference Sequence Database (RefSeq) and UniProtKB, respectively. The 9 This article is protected by copyright. All rights reserved.

altered cDNA/mRNA sequences and their respective mutant protein sequences can be generated according to the genetic variations identified by targeted sequencing, facilitating mutation confirmation by Sanger sequencing and mass spectrometry. This process provides a fully integrated account of DNA, RNA, and protein abnormalities in individual samples.

Data visualization Variants of genes and amplicons from a targeted sequencing panel contain information such as functional effects, population-related phenotypes, cancer/disease associated annotations, and metabolic pathways. However, efficient interrogation of such complex and multi-dimensional data is a difficult task. To address this issue, we use a circularized heatmap to visualize the most commonly used variant information of genes and amplicons from a user-uploaded panel in a single Circos plot. This concentric heatmap not only provides a global view of the identified variants but also offers sample-wise comparison. Moreover, Vanno offers instant high-quality plot updating ability after applying conditional filters or altering viewing criteria, dramatically reducing the regenerating time from several minutes to only a few seconds compared to the original version of Circos. To achieve the above features, an image formatting protocol termed SVG is widely used in Vanno where images are re-configured after plotting, without regenerating the whole image. For all supported panels, we pre-draw an empty Circos plot and use server-side PHP programs coupled with client-side script to fill heatmap cell colors of the circos plot while data filtering procedures are applied. Protein domains are conserved regions of amino acid sequences of genes, which are 10 This article is protected by copyright. All rights reserved.

associated with biological functions. Understanding the distributions of mutation sites located in protein domains provides useful information for discovering gene function and disease association. The domain-guided plotting schema are inspired from cBioPortal (http://www.cbioportal.org), but we implement an additional function that allows side-by-side comparison of user uploaded data and mutation sites categorized in the TCGA. Protein three-dimensional (3D) structure provides the most informative connection to biological function. For example, amino-acid mutations located in the enzyme catalytic site might cause loss of catalytic activities and finally lead to tumorization. To emphasize coordinate-associated variant sites, Vanno integrates all available 3D structures from the Protein Data Bank (PDB) (2000), rendering mutated amino acids by a space-fill styling-schema using JSmol (http://www.jmol.org/). Vanno also offers exporting of the 3D structure as a high-quality downloadable image to aid the generation of formal reports.

Data management Vanno provides two cross-sample comparing modules that includes amplicon and gene view. The identified alterations from multiple data sets are aggregated at the amplicon level and gene level, respectively, plotting on to Circos ideograms according to their amplicon or gene targets on specific chromosome, with a color gradient based on the variant counts. Detailed variant information such as gene symbol, amino acid mutation and population frequencies of the 1000 genomes project is displayed in a pop-up window triggered by mouse over events on the Circos plot. For group-wise comparison, Vanno is designed to take at most 11 This article is protected by copyright. All rights reserved.

6-finished job identifiers as input, using different colors to render samples from distinct jobs in single Circos plot or heatmap, alleviating the need for resubmitting jobs. In addition, the variant sites located in chromosomal positions coupled with gene annotations are plotted in the UCSC genome browser (Karolchik et al. 2007). With the help of this genome browser, the association of variant sites and genes can be found. This provides valuable information for further experimental analysis. The TCGA mutation spectrum from various cancer types was retrieved from the cBioPortal (Cerami et al. 2012). With the abundant mutation information collected form the cBioPortal, a more global view of the mutation sites and the downstream associations including biological function and disease types is more significant than simply analyzing a few samples from only a single panel. Vanno provides the TCGA comparison ability for the most used visualization modules, including the panel-summarized dynamic heatmap, the protein domain plot and the gene summary page. This comparison feature of Vanno can be activated by turning on the “compare with TCGA” at the bottom left-hand side of the panel on the web site. We use a dynamic heatmap modified from jHeatmap (Deu-Pons et al. 2014) to fulfill the need for large-scale data visualization, such as sample-wise or group-wise comparison. Variants counts of target genes or amplicons are organized by color gradients. Multiple ways of sorting are implemented, including value aggregation and mutual exclusiveness. Annotations such as gene symbol, mutation frequencies, pathway and protein domain can also been displayed and used as sorting parameters. A pop-up window delivers detailed 12 This article is protected by copyright. All rights reserved.

mutation information by clicking the heatmap cell itself.

RESULTS AND DISCUSSION Examples of Use As a proof-of-principle experiment, we applied Vanno to targeted sequencing data from 11 cancer cell lines, which can be further classified into three different tissue types (colon, oral and lung). Amplicons were generated from the genomic DNA of these cell lines based on the Ion AmpliSeq Cancer Panel v2 protocol –targeting mutation hotspot regions of 50 oncogenes and tumor suppressor genes, and subsequently sequenced by the Life Technologies Ion PGM sequencer on a single Ion 318 chip. Detailed information on these reference cell lines can be found on the demonstration page of Vanno (http://cgts.cgu.edu.tw/vanno/demo.php).

Summary of Output After successfully submitting the job, progress indicators will be displayed for monitoring the processing status, and a job identifier will be provided for retrieving the finished job. Figure 2 shows the standard output page of a Vanno run on the demonstration data sets. The output page is made up of three major blocks. The first block located in the upper-left corner shows the summary of input data, including the number of processed samples, identified genetic mutations, altered genes, records of execution time, and versions of annotation resources (Figure 2A). To provide a global view of the alteration frequency and 13 This article is protected by copyright. All rights reserved.

composition of each gene in a cohort study, each mutation type is rendered as a concentric histogram on a Circos map, with the height illustrating the frequencies of altered samples of relevant genes (Figure 3). The concept of mutually exclusive alteration patterns has been exploited to distinguish driver mutations from passenger mutations (Ciriello et al. 2012; Vandin et al. 2012). Considering this feature in identifying recurrent driver mutation in cancer, we also incorporated a new heatmap feature in Vanno with an option to sort genomic alterations by mutually exclusive patterns across multiple samples; this tool facilitates assessment of driver genes that are functionally linked in a common pathway or in the same biological process. The full annotation result is downloadable in MS Excel or tab-delimited file format. The altered cDNA/mRNA sequences and the relevant mutant protein sequences can also be retrieved for further experimental validation using Sanger sequencing and mass spectrometry. The second block located in the lower-left corner comprises a cascade of filters (Figure 2B), which permits filtering based on sequencing coverage, variant allele frequency or consequence type. Vanno has incorporated several gene-based annotation resources in this release such as ConsensusPathDB, OMIM database, and InterPro domain database. Thus, functional classification can be performed by collapsing genetic alterations at the gene level, followed by aggregating sets of mutated genes by biological pathway, disease, or protein domain, thus expediting identification of variants of interest. The third block (Figure 2D) summarizes genetic alterations according to their gene targets and consequence types, which are rendered as charts, tables, and Circos plots using 14 This article is protected by copyright. All rights reserved.

the same visual analytics concept from our previous implementation (Huang et al. 2013). We redesigned the infrastructure of the web application – Vanno’s interactive filter can now seamlessly connect to dynamic charts, tables, Circos plots, and heatmaps, which makes it possible to visualize the impact of different filtering criteria in real time and eliminates the need for regenerating each relevant component. The occurrence of variants is summarized in pie chart or histogram by items such as chromosome, sample name, gene symbol, and protein domain. A Circos plot is used for identifying recurrent mutations across samples. The Circos plot consists of two layers. Amplicon identifiers and gene symbols are rendered as the outer layer of the Circos plot according to their respective chromosomes. The inner layer of the Circos plot is composed of a series of concentric heatmap tracks, sequentially representing samples from the outer to inner tracks, with a color gradient based on the variant counts. Vanno provides two different Circos plots to aggregate the identified variants at the amplicon level (Amplicon view) and gene level (Gene view). Pop-up windows containing basic variant information are displayed as mouse-over effects on the Circos plot for a quick view of alteration events in terms of the amplicon identifier, gene symbol and sample name. Furthermore, hyperlinks are embedded in the Circos plot as multiple entry points (e.g., chromosome, gene symbol, amplicon identifier, sample name) for small subsets of the full annotation table. The full annotation table (Table view) can be divided into 10 categories, and the detailed information about this is shown on the tutorial page (http://cgts.cgu.edu.tw/vanno/Tutorial/index.php#Table). There are four new categories in Vanno including protein domain, ClinVar, disease description, GO, pathway, and cyptogenic 15 This article is protected by copyright. All rights reserved.

band, with the purpose of maximizing information content. An additional block colored in red will be shown on the upper-right corner when switching on the “compare with TCGA” option (Figure 2 A, C). The user can directly compare the identified genetic alterations with the mutation spectrum deposited in the TCGA by selecting the specific cancer type of interest.

Data comparison Cross-sample comparison has long been known as a computationally intense task. With Circos we successfully provide an interactive concentric heatmap to represent cross-sample mutation frequencies according to gene targets, with a group-wise comparison solution using distinct colors to distinguish the experimental group from the control group (Huang et al. 2013). However, the time-consuming nature of generating high-resolution Circos plots becomes a limiting factor when exploring complex and large data sets. Therefore, we incorporate jHeatmap as a new feature for visualizing and analyzing multidimensional datasets from cancer study cohorts, which allows exploring datasets of any size, given the possibility to deliver multidimensional information (e.g., gene symbol, sample name, mutation type) through each cell of the heatmap matrix. As shown in Supp. Figure S1, the heatmap can be colored, filtered, and sorted based on the values in the cells. Rows and columns of the heatmap can be clustered or sorted according to the encoded annotations, such as clinical features of samples or molecular function of genes, thus expanding the flexibility to visually explore patterns within the alterations. 16 This article is protected by copyright. All rights reserved.

As the target enrichment approach is a time- and cost-effective way for the detection of disease-causing variants, considerable efforts were devoted to developing disease-targeted tests, ranging from single to several hundreds of genes. A series of genetic tests were designed and launched by clinical laboratories or genetic testing companies, targeted at similar causative gene regions using diverse target-enrichment strategies such as PCR hybridization. However, differences in these approaches could contribute bias in sensitivity, specificity and uniformity across target regions. When assessing applicability for a particular project, it may thus be necessary to test a set of genes on the same specimens using different commercial products. To address this issue, cross-panel comparison is a newly implemented comparative module of our analysis pipeline. As shown in Supp. Figure S2, genetic variants identified by different gene panel tests (Ion AmpliSeqTM Cancer HotSpot Panel v1 vs. Ion Cancer AmpliSeqTM HotSpot Panel v2) using the same cell lines are displayed in the UCSC Genome Browser as custom tracks. Gene regions targeted by different testing panels are labeled by distinct colors alongside annotation tracks from dbSNP, COSMIC, HapMap, OMIM, and RefSeq, facilitating evaluation of the sensitivity, specificity and reproducibility between testing panels.

TCGA integration As mentioned above, the TCGA Pan-Cancer efforts analyzed over 5,000 cases from 12 cancer types, providing a strong foundation for understanding the key genetic alterations in cancer. To facilitate access to the complex datasets and subsequently the exploration of 17 This article is protected by copyright. All rights reserved.

multidimensional cancer genomics data, the cBioPortal for Cancer Genomics was designed (Cerami et al. 2012). The portal allows users to visualize gene alteration patterns across samples in a specific cancer study and to compare gene mutation frequencies across multiple cancer types. However, functionality for cross-referencing with users’ data is not available in this tool. As a complement to the cBioPortal, Vanno provides a portal to facilitate integrative analysis between users’ targeted sequencing and TCGA data. Through this function, users can inspect patterns of gene alterations across samples in a study cohort or compare signatures on gene alteration frequencies with TCGA tumor types using interactive heatmaps. Supp. Figure S3A shows alteration events for a cohort of TCGA colon adenocarcinoma patients. Three highly mutated genes with well-known relevance to colon cancer, APC, TP53 and KRAS (Er et al. 2014), can be readily identified using the mutually exclusive sorting function provided in the interactive heatmap. As previous studies have stated that mutually exclusive alteration of genes in a pathway is a characteristic of cancer drivers (Ciriello et al. 2012), this function of Vanno thus facilitates the exploration of driver genes involved in a specific pathway. After a significant set of genes has been identified, a detailed mutational landscape can be inspected at the gene level. A comparison table summarizing the composition of mutant amino acids between the study cohort and a specific TCGA cancer type is shown in Supp. Figure S3B. Furthermore, Vanno supports visualization of gene mutations in the context of protein domains and tertiary structures. The mutational landscapes from in-house data and the TCGA can be rendered side-by-side on specific protein domains. The unique feature of this variant annotation application largely simplifies comparison, while still providing an efficient 18 This article is protected by copyright. All rights reserved.

way to gain insight into the functional context of the mutation. This domain visualization module can also be applied to clarify molecular characteristics of mutant genes. For examples, KRAS mutations in codon 12 and 13 are recognized as a predictor of non-responsiveness to anti-EGFR therapies in metastatic colorectal cancer. Detailed molecular analysis of patients in this regard could be of significant clinical value and enhances personalized treatment.

Integration of genomics and proteomics Specific mutations of cancer-associated genes are considered as DNA biomarkers for diagnosis, and as molecular markers for therapeutic drug selection in clinical settings. In our previous study, we provided the functionality to retrieve the nucleotide sequences that span the alteration sites for the convenience of PCR primer design, which is a crucial step for subsequent experimental validation by Sanger sequencing. With continuously decreasing cost in next-generation sequencing and the advance of mass spectrometry-based proteomic technology, post-genomic research is executed at multilayers that link sample-specific variations in DNA, RNA and proteins to help understand cancer biology and reveal new therapeutic possibilities. To facilitate such efforts, Vanno incorporates genomic alterations identified from targeted sequencing data into mutated protein entries in FASTA format, according to their functional consequences in transcripts. Aberrant proteins result from non-synonymous coding variants and small INDELs are predicted and outputted as a FASTA file for generating peptide sequence tags as database entries. Subsequent mass spectrometry-based studies may then rely on such information for identifying disease-associated peptides and proteins. 19 This article is protected by copyright. All rights reserved.

Feature comparisons Vanno is a substantial improvement from CPAP (Huang et al. 2013), seamlessly integrating our existing workflows with the pathogenic variant database from ClinVar and cancer mutation profile from TCGA. Vanno supports all major variant calling formats generated from public packages (e.g., GATK, VarScan) and commercial tools (e.g., Torrent Variant Caller, MiSeq Reporter, Agilent SureCall). For a clear view of improvements, Table 1 is a comparison table on the performance of Vanno in comparison with other variant annotation packages.

Benchmarking The Vanno web server runs Apache 2.2.22 on a Centos 6.2 machine housing two Intel Xeon E7540 2.0 GHz processors and 8 GB RAM. To test the utility of this pipeline, five hundred variant calling files were randomly selected from sequenced samples generated by Chang Gung NGS core laboratory. For the performance of Vanno, please refer to http://cgts.cgu.edu.tw/vanno/Tutorial/#benchmark.

CONCLUSIONS We provide a comprehensive variant annotation tool, Vanno, for the visualization and analysis of genetic alteration profiles. Vanno provides a highly integrated framework for the functional analysis of genomic variants at both the gene level and protein domain level. 20 This article is protected by copyright. All rights reserved.

Vanno is also equipped with a portal for comparing in-house data with those of TCGA to support comprehensive identification of disease-relevant variation. Important efforts have been made with Vanno to develop a universal interface with intuitive visualization options for integrative analysis of cancer genomic data. With these features, Vanno represents a powerful tool that aids researchers in translating cancer genomic data into biological insights and potential clinical applications.

FUNDING This work was supported by grants from the Chang Gung Memorial Hospital (CMRPD190603 to PT); Ministry of Education, Taiwan (EMRPD1D0841 to PJH & EMRPD1D0761 to BT & EMRPD1D0671 to PT) and Ministry of Science and Technology, Taiwan (MOST 103-2632-B-182-001).

ACKNOWLEDGMENTS We would like to thank Dr. Shu-Jen Chen for critical review of this manuscript.


REFERENCES Asmann YW, Middha S, Hossain A, Baheti S, Li Y, Chai H-S, Sun Z, Duffy PH, Hadad AA, Nair A, Liu X, Zhang Y, et al. 2012. TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data. Bioinformatics 28: 277–278.

Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. 2011. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12: 745– 755.

Beck J, Pittman A, Adamson G, Campbell T, Kenny J, Houlden H, Rohrer JD, de Silva R, Shoai M, Uphill J, Poulter M, Hardy J, et al. 2014. Validation of next-generation sequencing technologies in genetic diagnosis of dementia. Neurobiol. Aging 35: 261–265.

Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. 2004. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32: D262–6.

Cancer Genome Atlas Network. 2012. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487: 330–337.

Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, et al. 2012. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2: 401– 404. 22 This article is protected by copyright. All rights reserved.

Ciriello G, Cerami E, Sander C, Schultz N. 2012. Mutual exclusivity analysis identifies oncogenic network modules. Genome Research 22: 398–406.

Costa JL, Sousa S, Justino A, Kay T, Fernandes S, Cirnes L, Schmitt F, Machado JC. 2013. Nonoptical massive parallel DNA sequencing of BRCA1 and BRCA2 genes in a diagnostic setting. Hum. Mutat. 34: 629–635.

Deu-Pons J, Schroeder MP, Lopez-Bigas N. 2014. jHeatmap: an interactive heatmap viewer for the web. Bioinformatics btu094.

Douville C, Carter H, Kim R, Niknafs N, Diekhans M, Stenson PD, Cooper DN, Ryan M, Karchin R. 2013. CRAVAT: cancer-related analysis of variants toolkit. Bioinformatics 29: 647–648.

Er T-K, Chen C-C, Bujanda L, Herreros-Villanueva M. 2014. Cancer Letters. Cancer Letters 343: 1–5.

Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, et al. 2014. Pfam: the protein families database. Nucleic Acids Res. 42: D222–30.

Gentzsch W. 2001. Sun Grid Engine: towards creating a compute power grid. IEEE. 35–36.

Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33: D514–7. 23 This article is protected by copyright. All rights reserved.

Huang P-J, Yeh Y-M, Gan R-C, Lee C-C, Chen T-W, Lee C-Y, Liu H, Chen S-J, Tang P. 2013. CPAP: Cancer Panel Analysis Pipeline. Hum. Mutat. 34: 1340–1346.

Kamburov A, Stelzl U, Lehrach H, Herwig R. 2013. The ConsensusPathDB interaction database: 2013 update. Nucleic Acids Res. 41: D793–800.

Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, Leiserson MDM, Miller CA, et al. 2013. Mutational landscape and significance across 12 major cancer types. Nature 502: 333–339.

Karolchik D, Bejerano G, Hinrichs AS, Kuhn RM, Miller W, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ. 2007. Comparative genomic analysis using the UCSC genome browser. Methods Mol. Biol. 395: 17–34.

Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. 2012. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research 22: 568–576.

Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. 2013. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42: D980–D985.

Liu X, Jian X, Boerwinkle E. 2013. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum. Mutat. 34: E2393–402.


McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20: 1297–1303.

Ng PC, Henikoff S. 2003. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res.

Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nature Publishing Group 42: 30–35.

Nikiforova MN, Wald AI, Roy S, Durso MB, Nikiforov YE. 2013. Targeted next-generation sequencing panel (ThyroSeq) for detection of mutations in thyroid cancer. J. Clin. Endocrinol. Metab. 98: E1852–60.

San Lucas FA, Wang G, Scheet P, Peng B. 2012. Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. Bioinformatics 28: 421–422.

Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. 2010. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Meth 7: 575–576.


Shepherd R, Forbes SA, Beare D, Bamford S, Cole CG, Ward S, Bindal N, Gunasekaran P, Jia M, Kok CY, Leung K, Menzies A, et al. 2011. Data mining using the Catalogue of Somatic Mutations in Cancer BioMart. Database 2011: bar018–bar018.

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29: 308–311.

Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, Day INM, Gaunt TR. 2013. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum. Mutat. 34: 57–65.

Vandin F, Upfal E, Raphael BJ. 2012. De novo discovery of mutated driver pathways in cancer. Genome Research 22: 375–385.

Wang K, Li M, Hakonarson H. 2010. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38: e164.

Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, Braxton A, Beuten J, Xia F, Niu Z, Hardison M, Person R, et al. 2013. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 369: 1502–1511.

2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242.


Figure 1. Analysis Workflow of Vanno Next-generation sequencing instrument providers such as MiSeq and IonPGM have built-in analysis packages with sequencing machines for sequence alignment and variant calling. Gene testing panel providers such as QIAGEN and Agilent also have relevant variant calling packages optimized for their own products. However, inconsistent file formats generated from different variant calling packages have made subsequent annotation and comparison difficult. Vanno provides a consistent and user-friendly interface to resolve this issue. The user is only required to select the correct gene panel and corresponding variant calling package before uploading variant lists to the Vanno server. Variant lists are subsequently compared to a locally stored integrated database precompiled from various annotation sources as listed in the figure to identify all plausible genetic perturbations and their corresponding molecular consequences. For prioritizing candidate targets of interest, data management modules such as sorting, filtering and comparison are closely connected with data visualization modules such as Circos, Pie chart, Bar chart, heatmap, protein domain topography and protein 3D structure.



Figure 2. Output of Vanno (A) Number of samples, identified mutations, and altered genes are summarized in the upper-left block of the output page. The alternation frequency and mutant composition of each gene in the cohort study are summarized in a concentric histogram with the height determined by the frequencies of altered samples. An interactive heatmap with mutually exclusive sorting functionality is also provided for distinguishing driver mutations from passenger mutations. The download links for the detailed annotation table and the related mutant cDNA and protein sequences are provided in this block. (B) Cascade of filters is provided based on items such as chromosome, gene symbol, coverage, mutation frequency and molecular consequence for variants of interest. The identified variants are collapsed at the gene level, and subsequently aggregated by functional classifications such as pathway, disease and protein domain, providing an alternative way to identify variants of interest. (C) After turning on the “compare with TCGA” switch located the upper-left block, an additional block colored in red will be displayed at the upper-right corner, providing a unique feature to compare an identified mutation with mutation spectrum of a specific cancer type deposited at the TCGA. (D) The distribution of variants is summarized in dynamic pie charts based in items such as chromosome, gene symbol, mutation type, sample names, and protein domain. Alternation events are summarized at the gene level and amplicon level, and rendered as Circos ideograms. Base variant information and a mutation spectrum on the relevant protein domain can be displayed in pop-up windows when hovering the mouse cursor over the Circos plots. Protein tertiary structure with a mutation site highlighted can also be displayed using 29 This article is protected by copyright. All rights reserved.

JSMol when the protein structure is available. The full content of the annotation table can be found on our tutorial page (http://cgts.cgu.edu.tw/vanno/Tutorial/index.php#Table).


Figure 3. Population frequencies for mutations in each gene A compact Circos plot is used to provide a simultaneous exploratory view of all alteration events in each gene as well as their corresponding population frequencies in tumor samples. Gene mutation frequencies from different alterations types in a given samples can be organize in layered circles. The outer-most circle represents all mutation events (All), with the height determined by the frequencies of altered samples, followed by circles for silent mutation (Si), missense mutation (Ms), nonsense mutation (Ns) and INDELs (ID).


Table 1. A comparison table between Vanno and other variant annotation analytic tools Tool Availability

Vanno

CPAP

Annotate-it

ANNOVAR

Anntools

KGGSeq

SeqAnt

SVA

TREAT

VarioWatch

Web

Web

Web

Command line

Command line

Command line

Web

Graphical

Command line

Web

1000 Genomes

✔

✔

✔

✔

✔

✔

✔

✔

✔

✔

ESP 6500 exomes

✔

dbSNP & COSMIC

✔

✔

✔

✔

✔

✔

✔

✔

✔

dbNSFP

✔

✔

✔

ClinVar

✔

Customized filter

✔

✔

Filter history

✔

✔

Multi-sample comparison (Circos)

✔

✔

Multi-sample comparison (Heatmap)

✔

TCGA comparison

✔

Dynamic summarized chart

✔

✔

Mutant DNA Sequence retrieval

✔

✔

Mutant protein sequence retrieval

✔

Pathway information

✔

Mutual exclusive sorting

✔

OMIM

✔

✔

Gene Ontology

✔

✔

Protein Domain & 3D structure

✔

Supported commercial gene/exome panels

28

This article is protected by copyright. All rights reserved.

✔ ✔

✔ ✔

✔

✔

✔

✔

✔

✔

✔

✔

✔

✔

✔

✔

✔

1

32

ePIANNO: ePIgenomics ANNOtation tool.

UCSC Data Integrator and Variant Annotation Integrator.

Marky: a tool supporting annotation consistency in multi-user and iterative document annotation projects.

SG-ADVISER CNV: copy-number variant annotation and interpretation.

Improving the Sequence Ontology terminology for genomic variant annotation.

SNiPA: an interactive, genetic variant-centered annotation browser.

High-performance web services for querying gene and variant annotation.

BEACON: automated tool for Bacterial GEnome Annotation ComparisON.

unitas: the universal tool for annotation of small RNAs.

A community-based resource for automatic exome variant-calling and annotation in Mendelian disorders.

Choice of transcripts and software has a large effect on variant annotation.

A variant by any name: quantifying annotation discordance across tools and clinical databases.

Protein Sequence Annotation Tool (PSAT): a centralized web-based meta-server for high-throughput sequence annotations.

Visual annotation display (VLAD): a tool for finding functional themes in lists of genes.

CIDI-lung-seg: a single-click annotation tool for automatic delineation of lungs from CT scans.

FunctionAnnotator, a versatile and efficient web tool for non-model organism annotation.

SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

Functional assays provide a robust tool for the clinical annotation of genetic variants of uncertain significance.

Pegasus: a comprehensive annotation and prediction tool for detection of driver gene fusions in cancer.

FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome.

DeAnnIso: a tool for online detection and annotation of isomiRs from small RNA sequencing data.

SNPsnap: a Web-based tool for identification and annotation of matched SNPs.

Establishing and validating regulatory regions for variant annotation and expression analysis.

FamAnn: an automated variant annotation pipeline to facilitate target discovery for family-based sequencing studies.