Briefings in Bioinformatics Advance Access published September 22, 2014

B RIEFINGS IN BIOINF ORMATICS . page 1 of 11

doi:10.1093/bib/bbu032

Alternative applicationns for distinct RNA sequencing strategies Leng Han*, Kasey C. Vickers*, David C. Samuels and Yan Guo Submitted: 6th June 2014; Received (in revised form) : 19th August 2014

Abstract

Keywords: RNAseq; data mining; SNP; mutation; exogenous RNA

BACKGROUND High-throughput sequencing technology, also known as next-generation sequencing (NGS), has greatly reshaped researchers’ ability to study the genome. One of the most popular applications of high-throughput sequencing is RNA sequencing (RNAseq), which represents the current state of the art in gene expression analyses [1]. Additionally, this technique also allows investigators to better understand how genes are regulated and assess DNA structure and variance. The number of studies using RNAseq technology has increased significantly over the past few years as evidenced by the number of RNAseq data set stored in Short Read Archive (SRA) (Figure 1). Previously, it has been shown that high-throughput sequencing of DNA produces unexpected insights and useful information [2]. The

same theory has emerged for RNAseq data. At the time, most investigators on the periphery of genomics have focused on quantification of mRNA expression [3], detection of alternative splicing [4–6] and identification of gene fusions [7–9]. Beneath the surface of each distinct RNAseq strategy, there are less obvious, but limitless, exciting data-mining opportunities. The key advantage of RNAseq over traditional methods (e.g. hybridization-based microarrays) is the depth and novelty of the output based on unbiased sequence information. By gaining base-call capacities, RNAseq has the capacity to identify single-nucleotide polymorphisms (SNPs) and somatic mutations [10, 11], RNA editing events [12–14], allele-specific expression (ASE) [15, 16], quantification of noncoding RNAs [17, 18] and detection of exogenous RNA [19, 20]. Although custom-designed probes can

Corresponding author: Yan Guo, Department of Cancer Biology, Vanderbilt University, Nashville, TN 37027, USA. Tel.: 615-9360816; Fax: 615-936-2602; E-mail: [email protected] *These authors contributed equally to this work. Leng Han is a postdoc fellow at Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center. Kasey Vickers is an assistant professor at Vanderbilt School of Medicine. He is an export in extracellular smRNA biology. David Samuel is an associate professor at Department of Molecular Physiology & Biophysics, Vanderbilt University. Yan Guo is an assistant professor at Department of Cancer Biology, Vanderbilt University. He is the Technical Director of Bioinformatics for Vanderbilt Technologies for Advanced Genomics Analysis and Research Design. ß The Author 2014. Published by Oxford University Press. For Permissions, please email: [email protected]

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

Recent advances in RNA library preparation methods, platform accessibility and cost efficiency have allowed highthroughput RNA sequencing (RNAseq) to replace conventional hybridization microarray platforms as the method of choice for mRNA profiling and transcriptome analyses. RNAseq is a powerful technique to profile both long and short RNA expression, and the depth of information gained from distinct RNAseq methods is striking and facilitates discovery. In addition to expression analysis, distinct RNAseq approaches also allow investigators the ability to assess transcriptional elongation, DNA variance and exogenous RNA content. Here we review the current state of the art in transcriptome sequencing and address epigenetic regulation, quantification of transcription activation, RNAseq output and a diverse set of applications for RNAseq data. We detail how RNAseq can be used to identify allele-specific expression, single-nucleotide polymorphisms and somatic mutations and discuss the benefits and limitations of using RNAseq to monitor DNA characteristics. Moreover, we highlight the power of combining RNA- and DNAseq methods for genomic analysis. In summary, RNAseq provides the opportunity to gain greater insight into transcriptional regulation and output than simply miRNA and mRNA profiling.

page 2 of 11

Han et al.

address some of these outlined subtleties, this information is inherent to the output of RNAseq platforms. In this review, we describe the benefits and limitations of different RNA- and DNAseq approaches, including the quantification of (i) active transcription through precision nuclear run-on sequencing; (ii) quantification of long noncoding RNA; (iii) exogenous RNA detection; (iv) RNA editing and ASE; and (v) genetic mutations. By combining RNA- and DNAseq approaches with novel applications, a greater understanding of the tremendous complexities of the transcriptome and genome can be gained. Moreover, combining of RNA- and DNAseq approaches represents a synergistic strategy to identify novel gene expression regulatory modules that may have potential to be targeted to prevent and treat a myriad of pathophysiologies.

BIOLOGICAL Detection of transcriptional activation Mammalian transcription is largely controlled by the chromatin state of DNA (epigenome), which allows cells to express specific genes and isoforms of both coding and noncoding elements. Euchromatic regions of assessable DNA provide transcription factors (both activators and repressors) to bind to regulatory elements within the genome. Recently, a large consortium of investigators cataloged these functional elements in the Encyclopedia of DNA Elements (ENCODE) project [21]. Results from this study suggest that as much as 80% of the genome is biologically active and

functional. This large-scale project was completed using multiple types of sequencing strategies. DNaseI hypersensitivity assays were used to identify transcription factor assessable regions of the genome. Likewise, formaldehyde-assisted isolation of regulatory elements sequencing was also used in multiple cell types to denote open chromatin [22]. These data sets provide powerful information on the transcriptional opportunity and activity of DNA elements when combined with variations of chromatin-immunoprecipitation sequencing (ChIP-seq). ChIP-seq uses immunoprecipitation of chromatin elements through antibody recognition of specific histone modifications that are then sequenced and cataloged. This approach has proven to be fundamental to the characterization of noncoding regulatory elements. Although a diverse set of histone modifications have been identified, the most popular strategies involve ChIP-seq of DNA regions corresponding to active regulatory elements (H3K27ac), poised or active promoters (H3K4me3) and active transcription (H3K79me2 and H3K36me3) [23, 24]. Recently, a systematic characterization of enhancer motifs and patterns (stretch enhancers) was demonstrated to be a viable strategy to define susceptibility regions for disease variants and cell-specific gene regulation [25]. DNAseq data from the ENCODE project and other regulatory atlases are powerful resources for the investigation of chromatin states and the regulatory control of the cellular transcriptome [26, 27]. Although ChIP-seq provides static information on promoter activity, DNAseq approaches cannot quantify active engaged transcriptional elongation.

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

Figure 1: Distribution of the number of RNAseq archived data sets and publications. SRA, Sequenced Read Archive.

Alternative applications for distinct RNA sequencing strategies

promoters provide greater regulatory control than just initiation and significantly contribute to the position of Pol II stalling [37]. Both GROseq and PROseq take advantage of labeling nascent transcription with modified bases in conditions in which Pol II initiation is inhibited—5-bromouridine 50 triphosphate (BrU, GROseq) and biotinylated-nucleotides (PROseq), respectively. One limitation of GROseq is the use of short polynucleotide run-on lengths (50 bases at 50 end). PROseq overcomes this limitation through the use of biotinylated nucleotides that provide bp resolution throughout the transcript length. As such, PROseq provides genome-wide assessment of the link between core promoter structure and promoter-proximal pausing to map Pol II pausing sites and quantify transcriptional activity. Although PROseq allows for the assessment of Pol II pausing at exon–intron junctions and 30 cleavage sites at transcription termination, both PROseq and GROseq sequence predominantly 50 end transcripts. Another approach includes native elongating transcript sequencing (NET-seq), which monitors nascent RNA transcription through high-throughput sequencing of 30 transcripts [38]. Recently, this technique was used to demonstrate that deacetylation of Rpd3S dictates promoter directionality and suppresses antisense elongation during divergent transcription [38]. Moreover, this study revealed extensive Pol II pausing and backtracking during elongation, and data suggest that Pol II must overcome nucleosome-induced pausing for productive elongation [38]. As such, NET-seq allows for the quantification of the 30 transcripts with single-base resolution of Pol II pausing, which aids in the identification of chromatin structure and factors that influence transcription. The human genome is under constant transcription and many transcripts, including antisense minimally elongated transcripts, are rapidly degraded. Nevertheless, the quantification of processed mRNA transcripts (RNAseq) provides spatial and quantitative assessment of RNA transcription. In summary, RNAseq approaches are appropriate unique class of methods to assess the functional potential of the genome and impact of the transcriptome (Table 1).

COMPUTATIONAL Quantification of long noncoding RNA Long noncoding RNAs (lncRNAs) are arbitrarily defined to be >200 nucleotides in length that do not encode proteins. Strikingly, there are tens of

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

Moreover, ChIP-seq also does not assess transcription potentiation associated with RNA polymerase 2 (Pol II) pausing or linear transcriptional activation of genes. Conversely, high-throughput RNAseq approaches do offer an opportunity to identify genes that are poised to be rapidly transcribed (elongated) by cell signaling or biological responses [28]. High-throughput sequencing can be applied to both classic nuclear run-on strategies to quantify genome-wide transcription activation of individual genes [28, 29]. Pol II pauses at promoter-proximal sites in the early stages of elongation just after transcriptional initiation and promoter escape before productive elongation. Pausing provides an additional level of gene regulation and can be used to identify and quantify potentiation of transcription in a biological context. Unlike yeast (Saccharomyces cerevisiae), Pol II levels simply bound to mammalian promoters do not correlate with gene transcription and mRNA levels [30–32]. For example, approximately one-third of genes with Pol II density at the 50 end are not transcriptionally active, and Pol II resides in a poised state, either in preinitiation or initiated-paused forms, but not transcriptionally engaged [29]. Many developmental and metabolic programs require rapid control of gene expression, and regulation of Pol II pausing provides this control. Recent genome-scale studies have reported large numbers of genes that are regulated by Pol II pausing, including overrepresentation of genes that are rapidly turned on by cell signaling [32–34]. In addition, Pol II pausing is the rate-limiting step of gene transcription post initiation. In 2008, global run-on sequencing (GROseq) was used to quantify genome-wide transcriptionally engaged Pol II [29]. Essentially, GROseq replaced more traditional methods of nuclear run-on assays and provided high resolution of transcriptionally engaged genes and the ability to quantify active expression. In addition to gene activity and pausing, GROseq also provided detailed evidence of divergent transcription, and data suggested that the majority of productive elongation occurs downstream, not upstream, from the transcriptional start site [29]. Nevertheless, the full regulatory impact of bimodal divergent transcription is not fully understood; however, small RNAs (smRNAs) and other regulatory RNAs have been reported to be produced from short nascent RNA from divergent transcription [29, 35, 36]. Recently, precision nuclear run-on highthroughput sequencing (PROseq) was demonstrated to provide base-pair (bp) resolution of Pol II pausing. Using this technique, investigators found that

page 3 of 11

page 4 of 11

Han et al.

Table 1: Experimental (biological) analysis using RNAseq Sequencing strategy

Target

Limitations

References

GROseq

Defines genome-wide transcriptionally engaged Pol II pausing (50 -end transcripts) Defines genome-wide transcriptionally engaged Pol II at base-pair resolution (50 -end transcripts) Monitors nascent RNA transcription through high-throughput sequencing of 30 transcripts

Short run-on lengths

[29]

Predominantly 50 end transcripts

[37]

Predominantly 30 end transcripts

[38]

PROseq NET-seq

is associated with some kind of biochemical function through the regulation of the expression of coding genes [40], though this claim has received criticism [56]. Nonetheless, RNAseq data provide us with the opportunity to study noncoding RNA at an unprecedented level.

Detection of exogenous RNA Based on the diversity and utility of transcriptomic output from RNAseq strategies, many investigators are using these methods to test a diverse set of hypotheses. For example, many groups are using RNAseq to better understand the relationship between viruses and cancer [19, 41, 57–60]. Viruses can integrate into host genome’s cDNA, either as free viral RNA or genomic DNA, and play important roles in many diseases, including cancer [61]. It has been estimated that viruses cause 15–20% of all cancers [62, 63]. Viruses trigger oncogenesis through insertion of viral oncogenes into host genomes near tumor suppressor genes (e.g. c-myc [64] or N-ras [65]). Viral gene insertion forces cells into G1 phase by turning on and off nearby oncogenes and tumor suppressor genes, respectively. This can occur via cis- and trans-regulation of promoter and enhancer sequences within viral LTRs [66] and lead to uncontrolled cell division and tumorigenesis. Viral genomes are readily detected using high-throughput sequencing technology [67–71]. Even though most of the oncogenic viruses are DNA viruses, their transcribed RNAs can be detected through RNAseq (Table 2). Taking advantage of publically available large-scale RNAseq data sets [The Cancer Genome Atlas (TCGA)], two independent studies analyzed thousands of RNAseq samples across multiple human cancers. Khoury et al. mined RNAseq data from 3775 TCGA samples for viral RNAs. As expected, human papillomavirus (HPV), hepatitis B virus (HBV) and Epstein–Barr virus were found in

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

thousands of lncRNAs, many of which are structurally similar to mRNAs, including the polyadenylation (polyA) that on selection will account for their inclusion in polyA-enriched RNAseq data sets. The standard RNAseq method for mRNA expression analyses selects for polyA tails, and because some lncRNAs display this tail, an mRNAseq library typically contains lncRNAs as well. Although they are noncoding, recent research has shown evidence of diverse functionality related to subcellular structural organization and embryonic stem cell differentiation, among many other biological processes. The rise in the popularity and affordability of RNAseq technology is largely responsible for the growing interest in and understanding of lncRNAs, as researchers explore the presence of these stowaways in their mRNA data sets. Although lncRNAs can be studied with traditional microarrays, RNAseq is the superior technology for this purpose because of its greater sensitivity and the ability to detect novel lncRNAs. Cabili et al. were pioneers of lncRNA research, describing the defining characteristics of lncRNAs [39], and the ENCODE [18] project made a huge contribution, identifying 13 333 lncRNAs and categorizing them into four classes: (i) antisense, (ii) large intergenic noncoding RNAs, (iii) sense intronic and (iv) processed transcripts. The ENCODE project also examined lncRNA expression and identified tissuespecific patterns in their expression (Table 2). Noncoding RNA has gained enormous interest in biomedical research during recent years. More and more researchers are starting to realize that the answers to some diseases may lie outside the coding regions. The National Human Genome Research Institute launched the ENCODE consortium to study the noncoding regulatory genome, and the consortium claims that 80% of the human genome

Alternative applications for distinct RNA sequencing strategies

page 5 of 11

Table 2: Computational analysis on RNAseq Target

Limitations

Tools/Resources

References

Noncoding RNA Exogenous RNA

Lack of annotations and coordinates Homology, high mutation rate of RNA viruses can induce misalignment Confounded by naturally occurring SNPs and somatic mutations Can only infer ASE of genes with variance (SNPs) High false positive rates for DNA mutations, RNA splicing

ENCODE SRSA, PathSeq, ViralFusionSeq, VirusSeq, VirusFinder REDItools, DARNED, REDidb, dbRES, RADAR asSeq, AlleleSeq

[18, 39, 40] [41, 42^ 45]

RNAmapper, SNVQ, RSMC, SNPiR

[11, 52^55]

RNA editing Allele-specific expression DNA variation

Detection of RNA editing and allele-specific expression As discussed above, RNA editing provides an additional level of posttranscriptional regulation to noncoding RNAs, and thus, is a powerful way to diversify the transcriptome [73, 74]. Recoding RNA editing is a genetic editing mechanism cells use to

[50, 51]

expand the number of proteins assembled from a single DNA locus, and recoding has been reported to play an important role in psychiatric diseases [75] and cancer [76, 77]. Nevertheless, accurate identification and cataloging of RNA editing sites remain challenging [78] in both coding and noncoding regions [79–81]. Recent algorithms are specifically designed to integrate orthogonal RNA and DNAseq data for a more accurate accounting of editing events [82–85]. These approaches generally apply stringent mapping criteria with a series of filters to remove false-positive results [86]. For example, RNA editing is generally not uniform on all RNA molecules; therefore, sites with apparent 100% editing are often discarded. Moreover, editing sites within 4 bp of splice junctions are also removed owing to the higher possibility of misalignment [83, 84]. Additional BLAST-like alignment tool searches are required so that reads with mismatches against the reference genomes can be mapped, and are necessary to exclude the effects of homologous regions [82–84]. Moreover, RNA editing sites that overlap with known SNP in public databases (i.e. dbSNP) are excluded [83, 84]. Likewise, known somatic mutations within pathological contexts (e.g. cancer) are also excluded. Combining orthogonal DNA and RNAseq approaches is a powerful strategy; however, it requires high-throughput sequencing of both, which can be cost-inefficient and time-consuming [87]. To reduce costs, Ramaswami et al. developed a pipeline to identify RNA editing sites using RNAseq data alone (Table 2) [87]. After filtering common SNPs, their algorithm distinguishes RNA editing sites from rare SNPs based on the assumption that the editing sites are highly conserved among different individuals, unlike rare SNPs [88]. Assessment of RNA editing site conservation benefits from larger sample sizes and, therefore, will be more

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

associated head-and-neck, uterine endometrioid and lung cancers, respectively [19]. This study demonstrated that viral integration detection is possible through RNAseq approaches. Moreover, a number of viral integration sites and oncogenes were detected in tumors infected with HPV and HBV [62, 63]. Nonetheless, a more recent study on 4433 tumors and 404 normal samples across 19 cancer types from the TCGA database argued against viral etiology in most cancer types [20]. More interestingly, this study reported co-adaption between viral transcripts and host mRNA expression with 1897 host genes that were found to be altered at least 2fold in HPV-positive HNSC tumors than HPVnegative HNSC tumors. This observation strongly suggests that HPV has a widespread impact on the host transcriptome that can be analyzed using smRNA and RNAseq. Because of the functional importance of virus detection, many tools have been developed for sequencing data, including PathSeq [42], ViralFusionSeq [43], VirusSeq [44] and VirusFinder [45]. Considering the rapid mutation rate of DNA viruses (106–108 mutations per base per generation) and RNA viruses (103–105 mutations per base per generation) [72], increasing mismatch allowance is required to align RNAseq reads to the viral genome. Even though some of the tools are designed for exome or whole-genome sequencing data, the fundamental algorithms also apply to RNAseq data.

[46 ^ 49]

page 6 of 11

Han et al.

Characterization of DNA variation through RNAseq Traditionally, Sanger sequencing has been the method of choice to detect DNA variation, including SNPs, somatic mutations, indels and microsatellite (MS) instability mutations. One obvious limitation of this method is that sequencing only covers a small genomic region of interest. As such, genome-scale massive parallel sequencing (DNAseq) has completely replaced outdated methods; however, focused sequencing approaches still use Sanger methods. DNAseq has been extensively used for exome and whole-genome sequencing and has been instrumental in the rapid advancement of genome-wide association studies (GWAS). Although RNAseq has not been widely used for detecting DNA mutations, RNAseq has proven to be a viable method to gain insight into DNA variance (Table 2). Combining RNA- and DNAseq methods on matched samples allows for the assessment of the impact of DNA variance on gene expression [10]. Currently, available tools for SNP detection in DNAseq data are abundant (GATK [97], Varscan [98] and MuTect [99]); however, relatively few corresponding tools are designed for RNAseq data input. Identifying mutations with RNAseq data poses unique challenges, primarily high false-positive rates for DNA mutations. This is due to several issues, one being cycle bias that occurs at heterozygous positions when one of the two alleles in the supporting reads is at the beginning or end of the read [96]. Another source of error is RNA splicing, as mapping reads to exon–intron junctions within a DNA reference is most likely alignment errors [11]. Nevertheless, recent advances in our understanding of alternative splicing and exon–intron junctions has challenged this view, and combining RNAseq with DNAseq is a great combinatorial approach to differentiate alignment errors from alternative splicing and intron retention, and specific tools are now available to address these issues [6, 100, 101]. Alternatively, aligning reads to a reference RNA ‘transcriptome’ is a potential solution; however, novel tools for this are required [11]. Nevertheless, SNPs and somatic mutations can be identified through counting mismatches against the reference genome; however, excessive mismatches owing to errors described above will result in a high false-positive rate for SNPs and somatic mutations. As quality control, false-positive results owing to cycle bias should be filtered by

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

accurate in large-scale RNAseq data sets. Based on Ramaswami’s approaches, Picardi and Pesole developed REDItools, which systematically identifies RNA editing events through integrating RNAand DNAseq data and/or through RNAseq data analysis alone [46]. RNA editing databases such as DARNED [47], REDidb [48], dbRES [49] and RADAR [89] can be used to exclude common RNA editing sites, which can be difficult because of somatic mutations [11]. Nevertheless, based on the fact that most RNA editing is A-to-G events, non-A-to-G variants are more likely to be SNPs or somatic mutations and can be parsed appropriately [11]. Despite the low prevalence of editing events [84, 87], a significant fraction of RNA–DNA descrepancies are attributed to RNA editing. As such, it is highly encouraged and beneficial to remove common RNA editing sites when screening for somatic mutations. ASE differentiates expression values between two alleles of a given gene. Recently, ASE has garnered significant attention because of the ENCODE project and greater interest in cis-acting genetic variants and parent-of-origin regulation [15, 90, 91]. Based on these interests, investigators have developed novel tools for combining RNA and DNAseq data to assess ASE, including asSeq [50] and AlleleSeq [51]. asSeq simply models total read counts using discrete distributions, whereas AlleleSeq combines genomic sequence variants (i.e. SNPs, indels and structural variants) to build a diploid personal genome and then identifies ASE for events with significant differences in the number of mapped reads between two alleles. The fundamental idea common to these tools is assigning reads to specific alleles. The major drawback of these approaches is that they can only infer ASE of genes with variance (SNPs). As discussed above, SNP detection using RNAseq data is possible, but the high false-positive rates associated with RNAseq data will greatly affect the accuracy of ASE analysis. Moreover, reference preferential bias can be problematic. Reference allele preferential bias is a phenomenon during alignment where there is a preference toward the reference allele caused by alignment algorithms that penalize a mismatch from the reference. In these alignment algorithms, more than one mismatch from the reference genome within a read leads to the discarding of the read, and thus, penalizing nonreference reads by one mismatch. Bias of this nature has been described by multiple studies (Table 2) [92–96].

Alternative applications for distinct RNA sequencing strategies

small structural variations in a few cancer studies [109, 110]. Microsatellite (MS) instability is a common type of genetic hypermutation characterized by the expansion or contraction of DNA repeat tracts as a consequence of DNA mismatch repair deficiency. As with SNP and somatic mutation identification, DNAseq is the preferred technology over RNAseq for studying MS instability; however, researchers have been creatively using available RNAseq data to detect MS instability. For example, RNAseq data have been used to identifying MS markers in Amophophallus [111] and cancer [109]. Furthermore, Lu et al. developed a novel approach [112] for characterizing MS instability by RNAseq in cancer by comparing results from software tools DINDEL [113] and Tandem Repeats Finder [114]. Despite the chance that this approach will miss MS instability in regulatory regions and low coverage regions, this study was able to find more short MS deletions instable (MSI) than MS deletions stable (MSS) samples. Yoon et al. performed comprehensive analyses using both genome- and transcriptome-wide sequencing data to identify 18 377 MS mutations in Korean gastric cancers, and >90% of these MS mutations are deletions in untranslated regions of genes [23]. Xu et al. identified 116 disruptive mutations in five human prostate cancer tissues, including frameshift indels and nonsynonymous nucleotide substitutions, by using RNAseq only [110].

CONCLUSION RNAseq is a class of distinct strategies to assess the infinite depth and complexity of the transcriptome and genome. Moreover, many may not be aware that the amount of publically available RNAseq data directly rivals the abundance of exome sequencing data. The large amount of available data provides enormous opportunities for researchers to conduct additional genomic analyses beyond the traditional RNAseq expression profiling. As such, data mining on RNAseq data provides new insights into pathophysiology that have been previously neglected. The narrative to this discussion is that RNAseq technology can be used in many different ways to make more acute observations of regulatory control and expression of our genome. Moreover, RNAseq provides a dynamic platform to test both biological and computational hypotheses. In summary, RNAseq should be viewed as a platform to gain information other than just smRNA and

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

removing reads with mutations that are found disproportionally at the beginning or end of reads, which has been effectively demonstrated by Kleinman et al. [79]. False-positive results owing to exon–intron junction alignments present a greater problem, and SNPs and somatic mutations identified near splicing sites should also be removed or flagged for further review. In additional to high false-positive results, another limitation for detecting mutations in RNAseq data is that coverage is dependent on gene expression. For a normal human tissue, 12 000 genes are expressed in a given cell, which corresponds to 60–70% of the 22 000 protein-coding genes [102]. As such, RNAseq data often do not generate enough coverage for lowly abundant and silent genes to infer mutation status. Given all of the possible complications with identifying SNPs and somatic mutations from RNAseq data, focusing on specificity rather than sensitivity is suggested. One of the first programs for detecting somatic mutations in RNAseq data (Zebrafish) was designed to detect causal mutations for a phenotype of interest in primary genomic regions [52]. Duitama etal. introduced a Bayesian model (single nucleotide variation quality, SNVQ) for SNP discovery in RNAseq data; however, neither SNVQ nor RNAmapper addresses cycle bias [53]. More recent tools are designed to remove apparent mutations associated with cycle bias and intronic SNPs within several bp of a splicing junctions, including RSMC [54] and SNPiR [11]. Recently, Chepelev et al. introduced a pipeline to reduce the false-positive results by implementing a ‘Redundant Read Filter’ for RNAseq data [54]. A recent report suggests that only >10 coverage is required to ensure 89% accuracy and 92% sensitivity for single nucleotide variations [103]. Another recent study by Miller et al. used high-confidence SNP markers to identify candidate deleterious mutations that directly alter amino acids, splicing events or gene expression [104]. Although RNAseq analyses of DNA variance have multiple limitations, RNAseq is highly capable of identifying small indels (insertions and deletions) and gene fusion, and a few tools have been developed to detect indels and gene fusion from RNAseq, including TopHat2 [105], FX [106], OSA [107] and PRADA [108]. Nonetheless, using RNAseq to detect indels has not been widely applied owing to false-positive results for the same reasons as described above; however, RNAseq has been used to detect

page 7 of 11

page 8 of 11

Han et al.

mRNA expression. Combined with proper informatics support, investigators can unlock more data and collect significantly more information from new or archived RNAseq data sets.

Key points  RNAseq data can be used for in-depth data mining  RNAseq technology can be used to assess transcriptional elongation  Noncoding RNA can be quantified by RNAseq  RNAseq methods capture exogenous RNA  Detection of RNA editing events and ASE  SNPs and mutations are easily detected with RNAseq data

FUNDING K.C.V is supported by NIH K22HL113039, NIH DK20593, AHA 14CSA20660001 and LRI Novel Grant Award. Y.G is supported by CCSG (P30 CA068485).

References 1.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63. 2. Samuels DC, Han L, Li J, et al. Finding the lost treasures in exome sequencing data. Trends Genet 2013;29:593–9. 3. Steijger T, Abril JF, Engstrom PG, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 2013;10:1177–84. 4. Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet 2010;11:345–55. 5. Li HD, Menon R, Omenn GS, et al. The emerging era of genomic data integration for analyzing splice isoform function. Trends Genet 2014;30:340–7. 6. Feng H, Qin Z, Zhang X. Opportunities and methods for studying alternative splicing in cancer with RNA-Seq. Cancer Lett 2013;340:179–91. 7. Maher CA, Kumar-Sinha C, Cao X, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 2009; 458:97–101. 8. Maher CA, Palanisamy N, Brenner JC, et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci USA 2009;106:12353–8. 9. Wang Q, Xia J, Jia P, et al. Application of next generation sequencing to human gene fusion detection: computational tools, features and perspectives. Brief Bioinform 2013;14: 506–19. 10. Comprehensive molecular portraits of human breast tumours. Nature 2012;490:61–70.

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

Acknowledgments The authors would like to thank Leslie A. Roteta and Margot Bjoring for their assistance in drafting the review.

11. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from RNA-seq data. Am J Hum Genet 2013;93:641–51. 12. Peng Z, Cheng Y, Tan BC, et al. Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotechnol 2012;30:253–60. 13. Ramaswami G, Lin W, Piskol R, et al. Accurate identification of human Alu and non-Alu RNA editing sites. Nat Methods 2012;9:579–81. 14. Ramaswami G, Zhang R, Piskol R, et al. Identifying RNA editing sites using RNA sequencing data alone. Nat Methods 2013;10:128–32. 15. Gregg C, Zhang J, Weissbourd B, et al. High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science 2010;329:643–8. 16. Zhang R, Li X, Ramaswami G, et al. Quantifying RNA allelic ratios by microfluidic multiplex PCR and sequencing. Nat Methods 2014;11:51–4. 17. Cabili MN, Trapnell C, Goff L, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011;25: 1915–27. 18. Djebali S, Davis CA, Merkel A, etal. Landscape of transcription in human cells. Nature 2012;489:101–8. 19. Khoury JD, Tannir NM, Williams MD, et al. Landscape of DNA Virus Associations across Human Malignant Cancers: Analysis of 3,775 Cases Using RNA-Seq. J Virol 2013;87: 8916–26. 20. Tang KW, Alaei-Mahabadi B, Samuelsson T, et al. The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat Commun 2013;4:2513. 21. Consortium EP, Bernstein BE, Birney E, etal. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. 22. Song L, Zhang Z, Grasfeder LL, et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res 2011;21: 1757–67. 23. Li B, Carey M, Workman JL. The role of chromatin during transcription. Cell 2007;128:707–19. 24. Creyghton MP, Cheng AW, Welstead GG, et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci USA 2010;107: 21931–6. 25. Parker SC, Stitzel ML, Taylor DL, et al. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proc Natl Acad Sci USA 2013;110:17921–6. 26. Ernst J, Kheradpour P, Mikkelsen TS, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011;473:43–9. 27. Shen Y, Yue F, McCleary DF, et al. A map of the cisregulatory sequences in the mouse genome. Nature 2012; 488:116–20. 28. Danko CG, Hah N, Luo X, et al. Signaling pathways differentially affect RNA polymerase II initiation, pausing, and elongation rate in cells. Mol Cell 2013;50:212–22. 29. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008;322:1845–8.

Alternative applications for distinct RNA sequencing strategies

49. He T, Du P, Li Y. dbRES: a web-oriented database for annotated RNA editing sites. Nucleic Acids Res 2007;35: D141–4. 50. Sun W. A statistical framework for eQTL mapping using RNA-seq data. Biometrics 2012;68:1–11. 51. Rozowsky J, Abyzov A, Wang J, et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol 2011;7:522. 52. Miller AC, Obholzer ND, Shah AN, et al. RNA-seq-based mapping and candidate identification of mutations from forward genetic screens. Genome Res 2013;23:679–86. 53. Duitama J, Srivastava P, Mandoiu I. Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data. BMC Genomics 2012;13:S6. 54. Yan Guo QS, Chung-I Li , David C, Samuels, Yu Shyr. RNA Somatic Mutation Caller (RSMC): identifying somatic mutation using RNAseq datahttps://github.com/ shengqh/rsmc/wiki. 55. Chepelev I, Wei G, Tang Q, et al. Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Res 2009;37:e106. 56. Palazzo AF, Gregory TR. The case for junk DNA. PLoS Genet 2014;10:e1004351. 57. Palacios G, Druce J, Du L, etal. A new arenavirus in a cluster of fatal transplant-associated diseases. N EnglJ Med 2008;358: 991–8. 58. Nakamura S, Yang CS, Sakon N, et al. Direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach. PLoS One 2009;4:e4219. 59. Quan PL, Wagner TA, Briese T, et al. Astrovirus encephalitis in boy with X-linked agammaglobulinemia. Emerg Infect Dis 2010;16:918–25. 60. Briese T, Paweska JT, McMullan LK, et al. Genetic detection and characterization of Lujo virus, a new hemorrhagic fever-associated arenavirus from southern Africa. PLoS Pathog 2009;5:e1000455. 61. Geuking MB, Weber J, Dewannieux M, et al. Recombination of retrotransposon and exogenous RNA virus results in nonretroviral cDNA integration. Science 2009;323:393–6. 62. Parkin DM. The global health burden of infectionassociated cancers in the year 2002. Int J Cancer 2006;118: 3030–44. 63. Morissette G, Flamand L. Herpesviruses and chromosomal integration. J Virol 2010;84:12100–9. 64. Girard L, Hanna Z, Beaulieu N, et al. Frequent provirus insertional mutagenesis of Notch1 in thymomas of MMTVD/myc transgenic mice suggests a collaboration of c-myc and Notch1 for oncogenesis. Genes Dev 1996;10: 1930–44. 65. Martı´n-Herna´ndez J, Sørensen AB, Pedersen FS. Murine Leukemia virus proviral insertions between the N-ras and unr genes in B-cell lymphoma DNA affect the expression of N-ras only. J Virol 2001;75:11907–12. 66. Wang T, Zhao R, Wu Y, et al. Hepatitis B virus induces G1 phase arrest by regulating cell cycle genes in HepG2.2.15 cells. VirolJ 2011;8:231. 67. Barzon L, Lavezzo E, Militello V, etal. Applications of nextgeneration sequencing technologies to diagnostic virology. IntJ Mol Sci 2011;12:7861–84.

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

30. Robert F, Pokholok DK, Hannett NM, et al. Global position and recruitment of HATs and HDACs in the yeast genome. Mol Cell 2004;16:199–209. 31. Guenther MG, Levine SS, Boyer LA, et al. A chromatin landmark and transcription initiation at most promoters in human cells. Cell 2007;130:77–88. 32. Muse GW, Gilchrist DA, Nechaev S, etal. RNA polymerase is poised for activation across the genome. Nat Genet 2007; 39:1507–11. 33. Zeitlinger J, Stark A, Kellis M, et al. RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo. Nat Genet 2007;39: 1512–16. 34. Wang X, Lee C, Gilmour DS, et al. Transcription elongation controls cell fate specification in the Drosophila embryo. Genes Dev 2007;21:1031–6. 35. Zamudio JR, Kelly TJ, Sharp PA. Argonaute-bound small RNAs from promoter-proximal RNA polymerase II. Cell 2014;156:920–34. 36. Jacquier A. The complex eukaryotic transcriptome: unexpected pervasive transcription and novel small RNAs. Nat Rev Genet 2009;10:833–44. 37. Kwak H, Fuda NJ, Core LJ, et al. Precise maps of RNA polymerase reveal how promoters direct initiation and pausing. Science 2013;339:950–3. 38. Churchman LS, Weissman JS. Nascent transcript sequencing visualizes transcription at nucleotide resolution. Nature 2011;469:368–73. 39. Cabili MN, Trapnell C, Goff L, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011;25: 1915–27. 40. Bernstein BE, Birney E, Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. 41. Isakov O, Modai S, Shomron N. Pathogen detection using short-RNA deep sequencing subtraction and assembly. Bioinformatics 2011;27:2027–30. 42. Kostic AD, Ojesina AI, Pedamallu CS, et al. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol 2011;29:393–6. 43. Li JW, Wan R, Yu CS, et al. ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution. Bioinformatics 2013;29: 649–51. 44. Chen Y, Yao H, Thompson EJ, et al. VirusSeq: software to identify viruses and their integration sites using nextgeneration sequencing of human cancer tissue. Bioinformatics 2013; 29:266–7. 45. Wang Q, Jia P, Zhao Z. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS One 2013;8:e64465. 46. Picardi E, Pesole G. REDItools: high-throughput RNA editing detection made easy. Bioinformatics 2013;29: 1813–14. 47. Kiran A, Baranov PV. DARNED: a DAtabase of RNa EDiting in humans. Bioinformatics 2010;26:1772–6. 48. Picardi E, Regina TMR, Brennicke A, et al. REDIdb: the RNA editing database. Nucleic Acids Res 2007;35: D173–7.

page 9 of 11

page 10 of 11

Han et al. 88. Danecek P, Nellaker C, McIntyre RE, et al. High levels of RNA-editing site conservation amongst 15 laboratory mouse strains. Genome Biol 2012;13:26. 89. Ramaswami G, Li JB. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res 2013; 42:D109–13. 90. Bray NJ, Buckland PR, Owen MJ, et al. Cis-acting variation in the expression of a high proportion of genes in human brain. Hum Genet 2003;113:149–53. 91. Smith RM, Webb A, Papp AC, et al. Whole transcriptome RNA-Seq allelic expression in human brain. BMC Genomics 2013;14:571. 92. Heap GA, Yang JH, Downes K, et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum Mol Genet 2010;19:122–34. 93. Degner JF, Marioni JC, Pai AA, et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 2009;25:3207–12. 94. Guo Y, Samuels DC, Li J, et al. Evaluation of allele frequency estimation using pooled sequencing data simulation. ScientificWorldJournal 2013;2013:895496. 95. Stevenson KR, Coolon JD, Wittkopp PJ. Sources of bias in measures of allele-specific expression derived from RNAseq data aligned to a single reference genome. BMC Genomics 2013;14:536. 96. Guo Y, Ye F, Sheng Q, et al. Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinformatics 2013. doi: 10.1093/bib/bbt069. 97. DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491–8. 98. Koboldt DC, Zhang Q, Larson DE, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012;22: 568–76. 99. Cibulskis K, Lawrence MS, Carter SL, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013;31: 213–19. 100. Katz Y, Wang ET, Airoldi EM, et al. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods 2010;7:1009–15. 101. Singh RK, Cooper TA. Pre-mRNA splicing in disease and therapeutics. Trends Mol Med 2012;18:472–82. 102. Ramskold D, Wang ET, Burge CB, et al. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 2009;5: e1000598. 103. Quinn EM, Cormican P, Kenny EM, et al. Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data. PloS One 2013;8:e58815. 104. Miller AC, Obholzer ND, Shah AN, et al. RNA-seq-based mapping and candidate identification of mutations from forward genetic screens. Genome Res 2013;23:679–86. 105. Kim D, Pertea G, Trapnell C, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013;14:R36. 106. Hong D, Rhie A, Park SS, et al. FX: an RNA-Seq analysis tool on the cloud. Bioinformatics 2012;28:721–3.

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

68. Radford AD, Chapman D, Dixon L, et al. Application of next-generation sequencing technologies in virology. J GenVirol 2012;93:1853–68. 69. Chevaliez S, Rodriguez C, Pawlotsky JM. New virologic tools for management of chronic hepatitis B and C. Gastroenterology 2012;142:1303–1313e1. 70. Li L, Delwart E. From orphan virus to pathogen: the path to the clinical lab. Curr OpinVirol 2011;1:282–8. 71. Capobianchi MR, Giombini E, Rozera G. Next-generation sequencing technology in clinical virology. Clin Microbiol Infect 2013;19:15–22. 72. Drake JW, Charlesworth B, Charlesworth D, et al. Rates of spontaneous mutation. Genetics 1998;148:1667–86. 73. Keegan LP, Gallo A, O’Connell MA. The many roles of an RNA editor. Nat Rev Genet 2001;2:869–78. 74. Garrett S, Rosenthal JJ. RNA editing underlies temperature adaptation in K+ channels from polar octopuses. Science 2012;335:848–51. 75. Eran A, Li JB, Vatalaro K, et al. Comparative RNA editing in autistic and neurotypical cerebella. Mol Psychiatry 2012;18: 1041–8. 76. Chen L, Li Y, Lin CH, et al. Recoding RNA editing of AZIN1 predisposes to hepatocellular carcinoma. Nat Med 2013;19:209–16. 77. Chan TH, Lin CH, Qi L, et al. A disrupted RNA editing balance mediated by ADARs (Adenosine DeAminases that act on RNA) in human hepatocellular carcinoma. Gut 2013;63:832–43. 78. Bass B, Hundley H, Li JB, et al. The difficult calls in RNA editing. Interviewed by H Craig Mak. Nat Biotechnol 2012; 30:1207–9. 79. Kleinman CL, Majewski J. Comment on ‘‘Widespread RNA and DNA sequence differences in the human transcriptome’’. Science 2012;335:1302; author reply 1302. 80. Lin W, Piskol R, Tan MH, et al. Comment on ‘‘Widespread RNA and DNA sequence differences in the human transcriptome’’. Science 2012;335:1302; author reply 1302. 81. Pickrell JK, Gilad Y, Pritchard JK. Comment on ‘‘Widespread RNA and DNA sequence differences in the human transcriptome’’. Science 2012;335:1302; author reply 1302. 82. Bahn JH, Lee JH, Li G, et al. Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Genome Res 2012;22:142–50. 83. Peng Z, Cheng Y, Tan BC, et al. Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotechnol 2012;30:253–60. 84. Ramaswami G, Lin W, Piskol R, et al. Accurate identification of human Alu and non-Alu RNA editing sites. Nat Methods 2012;9:579–81. 85. Park E, Williams B, Wold BJ, et al. RNA editing in the human ENCODE RNA-seq data. Genome Rese 2012;22: 1626–33. 86. Lee JH, Ang JK, Xiao X. Analysis and design of RNA sequencing experiments for identifying RNA editing and other single-nucleotide variants. RNA 2013; 19:725–32. 87. Ramaswami G, Zhang R, Piskol R, et al. Identifying RNA editing sites using RNA sequencing data alone. Nat Methods 2013;10:128–32.

Alternative applications for distinct RNA sequencing strategies 107. Hu J, Ge H, Newman M, et al. OSA: a fast and accurate alignment tool for RNA-Seq. Bioinformatics 2012;28: 1933–4. 108. Torres-Garcia W, Zheng S, Sivachenko A, et al. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics 2014;30:2224–6. 109. Yoon K, Lee S, Han TS, et al. Comprehensive genomeand transcriptome-wide analyses of mutations associated with microsatellite instability in Korean gastric cancers. Genome Res 2013;23:1109–17. 110. Xu X, Zhu K, Liu F, et al. Identification of somatic mutations in human prostate cancer by RNA-Seq. Gene 2013; 519:343–7.

page 11 of 11

111. Zheng X, Pan C, Diao Y, et al. Development of microsatellite markers by transcriptome sequencing in two species of Amorphophallus (Araceae). BMC Genomics 2013;14:490. 112. Lu Y, Soong TD, Elemento O. A novel approach for characterizing microsatellite instability in cancer cells. PLoS One 2013;8:e63056. 113. Albers CA, Lunter G, MacArthur DG, et al. Dindel: accurate indel calls from short-read data. Genome Res 2011;21: 961–73. 114. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999;27: 573–80.

Downloaded from http://bib.oxfordjournals.org/ at Brandeis University library on September 26, 2014

Alternative applications for distinct RNA sequencing strategies.

Recent advances in RNA library preparation methods, platform accessibility and cost efficiency have allowed high-throughput RNA sequencing (RNAseq) to...
246KB Sizes 2 Downloads 7 Views