COMMUNICATION

TO THE

EDITOR

An Evaluation of Public Genomic References for Mapping RNA-Seq Data From Chinese Hamster Ovary Cells Huong Le, Chun Chen, Chetan T. Goudar Drug Substance Technologies, Process Development, Amgen, Inc., 1 Amgen Center Drive, Thousand Oaks, CA 91320; telephone: þ1-805-447-5321; e-mail: [email protected]

ABSTRACT: While RNA-Seq is increasingly used as the method of choice for transcriptome analysis of mammalian cell culture processes, no universal genomic reference for mapping RNA-Seq reads from CHO cells has been reported. In previous publications, de novo transcriptomes assembled using these RNA-Seq reads were subsequently used for mapping. Potential caveats with this approach include the incomplete coverage and the non-universal nature of the de novo assemblies, leading to challenges in comparing results across studies. In order to facilitate future RNASeq studies in CHO cells, we performed a comprehensive evaluation of four public genomic references for CHO cells hosted by the NCBI Reference Sequence Database (RefSeq), including two annotated genomes released in 2012 and 2014 and their accompanying transcriptomes. Each genome showed significantly higher mapped rates compared to its accompanying transcriptome. Furthermore, higher mapped rates in deep intra-genic regions, especially within exons, were observed for the more recent genome release (2014) compared to the older one (2012), indicating that the 2014 genome was the preeminent reference among the four. Sequential addition of human and mouse genomes increased the total mapped rate to 87.3 and 89.7%, respectively, from 73.5% using the 2014 Chinese hamster genome alone. Thus, the sequential combination of the 2014 RefSeq Chinese hamster genome, the Ensembl human genome (h38), and the Ensembl mouse genome (m38) was suggested as the most effective strategy for mapping RNA-Seq data from CHO cells. Biotechnol. Bioeng. 2015;112: 2412–2416. ß 2015 Wiley Periodicals, Inc. KEYWORDS: RNA-Seq; mapping algorithms; genome; transcriptome; CHO cells; cell culture

Introduction Deep sequencing of RNA, or RNA-Seq, has transformed the quantification of gene expression dynamics. Tens of thousands of transcripts can now be measured digitally in a single sequencing Conflict of interest: None. Correspondence to: C.T. Goudar Received 26 January 2015; Revision received 15 April 2015; Accepted 12 May 2015 Accepted manuscript online 22 May 2015; Article first published online 30 June 2015 in Wiley Online Library (http://onlinelibrary.wiley.com/doi/10.1002/bit.25649/abstract). DOI 10.1002/bit.25649

2412

Biotechnology and Bioengineering, Vol. 112, No. 11, November, 2015

run with the possibility of generating more than a terabyte of data by the Illumina HiSeq platform (Illumina, 2014). Compared to microarrays, RNA-Seq offers unparalleled advantages to discover novel transcripts and examine gene expression over much wider dynamic ranges (Marioni et al., 2008; Mortazavi et al., 2008; Wang et al., 2009). With declining sequencing costs, RNA-Seq is being used increasingly in an industrial setting for bioprocess characterization (Becker et al., 2011; Birzele et al., 2010; Jacob et al., 2010; Johnson et al., 2013; Vishwanathan et al., 2015), which can ultimately enable mechanism-based rather than empirical decisions around process optimization. However, robust analysis of RNA-Seq data has been hindered by the lack of a high-quality genome or transcriptome reference for the Chinese hamster species (cricetulus griseus) and Chinese hamster ovary (CHO) cells, the predominant host for production of recombinant proteins. Consequently, all aforementioned bioprocessing-based RNA-Seq studies have resorted to de novo assembly of their own transcriptome references. Unfortunately, most of these reference transcriptomes are not publicly available and their inherent differences make comparison of results across studies challenging. The availability in 2011 of the first public genome of a parental CHO-K1 cell line (Xu et al., 2011) provided the first universal mapping reference, which was subsequently appended by genomic sequences from the Chinese hamster species (Lewis et al., 2013). Each of these two genome references was further curated and accompanied by a transcriptome reference released by the NCBI Reference Sequence in 2012 and 2014 (RefSeq, 2014). As shown in Table I, there was an addition of 2,000 genes in the 2014 RefSeq genome compared to the 25,651 genes in the 2012 release. A more substantial increase was seen in the 2014 transcriptome, which had 67,435 transcripts compared to 22,036 in the 2012 release. Another very significant difference between the 2012 and 2014 releases was the level of annotation. For the 2012 release, only 22% of the genes and 28% of the transcripts were annotated with a specific gene symbol (instead of a generic LOC number) while this number increased to 57% and 72%, respectively, in the 2014 release. This annotation improvement enables more robust biological interpretation of gene expression data through pathway analysis approaches which require gene symbols. Because reference quality is a key determinant of RNA-Seq data analysis, we performed a comprehensive comparison of these four ß 2015 Wiley Periodicals, Inc.

Table I. Statistics of four genomic references for mapping RNA-Seq data from CHO cells (Source: ftp://ftp.ncbi.nlm.nih.gov/genomes/ Cricetulus_griseus/). Genome

Transcriptome

References

2012 RefSeq

2014 RefSeq

2012 RefSeq

2014 RefSeq

No. of genes/transcripts No. of genes/transcripts with a specific gene symbola

25,651 5,597 (22%)

27,545 15,745 (57%)

22,036 6,224 (28%)

67,435 48,531 (72%)

a

Not a generic LOC number.

publicly available references for mapping RNA-Seq data from CHO cells. At first, each genome was compared to its accompanying transcriptome to identify if the genome provided any advantages over the corresponding transcriptome. This comparison was followed by a more detailed comparison across releases to determine the best CHO reference. Finally, we explored the possibility of further improvement in the total mapped rate by augmenting this reference with mouse and human genomes. Sixty RNA-Seq samples from a diverse set of CHO cell lines with different diameters, expressing different antibodies, and subjected to different experimental conditions (Fomina–Yadlin et al., 2014) were used for this analysis. These samples were sequenced using the Illumina HiSeq platform with an average output per sample of 39.7 million 50 bp single-end reads (65.1% duplicated). When mapped to the four genomic references listed in Table I, the median percentage of reads mapped to any location in the reference across all 60 samples was 75.0% and 53.2% using the 2012 genome and transcriptome reference, respectively (Figure 1a). When the 2014 genome and transcriptome were used, this total mapped rate was similar for the genome (73.5%) but increased for the transcriptome (63.1%). Two components of this total mapped rate, the uniquely mapped rate (the percentage of reads mapped to unique locations in the reference) and the non-uniquely mapped rate (the percentage of reads mapped to multiple locations in the reference), are shown in Figure 1b and 1c, respectively. The median uniquely mapped rate was 68.7% for the genome and 46.0% for the transcriptome in the 2012 release (P

An evaluation of public genomic references for mapping RNA-Seq data from Chinese hamster ovary cells.

While RNA-Seq is increasingly used as the method of choice for transcriptome analysis of mammalian cell culture processes, no universal genomic refere...
270KB Sizes 0 Downloads 8 Views