Quality score compression improves genotyping accuracy.

correspo n de n ce Box 2 Six recommendations to facilitate collaboration When embarking on an industrial collaboration, it is wise to seek counsel from seasoned faculty who already have experience interacting with the private sector as well as legal and technology transfer professionals who are familiar with common pitfalls. Below we list six key factors to bear in mind before commencing work with a company: Openly discuss intended benefits, requirements and risks for both partners Consider which mode of collaboration optimally fits joint objectives Negotiate professional contracts on IP, confidentiality and publication procedures Retain full transparency within the academic research group about the terms and conditions of the collaboration, and instruct scientists and students on the importance of confidentiality and IP rules Monitor progress in the project frequently, and communicate about alignment with joint and individual objectives

npg

© 2015 Nature America, Inc. All rights reserved.

Build relationships grounded in mutual trust and respect; acknowledge and celebrate successes, learn from mistakes

it combines access to a broad, fundamental research program with the option to explore more competitive, confidential research. In other PPPs, a single research program, with uniform IP regulations for all projects, is funded from a homogeneous blend of public and private funding. This model, in which all research projects are accessible to all academic and industrial partners of the PPP, requires a great deal of trust and may be particularly suitable for fundamental research programs with a relatively large government contribution. In a third model, research in a PPP is organized in thematic subprograms, which are funded from blends of public and private funding. IP ownership in each subprogram is then based on the relative contributions of the participating partners. In large PPPs, this ‘intermediate’ model enables industries to focus their financial contribution and monitoring of the research on subprograms that are of special relevance to them. Success of a PPP requires a strong sense of common purpose. Excellent communication, via regular ‘live’ meetings, and sufficient funding for each of the academic partners are two key prerequisites. If resources are spread too thinly (e.g., a single active researcher per participating group for a 5-year funding period), commitment is generally weaker than when at least 3–4 active researchers are funded in each participating group. A clear, joint vision on the future that is embodied by aligned, committed project leadership further strengthens collaboration. Government funding of PPPs generally has a finite lifespan but, with adequate industrial support, some PPPs use it to build a critical mass and performance level that allow them to become financially self-supporting 240

through patenting and licensing incomes. Irrespective of how industrial participation is arranged, a key challenge for PPPs is to maximize synergy by bringing together competing industries around their research programs. Not all industries, however, are inclined toward this mode of collaboration with academia. In many industrialized countries, industrial funding of academic research is a fact of life. Interactions between industry and academia offer many advantages and increase possibilities for advanced

training and education, but also come with downsides. Box 2 summarizes six recommendations for successful initiation and execution of collaborations. However, the most important requirement for mitigating risks and reaping mutual benefits is a willingness of industrial and academic collaborators to understand and respect each other’s core objectives and to actively seek options to optimally align these in joint research activities. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Jack T Pronk1, Sang Yup Lee2,3, Jeff Lievense4, John Pierce5, Bernhard Palsson3,6, Mathias Uhlen3,7 & Jens Nielsen3,8 1Department of Biotechnology, Delft University

of Technology, Delft, The Netherlands.

2Department of Chemical and Biomolecular

Engineering, KAIST, Daejeon, Korea. 3Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark. 4Genomatica, San Diego, California, USA. 5BP, Cambridge, Massachusetts, USA. 6Department of Bioengineering, University of California, San Diego, La Jolla, California, USA. 7Science for Life Laboratory, Royal Institute of Technology, Stockholm, Sweden. 8Department of Chemical and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden. e-mail: [email protected]

Quality score compression improves genotyping accuracy To the Editor: Most next-generation sequencing (NGS) quality scores are space intensive, redundant and often misleading. In this Correspondence, we recover quality information directly from sequence data using a compression tool named Quartz, rendering such scores redundant and yielding substantially better space and time efficiencies for storage and analysis. Quartz is designed to operate on NGS reads in FASTQ format, but it can be trivially modified to discard quality scores in other formats for which scores are paired with sequence information. Discarding 95% of quality scores resulted, counterintuitively, in improved SNP calling, implying that compression need not come at the expense of accuracy. Advances in next-generation sequencing (NGS) technologies have produced a deluge of genomic information, outpacing

increases in our computational resources1,2. This avalanche of data enables large-scale population studies (such as maps of human genetic variation3, reconstruction of human population history4 and uncovering of cell lineage relationships5), but to fully capitalize on these advances, we must develop better technologies to store, transmit and process genomic data. The bulk of NGS data typically consists of read sequences, in which each base call is associated with a corresponding quality score that consumes at least as much storage space as the base call itself6. Quality scores are often essential for assessing sequence quality (their main use), filtering low-quality reads, assembling genomic sequences, mapping reads to a reference sequence and performing accurate genotyping. Because quality scores require considerable space to store and transmit, they are a major bottleneck in

volume 33 NUMBER 3 MARCH 2015 nature biotechnology

correspo n de n ce a

Corpus of NGS reads

Dictionary of common k-mers

e

Genotyping pipeline Raw reads

Compressed quality

Preprocessing Quartz Smoothed FASTQ file

FASTQ file

b

c

d

Q 1: R1: Quartz

Mapped reads

R1,1

R1,4

R1,7

GATK or SAMtools

Q1:

npg


R1:

Bowtie 2 or BWA

R2,1

R2,4

Variant calls 102.....A 452.....A 146.....C 713.....C 278.....G 843.....G 343.....T 901.....T

R2,7

Figure 1 Compressive quality scores using the Quartz algorithm. (a) A dictionary of common k-mers (green lines) in a corpus of NGS reads is generated. The dictionary is generated once for any species. (b) For each read with sequence R and quality string Q, the read sequence R is broken up into overlapping supporting k-mers (purple), where Ri,j denotes positions j through j+(k-1) in the ith read sequence Ri with corresponding quality scores Qi. (c) Dictionary k-mers that are within one mismatch from the supporting k-mers are identified (mismatch positions in red). Top: every position different from a dictionary k-mer is annotated as a possible variant (in red) unless covered by a dictionary k-mer corresponding to a different supporting k-mer, in which case it will nevertheless be marked for correction (in green; e.g., the second mismatch). Other covered positions are also marked for correction as high quality (green). Bottom: when two dictionary k-mers correspond to the same supporting k-mer, all mismatches are preserved, unless the mismatch position is covered by a dictionary k-mer corresponding to a different supporting k-mer as shown (at top). Uncovered bases are also annotated (blue). (d) Quality scores are smoothed and the scores of all high-quality positions (i.e., bases) are set to a default value. Scores of uncovered and possible variant loci are kept. (e) Quartz can fit into existing genotyping analysis pipelines as an additional processing step between acquisition of raw reads and mapping and genotyping.

any sequence analysis pipeline, impacting genomic medicine, environmental genomics and the ability to find signatures of selection within large sets of closely related sequenced individuals. At the expense of downstream analysis (including variant calling, genotype phasing, disease gene identification, read mapping and genome assembly), biomedical researchers have typically discarded quality scores altogether or turned to compression, which has been moderately successful when applied to genomic sequence data7–12. Quality score compression is usually lossy, meaning that maximum compression is achieved at the expense of the ability to reconstruct the original quality values13,14. Because of decline in downstream accuracy, such methods are suboptimal for both transmission and indefinite storage of quality scores. To address these limitations, several newly developed methods either exploit sequence data to boost quality score compression using alignments to a reference genome8,10,15 or use raw read datasets without reference alignment16; however, reference-based compression requires runtime-costly whole-genome alignments of the NGS dataset, whereas alignment-free compression applies costly indexing methods directly to the read dataset. On the other hand, quality score recalibration

methods, such as found in GATK17, increase variant calling accuracy at the cost of significantly decreasing the compressibility of the quality scores (Supplementary Table 1). To our knowledge, no existing approach simultaneously provides a scalable method for terabyte-sized NGS datasets and addresses the degradation of downstream genotyping accuracy that results from lossy compression10. To achieve scalable analyses, we take advantage of redundancy inherent in NGS read data. Intuitively, it is clear that the more often we see a read sequence in a dataset, the more confidence we have in its correctness; thus, the more frequent a read, the less informative and useful are its quality scores. However, for longer read sequences (e.g., >100 bp), the probability of a read appearing multiple times is extremely low. For such long reads, shorter substrings (k-mers) can instead be used as a proxy to estimate sequence redundancy. By viewing individual read datasets through the lens of k-mer frequencies in a corpus of reads, we are able to ensure that ‘lossiness’ of compression does not deleteriously affect accuracy. Here we present a highly efficient and scalable compression tool, Quartz (quality score reduction at terabyte scale), which compresses quality scores by capitalizing on sequence redundancy. Compression is

nature biotechnology volume 33 NUMBER 3 MARCH 2015

achieved by smoothing a large fraction of quality score values based on the k-mer neighborhood of their corresponding positions in the read sequences (Fig. 1, left). We used the hypothesis that any divergent base in a k-mer is likely to correspond to either a single-nucleotide polymorphism (SNP) or a sequencing error; thus, we preserve only quality scores for probable variant locations and compress quality scores of concordant bases by resetting them to a default value. More precisely, frequent k-mers in a large corpus of NGS reads correspond to a theoretical consensus genome with overwhelming probability18; without having to do any explicit mapping, Quartz preserves quality scores at locations that potentially differ from this consensus genome. k-mer frequencies have been used to infer knowledge about the error content of a read sequence—in fact, many sequence-correction and assembly methods directly or indirectly make use of this phenomenon19,20—but never for quality score compression. Unlike other quality score compression methods, Quartz simultaneously maintains genotyping accuracy while achieving high compression ratios, and it is able to do so in orders of magnitude less time. Compression is made possible by Quartz’s ‘coarse’ representation of quality scores, which allows 241

correspo n de n ce

b

Bowtie 2–GATK

BWA-SAMtools

1.0

1.0

0.8

0.8 True positive rate

True positive rate

a

0.6

0.4

0.6

0.4

0.2

0.2

Original AUC = 0.7543 Quartz AUC = 0.7645

Original AUC = 0.7291 Quartz AUC = 0.7432 0

0 0

0.2

0.4 0.6 False positive rate

0.8

1.0

0

0.2

0.4 0.6 False positive rate

0.8

1.0

npg


Figure 2 Scaled ROC curves of genotyping accuracy. (a,b) Curves are shown for NA12878 before (blue) and after (red) Quartz compression using (a) Bowtie 2 and GATK UnifiedGenotyper and (b) BWA and SAMtools mpileup. Accuracy is improved under both variant-calling pipelines.

it to store quality scores at roughly 0.4 bits per value (down from the original size of 8 bits in FASTQ format or 1.4 bits even after standard lossless text compression; Supplementary Table 2 and 3). It is important to note that the compression ratio achieved by Quartz is primarily dependent on the conditional entropy of observing the read sequence, given the local consensus k-mer landscape. This advance is in contrast both to lossless compressors, which can reproduce the original quality scores with perfect fidelity but are dependent on the entropy of the quality scores themselves, and to lossy methods that directly reduce the quality score entropy by quality score smoothing procedures6,13,21. Surprisingly, by taking advantage of the local consensus k-mer landscape, Quartz, while eliminating >95% of the quality score information, achieves improved genotyping accuracy compared with the use of the original, uncompressed quality scores as measured against a trio-validated (i.e., offspring validated against parents’ genomes), gold-standard variant dataset for the NA12878 genome from the GATK ‘bestpractices’ bundle17 (Fig. 2, Supplementary Methods). We applied both the GATK17 and SAMtools22 pipelines (Fig. 1, right) to the compressed quality scores generated by Quartz on a commonly used NA12878 benchmarking dataset from the 1000 Genomes Project3 (Supplementary Figs. 1–4, Supplementary Table 4). The genotyping accuracy based on Quartz’s compressed data consistently outperforms that based on the uncompressed raw quality scores, as measured by the area under the receiveroperating characteristic curve (ROC AUC; Supplementary Tables 5 and 6); for the 242

experiments in Figure 2, Quartz compression decreased the number of false positives in the million highest-quality variant calls by >4.5% in several of the pipelines (Supplementary Figs. 1 and 2). Although this improvement is most pronounced for SNP calls, indel-calling accuracy is also maintained, if not improved, by Quartz compression (Supplementary Figs. 5 and 6). This result emerges from the discovery, through the application of Quartz, that quality score values within an NGS dataset are implicitly encoded in the genomic sequence information with 95% redundancy, so they often do not have to be stored separately. This improvement further indicates that compression achieved using Quartz reduces the noise in the raw quality scores, thus leading to better genotyping results. Notably, removing all quality scores (by setting them all to a default value of 50) caused an enormous drop in genotyping accuracy (~5% decrease in relative ROC AUC; Supplementary Fig. 7), indicating that retaining quality scores is necessary. Quartz is also scalable for use on large-scale, whole-genome datasets. After a one-time construction of the k-mer dictionary for any given species, quality score compression is orders of magnitude faster than read mapping, genotyping and other quality score compression methods (Supplementary Table 1 and Supplementary Figs. 8 and 9). Additionally, Quartz is especially applicable for large-scale cohort-based sequencing projects, because its improvements in genotyping accuracy are particularly useful when samples have lower depths of sequencing coverage (e.g., 2× to 4×; Supplementary Fig. 10).

Quartz alters quality scores and improves downstream genotyping accuracy, in common with some existing base quality score recalibration tools17 that make use of human genome variation from population-scale sequencing3 (e.g., the GATK BaseRecalibrator). Although currently available recalibration tools and Quartz both use distilled information from genome sequences, they differ in several important ways. Most importantly, quality score recalibration tools do not apply compression—in fact, GATK recalibration tends to greatly increase the amount of storage needed, while also requiring much more computing power (Supplementary Table 1); Quartz avoids losing compressibility by using a single default replacement quality value (chosen as described in Supplementary Methods, Supplementary Fig. 11). Furthermore, because recalibration tools, such as the GATK BaseRecalibrator, employ a list of known SNP locations, reads must first be mapped to the reference. As Quartz uses only k-mer frequencies and Hamming distances, it is possible to apply Quartz compression upstream of mapping, which is crucial for either genome assembly or mapping. Quartz is the first scalable, sequence-based quality score compression method that can efficiently compress quality scores of terabyte-sized (or larger) sequencing datasets, thereby solving both the problem of indefinite storage and that of transmission of quality scores. Had our results merely replicated the genotyping accuracy of existing tools, such as GATK and SAMtools, Quartz still improves storage efficiency by an order of magnitude at almost no additional computational cost (Supplementary Results). Notwithstanding these enhancements to storage through compression, our results with Quartz also suggest that a significant proportion of quality score data—despite having been thought entirely essential to downstream analysis—are less informative than the k-mer sequence profiles and can be discarded without weakening downstream analysis; indeed, in some cases, Quartz can improve downstream analysis. Even with aggressive lossy data compression, we have shown that it is possible to preserve biologically important data. A Quartz compression step can be added to almost any pre-existing NGS data processing pipeline. Quartz takes as input a FASTQ file (the standard format for read data) and outputs a smoothed FASTQ file, which can in turn be input into any compression program (e.g., BZIP2 or GZIP) for efficient storage and transmission, or any read mapper (e.g., BWA23 or Bowtie 2; ref. 24). Further analysis

volume 33 NUMBER 3 MARCH 2015 nature biotechnology

npg


correspo n de n ce steps, such as variant calling (e.g., using SAMTools or GATK), can be carried out in the usual way. Our optimized and parallelized implementation of Quartz is available, along with a high-quality human genome k-mer dictionary (Supplementary Software). We also provide here preliminary results on how Quartz changes compression levels and variant-calling accuracy on an Escherichia coli genome, indicating Quartz’s utility beyond human genomics (Supplementary Fig. 12). With improvements in sequencing technologies increasing the pace at which genomic data is generated, quality scores will require ever greater amounts of storage space; compressive quality scores will become crucial to fully realizing the potential of large-scale genomics. Our data here, in contrast to previous results21, indicate that the twin goals of compression and accuracy do not have to be at odds with each other. Although total compression comes at the cost of accuracy (Supplementary Fig. 6), and quality score recalibration generally decreases compressibility (Supplementary Table 1), there is a happy medium at which we can get good compression and improved accuracy. The Quartz software (available at http://quartz. csail.mit.edu and in Supplementary Software) will greatly benefit any researchers who are generating, storing, mapping or analyzing large amounts of DNA, RNA, CHIP-seq or exome sequencing data. Note: Any Supplementary Information and Source Data files are available in the online version of the paper. ACKNOWLEDGMENTS We thank L. Cowen and N. Daniels for helpful discussions and comments. Y.W.Y. gratefully acknowledges support from the Fannie and John Hertz Foundation. D.Y., J.P. and B.B. are partially supported by US National Institutes of Health grant GM108348. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Y William Yu1,2, Deniz Yorukoglu1, Jian Peng1,2 & Bonnie Berger1,2 1Computer Science and Artificial Intelligence

Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. 2Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. e-mail: [email protected] 1. Berger, B., Peng, J. & Singh, M. Nat. Rev. Genet. 14, 333–346 (2013). 2. Kahn, S.D. Science 331, 728–729 (2011). 3. The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012). 4. Veeramah, K.R. & Hammer, M.F. Nat. Rev. Genet. 15, 149–162 (2014). 5. Shapiro, E., Biezuner, T. & Linnarsson, L. Nat. Rev. Genet. 14, 618–630 (2013). 6. Bonfield, J.K. & Mahoney, M.V. PLoS ONE 8, e59190 (2013).

7. Apostolico, A. & Lonardi, S. in Proceedings of the IEEE Data Compression Conference 2000 (DCC’00) 143– 152 (IEEE Computer Society, 2000). 8. Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V. & Varghese, G. J. Comput. Biol. 18, 401–413 (2011). 9. Jones, D.C., Ruzzo, W.L., Peng, X. & Katze, M.G. Nucleic Acids Res. 40, e171 (2012). 10. Fritz, M.H.Y., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011). 11. Deorowicz, S. & Grabowski, S. Bioinformatics 27, 860–862 (2011). 12. Loh, P.R., Baym, M. & Berger, B. Nat. Biotechnol. 30, 627–630 (2012). 13. Ochoa, I. et al. BMC Bioinformatics 14, 187 (2013). 14. Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S.C. Bioinformatics 28, 3051–3057 (2012). 15. Christley, S., Lu, Y., Li, C. & Xie, X. Bioinformatics 25, 274–275 (2009).

16. Janin, L., Rosone, G. & Cox, A.J. Bioinformatics 30, 24–30 (2014). 17. DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011). 18. Yu, Y.W., Yorukoglu, D. & Berger, B. in Research in Computational Molecular Biology: 18th Annual International Conference, RECOMB 2014—Proceedings (ed. Sharan, R.) 385–399 (Springer, 2014). 19. Kelley, D.R., Schatz, M.C. & Salzberg, S.L. Genome Biol. 11, R116 (2010). 20. Grabherr, M.G. et al. Nat. Biotechnol. 29, 644–652 (2011). 21. Cánovas, R., Moffat, A. & Turpin, A. Bioinformatics 30, 2130–2136 (2014). 22. Li, H. et al. Bioinformatics 25, 2078–2079 (2009). 23. Li, H. & Durbin, R. Bioinformatics 26, 589–595 (2010). 24. Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357– 359 (2012).

Ballgown bridges the gap between transcriptome assembly and expression analysis To the Editor: Analysis of raw reads from RNA sequencing (RNA-seq) makes it possible to reconstruct complete gene structures, including multiple splice variants, without relying on previously established annotations1–3. Downstream statistical modeling of summarized gene or transcript expression data output from these pipelines is facilitated by the Bioconductor project, which provides open-source tools for analysis of high-throughput genomics data4. However, the outputs of upstream processing tools often are aggregated across samples or are not in a format that is readily compatible with downstream Bioconductor packages. This gap has slowed rigorous statistical analysis of expression quantitative trait locus (eQTL), time-course, continuous covariates or of confounded experimental designs at the transcript level and has led to considerable controversy in the analysis of populationlevel RNA-seq data5. In this Correspondence, we report the development of two pieces of software, Tablemaker and Ballgown, that bridge the gap between transcriptome assembly and fast, flexible differential expression analysis (Supplementary Fig. 1). Tablemaker uses a GTF file (the standard output from any transcriptome assembler) and spliced read alignments to produce files that explicitly specify the structure of assembled transcripts, mappings from exons and splice junctions to transcripts, and several measures of feature expression, including fragments per kilobase of transcript per million reads sequenced (FPKM) and average per-base coverage (Supplementary Note 1). Tablemaker

nature biotechnology volume 33 NUMBER 3 MARCH 2015

wraps Cufflinks to estimate FPKM for each assembled transcript. After the transcriptome assembly is processed using Tablemaker, the output files (Supplementary Note 1) can be explored interactively in R using the Ballgown package. Ballgown converts Tablemaker’s assembly structure and expression estimates into an easy-toaccess R object (Supplementary Fig. 2) for downstream analyses. Alternatively, the Tablemaker step can be skipped: the R object can be created based on an assembly created with StringTie6, a new, efficient assembler, or from a transcriptome whose expression estimates have been calculated with RSEM’s ‘rsem-calculate-expression’7. Ballgown can be used to visualize the transcript assembly on a gene-by-gene basis, extract abundance estimates for exons, introns, transcripts or genes, and perform linear model–based differential expression analyses (Supplementary Note 2). The basic linear modeling strategy for differential expression testing implemented in Ballgown allows analysis of eQTL, timecourse, continuous covariates or confounded experimental designs at the exon, gene or transcript level. This approach is similar to the linear modeling strategy implemented in limma8, without empirical Bayes shrinkage, and can be applied to exon or gene counts available through the Ballgown object after appropriately transforming the count data9. Alternatively, users may choose to apply the widely used Bioconductor packages for sequence count data10,11. There is no other existing statistical software that allows this level of flexibility for modeling transcript243

When less is more: 'slicing' sequencing data improves read decoding accuracy and de novo assembly quality.

Incorporating surgical quality into the AJCC 7th edition improves staging accuracy in gastric cancer.

Genotyping accuracy of high-resolution DNA melting instruments.

Pediatric surgeon-directed wound classification improves accuracy.

Data compression: effect on diagnostic accuracy in digital chest radiography.

Factors affecting the accuracy of chest compression depth estimation.

Correlations between quality indexes of chest compression.

QVZ: lossy compression of quality values.

The abnormal fetal biophysical profile score. V. Predictive accuracy according to score composition.

MKK3 deletion improves mitochondrial quality.

A simple tissue-handling technique performed in the endoscopy suite improves histologic section quality and diagnostic accuracy for serrated polyps.

Iterative reconstruction improves image quality and preserves diagnostic accuracy in the setting of blunt solid organ injuries.

MRI improves diagnostic accuracy.

Video-based training improves the accuracy of seizure diagnosis.

Imaging: Combining IVUS and NIRS improves accuracy of fibroatheroma detection.

Multi-Scale Hippocampal Parcellation Improves Atlas-Based Segmentation Accuracy.

Selecting fillers on emotional appearance improves lineup identification accuracy.

Performance Feedback Improves Compliance With Quality Measures.

Service quality improves over ten years.

Optimized sedation improves colonoscopy quality long-term.

Genome-wide genotyping demonstrates a polygenic risk score associated with white matter hyperintensity volume in CADASIL.

Quality of Life in Patients with Osteoporotic Vertebral Compression Fractures.

Effects of compression on speech acoustics, intelligibility, and sound quality.

Compression of next-generation sequencing quality scores using memetic algorithm.