Briefings in Bioinformatics Advance Access published December 3, 2013

B RIEFINGS IN BIOINF ORMATICS . page 1 of 15

doi:10.1093/bib/bbt087

High-throughput DNA sequence data compression Zexuan Zhu, Yongpeng Zhang, Zhen Ji, Shan He and Xiao Yang Submitted: 17th September 2013; Received (in revised form) : 4th November 2013

Abstract

Keywords: next-generation sequencing; compression; reference-based compression; reference-free compression

INTRODUCTION DNA sequencing has been one of the major driving forces in biomedical research. Sanger sequencing [1], the dominant technology for over three decades, has been gradually replaced by cheaper and higher throughput next-generation sequencing (NGS) technologies in recent years. Examples of such sequencing technologies include pyrosequencing, sequencing by synthesis, and single-molecule sequencing [2–5]. As a result, the growth rate of genomic data in the past few years outpaced the rate predicted by Moore’s law and data are generated faster than can be meaningfully analyzed. For example, the NGS data obtained by the 1000 Genomes Project in the first 6 months exceeded the sequence data in NCBI Genbank database accumulated in the past 21 years [6].

General-purpose compression algorithms like gzip (http://www.gzip.org/) and bzip2 (http://www. bzip.org), frequently used for efficient text storage and transferring, are directly applicable to genomic sequences. Only when longer genomic sequences were obtained and the storage and transferring of these data become routine, specialized compression algorithms [7–15] were developed. Compared with general-purpose text compression methods, these specialized algorithms better utilize repetitive subsequences inherent in genomic sequences to achieve higher compression ratio. In addition, evolutionary associations among genomic sequences have been exploited such that a known genomic sequence can serve as a template for efficient compression of another homologous sequence. Before NGS techniques were introduced, compression algorithms

Corresponding author. Xiao Yang, The Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA. Tel: þ1 617 714 7919; Fax: þ1 617 714 8932; E-mail: [email protected] Zexuan Zhu is an Associate Professor at the Colloge of Computer Science and Software Engineering, Shenzhen University. His research interests lie in bioinformatics and computational intelligence. Yongpeng Zhang is a MSc student at Shenzhen University. His research interests include data compression techniques and their applications to DNA sequencing data. Zhen Ji is a Professor at the Colloge of Computer Science and Software Engineering, Shenzhen University. His research interests include bioinformatics and signal processing. Shan He is a Lecturer at the School of Computer Science, the University of Birmingham. His research interests include optimisation, data mining, complex network analysis and their applications to biomedical problems. Xiao Yang is a computational biologist at the Broad Institute. His current research is focused on high throughput sequencing data analysis with applications on genomics and metagenomics. ß The Author 2013. Published by Oxford University Press. For Permissions, please email: [email protected]

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. Compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.

page 2 of 15

Zhu et al.

TRADITIONAL GENOMIC SEQUENCE COMPRESSION Although general-purpose compression methods such as gzip and bzip2 are widely applied to genomic sequences, they do not take into account biological characteristics such as small alphabet size, rich in repeats and palindromes. Compression algorithms tailored for such properties are more likely to achieve better performance, which we outline in this section methods intended for compressing genomic sequences with relatively small size (e.g. tens of Megabases) when Sanger was the dominant sequencing method. In general, these methods can be grouped as substitutional or dictionary based and statistical based [7]. In substitutional compression, highly repetitive subsequences are identified and stored in a codebook or dictionary. Compression is achieved by replacing each repetitive subsequence in the target genomic

sequence by the corresponding encoded subsequence in the codebook, the representation of which has a much smaller size. The performances of different algorithms differ mainly based on how many repetitive subsequences can be identified and how efficiently they can be encoded. Both BioCompress [8] and BioCompress-2 [9] methods use Lempel-Ziv (LZ) style substitutional algorithm [37] to compress the exact repeats and palindromes, where the latter also utilizes an order2 context-based arithmetic encoding scheme [38] to store the non-repetitive regions. Alternatively, GenCompress [10] and DNACompress [11] identify approximate repeats and palindromes so that a larger fraction of the target sequence can be captured. Therefore, a higher compression ratio is achieved. Along the same line, GeNML [12] divides a sequence into blocks of fixed size. Then, blocks rich in approximate repeats are encoded with an efficient normalized maximum likelihood (NML) model; otherwise, plain or order-1 context based arithmetic encoding [38] is used. A typical statistical compression method consists of two steps. In the first step, a prediction model is established, where a conditional probability is assigned to a base given bases preceding it. For instance, such a prediction model can be a second order Markov chain. Then, an encoding scheme such as arithmetic coding is used to carry out efficient compression using the probability distribution of each base provided by the prediction model. The accuracy of the prediction model is the dominant factor that determines the overall performance of the compression. XM [13] is recognized as one of the most efficient statistical compression methods [7, 22]. XM first uses the weighted summation of predictions by a set of expert models to calculate the probability distribution of bases. The expert models consist of the second-order Markov model trained on the full dataset, the first-order context model trained on partial data and a repeat prediction model for a base to be in a repetitive region or be a complement of a pervious base within an offset. Then, bases are arithmetic encoded relying on the probabilities to achieve compression. CTW-LZ [14] adopts both substitutional and statistical compression. It first uses hash and dynamic programming to search for palindromes and approximate repeats. Then, it utilizes context tree weighting (CTW) [39] prediction model, a method weighing on multiple models to predict succeeding

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

were intended for relatively short sequences with an upper bound of tens of Megabases and compression of raw sequencing data were typically not required. NGS techniques have introduced major changes in data analysis needs. More scalable compression algorithms [16–36] have been introduced to handle much longer genomic sequences as well as large raw sequencing data. The latter may consist of millions of reads (short sequences) and quality scores indicating the probabilities of corresponding bases being correctly determined. The distinct properties of genomic sequences and raw sequencing data, such as the differences in number, length and alphabet size, lead to distinct algorithmic strategies for data compression. Nonetheless, these methods can be generally categorized as reference-based or reference-free, depending on whether reference genomic sequences were used to facilitate compression. In this article, we discuss the compression of ‘genomic sequences’—DNA fragments that correspond to partially or fully assembled chromosomes or genomes, and ‘raw sequencing data’—reads, quality scores and metadata produced by sequencer. For completeness, we include a brief discussion of traditional genomic sequence compression methods that were developed before NGS techniques were introduced, then focus on more recent methods developed for NGS data and present their performance on biological datasets. Finally, we discuss remaining challenges and future perspectives.

High-throughput DNA sequence data compression base probabilities, to compress short repeats and uses LZ77 [37] to compress long repeats. As pointed out in [16, 17], although these traditional genomic sequence compression methods have shown encouraging performance, their performances in terms of compute time, memory usage and compression ratio limit their applicability to NGS data. We refer the readers to a more thorough review [7] of these traditional methods and invest our effort to more recently developed methods for high-throughput sequencing data in the following sections.

The cost per raw Megabase of DNA sequence reduced from over 5000 USD to under 0.1 USD over the past 10 years (http://www.genome.gov/ 27541954). As a result, sequencing is no longer limited to a handful selected model organisms but is being used as a genomic material surveying method [40]. Meanwhile, resequencing has been applied to study Human populations [41], clinical sequencing of virulent bacterial strains [42] and beyond. The compression strategies of newly assembled sequences vary depending on whether any similar genomic sequence has been previously deposited into the database. Based on this, compression methods can be categorized as reference-based and reference-free.

Reference-based compression When a large portion of a newly assembled genomic sequence, i.e. the target, has been captured by a similar, previously documented sequence, i.e. the reference, the target can be fully identified given the reference as well as the differences between the target and the reference. This is the basic idea of reference-based compression, where a typical procedure is given in Figure 1: the target is first aligned to the reference. Then mismatches between the target and the reference are identified and encoded. Each record of a mismatch may consist of the position with respect to the reference, the type (insertion, deletion, or substitution) of mismatch, and the value. Compression is achieved if the space required to encode the mismatches is less compared to store the target itself. Brandon et al. [18] used various coders like Golomb [43], Elias [44], Huffman [45] coding to

Figure 1: Schematic of the reference-base compression of genomic sequences.

encode the mismatches and achieved a best 345-fold compression rate when applied to Human mitochondrial genome sequences. Christley et al. [19] proposed DNAzip algorithm that relies on the human population variation database, where a variant can be a single-nucleotide polymorphism (SNP) or an indel (an insertion or a deletion of multiple bases). All variants of a newly obtained sequence can be identified by comparing the sequence against the database. Because the number of variants in the human population is substantially smaller than the size of human genome, DNAzip was able to encode the James Watson genome using merely 4MB, i.e. a compression ratio of 750-fold. Different from DNAzip, which is dependent upon the knowledge of variation data, Wang et al. [20] presented a de novo compression program, GRS, which directly obtains variant information via a modified UNIX diff program [46]. Follow the motivation of de novo compression of GRS, GReEn [21] proposed a probabilistic copy model that calculates target base probabilities based on the reference. Given base probabilities as input, an arithmetic coder was then used to encode the target. GReEn was shown to be more effective compared with GRS when the target and the reference have less similarity. Reference-based compression has also been applied to multiple homogeneous genomic sequences. Given a set of genomic sequences, RLZ method [22] arbitrarily selects one to be the reference. Then, each of the remaining target sequence is partitioned into a list of substrings in a greedy manner, where the next substring is selected to be the longest common substring between the reference and the suffix of the target sequence that has not yet considered. Therefore, every substring in any target sequence can be encoded by a substring in the reference in a LZ77 style [37]. RLZ-opt [23] improved upon RLZ by replacing the greedy partitioning method by the lookahead-by-h algorithm [24], where a shorter common substring can be selected between the target and the reference to achieve an

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

ADVANCES IN GENOMIC SEQUENCE COMPRESSION FOR THE NEXT-GENERATION SEQUENCING

page 3 of 15

page 4 of 15

Zhu et al.

overall better compression. GDC [25] further improved upon RLZ-opt in the following aspects (i) non-random selection of the reference, (ii) substrings in a target can be part of a non-reference sequence and (iii) approximate string matching is used to identify a substring. Together, these strategies lead to the identification of more repeats among input sequences. An equally important strategy used by GDC is to divide each input sequence into blocks of approximately equal size and encoded with Huffman coding to allow the random access of any specific sections of the compressed data without decompressing the whole dataset.

Reference-based compression methods can achieve excellent compression results when a highly similar reference can be identified or variation data is available for methods dependent upon such information. Nonetheless, reference-free methods are required such as in de novo sequencing and metagenomic sequencing [15], where target genomes share weaker similarities with known ones and variation information is not readily available. BIND [17] algorithm encodes each target sequence using two binary strings to achieve compression. In the first string, base A or T is assigned with bit 0 and base G or C with bit 1; whereas in the second string, T or C is assigned with bit 0 and A or G with bit 1. These two strings were subsequently stored by recording the length of alternating blocks of 1’s and 0’s using unary coding scheme. In DELIMINATE algorithm [26], the positions of the two most dominant types of bases are delta encoded [47] and eliminated from the sequence. The remaining bases are then represented with binary code. BIND and DELIMINATE were shown to achieve higher compression efficiency than the general-purpose compression algorithms such as gzip, bzip2 and lzma when applied to prokaryotic genomes in FNA (FASTA Nucleic Acid) format, prokaryotic gene sequences in FFN (FASTA nucleotide coding regions) format, human chromosomes and NGS genomic datasets. Kuruppu et al. [16] proposed a substitutional approach COMRAD for compressing a set of highly redundant sequences, i.e. homologous sequences or sequences with multiple identical copies. COMRAD first identifies repetitive substrings with length ranging from hundreds to thousands of base pairs in the input data and includes them in a corpus-wide codebook. Each type of repeat in the codebook is then encoded

Experimental results We select a set of genomes that have different species origin, length and repeat content to benchmark a subset of representative methods with implementations publicly accessible. The details of selected data are given in Table 1 and the compression methods in Table 2. All methods are tested on a cluster running 64-bit Red Hat 4.4.4–13 with 32-core 3.1GHz Intel(R) Xeon(R) CPU E31220. The performance of each method is evaluated based on compression ratio, memory usage, compression and decompression time (Table 3). For completeness, the compressed file sizes are also given. For datasets involving more than one chromosome (e.g. human, Arabidopsis thaliana, Caenorhabditis elegans and Saccharomyces cerevisiae), each chromosome was individually compressed. The run-times of compression and decompression reported in Table 3 are the summation over all chromosomes whereas the sizes of memory were the average values. We need to emphasize that the performance results shown in Table 3 across different categories of compression methods are not a direct demonstration of one being better than the other but to help researchers to choose the appropriate method in their specific applications. It can be seen that both reference-based and reference-free methods outperform the general-purpose methods in terms of compression ratio. Whereas on these datasets, the reference-free methods generally achieved a 15–40% better compression ratio over general-purpose compression programs, the availability of homologous reference has made a striking difference, where the

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

Reference-free compression

into a short word. Compression is accomplished by replacing all repeats in the input sequences with the corresponding words from the codebook. Since COMRAD constructs the codebook from the input sequences, no external reference is required during decoding process. By storing the codebook and the compressed sequences with additional space proportional to the length of each sequence, COMRAD also enables random access of the compressed sequences. Pinho et al. [48] developed DNAEnc3 algorithm that employs multiple competing finite-context (Markov) models. For each given genomic region, models with different orders are used to calculate the probability of the next occurring base taking into account the contextual and palindrome information. The best-performed model is then chosen to encode the region under consideration.

High-throughput DNA sequence data compression

page 5 of 15

Table 1: Genomic sequence datasets used for evaluation Target Genome

Species

Length (bp)

File size (Byte)

Reference genome

Reference genome length (bp)

KOREF_20090224 [49] TAIR10 [50] Ce10 sacCer3 NC_017526 NC_017652

Human A. thaliana C. elegans S. cerevisiae L. pneumophila E. coli

3 069 535 988 119 667 743 100 286 070 12157105 2 682 626 5 038 385

3120 695 071 121183 082 102 291839 12 400 379 2 721068 5110 458

KOREF_20090131 TAIR9 Ce6 (UCSC) sacCer2 (UCSC) NC_017525 (NCBI) NC_017651 (NCBI)

3 069 535 988 119 707 899 100 281426 12156 679 2 682 626 5 038 385

Table 2: Overview of genomic sequence compression methods Method

Brief description

GRS GReEn BIND DELIMINATE

FASTA

Reference-based; Huffman coding Reference-based; Arithmetic encoder with the copy model Reference-free; ‘Block-length’ encoding with 7-Zip (Lempel ^Ziv) Reference-free; Delta encoded with 7-Zip (Lempel ^Ziv)

bzip2 gzip

General ASCII

General-purpose; Burrows^Wheeler transform and Huffman coding General-purpose; LZ77 and Huffman coding

reference-based programs achieved a 40-fold to over a 10 000-fold better compression ratio. However, to recover the target sequence, reference-based methods require the presence of the same reference genome as used for compression. Both gzip and GRS process the input data by iteratively loading a small portion of data with fixed size. Therefore, their memory usage is constant regardless of input data size. gzip uses the least memory among the methods tested whereas GReEn tends to consume more memory than other methods. The reference-based methods could take much longer time for compression and decompression for genomes with large size (when reaching a scale of 100 MB in the current experiment) because the compress/decompress process takes longer query time for bases/substrings for longer reference. Thus, when an appropriate reference is available and the compression gain on the target sequences can compensate the extract space requirement for the reference genome (such as in the case of multiple targets), reference-based methods should be favorable. As more genomic sequences have been uniquely documented in public databases, we expect a good use of reference-based methods in cases that require the transferring of large target sequences over the Internet. On the other hand, reference-free methods are self-contained and have a better balance between time efficiency and

compression ratio compared with reference-based methods. Also, as shown in the current experiment, the total time used for compression and decompression is comparable between reference-free and general-purpose methods, making the former a more favorable choice.

NEXT-GENERATION SEQUENCING DATA COMPRESSION Each record of a NGS sequencing dataset typically consists of metadata (including read name, platform, project ID, etc.), a read and a quality score sequence, stored either as FASTQ [51] or SAM/BAM format [52]. SAM/BAM format was designed for storing read alignment against a set of chosen references but has also been used to store unaligned data. Similar to genomic sequence compression, compression methods for reads can also be categorized as reference-based and reference-free. Nonetheless, metadata and quality score sequences constitute a large part of sequencing data. As their alphabet size is much larger compared to reads, compression of these data fall back on strategies used for general-purpose compression.

Reference-based read compression Resequencing is a cost effective way to identify mutations present in a target genomic region, where

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

Input format

page 6 of 15

Zhu et al.

Table 3: Compression results of genomic sequence data Data

Reference-based

Reference-free

GRS

BIND

DELIMINATE

bzip2

gzip

GReEn

General-purpose

19.16 1194.03 1141.36 1.25 1.34 155.89

17.14 367.26 401.08 244.36 231.96 174.27

633.62 402.36 316.06 4.93 10.5 4.71

604.05 198.55 270.98 3.80 8.01 4.94

775.23 329.90 149.08 8.70 5.01 3.85

845.64 611.99 53.72 0.63 0.54 3.53

TAIR10 (115.6MB)

C-Size(MB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.4583 84.71 38.55 1.25 1.34 252.15

0.0063 42.06 42.22 522.61 522.40 18261.47

27.58 16.7 12.32 2.89 10.00 4.19

26.73 12.23 17.12 3.04 7.12 4.32

32.19 12.61 6.09 8.70 5.01 3.59

34.30 25.84 1.70 0.63 0.54 3.37

Ce10 (97.6MB)

C-Size(MB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.0755 33.91 24.88 1.25 1.34 1292.67

0.1621 35.14 35.77 435.01 436.23 601.70

23.17 12.91 12.38 4.01 10.00 4.21

22.36 9.70 14.84 3.06 7.15 4.36

27.57 9.95 5.20 8.7 5.01 3.54

30.35 20.71 1.48 0.63 0.54 3.21

sacCer3 (11.8MB)

C-Size(MB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.0054 3.95 6.01 1.25 1.34 2185.86

0.2902 3.45 3.83 223.35 219.82 40.75

2.90 2.14 3.30 2.92 7.45 4.08

2.85 9.06 13.84 2.92 6.88 4.15

3.40 1.87 1.32 6.74 3.75 3.47

3.64 2.70 1.01 0.63 0.54 3.25

NC_017526 (2.60MB)

C-Size(MB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.00012 1.25 0.95 1.25 1.34 22303.84

0.47 0.39 0.43 6.15 5.87 5.50

0.64 0.54 0.13 3.51 9.77 4.05

0.63 0.98 0.98 2.88 7.00 4.14

0.74 0.38 0.17 7.82 5.01 3.51

0.79 0.89 0.04 0.63 0.54 3.28

NC_017652 (4.87MB)

C-Size(MB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.00012 1.95 43.20 1.25 1.34 39311.22

0.00011 0.72 0.39 10.00 10.00 42945.03

1.20 0.93 0.50 3.90 10.00 4.05

1.19 1.16 0.78 2.88 7.00 4.08

1.38 0.58 0.45 8.70 5.01 3.52

1.48 1.11 0.11 0.63 0.54 3.29

C-Size: file size after compression; C-Time and D-Time denote compression time and decompression time in seconds, respectively; C-Mem and D-Mem denote the average memory used in compression and decompression, respectively; C-Ratio: compression ratio, i.e. the ratio of file sizes pre- and post- compression. Best-performed values within the same category are bolded.

mutations are identified by aligning reads against a reference sequence [53]. The complete read data need not be stored but the variant positions with respect to the reference. The typical referencebased read compression involves two major steps: read mapping and encoding of the mapping results, which is illustrated in Figure 2. First, reads are aligned to an appropriately selected reference genome, de novo assembled or chosen from the database, using one of the NGS aligners such as Bowtie [54], BWA [55] and Novoalign (http://www.novo craft.com). The choice of aligner depends on the

type of NGS data and the specific application, where we refer the readers to this recent review [56]. Then, the mapping results (Table 4) are encoded via methods such as arithmetic coding and Huffman coding. GenCompress [57] uses Bowtie as the read aligner and then Golomb [43], Elias Gamma [44], MOV [58] or Huffman coding is used to encode the mapping results. Differently, SlimGene [27] carries out alignment using CASAVA software toolkit (http://www.illumina.com/pages.ilmn?ID¼ 314). Then during encoding step, two binary vectors

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

KOREF_ 20090224 (2986.7MB)

C-Size(MB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

High-throughput DNA sequence data compression

page 7 of 15

Table 4: Read mapping results Mapping Results of each read

Description

Mapping position Mapping types Number of mismatches Mismatch positions Mismatch type Mismatch values

The start position on the reference where the read is aligned. Direct repeat or palindrome. The number of mismatches in the alignment The positions of mismatches. Insertion, deletions, or substitution. One or multiple bases in {A, C, G, T, or N (denoting an ambiguous base)}.

are generated. The first vector has the length equivalent to the reference, where a bit ‘1’ indicates that at least one read is mapped to the corresponding position. The second vector is a concatenation of binary encoding of read mapping results. Given that the chosen reference is highly similar to the target, the space required for the two binary vectors is substantially less compared to storing reads themselves. In addition to the mapping and encoding procedure, CRAM [53] includes the idea of using de Bruijn graph based assembly approach [59] to reads that failed to be mapped to the chosen reference, where the assembled contigs can be used as references (it is unclear whether this strategy is available in the public release). Quip [28] implements the standard reference-based compression component (enabled with parameter –r), and also has the de Bruijn graph based de novo assembly component (enabled with parameter a) to generate

references from the target data instead of relying on references externally provided. Different from the aforementioned methods that treat each read separately, NGC [29] traverses each column of read alignment to exploit the common features among multiple reads mapped to the same genomic positions and represents the mapped read bases in a per-column way using run-length encoding [60]. Samcomp [33], aiming at SAM format, performs reference-based compression by anchoring each base to a reference coordinate and then encodes the base according to a per-coordinate model. The more reads aligning to a specific reference coordinate, the better the compression would be. It is worth noting that as an alternative to SAM/BAM format, a multi-tier data organization Goby file format (http://campagnelab.org/software/goby/) also explores the idea of reference-base compression for storing read alignment, sequences of aligned

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

Figure 2: Flow chart of reference-base compression of short reads.

page 8 of 15

Zhu et al.

reads are not explicitly stored but looked up in a reference genome when necessary.

Reference-free read compression

Compression of metadata and quality scores The metadata, such as read name, platform and project identifiers, normally contain numeric and non-numeric text characters. Typically, they can be extracted from sequencing data and compressed separately using general text compression methods like Delta, Huffman, arithmetic, run-length and LZ encoding. Sometimes, part of metadata can even be discarded when they do not affect downstream analysis [28, 36]. Quality score sequences, on the other hand, are typically more important and required to be distributed along with reads. A quality score is denoted by an ASCII character, which can be converted to the probability of the corresponding base been correctly determined during sequencing. The probability information is directly utilized in many NGS analysis programs. As the alphabet size is large, e.g. 40, general-purpose compression schemes have been directly used to losslessly preserve the quality score information. For instance, as discussed previously, G-SQZ [30] uses Huffman coding to encode tuples based on frequency. SlimGene [27] uses Markov encoding taking account of the position-dependent distribution of the quality scores and the higher order Markovian property between adjacent quality values. In DSRC algorithm [31], if the variation of quality score values is small within a sequence, run-length encoding was used to encode such a sequence, otherwise, Huffman encoding was used. In both [29] and [36], arithmetic encoding approach has been used to compress quality scores but achieved no obviously superior

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

In applications such as de novo sequencing and metagenomic sequencing, where there is no clear choice of a reference, reference-free compression algorithms are needed. Typically, statistical encoding methods such as Huffman coding and arithmetic coding or substitutional methods like LZ77 are used. Tembe et al. [30] proposed G-SQZ method. First, the frequency of each unique tuple is calculated given the input. Then, Huffman code is generated for each tuple where ones with higher frequency are encoded with less number of bits. Lastly, the encoded tuples along with a header containing meta-information such as the platform and the number of reads are written to a binary file to fulfill the compression. In DSRC algorithm [31], the input is divided into blocks of 32 records and every 512 blocks is grouped into a superblock. Each superblock is indexed and compressed independently, such as via LZ77 encoding, to enable fast access to individual record. Bhola et al. [32] first divides each read into substrings of a fixed length, where each substring is queried against a dictionary for a highly similar word (string) and in the meanwhile, first order Markov encoding [61] is applied to the substring. The query substring is encoded by registering the differences compared to the similar word identified in the dictionary or Markov encoded if less space is required this way. Quip [28] program, in addition to the reference-based compression mode as discussed previously, also includes an implementation of reference-free statistical compression using arithmetic coding based on high order Markov chains. Bonfield and Mahoney [33] presented two methods Fqzcomp and Fastqz, both using order-N context model and arithmetic coder. Differently, Fastqz contains an additional preprocessing step before the context modeling to reduce the data volume and it can be applied to reference-based compression. A different set of algorithms use data transformation techniques like BWT or clustering algorithms to group similar reads to facilitate compression. ReCoil [34] algorithm first constructs an undirected graph with vertices denoting reads and edges carrying weights equivalent to the number of common kmers between the end nodes. Then a maximum spanning tree (MST) [62] is identified and traversed

starting from an arbitrarily selected root node. The read corresponding to the root is stored directly whereas each remaining read is encoded via the set of maximal longest common substrings shared with its parent. The 7-Zip compressor is further used to compress the encoded reads. Cox et al. [35] presented BEETL, which utilizes Burrows Wheeler Transformation (BWT) [63] to indirectly explore repeats among input reads. The transformed data are then further compressed via general-purpose compressors like gzip, bzip2 or 7-Zip. SCALCE algorithm [36] clusters input reads sharing the same longest ‘core’ substring identified by locally consistent parsing algorithm [64], then reads within a cluster are compressed by gzip or LZ77-like compression methods.

High-throughput DNA sequence data compression

Lossy compression on quality scores has obtained substantial reduction on the storage space of NGS data. Yet challenges and opportunities remain to apply lossy compression to improve the efficiency of NGS data storage and transmission. For example, to develop lossy compression methods to minimizing the loss of accuracy in storing read mapping results and in SNP calling [66].

Experimental results In this section, the performance of the compression methods for raw sequencing data is tested using nine NGS datasets. Similar to our test for genomic sequence compression, we also choose datasets that correspond to different species (Human, plant, worm, fungus and bacteria) and have different sizes (ranging from 1.7 GB to 69.5 GB). In addition, we select data produced by several popular NGS sequencing platforms including Illumina, 454, SOLiD and Ion Torrent. For the Human genomic data, we selected data from all four types of NGS platforms. All sequencing data and the reference genomic sequences (Table 5) are obtained from NCBI/ Sequence Read Archive (SRA) as FASTQ and FASTA format, respectively. We select ‘Quip–a’, ‘Quip–r’ and CRAM to represent the referencebased methods; DSRC, Quip, and Fqzcomp for reference-free methods, and the general-purpose compression methods bzip2 and gzip to be used as the baseline (Table 6). Default parameters are used for all methods. For ‘Quip–r’ and CRAM, the FASTQ files are converted to BAM format [53], i.e. the reads in FASTQ are mapped to the reference using BWA 0.5.9 [55] and the mapping results are then converted to BAM format using samtools0.1.18 [52]. Note that additional information such as paired-end read relationships might be required when compressing SAM/BAM format files, which are not necessary for unaligned FASTQ files. Such information is rather useful for downstream analysis whereas incur added cost for compression. For instance, programs like CRAM and ‘Quip–r’ are capable of encoding such information. In addition to the above methods, we select a lossy compression method CRAM-L, where the parameter ‘-L N40D8’ is used to preserve quality scores for non-matching bases with full precision, and bin quality scores (eight bins) for positions flanking deletions. All methods were executed on the same compute-cluster as previously mentioned. Note that although we intentionally discussed the compression of reads and

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

performance compared to general-purpose compression algorithms. Because quality scores may not precisely determine base accuracy and adjacent bases with similar ASCII values of quality scores are likely to have similar sequencing accuracy, a series of algorithms have been introduced to cluster quality scores with similar values to achieve more aggressive compression. Since the original quality score values are not exactly preserved, such methods are termed lossy compression. For example, SlimGene [27] employs randomized rounding strategy, a coarse quantization, to reduce the alphabet size of the quality scores and the resulting score sequences are compressed with Markov encoding. It showed no significant loss of accuracy in downstream analysis such as variant calling. The referencebased method CRAM [53] stores quality scores for read bases that differ from the reference, and a userspecified percent of quality scores for read bases that are identical to the reference. Quip [28] uses the third-order Markov chain-based arithmetic coding to exploit the correlation of neighboring quality scores, where coarse binning is applied to read positions. QScores-Archiver [65] uses three lossy transformations based on coarse binning (UniBinning, Truncating, and LogBinning). The transformed quality scores are further compressed using Huffman encoding, gzip, or bzip2. NGC algorithm [29] uses a non-uniform quantization to bin quality scores, where smaller quantization intervals are used for quality scores of bases that differ from the reference than bases that match the reference. The size of the interval can be adjusted to control the loss of information, which positively correlates with the accuracy of the downstream analyses. QualComp [66] utilizes rate distortion theory [67] to minimize the distortion between input quality sequence and the compressed sequence post compression using the maximum allowable word size for storing each quality sequence. The word size, specified by the user, controls the balance between the compression ratio and the accuracy for downstream analyses. Janin et al. [68] proposed a method on the premise that if a base in a read can, with high probability, be predicted by the context of bases surrounding it, then the base is less informative and its quality score can be discarded or aggressively compressed. The majority of bases can be predicted by aggregating reads into a compressed index via BWT and longest common prefix array [69], the corresponding quality scores could then be converted to an average value and effectively compressed.

page 9 of 15

page 10 of 15

Zhu et al.

Table 5: NGS datasets used for evaluation Data

Sequencing Platform

Species

Read Length (bp)

Number of Reads

File size (Byte)

Reference genome

Reference genome length(bp)

SRR935126 SRR489793 SRR327342 SRR801793 ERR022075 SRR003177 SRR010637 SRR611141 SRR017935 SRR400039 SRR125858 SRR111960

Illumina Genome Analyzer IIx Illumina HiSeq 2000 Illumina Genome Analyzer II Illumina HiSeq 2500 Illumina Genome Analyzer IIx 454 GS FLX Titanium AB SOLiD System 2.0 Ion Torrent PGM Illumina Genome Analyzer II Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000

A. thaliana C. elegans S. cerevisiae L. pneumophila E. coli Human Human Human Human Human Human Human

2*76 2*101 138 2*100 2*101 45^ 4398 35 6 ^ 406 76 2*101 2*76 2*76

49 719116 56 851258 15 036 699 10 812 922 45 440 200 1504 571 12 720 822 4 853 655 66 681139 124 331027 124 815 011 171750 536

10526908 520 13 770 411282 6166 561880 2 955 009 022 11799 799 800 1846 259 474 3156178 930 1887 295 686 17 550 499 802 68 916 373 282 54 706983156 74 585 925 838

AJ270058 NC_003281 NC_001147 NC_018140 NC_000913 Chr 21 GRCh37* Chr 21 GRCh37 Chr 21 GRCh37 Chr 21 GRCh37 Chr 21 GRCh37 Chr 21 GRCh37 Chr 21 GRCh37

3 052118 13 783 800 1091290 3 492 534 4 639 675 48129 889 48129 889 48129 889 48129 889 48129 889 48129 889 48129 889

Table 6: Overview of sequencing data compression methods Method

Input data format

Brief description

Quip ^a Quip ^ r CRAM CRAM-L DSRC Quip Fqzcomp bzip2 gzip

FASTQ, SAM, BAM SAM, BAM BAM BAM FASTQ FASTQ, SAM, BAM FASTQ General ASCII General ASCII

Assembly-based approach with de Bruijn graph Reference-based compression with arithmetic coding Reference-based approach with Huffman coding CRAM using lossy input parameter ‘-L N40 -D8’ Lempel ^Ziv and Huffman coding Markov chain model with arithmetic coding Context model and arithmetic coding and deltas coding Burrows-Wheeler transform and Huffman coding LZ77 and Huffman coding

quality scores separately from algorithmic point of view, all methods apply compression directly to the complete sequencing data. The compression results are reported in Table 7. Note that the time spent on read mapping is not included in the time of compression. In the current experiment, it is not surprising to observe that CRAM-L obtained the best compression ratio due to its lossy nature. And it also consumes less compression/decompression time and memory than the lossless counterpart CRAM. Both reference-based and reference-free methods outperform the general-purpose methods in terms of compression ratio. However, the outperformance of referencebased methods over the reference-free methods on sequencing data is not as evident as for genomic sequence compression as unmapped reads, along with quality score sequences, are much less efficiently compressed compared to the mapped reads that are compressed relying on the reference. For instance, the reference-based method ‘Quip–r’

shows slightly better compression ratio and compression time than the other methods. In the reference-free methods, Quip and Fqzcomp are competitive in terms of both compression time and compression ratio. DSRC is observed to obtain lower compression ratio than the other two methods, but also with less run time and memory usage. The reference-based methods and also the reference-free Quip use more memory than the other methods. Quip a uses the most memory space because it required extra space for the initial assembly of reference sequences. Consistent with genomic sequence compression, the general-purpose gzip used a constant and also the smallest memory. gzip is typically faster during decompression whereas when combined with the compression time, it has no clear advantage over other methods. Note that the parameter settings of the tested methods could be optimized to improve the compression efficiency, yet it is not the focus of this article.

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

*Chromosome 21alone is used as the reference.

High-throughput DNA sequence data compression

page 11 of 15

Table 7: Compression results of NGS data Data

Reference-based

Reference-free

Quip -r

CRAM

CRAM-L

DSRC

SRR935126 (9.80GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

1.63 1141.70 566.49 779.00 758.00 6.03

1.56 312.99 812.45 392.00 389.00 6.29

2.28 1670.15 542.63 392.00 388.00 4.30

1.05 1225.01 254.48 381.00 408.00 9.34

1.99 394.62 324.34 21.00 18.00 4.93

SRR489793 (12.82GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

2.84 1108.22 836.37 1536.00 1228.00 4.51

2.84 504.92 1151.33 392.00 392.00 4.51

3.67 2298.00 1133.37 397.00 393.00 3.50

1.45 1684.67 428.46 420.00 422.00 8.85

SRR327342 (5.74GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

1.08 485.00 414.19 900.00 777.00 5.30

1.06 158.97 448.60 382.00 380.00 5.39

1.50 1035.26 302.74 382.00 389.00 3.82

SRR801793 (2.75GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.51 469.79 333.09 1945 1638 5.43

0.49 124.44 221.86 391.00 390.00 5.63

ERR022075 (10.99GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

1.91 1192.97 624.07 2355.00 1638.00 5.76

SRR017935 (16.35GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

Quip

Fqzcomp

bzip2

gzip

1.63 881.54 865.50 396.00 391.00 6.03

1.67 1256.93 756.46 75.00 60.00 5.87

2.33 1561.28 1111.74 7.88 5.01 4.20

2.85 1925.50 407.31 0.63 0.54 3.43

3.27 573.11 766.69 27.00 21.00 3.92

2.84 657.97 632.19 384.00 381.00 4.51

2.87 786.96 510.56 75.00 64.00 4.48

3.56 1578.24 834.21 8.70 5.01 3.61

4.31 2508.78 414.32 0.63 0.54 2.98

0.60 694.24 243.14 388.00 396.00 9.59

1.39 320.63 343.00 19.00 19.00 4.13

1.08 379.36 207.90 395.00 391.00 5.30

1.16 225.54 203.88 79.00 63.00 4.94

1.55 715.61 421.15 7.84 5.01 3.69

1.85 1518.61 172.29 0.63 0.54 3.11

0.74 748.98 149.11 362.00 395.00 3.74

0.31 358.61 101.2 383.00 399.00 8.99

0.65 191.34 190.16 25.00 25.00 4.26

0.51 134.78 119.28 396.00 392.00 5.42

0.56 105.11 104.18 75.00 67.00 4.94

0.74 386.47 221.14 7.84 5.01 3.74

0.90 607.76 89.03 0.63 0.54 3.06

1.85 242.60 876.91 382.00 381.00 5.95

2.87 1966.83 595.46 395.00 389.00 3.83

1.29 1514.47 331.32 398.00 390.00 8.55

2.50 267.03 264.88 27.00 22.00 4.40

1.91 307.4 422.87 392.00 389.00 5.75

2.13 290.61 382.16 77.00 64.00 5.17

2.86 1055.41 490.81 8.48 5.01 3.85

3.49 1533.01 224.55 0.63 0.54 3.15

2.81 909.61 805.31 774.00 770.00 5.81

2.74 525.00 1029.39 404.00 402.00 5.96

3.46 1974.28 878.99 471.00 446.00 4.72

1.48 2064.2 409.06 465.00 451.00 11.03

3.15 708.99 526.68 23.00 14.00 5.18

2.83 1484.59 1956.85 398.00 390.00 5.78

2.79 731.08 583.65 77.00 65.00 5.86

3.80 2121.40 1444.68 8.70 5.01 4.30

4.64 2274.17 462.40 0.63 0.54 3.53

SRR003177 (1.72GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.37 261.45 273.34 776.00 765.00 4.60

0.37 52.77 146.94 416.00 418.00 4.68

0.49 312.45 88.53 445.00 455.00 3.53

0.24 231.56 72.94 433.00 442.00 7.31

0.40 45.02 47.17 74.00 63.00 4.30

0.37 58.67 74.73 402.00 397.00 4.60

0.36 56.79 63.88 75.00 56.00 4.73

0.45 144.54 80.58 8.16 5.01 3.84

0.55 240.11 35.76 0.63 0.54 3.12

SRR010637 (2.94GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.36 74.39 64.18 763.00 762.00 8.21

0.34 43.97 88.58 404.00 406.00 8.55

0.41 188.90 62.38 459.00 451.00 7.11

0.15 172.03 60.84 479.00 417.00 19.90

0.38 77.08 51.37 14.00 13.00 7.73

0.36 64.53 66.44 387.00 382.00 8.21

0.35 74.82 64.47 75.00 59.00 8.45

0.50 352.68 102.04 8.34 5.01 5.86

0.62 129.26 63.08 0.63 0.54 4.72

SRR611141 (1.76GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

0.49 188.56 161.22 823.00 753.00 3.58

0.49 54.26 143.91 407.00 402.00 3.60

0.58 304.84 95.46 474.00 450.00 3.03

0.22 226.80 68.23 467.00 441.00 8.16

0.52 48.17 53.17 19.00 22.00 3.37

0.49 64.14 73.16 387.00 385.00 3.58

0.45 61.91 72.80 75.00 59.00 3.88

0.58 176.50 89.97 8.70 5.01 3.02

0.71 231.59 52.92 0.63 0.54 2.48

SRR400039 (64.18GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

13.60 4872.36 2696.68 1536.00 1126.40 4.72

13.32 1970.82 4599.55 399.00 399.00 4.82

17.85 10124.9 3043.19 378.00 401.00 3.60

6.98 7638.81 1914.63 403.00 492.00 9.20

15.84 2485.74 2288.03 39.00 24.00 4.05

13.60 2164.64 2793.59 394.00 392.00 4.72

13.83 2646.33 2615.58 74.00 67.00 4.64

17.62 12932.88 5208.69 8.00 5.01 3.64

21.65 10456.10 1828.98 0.63 0.54 2.97

(continued)

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

Quip -a

General-purpose

page 12 of 15

Zhu et al.

Table 7 Continued Data

Reference-based

Reference-free

Quip -a

Quip -r

CRAM

CRAM-L

DSRC

Quip

SRR125858 (50.95GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

9.19 2135.49 2016.93 761.00 758.00 5.54

8.98 2213.16 5217.91 397.00 398.00 5.68

12.07 7466.20 2477.62 413.00 399.00 4.22

5.41 5990.00 1341.18 399.00 395.00 9.42

10.55 2016.97 1665.63 34.00 20.00 4.83

SRR111960 (69.46GB)

C-Size(GB) C-Time(s) D-Time(s) C-Mem(MB) D-Mem(MB) C-Ratio

12.32 5542.39 5807.31 758.00 771.00 5.64

12.03 1560.88 4425.95 397.00 393.00 5.78

15.96 9607.44 2865.54 382.00 386.00 4.35

7.43 7928.37 1844.28 371.00 377.00 9.35

13.99 3097.17 3303.89 40.00 21.00 4.96

General-purpose Fqzcomp

bzip2

gzip

9.19 1668.28 1869.83 392.00 391.00 5.54

9.27 2100.20 1724.39 77.00 62.00 5.50

12.42 11224.39 3298.37 8.70 5.01 4.10

14.98 8180.89 1490.65 0.63 0.54 3.40

12.32 2017.80 2583.57 389.00 388.00 5.64

12.44 3124.21 2115.36 77.00 65.00 5.59

16.45 15294.35 4994.71 7.73 5.01 4.22

20.06 13662.79 1927.73 0.63 0.54 3.46

CHALLENGES AND FUTURE PROSPECTS Despite advances in high-throughput DNA sequencing data compression, challenges remain. We list several aspects that may benefit from further development of compression algorithms for NGS data analysis.

jobs with inherent embarrassed parallelism and has been applied to process uncompressed data. Parallel compression methods are yet to be developed and more complicated tasks requiring synchronization among different processors require more sophisticated parallel algorithms, in which case, parallel programming model like Message Passing Interface (MPI) can be explored.

Toward compressive genomics Most of the DNA sequence data compression methods have been used for improving the efficiency in data storage and transfer. However, algorithms that can directly work with compressed data may potentially reduce both memory and run time. Loh et al. [70] presented the compressive version of alignment algorithms BLAST and BLAT to operate on compressed non-redundant large genomic datasets and achieved a similar accuracy but substantial reduction in run time. The self-index data structures such as FM-index [71], compressed suffix arrays [72, 73] and LZ-index [74], enable the query, extract, retrieval and random access on compressed datasets directly. They have shown promises in read mapping applications [56] and are expected to be the important support for designing compressive techniques. Routine analysis on compressed sequence data remains to be a challenge.

High-performance DNA sequence compression High-performance computing is a natural direct remedy for processing larger data. Although distributed computing on MapReduce cloud [75, 76] has been utilized, such solution is typically limited to

Combination of encryption and compression High-throughput sequencing has greatly promoted personalized genomics. As more people have their own genomes sequenced for the purpose of disease prevention, diagnosis and treatment, transmission of the relevant compressed data over the Internet raises privacy concerns over information access and data security [15]. One solution for this issue could be the combination of encryption and compression.

Redundancy elimination of DNA sequence database Currently, most of the public available highthroughput sequencing data are stored in the databases of some big data centers like NCBI, EBI and DDBJ. The data in these public databases suffer from high redundancy as many projects, like the 1000 Genomes Project, are set up to investigate genomes with high similarity. The redundancy adds cost of storage and reduces query efficiency. Compressed data structure could be utilized to eliminate the redundancy in the databases. Although efforts have

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

C-Size: file size after compression; C-Time(s) and D-Time(s) denote compression time and decompression time in seconds, respectively; C-Mem and D-Mem denote the average memory used in compression and decompression, respectively; C-Ratio: compression ratio, i.e. the ratio of file sizes pre- and post- compression. Best-performed values within the same category are bolded (CRAM-L is not involved in the comparison).

High-throughput DNA sequence data compression been made [16, 77], developing scalable algorithms to large databases such as NCBI remains open.

Key points  The compression of high-throughput DNA sequence data is important to the efficient data storage and transfer, as well as downstream analyses.  This review paper provides a comprehensive categorization and assessment of many high-throughput DNA sequence compression algorithms.  The future development of NGS data analysis can benefit from the compressive genomics, high-performance compression, combination of encryption and compression, and redundancy elimination of DNA sequence databases.

9. 10.

11. 12.

13.

14. 15.

This work is supported in part by the National Natural Science Foundation of China (NSFC) Joint Fund with Guangdong, under Key Project U1201256, in part by the NSFC, under grants 61001185, 61205092, and 61171125, in part by the NSFC and the Royal Society of UK (NSFC-RS) joint project under grants IE111069 and 61211130120, in part by the Guangdong Natural Science Foundation, under grants S2012010009545, in part by the Shenzhen City Foundation for Distinguished Young Scientists, and in part by Science Foundation of Shenzhen City under grants JC201105170650A, KQC201108300045A and JCYJ20130329115450637.

16.

17.

18.

19. 20.

21.

22.

References 1. 2.

3. 4. 5.

6. 7.

8.

Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl AcadSci 1977;74:5463–7. Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437:376–80. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008;26:1135–45. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet 2010;11:31–46. Branton D, Deamer DW, Marziali A, et al. The potential and challenges of nanopore sequencing. Nat Biotechnol 2008; 26:1146–53. Pennisi E. Will computers crash genomics? Science 2011; 331:666–8. Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009;25: 1575–86. Grumbach S, Tahi F. Compression of DNA sequences. In: Proceedings of the 1993 IEEE Data Compression Conference (DCC ’93), Snowbird, Utah 1993;340–50.

23.

24.

25. 26.

27.

28.

29.

Grumbach S, Tahi F. A new challenge for compression algorithms. Genet Seq Inform Process Manag 1994;30:875–86. Chen X, Kwong S, Li M. A compression algorithm for DNA sequences and its applications in genome comparison. Genome Informat Ser 1999;10:51–61. Chen X, Li M, Ma B, et al. DNACompress: fast and effective DNA sequence compression. Bioinformatics 2002;18:1696–8. Korodi G, Tabus I, Rissanen J, et al. DNA sequence compression - based on the normalized maximum likelihood model. IEEE Sign Process Mag 2007;24:47–53. Cao MD, Dix TI, Allison L, et al. A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 IEEE Data Compression Conference (DCC ’07). Snowbird, Utah 2007;43–52. Matsumoto T, Sadakane K, Imai H. Biological sequence compression algorithms. Genome Inform Ser 2000;11:43–52. Kahn SD. On the future of genomic data. Science 2011;331: 728–9. Kuruppu S, Beresford-Smith B, Conway T, et al. Iterative dictionary construction for compression of large DNA data sets. IEEE-ACM Trans Computat Biol Bioinformatics 2012;9: 137–49. Bose T, Mohammed MH, Dutta A, et al. BIND - An algorithm for loss-less compression of nucleotide sequence data. J Biosci 2012;37:785–9. Brandon MC, Wallace DC, Baldi P. Data structures and compression algorithms for genomic sequence data. Bioinformatics 2009;25:1731–8. Christley S, Lu Y, Li C, et al. Human genomes as email attachments. Bioinformatics 2009;25:274–5. Wang C, Zhang D. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res 2011;39:E45–U74. Pinho AJ, Pratas D, Garcia SP. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 2012;40:e27. Kuruppu S, Puglisi SJ, Zobel J. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. String Process Informat Retrieval, Lecture Notes in Computer Science (LNCS) 2010;6393:201–6. Kuruppu S, Puglisi SJ, Zobel J. Optimized relative LempelZiv compression of genomes. In: Proceedings of Australasian Computer Science Conference, Darlinghurst, Australia 2011;91–8. Horspool RN. The effect of non-greedy parsing in ZivLempel compression methods. In: Proceedings of the1995 IEEE Data Compression Conference (DCC ’95), Snowbird, Utah 1995; 302–11. Deorowicz S, Grabowski S. Robust relative compression of genomes with random access. Bioinformatics 2011;27:2979–86. Mohammed MH, Dutta A, Bose T, et al. DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences. Bioinformatics 2012;28:2527–9. Kozanitis C, Saunders C, Kruglyak S, et al. Compressing genomic sequence fragments using SlimGene, Journal of Computational Biology 2011;18:401–13. Jones DC, Ruzzo WL, Peng X, et al. Compression of nextgeneration sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 2012;40:e171. Popitsch N, von Haeseler A. NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 2013;41:e27.

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

FUNDING

page 13 of 15

page 14 of 15

Zhu et al.

52.

53.

54.

55.

56.

57.

58.

59.

60. 61. 62.

63. 64.

65.

66.

67. 68.

69.

70. 71.

Illumina FASTQ variants. Nucleic Acids Res 2010;38: 1767–71. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25: 2078–9. Fritz MH-Y, Leinonen R, Cochrane G, et al. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 2011;21:734–40. Langmead B, Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:R25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25: 1754–60. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 2010;11: 473–83. Daily K, Rigor P, Christley S, et al. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinformatics 2010;11:514. Baldi P, Benz RW, Hirschberg DS, et al. Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. J Chem Inform Model 2007; 47:2098–109. Pevzner PA, Tang HX, Waterman MS. An eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 2001;98:9748–53. Storer JA. Data Compression: Methods andTheory. New York, NY, USA: Computer Science Press, Inc., 1988. Nelson M. Data compression book. Foster City, CA, USA: IDG Books Worldwide, Inc., 1991. Dementiev R, Sanders P, Schultes D, et al. Engineering an external memory minimum spanning tree algorithm. Explor New Front Theoretical Informat, International Federation for Information Processing (IFIP) 2004;155:195–208. Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. SRC Research Report 1994. Sahinalp SC, Vishkin U. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, Burlington,Vermont 1996;320–8. Wan R, Anh VN, Asai K. Transformations for the compression of FASTQ quality scores of next-generation sequencing data. Bioinformatics 2012;28:628–35. Ochoa I, Asnani H, Bharadia D, et al. QualComp: a new lossy compressor for quality scores based on rate distortion theory. BMC Bioinformatics 2013;14:187. Cover TM, Thomas JA. Elements of information theory. New York, NY, USA: John Wiley & Sons, Inc, 1991. Janin L, Rosone G, Cox AJ. Adaptive reference-free compression of sequence quality scores. Bioinformatics; doi:10.1093/bioinformatics/btt257 (Advance Access publication 9 May 2013). Bauer MJ, Cox AJ, Rosone G, et al. Lightweight LCP construction for next-generation sequencing datasets. Algorithm Bioinformatics, Lecture Notes in Computer Science (LNCS) 2012; 7534:326–37. Loh P-R, Baym M, Berger B. Compressive genomics. Nat Biotechnol 2012;30:627–30. Ferragina P, Manzini G. Indexing compressed text. J ACM 2005;52:552–81.

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

30. Tembe W, Lowey J, Suh E. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 2010;26: 2192–4. 31. Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics 2011;27: 860–2. 32. Bhola V, Bopardikar AS, Narayanan R, et al. No-reference compression of genomic data stored in FASTQ format. In: 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM’2011), Atlanta, Georgia 2011;147–50. 33. Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLoS One 2013;8:e59190. 34. Yanovsky V. ReCoil - an algorithm for compression of extremely large datasets of DNA data. Algorithms Mol Biol 2011;6:23. 35. Cox AJ, Bauer MJ, Jakobi T, et al. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 2012;28:1415–9. 36. Hach F, Numanagic I, Alkan C, et al. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 2012;28:3051–7. 37. Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Trans InformTheory 1977;23:337–43. 38. Rissanen J, Langdon GG. Arithmetic coding. IBM J Res Dev 1979;23:149–62. 39. Willems FM, Shtarkov YM, Tjalkens TJ. The context-tree weighting method: basic properties. IEEETrans InformTheory 1995;41:653–64. 40. Peterson J, Garges S, Giovanni M, et al. The NIH human microbiome project. Genome Res 2009;19:2317–23. 41. Altshuler DM, Durbin RM, Abecasis GR, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65. 42. Didelot X, Bowden R, Wilson DJ, et al. Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 2012;13:601–12. 43. Golomb S. Run-length encodings. IEEE Trans InformTheory 1965;12:399–401. 44. Elias P. Universal codeword sets and representations of the integers. IEEE Trans InformTheory 1975;21:194–203. 45. Huffman DA. A method for the construction of minimumredundancy codes. Proc IRE 1952;40:1098–101. 46. Miller W, Myers EW. A file comparison program. Software Pract Exp 1985;15:1025–40. 47. Hunt JJ, Vo K-P, Tichy WF. Delta algorithms: an empirical analysis. ACM Trans Software Eng Methodol (TOSEM) 1998;7: 192–214. 48. Pinho AJ, Ferreira PJSG, Neves AJR, et al. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 2011;6:e21588. 49. Ahn S-M, Kim T-H, Lee S, et al. The first Korean genome sequence and analysis: full genome sequencing for a socioethnic group. Genome Res 2009;19:1622–29. 50. Huala E, Dickerman AW, Garcia-Hernandez M, et al. The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res 2001;29:102–5. 51. Cock PJA, Fields CJ, Goto N, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/

High-throughput DNA sequence data compression 72. Makinen V, Navarro G, Siren J, et al. Storage and retrieval of highly repetitive sequence collections. J Computatl Biol 2010;17:281–308. 73. Makinen V, Navarro G, Siren J, et al. Storage and retrieval of individual genomes. Res Computat Mol Biol, Lecture Notes in Computer Science (LNCS) 2009;5541:121–37. 74. Arroyuelo D, Navarro G, Sadakane K. Stronger LempelZiv based compressed text indexing. Algorithmica 2012;62: 54–101.

page 15 of 15

75. Afgan E, Baker D, Coraor N, et al. Harnessing cloud computing with Galaxy Cloud. Nat Biotechnol 2011;29:972–4. 76. Stein LD. The case for cloud computing in genome informatics. Genome Biol 2010;11:207. 77. Cochrane G, Cook CE, Birney E. The future of DNA sequence archiving. GigaScience 2012;1:2.

Downloaded from http://bib.oxfordjournals.org/ at Maastricht University on June 9, 2014

High-throughput DNA sequence data compression.

The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. Compressio...
287KB Sizes 0 Downloads 0 Views