Journal of Bioinformatics and Computational Biology Vol. 13, No. 3 (2015) 1541003 (18 pages) # .c Imperial College Press DOI: 10.1142/S0219720015410036

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

Anirban Dutta, Mohammed Monzoorul Haque, Tungadri Bose, C. V. S. K. Reddy and Sharmila S. Mande* Bio-Sciences R&D Division, TCS Innovation Labs Tata Consultancy Services Limited 54-B, Hadapsar Industrial Estate, Pune 411013 Maharashtra, India *[email protected] Received 1 September 2014 Revised 17 November 2014 Accepted 24 December 2014 Published 19 March 2015 Sequence data repositories archive and disseminate fastq data in compressed format. In spite of having relatively lower compression e±ciency, data repositories continue to prefer GZIP over available specialized fastq compression algorithms. Ease of deployment, high processing speed and portability are the reasons for this preference. This study presents FQC, a fastq compression method that, in addition to providing signi¯cantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users. This study also proposes a novel archival strategy which allows sequence repositories to simultaneously store and disseminate lossless as well as (multiple) lossy variants of fastq ¯les, without necessitating any additional storage requirements. For academic users, Linux, Windows, and Mac implementations (both 32 and 64-bit) of FQC are freely available for download at: https://metagenomics.atc.tcs.com/compression/FQC. Keywords: Data compaction and compression; algorithms for biological data management; NGS data; sequencing data archival.

1. Introduction Recent advances in DNA sequencing technology have resulted in drastically increasing the overall throughput of sequencing experiments. Present day sequencing platforms can, in principle, generate millions of DNA fragments in a single overnight run. The size of sequencing data (in a typical sequencing experiment) is currently in the order of gigabytes to a few terabytes. Specialized algorithms are required for e±cient compression, archival, and dissemination of such huge volumes of data.

*Corresponding

author. 1541003-1

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

A. Dutta et al.

In addition to reducing storage costs, archiving data in a compressed format also has the advantage of facilitating quicker data transmission over the network. Major public repositories of sequence data (e.g. NCBI, EBI, JGI, CAMERA, etc.) store/archive raw sequence data in \fastq" format. A ¯le in fastq format consists of multiple fastq records concatenated in the form of a single text ¯le. Each fastq record stores three streams of information, viz., (1) nucleotide sequence information, (2) quality values for the corresponding sequenced nucleotides (encoded as ASCII characters) and (3) associated sequence identi¯ers (also referred to as headers). The size of individual fastq ¯les typically ranges between a few hundred megabytes to several gigabytes. Storing (and enabling their download) in a compressed format is therefore, more of a necessity than a convenience. Most compression algorithms for fastq ¯les segregate data into homogeneous streams of information (viz., header, sequence, and quality). Individual streams of information are subsequently compressed using methods that are appropriate for the type/format of data present in the respective stream. The resulting compressed data streams are then bundled as a single archive. Adopting this strategy is observed to provide better compression gains as compared to that obtained through compression of un-segregated information (a strategy typically adopted by general purpose compression tools, such as GZIP, BZIP2, and LZMA). In spite of better compression gains achieved by fastq speci¯c compression algorithms, as demonstrated by some recently published approaches viz., DSRC,1 QUIP,2 and FQZCOMP,3 it is observed that major public sequence repositories continue to employ \GZIP" (a general purpose compression method) for compressing fastq ¯les. Reliability, ease of deployment, high compression/decompression speeds as well as portability across a range of system architectures (32-bit, 64-bit), and operating systems, are the likely reasons for data repositories to adopt GZIP over the available specialized fastq compressors. The compression approaches discussed above belong to the \lossless" category, wherein the integrity of all three streams of information (viz., header, sequence, and quality) is maintained. In order to achieve better compression gains, several \lossy" approaches (for fastq ¯les) have recently been proposed.2–6 Given that any \loss" in nucleotide sequence information is, in principle, unacceptable to end-users, these approaches are \lossy" only with respect to the quality and/or the header portions of fastq ¯les. In general, the strategy of lossy approaches involves (1) adoption of a reduced character mapping scheme, wherein similar quality values (over a de¯ned range) are rounded o® to singular values and are subsequently subjected to compression, and/or (2) replacing header information, with shorter unique identi¯ers. Loss in header/quality information (in fastq ¯les compressed using a lossy approach) becomes \acceptable" as long as the results of downstream analyses are not signi¯cantly di®erent from those obtained using the original ¯le. For obvious reasons, the threshold of \acceptable loss" depends on the type of downstream analysis (e.g. readmapping, taxonomic binning, variant detection, functional annotation, etc.) that is being performed on the fastq ¯le.

1541003-2

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

Although signi¯cant compression gains may be achieved using lossy compression methods, existing implementations of these approaches are not suitable for adoption by sequence repositories. The probable reason preventing such adoption would be the varying threshold of \acceptable loss" mentioned previously. To cater to di®erent types of end-users, the repositories will need to store either (1) multiple compressed versions of the fastq ¯le, each with a di®erent level of \information-loss," or (2) the lossless compressed version of the original fastq ¯le, and generate (on-the-°y) \lossyversions" as and when requested by individual end-users. While the ¯rst alternative would result in data redundancy and requires expanded storage capacity, the second alternative would signi¯cantly increase the compute requirements at the archivers end, thereby imposing a \waiting-time" for end-users downloading the data. Moreover, in a scenario, wherein an end-user has already downloaded a lossy variant of a fastq ¯le, subsequently requests a lossless variant of the fastq ¯le (for a di®erent set of downstream applications), a duplication of download e®orts, time, and costs would be entailed. This paper presents (1) FQC    an e±cient fastq compression approach, and (2) an associated (novel) data archival and dissemination strategy, which enables both end-users and sequence repositories (i.e. data archivers) to store/disseminate fastq data in an e±cient manner. The proposed archival strategy helps data archivers to simultaneously provide multiple variants (lossy/lossless) of any archived fastq ¯le, without necessitating extra storage space and/or compute power. Furthermore, to address portability issues, the FQC compression–decompression framework has been implemented for all popular OS environments (viz., Windows/MAC/Linux) and for di®erent system architectures (32-bit and 64-bit). 2. Materials and Methods The present approach has been developed for achieving the following objectives: (1) Design of a lossless compression technique that can e±ciently compress any standard fastq ¯le (generated using any of the existing sequencing platforms), and attain a compression ratio (CR) which is relatively better than that obtained using existing approaches. (2) Implementation of a lossy compression strategy for the header and quality streams to enable creation of several lossy-compressed variants of a fastq ¯le (at various levels of information loss) which will be suitable for di®erent types of downstream analyses. (3) Development of a novel archival strategy that avoids data redundancy, but still enables simultaneous storage/dissemination of multiple variants (both lossless as well as lossy) of a fastq ¯le. The following sections describe the methods adopted for achieving the above objectives.

1541003-3

A. Dutta et al.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

2.1. Lossless compression of fastq ¯les As mentioned previously, fastq ¯les contain three genres of information, i.e. headers, nucleotide sequences, and quality values, each being distinctly homogeneous with respect to their content and/or the formatting style. Since the e±ciency of compression algorithms is seen to be higher with ¯les/data-streams that have homogeneous information, the FQC algorithm parses a given fastq ¯le into three data streams and processes them separately. Details of the processing steps for various streams are as follows: Header stream: Fastq headers (i.e. the ¯rst line of every fastq record) typically contain unique sequence identi¯ers followed by several ¯elds of information (some of which is redundant). In cases, where the fastq ¯le is obtained from a sequencer generating variable-length sequences (like Roche-454), the length of sequences are also captured in the headers. Header information in the ¯rst line is sometimes repeated in the third line (i.e. quality header) of each fastq record. FQC performs a one-time indexing of the invariant portions identi¯ed within/across headers. The variable information between consecutive records is stored in a delta-encoded format. Sequence and quality streams: The nucleotide sequence information of each fastq record, taken together, makes up the sequence stream. The quality stream comprises of a string of quality values (base-calling quality or PHRED score) represented as single ASCII characters, having one-to-one correspondence with the characters in the sequence stream. The range of quality values (and their corresponding ASCII encoding) varies across di®erent sequencing platforms (Supplementary Table 1). FQC processes the sequence and quality streams simultaneously in the following manner. (a) The nucleotide sequence is generally represented in a \base-space" representation, i.e. with standard IUPAC recognized nucleotide characters (A, T, G, C, and N). In speci¯c cases, e.g. sequences generated by ABI-SOLiD sequencers, a \colorspace" representation consisting of the characters 0, 1, 2, 3, 4, and \." is used. FQC ¯rst determines the type of character encoding used in the sequence stream and processes the ¯le accordingly. For color-space representations, FQC converts the sequence stream into a \pseudo base-space" representation following the charactermapping given in Supplementary Table 2. (b) A, T, G and C are the most frequent nucleotide characters in the sequence stream, followed by the ambiguous character \N." While the corresponding quality values of A, T, G, and C in the quality stream vary over a wide range (on average between 0 and 40), the quality of the ambiguous characters are restricted to very low values (hence the ambiguity). Moreover, in several low quality fastq records, stretches of ambiguous bases are frequently encountered. For every occurrence of the ambiguous character \N," FQC replaces the corresponding quality value with a new ASCII character (using a set of hitherto unused ASCII characters in the range 130– 210). This new character indicates the ambiguity in the sequence stream as well as the original quality value. All ambiguous characters are subsequently removed from 1541003-4

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

the sequence stream. Additional ambiguous characters remaining in the sequence stream are also removed from the sequence stream, and their position, type, and quality are noted separately. Although the number of characters in the quality stream remains the same after performing the above mentioned steps, the removal of all ambiguous characters signi¯cantly reduces the size of the sequence stream. Supplementary Table 3 summarizes the rules involved in transforming the base and quality streams. (c) The modi¯ed sequence stream (now devoid of ambiguous characters) contains only four standard nucleotides, viz., A, T, G, and C. FQC employs a simple 2-bits/base encoding scheme to further reduce the size of the sequence stream by a factor of 4. (d) Due to the nature of sequencing chemistry and imaging techniques employed by sequencing platforms, nucleotide bases in close context share similar quality values. This often results in small stretches of quality-repeats within the quality stream. FQC captures such repeats using appropriate \run-length" encoding techniques. Depending on the nature of their contents, the processed header, sequence, and quality streams are further compressed using appropriate general purpose compression algorithms. While the processed header and sequence streams are compressed using the LZMA algorithm (Lempel–Ziv–Markov chain algorithm    a variant of LZ77), the quality stream is compressed using the ppmd compression algorithm (prediction by partial matching). FQC uses the implementation of these algorithms as available in the 7-Zip ¯le archiver software (version 9.20, available under GNU LGPL license). It is important to note that many of the steps described above have been employed in earlier published studies pertaining to compression of biological data compression. For instance, segregation of header and sequence data and processing them as distinct data streams has been adopted in earlier studies like DSRC,1 DELIMINATE,7 etc. On a similar note, standard data compression techniques such as run-length encoding of repeated stretches of similar characters, delta encoding of numeric ¯elds in headers, etc., have been used in approaches like FQZCOMP,3 KungFQ,6 BIND,8 MFcompress,9 etc. 2.2. Lossy compression of fastq ¯les Besides being capable of e±ciently compressing/decompressing fastq ¯les in a lossless fashion, the FQC implementation also provides archivers/end-users to generate lossycompressed variants of fastq ¯les. Decompression of such lossy variants results in generating fastq ¯les which di®er from the original ¯le in terms of (1) header information and/or (2) quality-value resolution. The rationale and the procedure followed by FQC for the generation of lossy variants are described in the following paragraphs. In certain cases, the size of the header stream exceeds the sizes of sequence and the quality streams. Moreover, most downstream analyses usually do not require the 1541003-5

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

A. Dutta et al.

information provided in fastq headers. Given this observation, FQC makes provision to generate a lossy compressed ¯le, wherein the original header information is discarded. While decompressing such a ¯le, FQC provides unique numeric headers to each fastq record. Typically, quality values in fastq ¯les range between 0 and 40. End-users, performing downstream analyses, utilize quality values within fastq ¯les as a con¯dence metric to identify/screen sequence regions (or fastq records) of high quality. To encode this range of quality values, an equivalent number of ASCII characters are required. Previous studies have indicated that encoding quality values using a reduced character set, wherein quality values in a close range are represented with a single character (i.e. storing quality at a lower resolution), does not signi¯cantly alter the outcomes of downstream analyses.2 The reduced character set makes the data more homogeneous, thereby making it more amenable for compression. FQC allows end-users to generate lossy fastq variants at three levels of quality discretization (resolution levels 1 to 3). Level 1: All quality values up to 3 are replaced with 0. All \odd" quality values above 3 are rounded o® to their immediately lower \even" value. Level 2: All quality values up to 3 are replaced with 0. For quality values greater than 3, every subsequent (nonoverlapping) set of 4 values are rounded o® to the second value in that set. For example, while the quality value of 5 replaces the quality values 4, 5, 6 and 7, quality values 8, 9, 10 and 11 are replaced by 9. Level 3: All quality values up to 7 are replaced with 0. For quality values greater than 7, every subsequent (nonoverlapping) set of eight values are rounded o® to the fourth value in that set. It is to be noted that the rules adopted in FQC for de¯ning the three levels of quality discretization are representative in nature. Based on user-speci¯c needs, the FQC implementation can be customized for a di®erent set of rules. 2.3. Archival and dissemination strategy for fastq ¯les Existing lossy-compression techniques provide signi¯cant compression gains and consequently facilitate easier dissemination/download. However, none of these techniques retain the information discarded during lossy-compression, thus rendering the process irreversible. Acknowledging this limitation, the FQC implementation enables retaining this discarded information in the form of \patch" ¯les. The \patch," in combination with the lossy-compressed ¯le, can thus be used to regenerate the original fastq ¯le. This provision (of storing the discarded information as a \patch") in FQC opens up a new data archiving paradigm for sequence repositories. The suggested archival strategy involves compressing (at level 1 quality discretization) a given fastq ¯le (using FQC) to initially generate a lossy-compressed fastq variant (hereafter referred to as BF1: \Base ¯le 1"). The information lost at this stage is 1541003-6

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

Fig. 1. A schematic representation of the proposed archival strategy.

stored in a \patch" (hereafter referred to as PF1: \Patch ¯le 1"). Processing BF1 in a similar fashion (this time using level 2 quality discretization) will result in generating BF2 and PF2. Repeating the same process on BF2 (using level 3 quality discretization) will subsequently generate BF3 and PF3. The archiver therefore is required to store only one base ¯le (BF3) and the three patch ¯les (PF1, PF2, and PF3). An appropriate combination of base and patch ¯les can be downloaded by endusers and provided to FQC for regenerating either the original fastq ¯le, or any intermediate lossy-compressed variants. In addition, FQC allows users to update a lossy-variant of a fastq ¯le to its lossless counterpart (or to a variant with lesser information loss) by only downloading the required patch ¯le(s). Figure 1 is a schematic representation of the proposed archival strategy. 3. Results Compression e±ciency is generally evaluated in terms of CR (i.e. size of the compressed ¯le as compared to the size of the original ¯le) and the time taken for compression/decompression. While CR is primarily dependent on algorithmic design, compression/decompression time is additionally dependent on factors related to 1541003-7

A. Dutta et al.

system hardware (e.g. processor speed, available RAM, number of available/allocated processing cores, etc.). Typically, higher availability of primary memory allows appropriately designed algorithms to provide relatively higher levels of compression, but generally results in increasing the overall time taken for compression.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

3.1. Evaluation of lossless compression Depending on the sequencing platform used, the generated fastq datasets have either ¯xed-length sequence reads (e.g. from Illumina and SOLiD) or variable-length reads (e.g. from Roche-454 and Ion Torrent). The compression e±ciency of FQC was validated with both types of fastq ¯les using 13 publicly available fastq datasets. Eight of these datasets (generated using Illumina, SOLiD, and Roche-454 sequencing platforms) had been previously used by Deorowicz and Grabowski,1 for benchmarking the DSRC algorithm. The ninth fastq dataset (a 41 GB SOLiD ¯le), used by Deorowicz and Grabowski,1 could not be downloaded from the NCBI Trace Archive, and was compensated for in our validation set by a similar sized SOLiD fastq ¯le (SRR070253) from the same repository. In addition, a very small Illumina dataset (SRR032638 of size 45 MB) and three datasets generated using the Ion Torrent sequencing platform (downloaded from di®erent public repositories like NCBI, DDBJ, EBI, and the Ion Torrent community website) were also used for validation. For the purpose of comparison, the 13 datasets were compressed using FQC, GZIP version 1.4, LZMA version 9.20 (available as an implementation within the 7-Zip package), DSRC version 1.02, QUIP version 1.1.6, and FQZCOMP version 4.5. All the compression algorithms were run using default parameters. In order to handle certain implementation issues of QUIP and FQZCOMP, these programs had to be executed with suitable modi¯ed versions of the test datasets. Results obtained with all programs were compared in terms of CR as well as compression/decompression time. Since \CR " indicates the ratio of the size of the compressed dataset to that of the original dataset, a better compression algorithm will have a lower numerical value for CR. Given that the implementations available for programs like DSRC work only on a 64-bit environment, all validation experiments were performed on a 64-bit desktop, having 2.13 GHz Intel quad-core processors with 8 GB of RAM. It may be noted that the implementation provided for FQC can work on both 32-bit and 64-bit environments. Table 1 provides a comparison of CRs obtained using FQC and other compression algorithms. Results with variable-length datasets indicate that FQC obtained the lowest numerical values of CR, thereby indicating its relatively better compression e±ciency. In particular, the compression e±ciency of FQC with Ion Torrent datasets was found to be signi¯cantly superior as compared to all other tested algorithms. The CRs obtained by FQC for these datasets were observed to be 16–21% better than their closest counterparts. In the case of ¯xed-length datasets, QUIP and FQZCOMP were observed to obtain the best CRs. However, on 2an average, their CRs were observed to be only 3–4% better than FQC. It is important to note that 1541003-8

1541003-9

206 1,141 845 5,686 2,208 658

Variable-length SRR001471 SRR003177 SRR003186 ERR039503 B7-143 SRR515926 42 238 190 1,077 497 162

8 883 1,058 1,081 98 375 8,203

SCD

CR

20.4 20.9 22.5 18.9 22.5 24.6

17.8 29.0 23.1 23.6 14.8 18.8 20.0

FQC

66 371 291 1645 913 279

13 1,289 1,631 1,667 163 587 12,614

SCD

GZIP

32 32.5 34.4 28.9 41.3 42.4

28.9 42.4 35.6 36.3 24.6 29.5 30.7

CR

52 297 234 1412 705 222

11 1,062 1,328 1,359 141 512 10,786

SCD

CR

25.2 26 27.7 24.8 31.9 33.7

24.4 34.9 29 29.6 21.3 25.7 26.3

LZMA

46 269 210 1,315    228

9 941 1,151 1,176 102 388   

SCD

22.3 23.6 24.9 23.1    34.7

20 30.9 25.1 25.6 15.4 19.5   

CR

DSRC

43 249 196    673 197

8    1,027 1,052 92 356   

SCD

20.9 21.8 23.2    30.5 29.9

17.8    22.4 22.9 13.9 17.9   

CR

FQZCOMP

Size of compressed dataset (SCD) in megabytes and compression ratio (CR)

46 256 202    594 206

10 869 1,020 1,042 94 364 7,971

SCD

CR

22.3 22.4 23.9    26.9 31.3

22.2 28.6 22.2 22.7 14.2 18.3 19.4

QUIP

Note: CR ¼ ðSCD=size of original dataset) × 100. a Summary of results providing a comparison of compressed size and CR obtained using FQC and other compression programs. A lower CR indicates better compression e±ciency. For each dataset, the best CRs are indicated in bold.

reads 454 454 454 Ion Torrent Ion Torrent Ion Torrent

45 3,043 4,586 4,586 663 1,990 41,086

Fixed-length reads SRR032638 Illumina Illumina SRR013951 2 SRR027520 l Illumina Illumina SRR027520 2 SRR007215 1 SOLiD SRR010637 SOLiD SRR070253 SOLiD

Dataset

Sequencing platform

Size of the original dataset in megabytes

Table 1. Results of fastq compression. a

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

A. Dutta et al.

none of the specialized fastq compression algorithms (except FQC) were able to process all the 13 test datasets successfully. QUIP, FQZCOMP, and DSRC were observed to fail while compressing and/or decompressing one or more of the evaluated test datasets. In contrast, the general purpose compression tools (GZIP and LZMA), in spite of providing lower compression gains, were able to reliably perform the compression and decompression steps (while maintaining data integrity). Given that almost all sequence repositories currently employ GZIP as the preferred tool for compression/archival of fastq datasets, the percentage improvement in compression ratios (PICR) of all other tested tools/algorithms over GZIP was evaluated. Results of this evaluation (summarized in Fig. 2) indicate that, on an average (across all datasets), PICR values obtained using the specialized compression algorithms (FQC, DSRC, FQZCOMP, and QUIP) were signi¯cantly higher (18–46%) than GZIP. For both ¯xed-length, as well as, variable-length datasets, FQC was consistently observed to achieve signi¯cant compression gains over GZIP. More importantly, unlike all the other tested specialized FASTQ compression methods, FQC was observed to maintain the integrity of compressed/decompressed data for all the test datasets. With respect to compression speed, although FQC did not achieve compression rates as fast as FQZCOMP and DSRC (the quickest among all the tested programs), its compression speed was observed to be comparable or better as compared to

Fig. 2. PICR of various compression algorithms as compared to GZIP. 1541003-10

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

Fig. 3. A comparison of the cumulative time required for compression and decompression of validation datasets (grouped according to sequencing platforms). The depicted compression and decompression times are normalized to indicate relative performance of the compared algorithms. Note: \FQZ" refers to the method \FQZCOMP."

QUIP, GZIP, and LZMA (Fig. 3). Decompression speeds of FQC, FQZCOMP, and QUIP were observed to be more or less comparable or slightly higher as compared to their respective compression speeds. Though, GZIP and LZMA exhibited extremely rapid rates of decompression (as compared to their compression speeds), it is pertinent to note that these general purpose compression methods loose out signi¯cantly in terms of CRs as compared to the specialized fastq compression methods. A scatterplot indicating the compression e±ciency of FQC and other compression methods is depicted in Fig. 4. In this ¯gure, CRs obtained using various compression methods (including FQC) are plotted on the x-axis, and the total processing time (for compression and decompression of validation data sets) are plotted along the y-axis. The values indicated in this ¯gure correspond to only those datasets which could be successfully compressed and decompressed by all the compared methods. For ease of comparison, the datasets were appropriately segregated as \¯xed-length" and \variable-length." Algorithms with optimal compression e±ciency (with respect to both CR and processing time) are expected to have their results plotted nearest to the origin. As expected, the results depicted in Fig. 4 clearly indicate the superiority of specialized compression approaches over the general purpose compression methods. Moreover, the compression e±ciency obtained with ¯xed-length datasets was observed to be signi¯cantly better as compared to that obtained with datasets having sequences of variable-length. For \¯xed-length" datasets, FQZCOMP was observed to perform the best, followed closely by FQC and QUIP. In contrast, FQC was observed to signi¯cantly outperform other specialized compression methods in case of \variable-length" datasets. 1541003-11

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

A. Dutta et al.

Fig. 4. Scatter-plot illustrating a comparison of compression ratios attained (on x-axis) and total processing time required (on y-axis) by FQC and other compression methods. Arrows on the x-axis indicate the best compression ratios obtained for ¯xed-length and variable-length datasets. Note: \FQZ" refers to the method \FQZCOMP."

3.2. Evaluation of archival strategy The objective of incorporating a provision for lossy-compression in FQC was to enable the implementation of the novel archival and dissemination strategy (Fig. 1). As one would expect, the proposed archival strategy can be considered for adoption by sequence repositories only if the cumulative size of the compressed base and patch ¯les generated (by FQC) for a given fastq ¯le is signi¯cantly less than the cumulative size required for storing multiple compressed variants of the original data ¯le (i.e. size of BF3 þ PF3 þ PF2 þ PF1  BF3 þ BF2 þ BF1 þ compressed original data ¯le). Furthermore, in an ideal scenario, the cumulative size of the compressed base and patch ¯les should be comparable to the size of the compressed (lossless) original fastq ¯le. Once the above two criteria are met/satis¯ed, sequence repositories will be able to simultaneously cater to di®erent categories of end-users (requesting either lossy or lossless variants of fastq ¯les), without necessitating any signi¯cant expansion of their current storage and/or compute requirements. To check the feasibility of implementing the proposed archival strategy using FQC, the individual and the cumulative sizes of the base and patch ¯les (generated by FQC), corresponding to the fastq ¯les SRR013951 2, SRR007215 1, SRR001471 and B7-143 were evaluated. These ¯les represent data generated from four diverse sequencing platforms (viz., Illumina, SOLiD, Roche-454, and Ion Torrent). Results in Table 2 show that the cumulative size of the compressed base and patch ¯les (generated by FQC for a given fastq ¯le) was around 10–25% higher in size as compared to the size of corresponding fastq ¯le compressed with FQC (lossless). 1541003-12

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

Table 2. Comparison of ¯le sizes to be archived. a Dataset

SRR013951 2

SRR007215 1

SRR001471

B7-143

Sequencing Platform Size of original fastq ¯le Size fastq ¯le compressed by GZIP Size of fastq ¯le compressed by FQC Size of compressed BF3 ¯le Size of compressed PF1 ¯le Size of compressed PF2 ¯le Size of compressed PF3 ¯le Storage space required upon adopting proposed archival strategy: BF3 þ PF3 þ PF2 þ PF1

Illumina 3,043 1,289 883 553 154 125 181 1,013

SOLiD 663 163 98 57 17 14 20 108

454 206 66 42 31 7 7 7 52

Ion Torrent 2,208 913 497 244 129 123 125 621

a Comparison of the size of data (in megabytes) to be archived using GZIP (the existing archival method), FQC and the proposed archival strategy. Results for four datasets representing diverse sequencing technologies are shown.

Moreover, this cumulative size was observed to be signi¯cantly lower (approximately 27–51%) as compared to the size of the fastq ¯le compressed with GZIP    the popularly used storage/archival tool. These results suggest that the proposed archival strategy can be successfully adopted by sequence repositories without necessitating any additional storage resource. Furthermore, results provided in Table 3 highlight the bene¯ts accrued by endusers in existence of such an archival strategy. End-users, content with fastq data with a low resolution of quality values, are speci¯cally bene¯ted, as the amount of data required to be downloaded by them is signi¯cantly less as compared to the original (compressed) fastq ¯le. For example, an end-user downloading the BF3 ¯le (corresponding to the fastq ¯le B7-143) for downstream analysis, will need to download only 244 MB of data. This size constitutes only 27% of the size of the compressed original ¯le (913 MB) obtained with GZIP (the method currently adopted by sequence repositories). Even for users requesting the fastq ¯le at a relatively higher quality resolution (say at level 1), the download size (BF3þ PF3 þ PF2) is only around 54% of the GZIP compressed ¯le. The proposed archival strategy is also expected to bene¯t (in terms of network band-width) the sequence repositories, especially in scenarios where it simultaneously caters to download requests from a high number of such end-users requiring \lossy" data. 4. Discussion It is observed that the popular sequence repositories employ general purpose compression algorithms (GPCAs) like GZIP, LZMA, and BZIP2 for compressing/archiving fastq ¯les. However, given their generic approach to compression (in order to handle a wide variety of ¯le types), GPCAs do not guarantee \optimal" performance with all types of input data formats (including fastq). In contrast, specialized algorithms, being ¯ne-tuned for identifying and modeling distinct features of speci¯c 1541003-13

1541003-14

Loss-less reconstruction BF3 þ PF3 þ PF2 þ PF1

Intermediate quality resolution-level 1 (BF3 þ PF3 þ PF2)

Intermediate quality resolution-level 2 (BF3 þ PF3)

Lowest quality resolution-level 3 (BF3)

553 (42.9) 734 (56.9) 859 (66.6) 1,013 (78.6)

Illumina 3,043 1,289

SRR013951 2

57 (35.0) 77 (47.2) 91 (55.8) 108 (66.3)

SOLiD 663 163

SRR007215 1

31 (47.0) 38 (57.6) 45 (68.2) 52 (78.8)

454 206 66

SRR001471

Note: The values in parentheses indicate the download size expressed as a percentage of the size of the original dataset compressed with GZIP. a Comparison of the size of the data to be downloaded (for four representative datasets) using existing and the proposed archival strategies.

Download size (in megabytes) required by user for reconstruction of fastq ¯les with di®erent quality resolutions

Sequencing Platform Size of original fastq ¯le Size of fastq ¯le compressed by GZIP

Dataset

Table 3. Comparison of ¯le sizes to be downloaded. a

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

244 (26.7) 369 (40.4) 492 (53.9) 621 (68.0)

Ion Torrent 2,208 913

B7-143

A. Dutta et al.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

input ¯le types/formats, are generally observed to provide signi¯cantly higher levels of compression as compared to that by GPCAs. For fastq ¯les, existing specialized compression algorithms use a variety of strategies to improve CR (as compared to that obtained using GPCAs). Four recently reported algorithms (specialized for compressing sequence data in fastq format), namely DSRC,1 QUIP,2 FQZCOMP,3 and SOLiDZipper,10 provide reasonably higher compression gains as compared to GPCAs like GZIP and LZMA. Compression gains by these methods are primarily achieved by (a) di®erentially modeling/encoding di®erent streams of information (within fastq ¯les), and/or (b) selectively combining sequence and quality information for attaining optimal compression. In generic terms, the FQC algorithm presented in this paper utilizes a compression strategy more or less similar to other specialized fastq compression methods. However, subtle di®erences in handling the three data-streams help FQC in achieving relatively higher compression e±ciency. Furthermore, the additional option (provided in FQC) of generating lossy variants of fastq ¯les (at various levels of quality discretization/resolution) has been implemented in a novel fashion, wherein information lost at successive stages of lossy compression is retained in the form of \patch" ¯les. This option (of storing lossy ¯le variants along with their corresponding patch ¯les) enables the implementation of the proposed archival framework, the bene¯ts of which have been graphically illustrated in Fig. 1. Besides enabling end-users to selectively choose the amount of data to be downloaded (based on the level of quality resolution required for downstream analysis), the FQC archival framework also allows users to easily \update" a previously downloaded fastq ¯le (with a low quality resolution) to fastq variants with relatively higher quality resolution (even to a lossless variant). For this purpose, users simply need to download only the appropriate patch ¯les. This archival model therefore greatly reduces duplication of download e®orts. For instance, a user intending to update a BF3 fastq variant (corresponding to the ¯le SRR013951 2) to its original lossless format, needs to download just 460 MB (cumulative size of the patch ¯les PF3, PF2, and PF1), instead of downloading 883 MB (i.e. the size of the original compressed fastq ¯le). It is to be noted that the FQC implementation also allows users to separate out (if required) the header information from the base ¯les into a separate \header patch" ¯le. Considering that the header stream constitutes a signi¯cant proportion of fastq ¯les, separating out the headers will result in generating even smaller base ¯les that are easier to store and/or disseminate. 5. Conclusion This study presents a novel archival paradigm that makes it possible for sequence repositories to \simultaneously" disseminate lossless as well as multiple lossy variants of fastq ¯les. It is signi¯cant to note that the proposed archival strategy does not entail any signi¯cant increment in the existing storage and/or compute requirements for data repositories. The archival strategy is aided by a specialized 1541003-15

A. Dutta et al.

fastq compression approach (FQC), which outperforms most existing fastq compression methods, when run in \lossless" mode. In addition, FQC also enables creation of \lossy" fastq ¯le variants. However, in contrast to existing lossy compression approaches, FQC allows retention of the information discarded during lossy compression in the form of (one or more) small patch ¯les. This provision allows users, on requirement, to update lossy fastq variants to their lossless counterparts. In summary, the archival strategy proposed in this study provides signi¯cant bene¯ts not only to data repositories but also to end-users downloading fastq data.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

Acknowledgments We thank Mr. Hemang Gandhi for his help in designing the FQC webpage. Mr. Tungadri Bose is also a PhD scholar in the Indian Institute of Technology, Bombay and would like to acknowledge the Institute for its support. References 1. Deorowicz S, Grabowski S, Compression of DNA sequence reads in FASTQ format, Bioinformatics 27(6):860–862, 2011. 2. Jones DC, Ruzzo WL, Peng X, Katze MG, Compression of next-generation sequencing reads aided by highly e±cient de novo assembly, Nucleic Acids Res 40(22):e171, 2012. 3. Bon¯eld JK, Mahoney MV, Compression of FASTQ and SAM format sequencing data, PLoS One 8(3):e59190, 2013. 4. Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G, Compressing genomic sequence fragments using SlimGene, J Comput Biol 18(3):401–413, 2011. 5. Wan R, Anh VN, Asai K, Transformations for the compression of FASTQ quality scores of next-generation sequencing data, Bioinformatics 28(5):628–635, 2012. 6. Grassi E, Gregorio FD, Molineris I, KungFQ: A simple and powerful approach to compress fastq ¯les, IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842, 2012. 7. Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS, DELIMINATE    A fast and e±cient method for loss-less compression of genomic sequences: Sequence analysis, Bioinformatics 28(19):2527–2529, 2012. 8. Bose T, Mohammed MH, Dutta A, Mande SS, BIND    An algorithm for loss-less compression of nucleotide sequence data, J Biosci 37(4):785–789, 2012. 9. Pinho AJ, Pratas D, MFCompress: A compression tool for FASTA and multi-FASTA data, Bioinformatics 30(1):117–118, 2014. 10. Jeon YJ, Park SH, Ahn SM, Hwang HJ, SOLiDzipper: A high speed encoding method for the next-generation sequencing data, Evol Bioinform Online 7:1–6, 2011.

1541003-16

FQC: A novel approach for e±cient compression, archival, and dissemination of fastq datasets

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

Anirban Dutta did his doctoral research in Computational Biology at the Indian Institute of Chemical Biology-CSIR, Kolkata and received his Ph.D. (in Engineering) from Jadavpur University, Kolkata, India, in 2011. He is currently working as a Scientist in the Bio-Sciences R&D Division of Tata Consultancy Services Ltd. His research interests include bioinformatics, metagenomics, genome informatics, algorithm development, and systems biology.

Mohammed Monzoorul Haque obtained his M.Phil. (Biology) from Department of Molecular Biology, Madurai Kamaraj University. He also holds a Post Graduate Diploma in Bioinformatics from Institute of Bioinformatics and Applied Biotechnology, Bangalore. Working as a senior scientist in Bio-Sciences R&D Division, TCS Innovation Labs, India, he has been involved in bioinformatics research for more than eight years. Monzoorul Haque's overall research interests include development of algorithms for analyzing genomic and metagenomic datasets and compression of biological data.

Tungadri Bose did his M.Tech. in Bioinformatics from Hyderabad Central University, India in 2009. He is currently working as a Scientist in the Bio-Sciences R&D Division of Tata Consultancy Services Limited. He is also pursuing his Ph.D. degree at the Indian Institute of Technology, Bombay. His research interests include Bioinformatics, Metagenomics, Genome Informatics, Algorithms development, and Systems Biology.

C. V. S. K. Reddy did his M.Tech. in Computer Science from Hyderabad Central University, India in 2005. He is currently working as a Scientist in the Bio-Sciences R&D Division of Tata Consultancy Services Limited. His research interest includes development and optimization of algorithms for analyzing metagenomic datasets and compression of biological data.

1541003-17

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FUDAN UNIVERSITY on 05/07/15. For personal use only.

A. Dutta et al.

Sharmila S. Mande received her Ph.D. degree (in Physics) in the year 1991 from Indian Institute of Science, Bangalore, and later was trained in Protein Crystallography, through which she began to address problems of biological importance. She had her postdoctoral training at University of Groningen, Netherlands and University of Washington, Seattle, USA. She joined Tata Consultancy Services (TCS), in 2001 and heads the Bio-Sciences R&D activities at the TCS Innovation Labs. Her research interests include metagenomics, comparative genomics, algorithm development, mathematical modeling of biological systems, and structural biology. She has authored several papers, patented-algorithms, and software solutions that address challenges faced by researchers in analyzing data generated from Next Generation Sequencing technologies. She is a member of the editorial board of the journal Gut Pathogens.

1541003-18

FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets.

Sequence data repositories archive and disseminate fastq data in compressed format. In spite of having relatively lower compression efficiency, data r...
384KB Sizes 0 Downloads 8 Views