Briefings in Bioinformatics Advance Access published May 27, 2014

B RIEFINGS IN BIOINF ORMATICS . page 1 of 11

doi:10.1093/bib/bbu016

Methodological aspects of whole-genome bisulfite sequencing analysis Swarnaseetha Adusumalli, Mohd Feroz Mohd Omar, Richie Soong and Touati Benoukraf Submitted: 24th February 2014; Received (in revised form) : 17th April 2014

Abstract

Keywords: DNA methylation; epigenetics; high-throughput sequencing; bisulfite sequencing

INTRODUCTION DNA methylation is a biochemical event involved in a number of biological processes, including cellular proliferation and differentiation, pluripotency, development, genomic imprinting and oncogenesis [1]. This DNA modification manifests its effects by enhancing or reducing the affinity of protein–DNA interactions [2–4]. This in turn modulates the recruitment of transcription factors, thereby influencing gene activation or repression depending on the role of the affected regulatory elements (promoter, enhancer, silencer or insulator) and/or transcription factors (activator or repressor) [5–7]. DNA

methylation has been commonly studied at CpG dinucleotides, which are often found in clusters termed CpG islands (CGI). Nonetheless, methylation is also known to occur at non-CpG sites in plants [8, 9], mammalian stem cells [10] and mammalian brain cells [11–13], highlighting an importance for being able to analyze all cytosines in the genome. A large number of methods are available for the assessment of DNA methylation [14], with bisulfite treatment being a central procedure to a majority. The treatment deaminates unmethylated cytosines to uracil (which is later converted into thymine in DNA), while methylated cytosines are protected

Corresponding author. Touati Benoukraf, Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #11-02H, 117599 Singapore. Tel.: þ65 6516 1025; Fax: þ65 6873 9664; E-mail: [email protected] Swarnaseetha Adusumalli received her Master in Bioinformatics degree from the Nanyang Technological University, Singapore. She has a broad experience in the analysis of different high-throughput sequencing data sets. She was a bioinformatician at the Cancer Science Institute of Singapore, National University of Singapore, and recently joined the Genome Institute of Singapore, A*STAR. Mohd Feroz Mohd Omar is a PhD student in Oncology at the National University of Singapore. He holds a Bachelor in Biotechnology degree from Flinders University, Australia. He is interested in elucidation of epigenetic cancer biomarkers using high-throughput sequencing and molecular biology. Richie Soong is Senior Principal Investigator at the Cancer Science Institute of Singapore, Associate Professor at the National University of Singapore and the Chief of the Centre for Translational Research and Diagnostics, which combines the largest biosample repository in Singapore, a translational research laboratory and two clinical molecular diagnostics centers. Touati Benoukraf holds a PhD in Computational Biology from the University of Aix-Marseille II, France. He is currently Research Assistant Professor at the Cancer Science Institute of Singapore, National University of Singapore. His interests include the study of epigenetic aberrations in cancers using high-throughput sequencing technologies. ß The Author 2014. Published by Oxford University Press. For Permissions, please email: [email protected]

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

The combination of DNA bisulfite treatment with high-throughput sequencing technologies has enabled investigation of genome-wide DNA methylation beyond CpG sites and CpG islands. These technologies have opened new avenues to understand the interplay between epigenetic events, chromatin plasticity and gene regulation. However, the processing, managing and mining of this huge volume of data require specialized computational tools and statistical methods that are yet to be standardized. Here, we describe a complete bisulfite sequencing analysis workflow, including recently developed programs, highlighting each of the crucial analysis steps required, i.e. sequencing quality control, reads alignment, methylation scoring, methylation heterogeneity assessment, genomic features annotation, data visualization and determination of differentially methylated cytosines. Moreover, we discuss the limitations of these technologies and considerations to perform suitable analyses.

page 2 of 11

Adusumalli et al.

OVERVIEW There are many steps involved in the conversion, processing, quality control (QC), interpretation and reporting of WGBS data (Figure 1). There have also been many programs developed to address numerous

combinations of components of the process, with different input and output formats and programming languages (Table 1). In the following sections, we discuss the principles of analysis and relevant programs according to the individual steps in processing.

HTS OUTPUT FILE FORMATS To approach a comprehensive coverage for genomes as large as the human or the mouse genome, hundreds of millions to billions of reads have to be generated, producing a high volume of data per sample. Illumina sequencers generate fastq formatted files, which record both the nucleotide sequence of reads and the base call confidence level for each nucleotide. In contrast, SOLiD sequencers export read sequences and corresponding call quality into two files, csfasta and text, respectively. As the majority of academic WGBS programs are designed for the fastq format, Cock et al. described how to convert SOLiD output to fastq for further downstream analysis [49].

QUALITY CONTROL Conventional QC tools are available for providing informative global and graphical representations of WGBS read quality prior to their alignment. Basically, two processes, adapter removal and reads trimming/filtering, are required for the removal of low-quality bases from further usage. During sequencing library preparation, adapter sequences are added to the 50 end of fragmented or PCR-amplified DNA to prime sequencing. Additional adapter sequences, called barcodes, may also be added to enable sequencing of multiple samples in a single sequencing experiment (multiplex sequencing). The non-genomic adapter and barcode DNA sequences have to be trimmed prior to sequence alignment. This procedure is done either by sequencing software packages associated with the various sequencing platforms, such as the HiSeq Analysis Software from Illumina, or by third-party tools, such as SeqTrim [50], Trimmomatic [51] or FastXtool kit [52]. Additional trimming removes low-quality bases with low phred scores (base call accuracy) from the ends (tails) of reads [53], and is necessary as base quality tends to decrease with read length [54]. Filtering removes entire reads with overall low-quality scores. Programs such as FastQC [22], BIGpre [20] and PIQA [25] provide varying analyses for trimming and filtering reads in an informed manner.

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

from this conversion [15–17]. The differential in DNA bases is subsequently exploited by a large variety of methods to assess methylation, including polymerase chain reaction (PCR) amplification, primer extension and hybridization. Nonetheless, bisulfite sequencing, which couples bisulfite treatment to Sanger sequencing, has remained the gold standard, presumably because of familiarity with the method and the specificity of its sequence output. Recently, advances in DNA sequencing have led to the emergence of so-called high-throughput sequencing (HTS) platforms for massively parallel DNA sequencing. Coupled to bisulfite treatment, HTS has expanded the scope of DNA methylation studies by enabling an unbiased genome-wide analysis of CpG sites at a nucleotide resolution. Although HTS technologies are becoming more robust and affordable, the analysis of bisulfitesequenced DNA remains challenging at several levels. First, accurate alignment of bisulfiteconverted sequences is a challenge because of the lower complexity of bases arising from the conversion of unmethylated cytosines to thymine. Second, scoring cytosine methylation levels require comprehensive genome coverage (more than billions of reads for mammalian genomes). Such procedures have been reportedly optimized; however, software Web sites have recently highlighted the need for their further improvement [18, 19]. Third, elucidating the consequences of cytosine methylation on gene regulation entails integration with large volumes of other genomic knowledge still being elucidated, including chromatin states and transcription factor binding sites. Finally, there is an issue of identifying differentially methylated loci between samples without necessarily having replicates because of the high costs of whole-genome sequencing (WGS). In this review, we describe the challenges associated with the analysis of data sets derived from whole-genome bisulfite sequencing (WGBS) experiments, from the initial treatment of sequenced reads to the determination of differentially methylated nucleotides, and discuss the tools and methods that have emerged to address these issues.

Whole-genome bisulfite sequencing analysis workflow

page 3 of 11

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

Figure 1: Stages in genome-wide methylation analysis. There are many steps (indicated by letters) and output formats (indicated in parentheses) involved in the analysis of genome-wide methylation data generated by nextgeneration sequencers. (A) QC of pre-alignment and post-aligned reads is performed using software programs that assess for features specific to WGBS data, such as the bisulfite conversion efficiency. (B) Programs that specifically take into account the altered distributions of cytosine and thymine nucleotides in WGBS data are used to obtain accurate alignment of reads to reference genomes in standard SAM/BAM formats. (C) Methylation scoring of aligned reads is computed by software that aggregates reads and outputs methylation scores of individual CpG sites or bins/clusters into bedgraph format. (D) Quantitative assessment of the methylation heterogeneity can be determined by the Shannon entropy. (E) Genome DNA methylation can be visualized using conventional genome browsers or plot tools. (F) Methylated cytosines can be annotated for genomic features to allow prioritization of candidates. (G) Statistical comparison of methylation scores between samples can be performed to identify DMCs and/or DMRs.

FASTQ SAM, BAM FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ, BAM FASTQ FASTQ FASTQ FASTA FASTQ FASTQ FASTQ FASTQ FASTQ FASTQ, BAM, SAM Bismark output (.bis) Bismark output, text BSP, BS-Seeker (text) Tab-delimited text, Bismark ouput BAM, bedgraph, VCF, wig Whitespace-delimited, VCF

BIGpre [20] BSeQC [21] Fastqc [22] HTQC [23] NGS QC Toolkit [24] PIQA [25] QC-Chain [26] SolexaQA [27] RUbioSeq [28] WBSA [29] BRAT [30] BRAT-BW [31] BSMAP [32] BS-Seeker [33] GSNAP [34] SOAP [35] BatMeth [36] Bismark [37] BS-Seeker2 [38] MethylCoder [39] MethyQA [40] PASS-bis [41] MethPipe [42] DMEAS [43] CpG_MPs [44] GBSA [45] MethylKit [46] UCSC Genome Bowser [47] VEP [48]

A A A A A A A A A, B, C A, B, C, E, F, G B B B B B B B, C B, C B, C B, C B, C B, C B, C, G C, D, E C, D, E, G C, E, F C, E, F, G E F

Processing step(s)b

Program type Command line (Perl) Command line (Python) GUI ( Java), Command line( Java) Command line (Cþþ) Command line (R library) Command line (R library) Command line (binary, Cþþ) Command line (Perl, R library) Command line (Perl) Web server Command line(Cþþ) Command line(Cþþ) Command line(Cþþ) Command line(Python) Command line (C/Perl) Command line (Cþþ) Command line (C/Cþþ) Command line (Perl) Command line (Python) Command line (Python) Command line (Cþþ) Command line (Cþþ) Command line (Cþþ) Command line (Perl) Java applet GUI (Windows), Command line (Python) Command line (R library) Web server Web server, Command line (Perl)

Output file format(s)a Tab-delimited text, R file (for plotting) SAM, BAM, PDF, text HTML report FASTQ, tab-delimited text FASTQ, HTML report, tab-delimited text HTML report FASTQ FASTQ, PDF, text SAM/BAM, bed, wig, VCF SAM, HTML report, tab-delimited text, genome browser Tab-delimited text (convertible to SAM) Tab-delimited text (convertible to SAM) SAM/BAM, BSP, text SAM/BAM, tab-delimited text SAM SAM custom text format, PDF SAM, tab-delimited text SAM/BAM, tab-delimited text, wig SAM, tab-delimited text Tab-delimited text Modified SAM BAM, tab-delimited text Tab-delimited text, images Tab-delimited text, xls, images, genome browser Bedgraph, tab-delimited text, images, gene browser Bedgraph, tab-delimited text, images Genome browser Tab-delimited text, VCF

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

a Only major formats were retained. bProcessing steps managed by the programs according to the stages described in Figure 1. A: QC; B: alignment; C: methylation scoring; D: quantitative score assessment; E: visualization; F: annotation; G: determination of differential DNA methylation.

Input file format(s)a

Software

Table 1: Programs for analyzing genome-wide methylation data

page 4 of 11 Adusumalli et al.

Whole-genome bisulfite sequencing analysis workflow

MAPPING BISULFITE-CONVERTED READS TO A GENOME REFERENCE The dissimilarity between reads of bisulfite-treated DNA and standard reference genomes renders conventional alignment tools such as Bowtie [57], BWA [58] or Maq [59] unsuitable. Furthermore, the possibility that every given thymine could either be a genuine genomic thymine or a converted unmethylated cytosine adds another level of complexity. A number of programs have now been developed to address these WGBS-specific issues, with some conventional aligners also incorporating these capabilities [34, 35]. The programs can be categorized

into two groups, namely, wild-card aligners and three-letter aligners [60]. Wild-card bisulfite sequencing aligners like BSMAP [32], RMAP [61] and Segemehl [62] operate by replacing cytosines in the sequenced reads with wild-card Y (the International Union of Pure and Applied Chemistry code for cytosine or thymine) nucleotides. In contrast, three-letter aligners like BS-Seeker [33], Bismark [37], BRAT [30] and MethylCoder [39] convert C to T in both sequenced reads and genome reference prior to performing the reads mapping using modified conventional aligners such as Bowtie for BSSeeker and Bismark. The alignment tools export aligned reads in the standardized SAM/BAM [63] formats, or a custom SAM-like format that includes extra columns describing the conversion status of each cytosine within each read, such as the bsp format from BSMAP.

DNA METHYLATION SCORING Aligned reads do not themselves provide any biological interpretable data and need further analysis to achieve meaningful methylation assessment. Reads have to be aggregated to compute a methylation score for cytosines or genomic regions. From this aggregation, differentially methylated cytosines (DMCs) or differentially methylated regions (DMRs) between cell populations can be determined, and subsequently overlapped with other genomic features to provide context and perspective to the methylation events. Another consideration is that DNA can be methylated in three sequence contexts: CpG, CHG and CHH (where H ¼ A, T or C). In Arabidopsis thaliana, it has been shown that each context is methylated through distinct pathways [64]. Likewise, the impact of DNA methylation on gene regulation has also been demonstrated to depend on sequence context. Indeed, CpG methylation appears to be more involved in nucleosome positioning [65–68] and transcription factor binding site affinity [2–4], while CHG and CHH seem to affect gene regulation at DNA repetitive elements [69].

GENERATING DNA METHYLATION SCORES AT A SINGLE-NUCLEOTIDE RESOLUTION Cytosine methylation scoring consists of aggregating overlapping read segments and determining the proportion of cytosines (indicating methylated cytosines)

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

However, these tools have the drawback of requiring the trimming tools described earlier for read filtering and trimming. NGS QC Toolkit [24] and SolexaQA [27] are QC tools that also have inbuilt read trimming capabilities, while HTQC [23] and QC-Chain [26] have read filtering options in addition. The unique feature of bisulfite-treated DNA is an over- and under-representation of thymine and cytosine bases, respectively. This altered sequence composition is considered biased using conventional QC tools, which are based on patterns in untreated DNA. To address this, MethyQA [40] was devised to allow QC (using FastQC) of reads prior to their alignment, while BSeQC [21] allows QC of reads after read alignment. Both BSeQC and MethyQA are restricted to the analysis of cells where DNA methylation occurs only in context of CpG sites. These programs assess the efficiency of bisulfite treatment by analyzing the conversion of non-CpG cytosines. The non-CpG cytosines are presumably not methylated and, therefore, are expected to be completely converted to thymines with adequate bisulfite treatment. These programs also provide tools for read processing, such as read trimming and duplicate removal. In addition, BSeQC provides a summary report of bias unique to each read length and mapping strand, including the occurrence of duplicates. To overcome the limitations of considering only methylation events in a CpG context, bisulfite conversion efficiency can also be determined by spiking DNA sequences of varying lengths and known methylation status from a different organism, such as bacterial plasmid [55] or l-phage [56], prior to bisulfite treatment. The methylation status of the spiked DNA is then compared with its expected status to determine bisulfite conversion efficiency.

page 5 of 11

page 6 of 11

Adusumalli et al.

or thymines (indicating unmethylated cytosines) for each cytosine in the genome. This proportion of methylated cytosines to all cytosines in genomic DNA, or so-called b-score [70], is calculated as follows: Ci,g ¼

n

1 if Ci,g ¼ M 0 if Ci,g ¼ U

o , bscorei ¼

Pn i

g¼1

Ci,g

cytosine methylation status [72], which corresponds to the overall proportion of methylated cytosines to all cytosines within the bin, and covered by a minimum amount of reads as follows: Ci,j,g ¼

n

1 if Ci,j,g ¼ M 0 if Ci,j,g ¼ U

o

P d P ni , bin scorej ¼

Obtaining complete and high-depth coverage of all cytosines throughout the genome is not always feasible. This presents a problem to studies that aim to compare methylation levels between multiple samples, as only cytosines that have adequate sequence coverage in all studies can be selected for comparison. Software like Methylkit [46] has implemented a strategy to overcome this issue by dividing the genome into several bins comprising regions ranging from a few base pairs up to several kilobases. This approach trades off an increase in the regions of the genome being studied at the expense of a loss of resolution. Bin scores can eventually be computed as a mean of b-scores. Nonetheless, this method does not take into account the sequencing depth variation at each cytosine that impacts the accuracy of methylation level calling. For this reason, it is more appropriate to compute a weighted mean of

g¼1

Ci,j,g

ni,j

where bin scorej denotes the methylation score of a given bin located at the position j in the genome; Ci,j,g denotes a cytosine at the position i (ranging from 1 to d) within the bin j in a corresponding read g (ranging from 1 to ni,j ); and ni,j denotes the depth of sequencing coverage at the position i in the bin j. Alternatively, methylated or unmethylated cytosine clusters can be highlighted dynamically using a sliding window approach [44, 45], which delineates (un)methylated DNA domains regardless of the bin size parameter.

GENERATING DNA METHYLATION HETEROGENEITY SCORES Clonal cell populations can vary in their maintenance of DNA methylation patterns through successive generations, leading to differential methylation at genomic sites such as promoters, CGIs and repetitive elements [73]. These differential methylation patterns cannot be determined using b or bin scores. For example, a region comprising four CpG sites can have a methylation score of 0.5 (50%) with six different patterns: MMUU, MUMU, MUUM, UMMU, UMUM and UUMM, where M and U indicate methylated and unmethylated cytosines, respectively. To address this, DMEAS used a modified Shannon entropy method to generate an entropy score (ME) for a given amount of consecutive CpG sites [43]. This ME considers the number of sequenced reads N generated for a genomic locus and the frequency of each distinct DNA methylation pattern ni observed in a genomic locus as follows: ME ¼

e X  ni ni   log10 ð Þ b N N

The constant be is built with e, the entropy for code bit, defined as e ¼ lnð10Þ= lnð2Þ; and b, the amount of consecutive CpG sites being considered. In this calculation, the larger the ME, the greater the heterogeneity of methylation within the studied cell population.

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

GENERATING REGIONAL DNA METHYLATION SCORES

Pd

i¼1

ni

where M and U denote methylated and unmethylated cytosines, respectively; Ci,g denotes a given cytosine located at the position i in the genome in a corresponding read g (ranging from 1 to ni ); and ni denotes the depth of sequencing coverage at the position i. This calculation also relies on excluding analysis of cytosines with poor depth of coverage to reduce inaccuracy, with some studies suggesting at least three reads are sufficient for methylation call accuracy [71]. Methylation scoring call files generally report cytosine methylation scores as a standard bedgraph file, or non-standard tab-delimited file that includes cytosine methylation scores along with the depth of coverage (BatMeth [36], Bismark [37], BRAT-BW [31], BSSeeker2 [38], GBSA [45], Pass-bis [41] and RuBioSeq [28]). Despite the fact that the latter tab-delimited file is not a standard format, it is still useful for the identification of DMCs and DMRs because the depth of coverage information is used for calculating their statistical relevance.

i¼1

Whole-genome bisulfite sequencing analysis workflow

VISUALIZING DNA METHYLATION

ASSESSING DNA METHYLATION ACCORDING TO GENOMIC CONTEXT To fully understand its role in chromatin modeling and gene regulation, cytosine methylation patterns have to be annotated in their genomic context. Data have indicated that methylation within transcription start site (TSS) regions or first exons has a stronger correlation with gene repression than methylation within the gene body [78, 79]. Emerging evidence correlating gene body methylation with active gene transcription [5, 80–82] has begun to challenge the canonical association of DNA methylation with gene silencing. To address this, programs such as GBSA [45] or Web Service for Bisulfite Sequencing Data Analysis (WBSA) [29] analyze DNA methylation with respect to their location in features such as promoters, TSS regions, gene bodies or repetitive elements. In addition, cytosines or domains provided in bed or bedgraph formats can be further annotated using third-party programs such as VEP [48].

DIFFERENTIAL DNA METHYLATION ASSESSMENT DNA methylation patterns can differ between tissues [83], developmental stages [84] and diseases types [85] and within individuals owing to environmental effects [86]. Comparing methylation patterns between different conditions, at a paired or cohort level, is therefore an essential element of DNA methylation analysis. However, performing replicate experiments of WGBS for statistical robustness is not always feasible owing to the high cost of WGS. Therefore, conventional statistical tools used for other platforms such as gene arrays (e.g. t-test or analysis of variance) are often unsuitable. WBSA compares DNA methylation levels between samples within fixed or dynamic regions using a Wilcoxon test. It identifies DMRs within CpG context only, within CH context only or regardless of the cytosine context. However, DMCs are not calculated by this software. Because cytosine methylation levels are determined by the fraction of methylated cytosines to all cytosines in sequenced reads, statistical significance can also be assessed by comparing these proportions with the proportions of bisulfite-converted and -unconverted cytosines for each position or bins across reads for different samples. For this reason, proportional statistical tests such as Fisher’s exact test (FET) can be appropriate for assessing the statistical relevance of DMCs/DMRs between samples. MethylKit has implemented this approach to identify cytosines (or scalable bins) with differential methylation without necessarily having experimental replicates. The statistical significance is determined from the methylation status of reads between samples. The FET_SW module from the CpG_MPs program [44] also uses FET to delineate differential methylated clusters using a sliding window approach. The difference or the consistency in methylation among samples (where n  3) can further be quantitatively assessed by an ME score [44]. In contrast, MethPipe [42] characterizes hypo- or hyper-methylated cytosine islands within each sample using a hidden Markov model framework, and determines the statistical significance of DMCs by FET. Then, DMRs are identified by overlapping hypo- or hyper-methylated cytosine islands with DMCs.

OUTLOOK Recent genome-wide DNA methylation studies in plants [8, 9], human stem cells [10] and mammalian

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

WGBS outputs for aligned reads to DMCs and DMRs are all provided in the same formats as other HTS data. Hence, all conventional genome browsers and visualization tools can be used for WGBS data. However, it should be noted that the conversion of unmethylated cytosines to thymines in WGBS read files (SAM/BAM formats) leads to their highlighting as single-nucleotide variants. A common approach to enable browsing of the methylome is to use bedgraph files, such as those generated by Bismark, GBSA and Methylkit, to provide methylation scores and depth of coverage. These standard formats are compatible with most genome browsers and provide a clear representation of methylation score at the nucleotide precision [74, 75]. Moreover, bedgraph files are easily convertible to a tab-delimited format and used for global visualization using tools such as Circos [76]. Alternatively, software packages such as GobbyWeb [77] output methylation results in vcf (Variant Call Format) files. Although this format is less used than the bedgraph format, it has the benefit of incorporating additional information such as methylation score, depth of coverage and read quality in a single file.

page 7 of 11

page 8 of 11

Adusumalli et al. elements and other non-coding regions promises to be a rich source of new findings in the near future. RegulomeDB [105] or AnnTools [106] projects aim to annotate non-coding genomic variants by integrating inter alia transcription factor binding sites and histone modification signatures derived from HTS data sets. The extension of these projects to DNA methylation analysis would be desirable. DNA methylation dynamics within CGIs have also been associated with nucleosome formation and positioning [64–68]; however, the mechanism by which the chromatin plasticity is controlled is still unclear. The integration of knowledge of nucleotide resolution of cytosine methylation, protein–DNA interaction and gene expression should help to shed light on the factors involved in transcriptional machinery recruitment and decipher the causes and consequences of DNA methylation.

Key points  Bisulfite sequencing has become the standard technique for genome-wide DNA methylation investigations at the nucleotide resolution.  Whole-genome bisulfite sequencing allows unbiased investigation of DNA methylation (beyond CpG sites and CpG islands).  Comprehensive bisulfite sequencing analysis requires the use of a combination of several complementary tools.  This article provides a broad review of bisulfite sequencing analysis, covering steps from sequencing quality reports to assessing differential cytosine methylation.

Acknowledgements The authors thank Nicolas Bertin and Samuel Collombet for useful discussions and comments during the preparation of this manuscript.

FUNDING National Research Foundation and the Singapore Ministry of Education under its Centres of Excellence initiative, and a grant (NMRC/IRG/ 1278/2010) from the National Medical Research Council (Singapore).

References 1.

2.

Franchini D-M, Schmitz K-M, Petersen-Mahrt SK. 5-methylcytosine DNA demethylation: more than losing a methyl group. Annu Rev Genet 2012;46:419–41. Fujimoto M, Kitazawa R, Maeda S, et al. Methylation adjacent to negatively regulating AP-1 site reactivates TrkA gene expression during cancer progression. Oncogene 2005; 24:5108–18.

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

brain cells [11–13] have revealed a significant amount of non-CpG-methylated sites. Early data suggest this non-CpG methylation may be involved in gene regulation at DNA repetitive elements [69]. Hence, moving on from CGIs to unbiased genome-wide analyses of cytosines promises to expand the understanding of the molecular and organismal effects of DNA methylation. Other studies have highlighted that DNA methylation can occur in an allele-specific manner, leading to parental-specific gene expression [87–90]. Genome-wide allelic-specific methylation analysis could hence provide new insights into the mechanisms that cause disorders associated with genomic imprinting, such as Prader–Willi, Angleman, Silver–Russell or Beckwith–Weidemann syndromes [91]. To this end, emerging statistical methods are allowing assessment of allele methylation scores [92]. However, these methods are unreliable when applied to heterogeneous cell populations such as cancer cells. The development of single-cell sequencing technologies [93] could help address this limitation. The ten–eleven translocation (TET) enzyme family of dioxygenases oxidizes and converts methylcytosines to 5-hydroxymethylcytosines (5h-mC) [94, 95]. Nonetheless, it is still unclear if 5h-mC is a functional epigenetic mark or a transition phase in DNA demethylation [96]. Unfortunately, bisulfite treatment converts both methylcytosine and 5h-mC to uracil and, therefore, cannot be used to distinguish them [97], and this limitation should be taken into account when interpreting WGBS data. Recently, oxidative bisulfite sequencing [98] and TET-assisted bisulfite sequencing [99] protocols to distinguish methylcytosine and 5h-mC have been reported. However, these protocols require a standard bisulfite-treated reference, which requires that two WGS analyses per sample be performed to enable comprehensive analyses of methylcytosine and 5h-mC patterns. Methylation analysis approaches that do not require bisulfite treatment, such as Nanopore technology, promise the next major advance in methylation analysis [100, 101]. Finally, recent HTS investigations have begun to describe the physiological and pathological DNA methylation dynamics of CGI shores [102, 103] and intra- and intergenic distal regulatory elements [6, 104]. These elements include enhancers and insulators that have been shown to strengthen or suppress the binding of transcription factors [2–4]. Therefore, the expansion of methylation investigations beyond gene features and into regulatory

Whole-genome bisulfite sequencing analysis workflow 3.

4.

5. 6.

7.

8.

10.

11.

12.

13.

14.

15.

16. 17.

18.

19.

20.

21.

22.

23. Yang X, Liu D, Liu F, et al. HTQC: a fast quality control toolkit for Illumina sequencing data. BMC Bioinformatics 2013;14:33. 24. Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoSOne 2012; 7:e30619. 25. Martı´nez-Alca´ntara A, Ballesteros E, Feng C, et al. PIQA: pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics 2009;25:2438–9. 26. Zhou Q, Su X, Wang A, et al. QC-Chain: fast and holistic quality control method for next-generation sequencing data. PLoS One 2013;8:e60234. 27. Cox MP, Peterson DA, Biggs PJ. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010;11:485. 28. Rubio-Camarillo M, Go´mez-Lo´pez G, Ferna´ndez JM, et al. RUbioSeq: a suite of parallelized pipelines to automate exome variation and bisulfite-seq analyses. Bioinformatics 2013;29:1687–9. 29. Liang F, Tang B, Wang Y, et al. WBSA: Web Service for Bisulfite Sequencing Data Analysis. PLoS One 2014;9: e86707. 30. Harris EY, Ponts N, Levchuk A, et al. BRAT: bisulfite-treated reads analysis tool. Bioinformatics 2010;26: 572–3. 31. Harris EY, Ponts N, Le Roch KG, et al. BRAT-BW: efficient and accurate mapping of bisulfite-treated reads. Bioinformatics 2012;28:1795–6. 32. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics 2009;10:232. 33. Chen P-Y, Cokus SJ, Pellegrini M. BS Seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics 2010; 11:203. 34. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 2010; 26:873–81. 35. Li R, Li Y, Kristiansen K, et al. SOAP: short oligonucleotide alignment program. Bioinformatics 2008;24: 713–4. 36. Lim J-Q, Tennakoon C, Li G, et al. BatMeth: improved mapper for bisulfite sequencing reads on DNA methylation. Genome Biol 2012;13:R82. 37. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 2011;27:1571–2. 38. Guo W, Fiziev P, Yan W, et al. BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics 2013;14:774. 39. Pedersen B, Hsieh T-F, Ibarra C, et al. MethylCoder: software pipeline for bisulfite-treated sequences. Bioinformatics 2011;27:2435–6. 40. Sun S, Noviski A, Yu X. MethyQA: a pipeline for bisulfitetreated methylation sequencing quality assessment. BMC Bioinformatics 2013;14:259. 41. Campagna D, Telatin A, Forcato C, et al. PASS-bis: a bisulfite aligner suitable for whole methylome analysis of Illumina and SOLiD reads. Bioinformatics 2013; 29:268–70. 42. Song Q, Decato B, Hong EE, et al. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS One 2013;8:e81148.

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

9.

Zhu W-G, Srinivasan K, Dai Z, et al. Methylation of adjacent CpG sites affects Sp1/Sp3 binding and activity in the p21(Cip1) promoter. Mol Cell Biol 2003;23:4056–65. Hu S, Wan J, Su Y, etal. DNA methylation presents distinct binding sites for human transcription factors. Elife 2013;2: e00726. Jones PA. The DNA methylation paradox. Trends Genet 1999;15:34–7. Tatetsu H, Ueno S, Hata H, etal. Down-regulation of PU.1 by methylation of distal regulatory elements and the promoter is required for myeloma cell growth. CancerRes 2007; 67:5328–36. Lai AY, Fatemi M, Dhasarathy A, et al. DNA methylation prevents CTCF-mediated silencing of the oncogene BCL6 in B cell lymphomas. J Exp Med 2010;207:1939–50. Lister R, O’Malley RC, Tonti-Filippini J, etal. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 2008;133:523–36. Cokus SJ, Feng S, Zhang X, et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 2008;452:215–9. Lister R, Pelizzola M, Kida YS, et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 2011;471:68–73. Xie W, Barr CL, Kim A, et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 2012;148:816–31. Lister R, Mukamel EA, Nery JR, et al. Global epigenomic reconfiguration during mammalian brain development. Science 2013;341:1237905. Guo JU, Su Y, Shin JH, et al. Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nat Neurosci 2014;17:215–22. Harrison A, Parle-McDermott A. DNA methylation: a timeline of methods and applications. Front Genet 2011; 2:74. Wang RY, Gehrke CW, Ehrlich M. Comparison of bisulfite modification of 5-methyldeoxycytidine and deoxycytidine residues. Nucleic Acids Res 1980;8:4777–90. El-Maarri O. Methods: DNA methylation. Adv Exp Med Biol 2003;544:197–204. Hayatsu H. Discovery of bisulfite-mediated cytosine conversion to uracil, the key reaction for DNA methylation analysis–a personal account. Proc Jpn Acad Ser B Phys Biol Sci 2008;84:321–30. Akalin A, Kormaksson M, Li S, et al. methylKit. https:// code.google.com/p/methylkit/(14 April 2014, date last accessed). Benoukraf T, Wongphayak S, Hadi LHA, et al. GBSA. http://ctrad-csi.nus.edu.sg/gbsa/(14 April 2014, date last accessed). Zhang T, Luo Y, Liu K, et al. BIGpre: a quality assessment package for next-generation sequencing data. Genomics Proteomics Bioinformatics 2011;9:238–44. Lin X, Sun D, Rodriguez B, et al. BSeQC: quality control of bisulfite sequencing experiments. Bioinformatics 2013;29: 3227–9. Andrews S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.bab raham.ac.uk/projects/fastqc/(14 April 2014, date last accessed).

page 9 of 11

page 10 of 11

Adusumalli et al. 63. Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–9. 64. Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat Rev Genet 2010;11:204–20. 65. Choi JK, Bae J-B, Lyu J, et al. Nucleosome deposition and DNA methylation at coding region boundaries. Genome Biol 2009;10:R89. 66. Chodavarapu RK, Feng S, Bernatavichute Y V, et al. Relationship between nucleosome positioning and DNA methylation. Nature 2010;466:388–92. 67. Fenouil R, Cauchy P, Koch F, et al. CpG islands and GC content dictate nucleosome depletion in a transcriptionindependent manner at mammalian promoters. Genome Res 2012;22:2399–408. 68. Collings CK, Waddell PJ, Anderson JN. Effects of DNA methylation on nucleosome stability. Nucleic Acids Res 2013; 41:2918–31. 69. Guo W, Chung W-Y, Qian M, et al. Characterizing the strand-specific distribution of non-CpG methylation in human pluripotent cells. Nucleic Acids Res 2013;42:3009–16. 70. Laird PW. Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet 2010;11:191–203. 71. Laurent L, Wong E, Li G, et al. Dynamic changes in the human methylome during differentiation. GenomeRes 2010; 20:320–31. 72. Schultz MD, Schmitz RJ, Ecker JR. ‘Leveling’ the playing field for analyses of single-base resolution DNA methylomes. Trends Genet 2012;28:583–5. 73. Xie H, Wang M, de Andrade A, et al. Genome-wide quantitative assessment of variation in DNA methylation patterns. Nucleic Acids Res 2011;39:4099–108. 74. Nicol JW, Helt GA, Blanchard SG, et al. The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics 2009;25:2730–1. 75. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 2013;14: 178–92. 76. Krzywinski M, Schein J, Birol I, etal. Circos: an information aesthetic for comparative genomics. Genome Res 2009;19: 1639–45. 77. Dorff KC, Chambwe N, Zeno Z, et al. GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data. PLoS One 2013;8: e69666. 78. Hodges E, Molaro A, Dos Santos CO, et al. Directional DNA methylation changes and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment. Mol Cell 2011;44:17–28. 79. Li Y, Zhu J, Tian G, et al. The DNA methylome of human peripheral blood mononuclear cells. PLoS Biol 2010;8: e1000533. 80. Zilberman D, Gehring M, Tran RK, et al. Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nat Genet 2007;39:61–9. 81. Hellman A, Chess A. Gene body-specific methylation on the active X chromosome. Science 2007;315:1141–3.

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

43. He J, Sun X, Shao X, et al. DMEAS: DNA methylation entropy analysis software. Bioinformatics 2013;29:2044–5. 44. Su J, Yan H, Wei Y, et al. CpG_MPs: identification of CpG methylation patterns of genomic regions from highthroughput bisulfite sequencing data. Nucleic Acids Res 2012;41:1–15. 45. Benoukraf T, Wongphayak S, Hadi LHA, et al. GBSA: a comprehensive software for analysing whole genome bisulfite sequencing data. Nucleic Acids Res 2013;41:e55. 46. Akalin A, Kormaksson M, Li S, et al. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol 2012;13:R87. 47. Meyer LR, Zweig AS, Hinrichs AS, et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res 2012;41:D64–9. 48. McLaren W, Pritchard B, Rios D, et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 2010;26:2069–70. 49. Cock PJA, Fields CJ, Goto N, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010; 38:1767–71. 50. Falgueras J, Lara AJ, Ferna´ndez-Pozo N, et al. SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010;11:38. 51. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 2014. doi: 10.1093/bioinformatics/btu170. 52. The FASTX-Toolkit. http://hannonlab.cshl.edu/fastx_ toolkit/ (14 April 2014, date last accessed). 53. Ewing B, Hillier L, Wendl MC, et al. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998;8:175–85. 54. Liu Q, Guo Y, Li J, et al. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics 2012;13(Suppl 8):S8. 55. Trimarchi MP, Murphy M, Frankhouser D, et al. Enrichment-based DNA methylation analysis using nextgeneration sequencing: sample exclusion, estimating changes in global methylation, and the contribution of replicate lanes. BMC Genomics 2012;13(Suppl 8):S6. 56. Lisanti S, von Zglinicki T, Mathers JC. Standardization and quality controls for the methylated DNA immunoprecipitation technique. Epigenetics 2012;7:615–25. 57. Langmead B, Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:R25. 58. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25: 1754–60. 59. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18:1851–8. 60. Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet 2012;13:705–19. 61. Smith AD, Chung W-Y, Hodges E, et al. Updates to the RMAP short-read mapping software. Bioinformatics 2009;25: 2841–2. 62. Hoffmann S, Otto C, Kurtz S, et al. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol 2009;5:e1000502.

Whole-genome bisulfite sequencing analysis workflow

96. Branco MR, Ficz G, Reik W. Uncovering the role of 5-hydroxymethylcytosine in the epigenome. Nat Rev Genet 2012;13:7–13. 97. Nestor C, Ruzov A, Meehan R, et al. Enzymatic approaches and bisulfite sequencing cannot distinguish between 5-methylcytosine and 5-hydroxymethylcytosine in DNA. Biotechniques 2010;48:317–9. 98. Booth MJ, Branco MR, Ficz G, et al. Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science 2012;336:934–7. 99. Yu M, Hon GC, Szulwach KE, et al. Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell 2012;149:1368–80. 100. Shim J, Humphreys GI, Venkatesan BM, et al. Detection and quantification of methylation in DNA using solid-state nanopores. Sci Rep 2013;3:1389. 101. Laszlo AH, Derrington IM, Brinkerhoff H, et al. Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc Natl Acad Sci USA 2013;110:18904–9. 102. Doi A, Park I-H, Wen B, et al. Differential methylation of tissue- and cancer-specific CpG island shores distinguishes human induced pluripotent stem cells, embryonic stem cells and fibroblasts. Nat Genet 2009;41:1350–3. 103. Kozlenkov A, Roussos P, Timashpolsky A, et al. Differences in DNA methylation between human neuronal and glial cells are concentrated in enhancers and non-CpG sites. Nucleic Acids Res 2013;9000:1–19. 104. Kulis M, Queiro´s AC, Beekman R, et al. Intragenic DNA methylation in transcriptional regulation, normal differentiation and cancer. Biochim Biophys Acta 2013;1829: 1161–74. 105. Boyle AP, Hong EL, Hariharan M, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 2012;22:1790–7. 106. Makarov V, O’Grady T, Cai G, et al. AnnTools: a comprehensive and versatile annotation toolkit for genomic variants. Bioinformatics 2012;28:724–5.

Downloaded from http://bib.oxfordjournals.org/ at Harvard Library on May 28, 2014

82. Ball MP, Li JB, Gao Y, et al. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nat Biotechnol 2009;27:361–8. 83. Eckhardt F, Lewin J, Cortese R, et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet 2006;38:1378–85. 84. Straussman R, Nejman D, Roberts D, et al. Developmental programming of CpG island methylation profiles in the human genome. Nat Struct Mol Biol 2009;16:564–71. 85. Bergman Y, Cedar H. DNA methylation dynamics in health and disease. Nat Struct Mol Biol 2013;20:274–81. 86. Breitling LP, Yang R, Korn B, et al. Tobacco-smokingrelated differential DNA methylation: 27K discovery and replication. AmJ Hum Genet 2011;88:450–7. 87. Bartolomei MS, Ferguson-Smith AC. Mammalian genomic imprinting. Cold Spring Harb Perspect Biol 2011;3: a002592. 88. Tarutani Y, Takayama S. Monoallelic gene expression and its mechanisms. Curr Opin Plant Biol 2011;14:608–13. 89. Abu-Amero S, Monk D, Apostolidou S, et al. Imprinted genes and their role in human fetal growth. Cytogenet. Genome Res 2006;113:262–70. 90. Edwards CA, Ferguson-Smith AC. Mechanisms regulating imprinted genes in clusters. Curr Opin Cell Biol 2007;19:281–9. 91. Butler MG. Genomic imprinting disorders in humans: a mini-review. J Assist Reprod Genet 2009;26:477–86. 92. Fang F, Hodges E, Molaro A, et al. Genomic landscape of human allele-specific DNA methylation. Proc Natl Acad Sci USA 2012;109:7332–7. 93. Kantlehner M, Kirchner R, Hartmann P, et al. A highthroughput DNA methylation analysis of a single cell. Nucleic Acids Res 2011;39:e44. 94. Kriaucionis S, Heintz N. The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 2009;324:929–30. 95. Tahiliani M, Koh KP, Shen Y, et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 2009; 324:930–5.

page 11 of 11

Methodological aspects of whole-genome bisulfite sequencing analysis.

The combination of DNA bisulfite treatment with high-throughput sequencing technologies has enabled investigation of genome-wide DNA methylation beyon...
731KB Sizes 2 Downloads 3 Views