Genomics 103 (2014) 222–228

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

H3K4me2 reliably defines transcription factor binding regions in different cells Ying Wang a, Xiaoman Li b,⁎, Haiyan Hu a,⁎ a b

Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA

a r t i c l e

i n f o

Article history: Received 24 October 2013 Accepted 1 February 2014 Available online 12 February 2014 Keywords: ChIP-seq Transcription factor binding regions Histone modification H3K4me2

a b s t r a c t Histone modification (HM) patterns are widely applied to identify transcription factor binding regions (TFBRs). However, how frequently the TFBRs overlap with genomic regions enriched with certain types of HMs and which HM marker is more effective to pinpoint the TFBRs have not been systematically investigated. To address these problems, we studied 149 transcription factor (TF) ChIP-seq datasets and 33 HM ChIP-seq datasets in three cell lines. We found that on average about 90% of the TFBRs overlap with the H3K4me2-enriched regions. Moreover, the H3K4me2-enriched regions with stronger signals of H3K4me2 enrichment more likely overlap with the TFBRs than those with weaker signals. In addition, we showed that the H3K4me2-enriched regions together with the H3K27ac-enriched regions can greatly reduce false positive predictions of the TFBRs. Our study sheds light on the comprehensive discovery of the TFBRs using the HeK4me-enriched regions, especially when no good antibody to a TF exists. © 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction Transcription factor binding regions (TFBRs) are genomic regions of several hundred to several thousand base pairs long that contain transcription factor binding sites (TFBSs). Under an experimental condition, a transcription factor (TF) binds its TFBSs only in a small number of TFBRs, although many more genomic regions may contain short DNA segments exactly the same as its TFBSs [1]. The common pattern of the TFBSs bound by a TF is usually called a motif, often represented as a consensus sequence or a position weight matrix (PWM) [1,2]. The binding of TFs to their TFBSs in the TFBRs can activate or repress the transcriptional expression of genes near the TFBRs [3]. It is thus important to discover TFBRs and TFBSs for a given TF. Currently, a routine approach to discover TFBRs and TFBSs is to carry out the chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-seq) experiments with a specific antibody against a TF [4,5]. These TF-based ChIP-seq experiments can define potential TFBRs that are enriched with the binding of this TF on the genome scale, the so-called ChIP-seq peak regions. With several hundreds to thousands of potential TFBRs, computational methods can then be applied to discover TFBSs in these regions [6–12]. Although ChIP-seq experiments with a specific antibody against a TF can be a powerful means to define potential TFBRs and subsequently discover TFBSs, the ⁎ Corresponding authors. E-mail addresses: [email protected] (Y. Wang), [email protected] (X. Li), [email protected] (H. Hu).

challenge is that TF-specific antibodies remain unknown for the majority of TFs [13]. Additional limitation is that it is still costly to perform ChIP-seq experiments for all TFs under a given condition [14]. An alternative approach to discover TFBRs and TFBSs is based on ChIP-seq experiments using antibodies against different histone modification (HM) markers [15,16]. HM-enriched (HME) genomic regions are widely used to define genomic signatures including two major types of TFBRs, an enhancer and a promoter [15,17]. For instance, H3K4me1 has been identified as an enhancer maker since genomic regions enriched with H3K4me1 significantly overlap with p300-defined enhancer regions; and H3K4me3 has been considered as a marker of active promoters since H3K4me3 enriched genomic regions significantly overlap with basal promoters defined by the binding of RNA polymerase II and TBP-associated factor 1 [15]. Different combinations of HMs are also discovered to distinguish different genomic regions. For example, H3K4me1 is used to distinguish active from poised enhancers when combined with H3K27ac and H3K27me3 respectively [18–20]. Compared with individual TF-based ChIP-seq experiments for TFBR identification, HM-based ChIP-seq experiments can potentially define the binding regions of all active TFs under a given condition. Recently, several studies [21–23] from the Encyclopedia of DNA Elements (ENCODE) project [13] have considered both TF-based and HM-based ChIP-seq experiments to study TFBRs. For instance, Wang et al. [23] predicted TF binding motifs in most of the 457 TF-based ChIP-seq datasets and studied the epigenomic and genomic patterns around the TFBSs of these motifs. Arvey et al. [21] explored the contribution of DNA sequence patterns, HMs, and open chromatin regions (OCRs) to cell-

http://dx.doi.org/10.1016/j.ygeno.2014.02.002 0888-7543/© 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

Y. Wang et al. / Genomics 103 (2014) 222–228

type-specific TF binding motifs of 67 TFs. Kundaje et al. [22] analyzed the patterns of HMs, OCRs, and nucleosome positioning around the summit of TF binding peaks and gene transcriptional start sites for 119 TFs. However, all these studies have not attempted to pinpoint the TFBRs of all active TFs under an experimental condition using a small number of HMs. It has thus not been systematically investigated that how frequently the TFBRs overlap HME regions and which HM patterns are the most effective to pinpoint the TFBRs of all active TFs under an experimental condition. To address these questions, we surveyed 149 TF-based and 33 HM-based ChIP-seq experiments in three cell lines (GM12878, K562, and HeLa-S3) from the ENCODE project [13]. For the convenience of explanation, we named the HME regions corresponding to a certain HM α as α regions. For example, an HME region corresponding to H3K4me1 was simplified as an H3K4me1 region. We observed that in the three cell lines, the average percentage of the TFBRs overlapping with the H3K4me2 regions is from 83% to 90%, and that of the H3K4me2 regions overlapping with the TFBRs is from 56% to 67%. Moreover, we noticed that the H3K4me2 regions with higher level of the H3K4me2 enrichment tend to more likely overlap with the TFBRs than the H3K4me2 regions with lower level of the H3K4me2 enrichment. Similarly, the TFBRs with stronger TF binding signal are more likely to overlap with the H3K4me2 regions than the TFBRs with weaker TF binding signal. In addition, we applied the knowledge learned above to identify the TFBRs on the genome-wide scale and found that the H3K4me2 regions together with the H3K27ac regions or OCRs can greatly reduce false positive predictions of the TFBRs. Our study suggests that computational methods may be developed in the future to systematically discover TFBSs of almost all active TFs under a condition with the H3K4me2 regions, especially when no known antibody for a TF is available.

2. Results 2.1. The majority of potential TFBRs overlap with the same six types of HM regions We first investigated the overlap of potential TFBRs and HME regions for each TF with each combination of 1 to 11 types of HME regions in two cell lines GM12878 and K562. Here the potential TFBRs were the original ChIP-seq peak regions defined by the ENCODE project [13,24] (see the Methods section). Two regions were claimed to overlap if they shared at least 1 bp. For every HM combination that is composed of 1, 2,…, 11 HMs, we calculated the percentages of the potential TFBRs that overlapped with its HME regions. We observed that when the number of HMs in a HM combination increases, the percentage of the potential TFBRs overlapping with the HM combination also increases (Supplementary Table S1). We also observed that the increased fraction (b0.1%) becomes negligible after six HMs in the combinations (Supplementary Fig. S1). Therefore, we further studied the combination of six types of HME regions. We found that the combination of the following 6 types of HME regions overlaps with the most number of the potential TFBRs in both GM12878 and K562 cells: H3K36me3, H3K9me3, H2AZ, H4K20me1, H3K27me3, and H3K4me1 regions. We called these six HMs as the covering HMs and the six types of HME regions as the covering HME regions. On average, for each TF, 92.93% and 96.88% of the potential TFBRs overlap with the covering HME regions in GM12878 and K562, respectively (Supplementary Table S1). We applied the hypergenometric testing to evaluate the significance of the overlap between the potential TFBR and the covering HME regions (see the Methods section). All TFs in GM2878 and K562 have small p-values (b10− 5) (Supplementary Table S2). This observation implies that ChIP-seq experiments with the covering HMs are sufficient to identify almost all the potential TFBRs under a specific condition.

223

The above observed percentage of overlap was calculated for each TF and then was averaged across all TFs. We also calculated the percentages of overlap between the potential TFBRs and HME regions for all TFs in each chromosome and then averaged across all chromosomes. We found that on average, more than 92.71% and 94.62% of the potential TFBRs overlap with the covering HME regions across all chromosomes in GM12878 and K562, respectively (Supplementary Table S2), with small p-values (b 10−5) of overlapping in these two cell lines. Therefore, the above overlap percentages are not affected by the different numbers of the potential TFBRs of different TFs. We then investigated whether alternative criteria of overlap affect the above observation. Alternatively, we used 10%, 20%, …, 100% of the length of the potential TFBRs as cutoffs to determine whether a potential TFBR overlapped with a HME region (Fig. 1). Interestingly, even when the cutoff was 100%, which meant that the potential TFBRs were completely within the HME regions, we still noticed that 90.98% and 95.13% of the potential TFBRs overlap with the covering HME regions in the two cells. So in the following, we only reported results of the overlap of at least 1 bp when we claimed the overlap of two regions. Next, we studied whether alternative definitions of the potential TFBRs affect the observed percentages. Instead of using the original ChIP-seq peaks defined by ENCODE as potential TFBRs [13], for each TF, we defined the ChIP-seq peak regions using CisGenome [25] and model-based analysis of ChIP-seq (MACS) [26]. CisGenome uses a fixed width window that is shifted through the genome to find enriched regions of ChIP-seq tags, while MACS empirically models the shift size of ChIP-Seq tags and uses it to improve the spatial resolution of predicted TFBSs. For the CisGenome-defined TFBRs, we observed that on average, for a TF, 90.97% and 91.33% of the potential TFBRs overlap with the covering HME regions in GM12878 and K562, respectively. Similarly, for the MACS-defined TFBRs, for a TF, we observed that the average percentages of the potential TFBRs overlapping with the covering HME regions are 92.73% and 91.65% in GM12878 and K562, respectively. Therefore, different definitions of the potential TFBRs do not change the conclusion: the majority of the potential TFBRs overlap with the covering HME regions. Finally, we checked whether the above conclusion holds in other cells. We used all available potential TFBR and HME regions in the HeLa-S3 cell line from the ENCODE project [13]. Consistently, we found that 97.45% of the potential TFBRs overlap with the covering HME regions in HeLa-S3 (Supplementary Table S1), with p-values of overlapping smaller than b10−5 for all TFs. In addition, adding additional types of HME regions does not increase this percentage much. This observation from the HeLa-S3 cell suggests that it is very likely to pinpoint the TFBRs for all TFs under a specific condition with ChIP-seq experiments only on the six HMs. 2.2. H3k4me2 reliably defines TFBRs The above analyses suggest, in order to identify TFBRs under a given condition, one can perform the six HM-based ChIP-seq experiments under this condition. However, the cost of these experiments is relatively high. Moreover, the number of the covering HME regions is so large that the identification of the TFBSs from these regions may be difficult. In addition, a large number of the potential TFBRs do not contain TFBSs and are thus false TFBRs [22,26]. We thus further investigated whether one type of HME regions already contained the majority of the TFBRs. To obtain the TFBRs, for each TF-based ChIP-seq experiment, we used the known or predicted motif PWMs to scan the potential TFBRs with different score cutoffs (see the Methods section). The potential TFBRs with highly scored DNA segments (scores larger than the cutoff) were defined as TFBRs. Note that we may have missed certain TFBRs, due to the lack of knowledge about cofactor motifs. However, with many ChIP-seq experiments, we hope to minimize the number of missed TFBRs. We considered all TFs with ChIP-seq experiments in the

224

Y. Wang et al. / Genomics 103 (2014) 222–228

Fig. 1. Overlap percentages of TFBRs and HME regions under different cutoffs. The overlap percentages between TFBRs and the six covering HME regions were calculated with the overlap cutoffs defined as 1 bp, and 10%, 20%,.., 100% of the length of TFBRs in GM12878, K562 and HeLa-S3 cells.

three cells, GM12878, K562, and HeLa-S3. For TFs without known motifs in the TRANSFAC database [27], we computationally defined their motifs using a recently developed tool Dreme [7] (Supplementary Table S3). For each TF, using three different score cutoffs, we defined TFBSs and TFBRs. Depending on the TF and the corresponding ChIPseq experiments, the number of the defined TFBRs varies dramatically (Supplementary Table S3). With the TFBRs, we calculated the percentages of the TFBRs overlapping with a specific type of HME regions for each TF and obtained an average percentage across all TFs. We found that despite the much shorter length of the H3K4me2 regions, the H3K4me2 regions overlap with the maximal percentages of the TFBRs under various score cutoffs in GM12878 and HeLa-S3 (Table 1, Supplementary Table S4). Although the H3K4me2 regions achieved the second maximal percentage of overlap with the TFBRs under various score cutoffs in K562, these percentages are close to the maximal percentages achieved by the H2AZ. Moreover, the number of the H2AZ regions is much larger than that of the H3K4me2 regions, and the length of the H2AZ regions is much longer than that of the H3K4me2 regions. Therefore, we focused on the H3K4me2 regions to study their overlap with the TFBRs. Under the score cutoff of 0.85, on average, 87.87%, 85.45%, and 89.66% of the TFBRs of a TF overlap with the H3K4me2 regions in GM12878, K562, and HeLa-S3, respectively. For all TFs, the overlap of the TFBRs and the H3K4me2 regions in the three cell lines was significant. Because current methods may not accurately define the boundary of HME regions [28,29], we further extended each HME region to 500 bp on both sides. We found that the H3K4me2 regions are still the best type of HME regions to narrow down the TFBRs. We also calculated the percentage of the TFBRs overlapping with each type of HME regions for each chromosome and averaged the percentages across all chromosomes. We had a consistent observation that the H3K4me2 regions usually overlap with the maximal percentage of the TFBRs under all cutoffs in all cells (Supplementary Table S4). Therefore, we concluded that ~90% of the TFBRs overlap with the H3K4me2 regions. This conclusion implies that for almost all TFs, no matter whether their TF-specific antibodies are available, one may be able to identify the majority of their TFBRs with the H3K4me2-based ChIP-seq experiment. We further investigated the location of the TFBSs relative to the H3K4me2 regions. We found that more than 96% of the TFBSs were in the H3K4me2 regions in the three cell lines, if the TFBRs overlap with the H3K4me2 regions. In addition, the TFBSs tend to locate in the middle of the H3K4me2 regions, where the highest level of H3K4me2 modification occurs (Fig. 2). This observation implies that one possible approach

to discover motifs of all active TFs under a condition is to perform the H3K4me2-based ChIP-seq experiments and then computationally identify TF binding motifs in the middle of the H3K4me2 regions. Focusing on the central H3K4me2 regions will certainly miss many TFBSs. However, since most TFBSs of a TF occur in the middle H3K4me2 regions, it is likely that the motifs of active TFs could be identified. We also compared the H3K4me2 regions with OCRs in terms of the percentages of overlapped the TFBRs. OCRs have been widely used to identify the TFBRs [30–32]. There are 195,014 (79,675), 239,971 (70,379), and 223,512 (81,013) OCRs (H3K4me2 regions) in GM12878, K562, and HeLa-S3 cells, respectively. We found that 88.46% (88.98%), 89.04% (89.98%), and 90.58% (88.61%) of the TFBRs overlap with OCRs (H3K4me2 regions) in the three cell lines, respectively. That is, the percentages of the TFBRs overlapping with OCRs are similar to those with the H3K4me2 regions, despite the much smaller number of the H3K4me2 regions. However, the percentages of the H3K4me2 regions overlapping the TFBRs were much larger than that of OCRs (Fig. 3). For instance, 56.22%, 67.10%, and 56.34% of the H3K4me2 regions, compared with 31.60%, 33.67%, 25.61% of OCRs, overlap with the TFBRs in GM12878, K562, and HeLa-S3, respectively. Since the average length of OCRs is about 300 bp while cis-regulatory regions have an average length of 600 bp or longer [9,33], we further extended the OCRs to 600, 800, and 1000 bp, respectively. We found that although the sum of lengths of the extended OCRs is much larger than that of the H3K4me2 regions, the percentages of the H3K4me2 regions overlapping the TFBRs were much larger than that of the extended OCRs (Supplementary Table S5). Therefore, the H3K4me2 regions may be more specific to the TFBSs and TFBRs than OCRs. 2.3. Non-overlapping TFBR and HME regions have relative weaker signals compared with overlapped ones We observed that a fraction of the above covering HME regions do not overlap with the TFBRs. Similarly, a small portion of the above TFBRs does not overlap with the covering HME regions or the H3K4me2 regions. Therefore, we investigated the differences between the overlapped regions and the non-overlapped regions. We compared the q-values of enrichment of HM signals in the covering HME regions that overlap with the TFBRs with those that do not overlap with the TFBRs. Here a q-value is the minimum false discovery rate at which an observed enrichment of HM signals in a HME region that is deemed significant. The q-values were defined by the ENCODE project [13]. The statistical significance of the HM signal difference is measured by the p-values from the Wilcoxon rank-sum test [34]. For each TF in the three cell lines, we found that the overlapped regions always have stronger HM signals (smaller q-values) than the nonoverlapping regions (Supplementary Table S6). The largest p-value for all TFs in the three cell lines is 4.36E − 12, which demonstrates that the majority of non-overlapping regions have weaker HM signals and thus may be false HME regions. This also implies that TF binding prefers to occur in covering HME regions with stronger HM signals than in covering HME regions with weaker HM signals. We also compared the q-values of enrichment of TF binding in the TFBRs that overlap with the covering HME regions and those in the TFBRs that do not overlap (Supplementary Table S6). For 115 out of 149 TF-based ChIP-seq experiments in the three cell lines, the overlapping the TFBRs have significantly more TF binding signals than the non-overlapping TFBRs (Wilcoxon rank-sum test p-value b 0.05 [34]). For the remaining 35 experiments, there are two cases. One is the 31 ChIP-seq experiments in which the number of non-overlapping regions is too small (b 20) for effective comparison of the non-overlapping regions with the overlapping regions. The other is the 3 ChIP-seq experiments for TFs GATA1 and ZNF274 in K562 and GTF3C2 in HeLa-S3. The three exceptions may be caused by the quality of the antibodies, because there is another ChIP-seq experiment for GATA1 using a different antibody, which has a smaller p-value of 0.035.

Y. Wang et al. / Genomics 103 (2014) 222–228

225

Table 1 The overlapped percentage of TFBRs with H3K4me2 regions. Cutoffs

H3K4me2 regions Extended H3K4me2 regions

GM12878

K562

HeLa-S3

0.8

0.85

0.9

0.8

0.85

0.9

0.8

0.85

0.9

87.13% 88.29%

87.87% 88.98%

90.48% 91.41%

85.03% 86.16%

85.45% 86.54%

87.76% 89.02%

89.43% 90.51%

89.66% 90.71%

90.14% 91.15%

Similarly, we compared the q-values of enrichment of TF binding in the TFBRs that overlap with the H3K4me2 regions with those in the TFBRs that do not overlap (Supplementary Table S6). For 132 out of 149 TF-based ChIP-seq experiments in the three cell lines, the overlapping TFBRs have more TF binding signals than the non-overlapping TFBRs (Wilcoxon rank-sum test p-value b 0.05). For the remaining 17 experiments, 7 experiments have a small number of non-overlapping regions that is smaller than 20; and 10 experiments for the TFs GATA1, ZNF274, ZNF263, E2F4, STAT2 and MYC in K562, and SMARCC1, E2F6, ZNF274 in HeLa-S3 have large p-values. Again, we observed that three ChIP-seq experiments for the TFs GATA1, MYC, and STAT2 had a much smaller p-value (b0.014) when alternative antibodies were used. We further compared the TFBRs that overlap with the H3K4me2 regions with the TFBRs that do not overlap with the H3K4me2 regions. Consistently, we observed that almost in all ChIP-seq experiments, the overlapped TFBRs have much more TF binding signals than the nonoverlapped regions (Supplementary Table S7). 2.4. H3k4me2 regions together with H3K27ac or FAIRE OCRs significantly reduce false positive prediction of TFBRs Observing that the majority of the TFBRs overlap with the H3K4me2 regions, we attempted to discover the TFBRs containing the TFBSs of a TF directly in the H3K4me2 regions using the known TF PWMs. We studied three TFs, CHREBP, FOXA1, and SREBF1, whose TFBRs are experimentally determined in HepG2 cells [35–37]. We first studied whether the TFBRs defined by TF-based ChIP-seq experiments for the three TFs overlap with the H3K4me2 regions in HepG2. We found that 95.49% (1100 out of 1152) of CHREBP, 87.56% (7156 out of 8173) of FOXA1, and 91.74% (955 out of 1041) of SREBF1 potential TFBRs overlap with the H3K4me2 regions in HepG2. With the score cutoff of 0.85 to define the TFBSs and TFBRs, we found that 97.17% of CHREBP, 89.71% of FOXA1, and 93.47% of SREBF1 TFBRs are

inside the H3K4me2 regions. Therefore, consistent with the above three cell lines, the majority of the TFBRs overlap with the H3K4me2 regions in HepG2 cells. Since the number of the H3K4me2 regions is much larger than that of the TFBRs of a TF, we next attempted to predict the TFBRs of the three TFs in HepG2 from the H3K4me2 regions. We defined the H3K4me2 regions containing putative TFBSs of a TF as the predicted TFBRs. We then wanted to select the predicted H3K4me2 regions that overlap with the TFBRs of this TF by using the following eight features (Table 2): H3K4me1, OCRs from DNase I hypersensitive experiments (DNase I OCRs), OCRs from FAIRE experiments (FAIRE OCRs), OCRs from synthesized DNase I hypersensitive sites and FAIRE (synthesized OCRs), H3K27me3, H3K27ac, DNA methylation sites, and signal strength of the H3K4me2 regions. Here the synthesized OCRs are the combined high quality OCRs from the DNase I OCRs, the FAIRE OCRs, and others by the ENCODE project [13]. We found that H3K27ac together with the H3K4me2 regions can significantly reduce the false positive prediction of the TFBRs (Fig. 4). For instance, for the TF SREBF1, only 26,431 out of 51,245 of the predicted TFBRs (H3K4me2 regions containing the putative SREBF1 TFBSs) overlap with the H3K27ac regions, which showed higher specificity and sensitivity than other features (Supplementary Table S8). We also noticed that FAIRE OCRs together with the H3K4me2 regions can achieve similar specificity and sensitivity. Finally, we studied all TFs with TF-based ChIP-seq experiments in GM12878 and K562. On average, for a TF, the overlap required of the H3K4me2 regions and H3K27ac regions had the specificity of 35.46% and 35.36% and the high sensitivity of 91.64% and 91.98% in GM12878 and K562, respectively (Supplementary Table S8), Consistently, we observed that the overlap of the H3K4me2 regions and FAIRE OCRs can achieve the specificity 48.95% and 32.18%, and the sensitivity of 84.11%, 93.68% in GM12878 and K562, respectively. Combining these two features H3K27ac and FAIRE OCRs, we can have the specificity of 59.07% and 46.97%, and the sensitivity of 79.36% and 88.56% in GM12878 and K562, respectively. 3. Discussion

Fig. 2. TFBSs tend to locate in the middle of H3K4me2 regions. The percentages of TFBSs that fall into different subregions in H3K4me2 regions were shown, where each H3K4me2 region was normalized to have a unit length.

We systematically studied the overlap of the TFBR and HME regions in the three cell lines. We observed that more than 92% of the potential TFBRs overlap with the covering HME regions in different cell lines. The potential TFBRs that do not overlap with the covering HME regions have significantly weaker TF binding signals compared with the overlapped potential TFBRs. We also noticed that on average close to 90% of the TFBRs overlap with the H3K4me2 regions under various parameters in the three cell lines. The TFBSs tend to be located in the middle of the H3K4me2 regions. Finally, we showed that the H3K4me2 regions together with the H3K27ac regions or FAIRE OCRs can significantly reduce false positive predictions of the TFBRs. Our study shed light on the systematic studies of the TFBRs for all active TFs under a specific condition. Our study is different from that in ChromHMM [38] and Segway [39]. The ultimate goal of our study is to narrow down the TFBRs of a TF with only one histone modification. ChromHMM and Segway used combinations of multiple histone modifications to divide the genomes into different regions with different functional characteristics. With that said, ChromHMM and Segway do not aim to cover most of the TFBRs of a TF by one state. They also did not analyze the overlap of their predicted states with the TFBRs of all TFs with ChIP-seq data

226

Y. Wang et al. / Genomics 103 (2014) 222–228

Fig. 3. Comparison of H3K4me2 regions with OCRs. (A)–(C), the percentages of H3K4me2 regions, FAIRE OCRs, DNase I OCRs, and OCRs overlapping with TFBRs of a given number of TFs in GM12878, K562 and HeLa-S3 cells. OCRs are also called synthesized OCRs by the ENCODE project. The x-axis gives the number of TFs used for calculating the percentages. (D)–(F), for each TF in GM12878, K562 and HeLa-S3 cells, the percentages of TFBRs overlap with H3K4me2 regions, FAIRE OCRs, DNase I OCRs, and OCRs. The x-axis provides the TF index number used in calculating the percentages.

from the ENCODE project, as it is not their goal. The predicted TFBRs of a given TF in our study may be a subset of the combination of multiple chromatin states predicted by ChromHMM or Segway. For example, in cell lines GM12878 and K562, on average, about 96.61% and 94.45% TFBRs predicted from the H3K4me2 regions overlapped with predicted (active/weak/poised) promoters or enhancers (states 1–7). In addition to the factors mentioned in the Results section that may affect the overlap percentage of the potential TFBR and HME regions, we also checked how different definitions of HME regions affect the overlap percentages. Among the several available tools we tried, we managed to define HME regions by another tool, SICER [29]. The overlap percentage was reduced by 12.85% and 9.13% in the GM12878 and K562, respectively. However, the majority (more than 88% in K562) of the potential TFBRs still overlap with the covering HME regions. The above overlap percentages are robust in terms of different cell types. In addition to the above analyses, we studied all other ENCODE cell lines with at least 10 TF-based ChIP-seq datasets and the 11 HMbased ChIP-seq datasets. HepG2 and H1-hESC are the only two other cell lines. On average, the same covering HME regions overlap with more than 92% of the potential TFBRs of a TF in these two cell lines. The above overlap percentages may be underestimated. This is because current computational methods still cannot perfectly define the boundary of HME regions. We may thus need to extend the current HME regions to calculate the overlap percentages, as what was shown for the extended H3K4me2 regions. Moreover, variation exists among similar or replicated ChIP-seq experiments. For instance, a recent ChIP-seq experiment using the antibody against the TF CTCF only recovered 94.5% of the previously reported CTCF TFBRs in the same cell [40]. Thus, the overlap of 96.88% in K562 may already represent the 100%

overlap. In addition, the quality of the antibody used may decrease the overlap percentages. As mentioned above, for several TFs, the overlap percentage is significantly larger when using one antibody than another antibody. Finally, many non-overlapping regions may be incorrectly defined as the TFBR or HME regions, as we have shown that nonoverlapping regions have weaker signals compared with overlapping regions. Our comparison of OCRs with the H3K4me2 regions shed light on how to systematically identify the TFBRs under a given condition. OCRs are widely used to map the regulatory regions, which include both the TFBRs and others. Our analysis shows that the H3K4me2 regions are more specific to the TFBRs than OCRs. It is thus more appealing to carry out one ChIP-seq experiment using an antibody against H3K4me2 instead of two experiments to define the DNase I hypersensitive OCRs and FAIRE OCRs. However, it is also evident that the number of H3K4me2 regions is much larger than the number of the TFBRs of a TF. Computational methods [44,45] should thus be developed to identify true TFBRs by integrating the H3K4me2 signals and the genomic sequence information within the large number of H3K4me2 regions. 4. Methods 4.1. TF-based and HM-base ChIP-seq data We downloaded ChIP-seq data from the hg19 version of ENCODE production data at the University of California Santa Cruz (UCSC) genome browser [41]. Each peak dataset contains the peak location, the q-value, and the mapped reads of each peak in the dataset. The q-value of a peak here is the minimum false discovery rate if the

Y. Wang et al. / Genomics 103 (2014) 222–228

227

Fig. 4. Reduce false positive predicted TFBRs using H3K27ac regions. For each TF, the H3K4me2 regions containing its putative TFBSs were defined as predicted TFBRs. We then used the predicted TFBRs that overlap with H3K27ac regions as the final predicted TFBRs. TP: true positive; FP: false positive; FN: false negative; TN: true negative.

observed number of mapped read in this peak is claimed to be significant, which describes the statistical significance of the enrichment of a HM or a TF binding in the peak. The q-values were defined by the ENCODE project [13]. All available ChIP-seq peak data for HMs were downloaded from ENCODE/Broad Institute (hg19). There were 11 HM markers for each of the three cell lines, GM12878, K562 and HeLa-S3. These 11 HM markers are H3K9me3, H3K4me2, H3k27ac, H3K4me1, H3K9ac, H3K4me3, H2A.Z, H4K20me1, H3K27me3, H3K36me3, and H3K79me2. For the HepG2 cell, we downloaded the H3K4me2 ChIP-seq peaks from ENCODE/ Broad Institute (hg19) as well. We called genomic regions corresponding to these ChIP-seq peak HME regions. We also downloaded the mapped reads from these ChIP-seq experiments. We then defined an alternative set of ChIP-seq peaks by using the tool SICER with the default parameters [29]. All available ChIP-seq peak data for the TFs in the above three cell lines were downloaded from ENCODE/SYDH TFBS (hg19) at the UCSC genome browser. With different treatments and controls, there are in total 36, 70, 43 ChIP-seq peak datasets, corresponding to 35, 51, and 41 TFs, in GM12878, K562, and HeLa-S3, respectively. We called genomic regions corresponding to these ChIP-seq peak potential TFBRs. We also downloaded the mapped reads data from these ChIP-seq experiments. We defined two sets of alternative ChIP-seq peaks by using the tools CisGenome [25] and MACS [26] with the default parameters, respectively. For the three TFs in the HepG2 cells, CHREBP, FOXA1, and SREBF1, we downloaded their ChIP-seq data from the published papers [37,42,43]. 4.2. OCRs and other data We downloaded OCRs from the hg19 version of ENCODE production data at the UCSC genome browser [41]. We used the Duke DNase I hypersensitive sites as DNase I OCRs. We used the UNC FAIRE as FAIRE OCRs. We used the “Open Chrom Synth” as the synthesized OCRs. Similarly, we used DNA methylation data from the ENCODE project

generated by HudsonAlpha (hg19). We used all TF motifs defined in the TRANSFAC 9.2 database [27]. In total, 25 TFs have known motifs in TRANSFAC in GM12878, K562, and HeLa-S3 cells. 4.3. Identification of motifs and TFBSs For TFs with no known motifs, we applied a motif discovery tool DREME [7] to the top 100 TF ChIP-seq peaks. We used the top one motif predicted by DREME that was not a simple-repeat-like pattern as the defined motif of the corresponding TF. We used the predicted PWMs to represent motifs. To define TFBSs, we scanned DNA sequences corresponding to the TF ChIP-seq peaks (we called them the potential TFBRs) with the corresponding known or defined motif PWMs. For a motif PWM, say M = (mij)k × 4 where k is the motif length, we calculated the score of a   k 4 ∑i¼1 ∑ j¼1 mij Isi¼ j − min mij j   . Here DNA segment s1s2 … as follows: k ∑i¼1 max mij − min mij j

j

Isi¼ j is an indicator function, which is 1 when si is the j-th type of nucleotides and 0 when it is not. A DNA segment of length k is called a TFBS if its score is larger than a score cutoff. We used three different cutoffs, 0.80, 0.85, and 0.90, to define TFBSs in the potential TFBRs of a TF. A potential TFBR containing at least one TFBS is defined as a TFBR. 4.4. Calculation of overlapping p-value We calculated the significance of the overlapping percentage of two types of regions by the hypergeometric testing. This testing has four parameters, N, M, n, and m. N is the number of individuals in a population, M is the number of individuals with a certain feature, n is the number of chosen individuals, and m is the number of chosen individuals with the feature. For the overlap of the covering HME regions and the potential TFBRs of a TF, we used the sum of the number of all 11 types of HME regions and the number of potential TFBRs of all TFs as N, the number of the covering HME regions as n, the number of potential TFBRs of a TF

228

Y. Wang et al. / Genomics 103 (2014) 222–228

as M, and the number of the covering HME regions overlapping with the potential TFBRs of the TF under consideration as m. For the overlap of the H3K4me2 regions and the TFBRs of a TF, N and M are similar as above, n is the number of the H3K4me2 regions, and m is the H3K4me2 regions overlapping with the TFBRs of the TF under consideration. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.ygeno.2014.02.002. Acknowledgments XL thanks Dr. Tim Osborne for the valuable comments and discussion. This work was supported by the National Science Foundation (grants 1149955, 1125676, and 1218275). Funding for the open access charge came from the National Science Foundation (grant 1149955). References [1] X. Zhou, E.K. O'Shea, Integrated approaches reveal determinants of genome-wide binding and function of the transcription factor Pho4, Mol. Cell 42 (2011) 826–836. [2] G.D. Stormo, DNA binding sites: representation and discovery, Bioinformatics 16 (2000) 16–23. [3] M.I. Arnone, E.H. Davidson, The hardwiring of development: organization and function of genomic regulatory systems, Development 124 (1997) 1851–1864. [4] D.S. Johnson, A. Mortazavi, R.M. Myers, et al., Genome-wide mapping of in vivo protein–DNA interactions, Science 316 (2007) 1497–1502. [5] G. Robertson, M. Hirst, M. Bainbridge, et al., Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing, Nat. Methods 4 (2007) 651–657. [6] V.X. Jin, J. Apostolos, N.S. Nagisetty, et al., W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data, Bioinformatics 25 (2009) 3191–3193. [7] T.L. Bailey, DREME: motif discovery in transcription factor ChIP-seq data, Bioinformatics 27 (2011) 1653–1659. [8] M. Hu, J. Yu, J.M. Taylor, et al., On the detection and refinement of transcription factor binding sites using ChIP-Seq data, Nucleic Acids Res. 38 (2009) 2154–2167. [9] X. Cai, L. Hou, N. Su, et al., Systematic identification of conserved motif modules in the human genome, BMC Genomics 11 (2010) 567. [10] J. Ding, X. Cai, Y. Wang, et al., Chipmodule: systematic discovery of transcription factors and their cofactors from chip-seq data, Pac. Symp. Biocomput. (2013) 320–331. [11] I.V. Kulakovskiy, V.A. Boeva, A.V. Favorov, et al., Deep and wide digging for binding motifs in ChIP-Seq data, Bioinformatics 26 (2010) 2622–2623. [12] M. Thomas-Chollier, C. Herrmann, M. Defrance, et al., RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets, Nucleic Acids Res. 40 (2012) e31. [13] I. Dunham, A. Kundaje, S.F. Aldred, et al., An integrated encyclopedia of DNA elements in the human genome, Nature 489 (2012) 57–74. [14] J.A. Stamatoyannopoulos, What does our genome encode? Genome Res. 22 (2012) 1602–1611. [15] N.D. Heintzman, R.K. Stuart, G. Hon, et al., Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome, Nat. Genet. 39 (2007) 311–318. [16] I. Chepelev, G. Wei, D. Wangsa, et al., Characterization of genome-wide enhancer– promoter interactions reveals co-expression of interacting genes and modes of higher order chromatin organization, Cell Res. 22 (2012) 490–503. [17] A. Barski, S. Cuddapah, K. Cui, et al., High-resolution profiling of histone methylations in the human genome, Cell 129 (2007) 823–837. [18] M.P. Creyghton, A.W. Cheng, G.G. Welstead, et al., Histone H3K27ac separates active from poised enhancers and predicts developmental state, Proc. Natl. Acad. Sci. 107 (2010) 21931–21936. [19] A. Rada-Iglesias, R. Bajpai, T. Swigut, et al., A unique chromatin signature uncovers early developmental enhancers in humans, Nature 470 (2010) 279–283. [20] G. Wei, L. Wei, J. Zhu, et al., Global mapping of H3K4me3 and H3K27me3 reveals specificity and plasticity in lineage fate determination of differentiating CD4 bsup N+b/sup NT cells, Immunity 30 (2009) 155–167.

[21] A. Arvey, P. Agius, W.S. Noble, et al., Sequence and chromatin determinants of cell-type-specific transcription factor binding, Genome Res. 22 (2012) 1723–1734. [22] A. Kundaje, S. Kyriazopoulou-Panagiotopoulou, M. Libbrecht, et al., Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements, Genome Res. 22 (2012) 1735–1747. [23] J. Wang, J. Zhuang, S. Iyer, et al., Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors, Genome Res. 22 (2012) 1798–1812. [24] M.B. Gerstein, A. Kundaje, M. Hariharan, et al., Architecture of the human regulatory network derived from ENCODE data, Nature 489 (2012) 91–100. [25] H. Ji, H. Jiang, W. Ma, et al., An integrated software system for analyzing ChIP-chip and ChIP-seq data, Nat. Biotechnol. 26 (2008) 1293–1300. [26] Y. Zhang, T. Liu, C.A. Meyer, et al., Model-based analysis of ChIP-Seq (MACS), Genome Biol. 9 (2008) R137. [27] E. Wingender, P. Dietze, H. Karas, et al., TRANSFAC: a database on transcription factors and their DNA binding sites, Nucleic Acids Res. 24 (1996) 238–241. [28] Q. Song, A.D. Smith, Identifying dispersed epigenomic domains from ChIP-Seq data, Bioinformatics 27 (2011) 870–871. [29] C. Zang, D.E. Schones, C. Zeng, et al., A clustering approach for identification of enriched domains from histone modification ChIP-Seq data, Bioinformatics 25 (2009) 1952–1958. [30] P.G. Giresi, J. Kim, R.M. McDaniell, et al., FAIRE (formaldehyde-assisted isolation of regulatory elements) isolates active regulatory elements from human chromatin, Genome Res. 17 (2007) 877–885. [31] M.T. Maurano, R. Humbert, E. Rynes, et al., Systematic localization of common disease-associated variation in regulatory DNA, Science 337 (2012) 1190–1195. [32] L. Song, Z. Zhang, L.L. Grasfeder, et al., Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity, Genome Res. 21 (2011) 1757–1767. [33] M. Blanchette, A.R. Bataille, X. Chen, et al., Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression, Genome Res. 16 (2006) 656–668. [34] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (1945) 80–83. [35] M.K. Bennett, Y.-K. Seo, S. Datta, et al., Selective binding of sterol regulatory element-binding protein isoforms and co-regulatory proteins to promoters for lipid metabolic genes in liver, J. Biol. Chem. 283 (2008) 15628–15637. [36] T.-I. Jeon, T.F. Osborne, SREBPs: metabolic integrators in physiology and metabolism, Trends Endocrinol. Metab. 23 (2012) 65–72. [37] Y.-S. Jeong, D. Kim, Y.S. Lee, et al., Integrated expression profiling and genome-wide analysis of ChREBP targets reveals the dual role for ChREBP in glucose-regulated gene expression, PLoS One 6 (2011) e22544. [38] J. Ernst, M. Kellis, ChromHMM: automating chromatin-state discovery and characterization, Nat. Methods 9 (2012) 215–216. [39] M.M. Hoffman, O.J. Buske, J. Wang, et al., Unsupervised pattern discovery in human chromatin structure through genomic segmentation, Nat. Methods 9 (2012) 473–476. [40] Y. Shen, F. Yue, D.F. McCleary, et al., A map of the cis-regulatory sequences in the mouse genome, Nature 488 (2012) 116–120. [41] W.J. Kent, C.W. Sugnet, T.S. Furey, et al., The human genome browser at UCSC, Genome Res. 12 (2002) 996–1006. [42] M. Motallebipour, A. Ameur, M.S. Reddy Bysani, et al., Differential binding and co-binding pattern of FOXA1 and FOXA3 and their relation to H3K4me3 in HepG2 cells revealed by ChIP-seq, Genome Biol. 10 (2009) R129. [43] B.D. Reed, A.E. Charos, A.M. Szekely, et al., Genome-wide occupancy of SREBF1 and its partners NFY and SP1 reveals novel functional roles and combinatorial regulation of distinct classes of genes, PLoS Genet. 4 (2008) e1000133. [44] J. Ding, H. Hu, X. Li, SIOMICS: a Novel Approach for Systematic Identification of Motifs in ChIP-seq Data. Nucleic Acids Research. (2013), http://dx.doi.org/10.1093/ nar/gkt1288. [45] R.P. Pique-Regi, J.F. Degner, A.A. Pai, D.G. Gaffney, Y. Gilad, J.K. Pritchard, Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data, Genome Research. 21 (3) (2011 Mar) 447-55.

H3K4me2 reliably defines transcription factor binding regions in different cells.

Histone modification (HM) patterns are widely applied to identify transcription factor binding regions (TFBRs). However, how frequently the TFBRs over...
956KB Sizes 0 Downloads 0 Views