Horm Mol Biol Clin Invest 2012;10(1):207–210 © 2012 by Walter de Gruyter • Berlin • Boston. DOI 10.1515/hmbci-2012-0012

Genomic and bioinformatics tools to understand the biology of signal transducers and activators of transcription

Keunsoo Kang and Lothar Hennighausen* Laboratory of Genetics and Physiology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA

Abstract The signal transducer and activator of transcription (STAT) family is activated by cytokines and conveys biochemical signals to the genome through binding to specific regulatory sequences, called IFN-γ-activated sequence (GAS) motifs. As common GAS motifs (TTCnnnGAA) contain only six conserved nucleotides, the mammalian genome harbors hundreds of thousands of copies of this sequence. However, it is not possible to predict which specific GAS motifs bind to STATs and are of functional significance. Here, we apply several layers of statistical, bioinformatics and experimental analyses to narrow down the number of GAS sites that might be of biological relevance. In particular, we determined the number of bona fide GAS motifs by utilizing publically available genome-wide STAT5 ChIP-seq data sets. Less than 10% of GAS motifs within the mouse genome are recognized by STAT5 in vivo and only a small portion of them are shared across different cell types. However, even bona fide STAT5 binding did not predict that the respective gene was under cytokine-STAT control. Therefore, additional bioinformatics, genomic and epigenetic parameters, such as patterns of histone modifications, are required to more reliably predict the behavior of cytokine-STAT regulatory networks. Keywords: GAS motif; STAT5.

Introduction The signal transducers and activators of transcription (STAT) family consist of seven transcription factors (TFs) called STAT1, STAT2, STAT3, STAT4, STAT5A, STAT5B and STAT6. STATs are activated by cytokines, and they activate genetic programs upon their translocation from the cytoplasm to the nucleus [1, 2]. STATs bind to palindromic DNA motifs called IFN-γ-activated sequences (GAS), which can induce transcription of neighboring genes.

*Corresponding author: Dr. Lothar Hennighausen, National Institutes of Health, 8 Center Drive, Bethesda, MD 20892, USA E-mail: [email protected] Received February 24, 2012; accepted March 1, 2012

Whereas STAT6 specifically binds a nucleotide consensus motif of TTCnnnnGAA (four- nucleotide gap), the GAS sites for other STATs consist of a 9-bp palindrome with a 3-bp gap (TTCnnnGAA) [3, 4]. Statistically, a GAS motif occurs every 4096 bp, and a total of approximately 580,000 STAT binding motifs can be predicted to exist in the repeatmasked mouse genome. At this point, it is not clear how many GAS motifs are associated with regulatory regions in the genome or what percentage is recognized by STATs to control neighboring genes at any particular time point in any given cell. Here, we use genomic and ChIP-seq data to assess the capacity of STATs to functionally access GAS motifs in the genome.

Genome-wide distribution of GAS motifs The mouse genome is composed of over 2.7 billion nucleotides, and more than 580,000 GAS motifs (TTCnnnGAA or TTCnnnnGAA) have been identified in the repeat-masked genome (Figure 1). This number increases to more than 8 million when one mismatch is allowed. Among the 4096 possible combinations of a 6-bp motif, GAS motifs with a 3-bp spacer are ranked 528th (Figure 1A, left panel). Similarly, GAS motifs with a 4-bp spacer are ranked at a frequency of 782 out of 4096 (Figure 1, middle panel). The strong binding motif of Yin Yang-1 (YY1), consisting of an 8-bp consensus sequence with one gap (CGCCATnTT), occurs < 15,000 times in the genome (Figure 1, right panel) [5]. Overall, GAS motifs are abundant genomic elements that might be distributed randomly or be enriched in specific genetic units. We next examined whether GAS motifs are distributed randomly throughout the genome or whether enrichment was observed in transcriptional units. Transcriptional control of genes is exerted to a large extent by common and cell-specific TFs and is further modulated by distinct histone modifications [6–8]. In general, TFs bind to promoter proximal and enhancer sequences and control recruitment and function of the RNA polymerase-containing transcriptional machinery. Out of approximately 580,000 GAS sites (3 gaps), 8% are located within 10 kbp of promoter/upstream sequences of the approximately 30,000 genes (Figure 2, left panel). This frequency is statistically expected as promoter sequences account for approximately 10% of the genome. Based on sequence analyses, the expected number of GAS motifs per 1 kb of promoter regions is 0.36, which is slightly less than that predicted for gene bodies (0.41) and intergenic regions (0.42) (Figure 2, right panel). Therefore, there is no evidence that GAS motifs are significantly enriched in gene promoter sequences.

208

Kang and Hennighausen: Genomic and bioformatics tools to understand STAT biology

Figure 1 Statistics of GAS sites in the mouse genome. (A) Statistics of GAS- (3nt gap or 4nt gap, ‘n’ indicates a single gap) and YY1-like motifs. GAS and YY1 motifs were ranked by frequency with possible combinations of 4096 GAS-like and 131,072 YY1-like sequences in the repeat-masked mouse genome. Asterisk and sharp marks (red) indicate motif sequence and rank, respectively.

Unique and overlapping STAT5 binding to GAS motifs in fibroblasts and T cells Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-seq) has been widely used to accurately determine the binding of specific proteins to chromatin in vivo [6, 7]. To calculate the number and percentage of GAS motifs that are recognized by STAT5 in different cell types in vivo, two ChIP-seq data sets were downloaded from the gene expression omnibus database (GEO, http://www.ncbi.nlm. nih.gov/geo/) [9, 10] and processed with the MACS program [11]. A total of 18,241 and 58,245 STAT5 peaks were identified in T cells (TH1) (+IL-2, interleukin-2) [9] and mouse embryonic fibroblasts (MEFs) [+growth hormone (GH)] [10], respectively. Out of 582,685 GAS sites, 9986 (2%) and 27,266 (5%) GAS motifs were bound by STAT5 in TH1 cells and MEFs, respectively (Figure 3, left panel). This demonstrates that only a small percentage of GAS motifs are recognized by STAT5 at the given time point. It can be speculated that the majority of GAS sites in the genome are physically inaccessible to STATs. These peaks can be regarded as bona fide STAT5 peaks. Whereas 8% of GAS sites are found in promoter sequences, 15% of STAT5 peaks were associated with GAS motifs in these regions (Figure 3, middle panel), which points to preferential STAT5 binding to putative regulatory

sequences. Notably, the enrichment for binding to GAS sites in promoter sequences was more prominent in T cells than in MEFs. As a signal transmitter from the membrane to the genome, target genes (or sites) of STATs may vary according to cellcontexts [1, 2, 12]. This was confirmed by a comparison of bona fide STAT5 peaks between TH1 cells and MEFs (Figure 3, right panel). Whereas 22% of bona fide STAT5 peaks within promoter regions were shared between both cell types, only 7% of bona fide peaks in intergenic regions were common to both cell types. Features shared across species boundaries can indicate biological significance. The enrichment of STAT5-bound GAS sites within promoters and their shared occupancy in different cell types suggests that these elements contribute to the regulation of the respective genes.

STAT5 target genes regulated by growth hormone in mouse embryonic fibroblasts Although transcription factor binding to promoter sequences presumably regulates expression of the respective genes, this assumption has not been rigorously tested. We recently performed a study aimed at correlating genome-wide STAT5 binding to cytokine-induced gene expression [10].

Figure 2 Genomic distribution of GAS motifs. The distribution of GAS (3nt gap) motifs is estimated according to gene annotation; promoter (–10 kb ∼transcription start site), gene body and intergenic (left). The average number of GAS motif was calculated per 1 kb of non-overlapping windows in each gene annotation (right).

Kang and Hennighausen: Genomic and bioformatics tools to understand STAT biology

GAS motifs coinciding with STAT5 peaks n=582,685

Shared and unique bona fide STAT5 peaks

Distribution of bona fide STAT5 peaks n=7858

n=21,240

Unique to OE-STAT5 (MEF, +GH)

100

Overlap

6

%

%

4 9986

2

Unique to STAT5 (TH1, +IL-2)

80

27,266

60

Intergenic

40

Genebody Promoter

75%

Intergenic

7% 18% 56%

20 0

209

Promoter

22% 22%

0 STAT5 (TH1, +IL-2) OE-STAT5 (MEF, +GH)

STAT5 (TH1, +IL-2) OE-STAT5 (MEF, +GH)

0

2000

4000

6000

8000 10,000

Number of peaks

Figure 3 Comparison of STAT5 binding sites. Two different STAT5 ChIP-seq data (STAT5A in T helper cells with IL-2 treatment and over-expressing STAT5A in mouse embryonic fibroblasts with growth hormone treatment) were downloaded from the GEO with the accession numbers GSE27158 and GSE34986 and analyzed with MACS (version 1.4.1) [9–11]. The bar graph represents the percentage of the GAS motifs that coincided with STAT5 binding (left panel). The distribution of bona fide STAT5 peaks that contain at least one GAS motif is shown (middle panel). The bona fide peaks in each cell-context were intersected and divided into three groups (right panel).

Notably, STAT5 binding to promoter sequences was not a good predictor that the corresponding gene was under cytokineSTAT5 control [10]. In particular, exposure of cells overexpressing STAT5 with growth hormone resulted in expression changes of only a small fraction of genes whose promoters were bound by STAT5 (Figure 4A). Although the majority of in vivo STAT5 binding to promoter sequences appears to be non-productive and does not result in cytokine-induced transcriptional changes using Affymetrix arrays as readout, a direct correlation between STAT5 binding and gene expression was obtained for specific genes (Figure 4B). Strong STAT5 binding to two tandem GAS sites in the Cish promoter and to one tandem GAS site in the Socs2 promoter correlated with GH-induced expression of both genes [10].

Conclusion For any given TF, hundreds of thousands of potential binding sites exist in the mammalian genome, which creates a

A

GH induced genes

B

challenge of identifying those of biological significance. Using genome-wide analyses and bioinformatics tools, it is possible to enrich for those sites that might be of functional relevance. For example, more than 580,000 binding sites for the TF STAT5 can be found in the mouse genome, many of which might bear no relevance. Here, we use statistical and bioinformatics tools to more clearly define those GAS sites in the mouse genome that might be of functional relevance (Figure 5). In particular, we analyzed publicly available genome-wide STAT5 ChIP-seq data from two distinct cell types with the goal of identifying GAS sites of biological significance. Approximately 10% of promoter-bound GAS sites were recognized by STAT5 in each individual cell type, and STAT5 occupancy shared between the two cell types was obtained on 2% of the GASs. However, even binding across different cell types did not predict that the respective gene was under cytokine-STAT control. Thus, predicting whether any given gene is under cytokine control through STATs will require the identification of additional biological markers, such as specific histone modifications and polymerase loading. We have not reached the point yet where an elementary

1 kb

1 kb

150 OE-STAT5 (MEF +GH)

n=126

36% 64%

0 150 OE-STAT5 (MEF -GH) 0 Gene

Cish

Socs2

GAS motif No bona fide STAT5 peaks in promoter Bona fide STAT5 peaks in promoter

TTCctgGAAagTTCttgGA(N85)TTCtagGAAgatgaggcTTCcggGAA

TTCcagGAActTTCcagGAA

Figure 4 Examples of the genes regulated by GH through STAT5. Expression of a total of 126 genes in MEFs was induced by GH (log2 ratio of expression signals > 1.5, OE-STAT5 with GH to OE-STAT5 without GH). These genes were correlated with the bona fide STAT5 peaks in their respective promoter region. (B) In MEFs over-expressing STAT5A, the expression levels of Cish and Socs2 were increased more than 4-fold by GH [10]. Upon induction by GH, STAT5 specifically binds to the promoter regions of the Cish and Socs2 genes by recognizing GAS motifs. The promoter regions of both genes harbor more than two strong GAS motifs in proximity.

210

Kang and Hennighausen: Genomic and bioformatics tools to understand STAT biology

Figure 5 Combination of bioinformatics and genomic approaches followed by an integrative analysis will be needed to further identify those subsets of GAS motifs that mediate biological cues induced by STAT5.

student in Mijo with high-speed Internet access can interrogate publically available genomic data to predict the behavior of any given gene. However, before long, legions of de novo online geneticists will add unprecedented power to unraveling genetic mysteries.

Acknowledgments This work was supported by the Intramural program of the National Institutes of Health (NIDDK).

References 1. Darnell JE. STATs and gene regulation. Science 1997;277: 1630–5. 2. Hennighausen L, Robinson GW. Interpretation of cytokine signaling through the transcription factors STAT5A and STAT5B. Genes Dev 2008;22:711–21. 3. Kraus J, Borner C, Hollt V. Distinct palindromic extensions of the 5′-TTC...GAA-3′ motif allow STAT6 binding in vivo. FASEB J 2003;17:304–6. 4. Seidel HM, Milocco LH, Lamb P, Darnell JE, Stein RB, Rosen J. Spacing of palindromic half sites as a determinant of selective STAT (signal transducers and activators of transcription) DNA binding and transcriptional activity. Proc Natl Acad Sci USA 1995;92:3041–5. 5. Kim JD, Hinz AK, Bergmann A, Huang JM, Ovcharenko I, Stubbs L, Kim J. Identification of clustered YY1 binding sites in imprinting control regions. Genome Res 2006;16:901–11.

6. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell 2007;129: 823–37. 7. Kang K, Kim J, Chung JH, Lee D. Decoding the genome with an integrative analysis tool: combinatorial CRM Decoder. Nucleic Acids Res 2011;39:e116. 8. Negre N, Brown CD, Ma L, Bristow CA, Miller SW, Wagner U, Kheradpour P, Eaton ML, Loriaux P, Sealfon R, Li Z, Ishii H, Spokony RF, Chen J, Hwang L, Cheng C, Auburn RP, Davis MB, Domanus M, Shah PK, Morrison CA, Zieba J, Suchy S, Senderowicz L, Victorsen A, Bild NA, Grundstad AJ, Hanley D, MacAlpine DM, Mannervik M, Venken K, Bellen H, White R, Gerstein M, Russell S, Grossman RL, Ren B, Posakony JW, Kellis M, White KP. A cis-regulatory map of the Drosophila genome. Nature 2011;471:527–31. 9. Liao W, Lin JX, Wang L, Li P, Leonard WJ. Modulation of cytokine receptors by IL-2 broadly regulates differentiation into helper T cell lineages. Nat Immunol 2011;12: 551–9. 10. Zhu BM, Kang K, Yu JH, Chen W, Smith HE, Lee D, Sun HW, Wei L, Hennighausen L. Genome-wide analyses reveal the extent of opportunistic STAT5 binding that does not yield transcriptional activation of neighboring genes. Nucleic Acids Res 2012. DOI:10.1093/nar/gks056. 11. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008;9:R137. 12. Hennighausen L, Robinson GW. Information networks in the mammary gland. Nat Rev Mol Cell Biol 2005;6: 715–25.

Copyright of Hormone Molecular Biology & Clinical Investigation is the property of De Gruyter and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Genomic and bioinformatics tools to understand the biology of signal transducers and activators of transcription.

Abstract The signal transducer and activator of transcription (STAT) family is activated by cytokines and conveys biochemical signals to the genome th...
273KB Sizes 0 Downloads 5 Views