Author's Accepted Manuscript An intricate network of conserved DNA upstream motifs and associated transcription factors regulate the expression of Uromodulin gene Rajneesh Srivastava, Radmila Micanovic, Tarek M. El-Achkar, S.C. Janga
PII: DOI: Reference:
S0022-5347(14)00353-X 10.1016/j.juro.2014.02.095 JURO 11148
To appear in: The Journal of Urology Accepted Date: 21 February 2014 Please cite this article as: Srivastava R, Micanovic R, El-Achkar TM, Janga SC, An intricate network of conserved DNA upstream motifs and associated transcription factors regulate the expression of Uromodulin gene, The Journal of Urology® (2014), doi: 10.1016/j.juro.2014.02.095. DISCLAIMER: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our subscribers we are providing this early version of the article. The paper will be copy edited and typeset, and proof will be reviewed before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to The Journal pertain. All press releases and the articles they feature are under strict embargo until uncorrected proof of the article becomes available online. We will provide journalists and editors with full-text copies of the articles in question prior to the embargo date so that stories can be adequately researched and written. The standard embargo time is 12:01 AM ET on that date.
ACCEPTED MANUSCRIPT
MANUSCRIPT
RI PT
An intricate network of conserved DNA upstream motifs and associated transcription factors regulate the expression of Uromodulin gene Rajneesh Srivastava1, Radmila Micanovic2, Tarek M. El-Achkar2,*, Sarath Chandra
SC
Janga1,3,4,* 1
M AN U
School of Informatics, Indiana University Purdue University, 719 Indiana Ave Ste 319, Walker Plaza Building,
Indianapolis, Indiana 46202 2
Division of Nephrology, Department of Medicine, School of Medicine, Indiana University, 950 West Walnut Street,
Indianapolis, Indiana, 46202 and the Roudebush VA Medical Center, Indianapolis 3
TE D
Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 5021 Health
Information and Translational Sciences (HITS), 410 West 10th Street, Indianapolis, Indiana, 46202 4
EP
Department of Medical and Molecular Genetics, Indiana University School of Medicine, Medical Research and Library
Building, 975 West Walnut Street, Indianapolis, Indiana, 46202
AC C
* Correspondence can be addressed to : Sarath Chandra Janga (
[email protected]) or Tarek M. El-Achkar (
[email protected])
Running Head:
A biocomputational study of UMOD upstream regulatory sequences.
Key words: Gene regulation, Expression, Transcription factors, Acute Kidney Injury, Uromodulin
1
ACCEPTED MANUSCRIPT
MANUSCRIPT
Abstract
RI PT
Purpose Uromodulin is a kidney-specific glycoprotein whose expression can modulate the homeostasis of the kidney. However the set of sequence-specific transcription factors (TFs) which regulate the Uromodulin gene (UMOD) and their upstream binding locations are not well characterized. Our study aims to build a high resolution map of its transcriptional regulation.
SC
Methods
M AN U
We applied in silico phylogenetic foot-printing on the upstream regulatory regions of a diverse set of human UMOD orthologs, for the identification of conserved binding motifs (BMo) and corresponding position specific weight matrices. We further analyzed the predicted BMo by motif comparison, which identified TFs likely to bind these discovered motifs. Predicted TFs were then integrated with experimentally known protein-protein interactions available from public databases and tissue-specific expression resources to delineate the important regulators controlling the expression of UMOD.
TE D
Results
AC C
EP
Our analysis allowed the identification of a reliable set of BMo in the upstream regulatory regions of UMOD to build a high confidence compendium of TFs such as GATA3, HNF1B, SP1, SMAD3, RUNX2 and KLF4 that could bind these motifs. ENCODE DNase I hypersensitivity sites in UMOD upstream region of mouse kidney confirmed some of these BMo to be opened to binding by predicted TFs. TF-TF network revealed several highly connected TFs such as SP1, SP3, TP53, POU2F1, RARB, RARA and RXRA as well as the likely protein complexes formed between them. Expression levels of these TFs in kidney suggest their central role in controlling UMOD expression.
Conclusions
Our findings will form a roadmap for understanding the regulation of Uromodulin expression in health and disease. (Words: 248)
2
ACCEPTED MANUSCRIPT
MANUSCRIPT
Introduction
M AN U
SC
RI PT
Uromodulin, also called THP (Tamm–Horsfall protein), is the most abundant protein excreted in the urine under physiological conditions. It is highly produced in the kidney and secreted into the urine via proteolysis of its GPI anchored domain1. It has also been reported in blood as a secretory product in circulation 2. Although the biological function of Uromodulin had been elusive for many years, the last decade witnessed significant advancements in understanding the function of this protein in health and disease2, 3. In fact, uromodulin is now thought to facilitate electrolyte transport across thick ascending limb4, modulate the inflammatory response during kidney injury 5, inhibit stone formation 6 and protect the bladder from invasive ascending infections 7. Interestingly, the correlation of rate of uromodulin excretion with disease models is not well established 3, although recent studies suggest that uromodulin expression and excretion increases in injury states and may serve as biomarker for developing chronic kidney disease 8. Because of its protective function in vivo during injury, we previously suggested that this increase in excretion is reactive 3, but the factors that control uromodulin expression are not well studied. Indeed, understanding the complex regulatory mechanisms that regulate the Uromodulin gene is essential to advancing this field.
TE D
Several studies have reported that any variation or mutation occurring in this gene directly or indirectly is linked to kidney disorders like glomerulocystic kidney disease, medullary cystic kidney disease type 2 (ADMCKD2), familial juvenile hyperuricemic nephropathy disease (FJHN) etc. with the autosomal dominant tubulointerstitial kidney diseases collectively known as uromodulin-associated kidney diseases (UAKD) 9. Many SNPs on UMOD are linked to chronic kidney disease (CKD) by Genome-wide association studies in the general population2.
EP
Transcription factors (TFs) are known to bind specifically to gene’s promoters’ at regulatory positions (binding motifs) and thus contribute to cell identity, physiology and cell development. Various in vitro 10, in vivo 11 and in silico 12 approaches have been developed so far for the regulatory motif discovery of genes. Typically, potential TF binds to its high affinity binding site however little is known about the tissue specific binding patterns (binding patterns are represented as weight matrices) of most TFs in higher eukaryotes13.
AC C
In this study, we used the upstream regulatory regions of human UMOD orthologs from a diverse set of 8 primates and 7 rodents to perform phylogenetic foot-printing 14 by employing the MEME-SUITE of tools (http://meme.nbcr.net/meme/intro.html) , which allowed the identification of high confident conserved binding motifs and corresponding position specific weight matrices. We also tested the feasibility (i.e TF binding tendency) of BMo in open chromatin region of mouse, using ENCODE DNAse Hypersensitive sites in UMOD upstream region. We further analyzed the predicted binding motifs by a motif comparison tool from MEME-SUITE, TOMTOM- which compares discovered motifs with currently annotated motifs, to identify transcription factors, which have a high tendency and specificity to bind to these discovered motifs. Predicted TFs were integrated with existing protein-protein interaction databases like BioGRID (http://thebiogrid.org/) and tissue-specific expression information available in public databases for specific TFs were utilized to delineate the important regulators and the network of interactors controlling the expression of UMOD in kidney.
3
ACCEPTED MANUSCRIPT
MANUSCRIPT
Materials and Methods Summary:
Results and Discussion
M AN U
SC
RI PT
We analyzed the RNA-seq data for all available transcripts of UMOD gene released in Human Body Map (HBM) project to profile their expression across tissues. Human-UMOD orthologs and their upstream regulatory regions were extracted (fasta sequences) from ENSEMBL. These UMOD sequences from human and its orthologs (8 Primates, 7 Rodents) were analyzed using MEME-SUITE (http://meme.nbcr.net/meme/intro.html). Novel regulatory motifs were predicted by using in-silico phylogenetic footprinting to discover consensus sequences in upstream regions using MEME. These consensus sequences were further analyzed using TOMTOM tool which enables the comparison of predicted motifs with PWM’s of TFs for overlap. Further, protein interactions network was constructed between the potential TFs by utilizing the available physical interactions in BioGRID and tissue-specific expression information from three nephron segments available for specific TFs in a rat transcriptomic database (http://helixweb.nih.gov/ESBL/Database/Transcriptomic/index.html) and in HBM for human equivalents across tissues, were utilized to delineate the important regulators and the network of interactors controlling the expression of UMOD in kidney. Complete description of methods is available from http://www.iupui.edu/~sysbio/UMOD/.
TE D
Uromodulin is one of the most abundant proteins in urine. Although a total of 15 transcripts are currently annotated for UMOD gene in the Ensembl Database (Table 1), major transcript forms of UMOD are expressed exclusively in the kidney and hence UMOD is likely to exhibit a specific cis-regulatory signature not prevalent in other non-kidney specific genes.
EP
Identification of potential binding motifs by phylogenetic footprinting the regulatory regions of UMOD across primates and rodents
AC C
Since UMOD was found to be tissue-specifically expressed, we postulated that it’s cis-regulatory signature might be governed by an intricate interplay between TFs whose network might control its expression in a tissue-specific manner. Hence, to uncover the set of binding sites and the TFs controlling the UMOD gene, in this study we performed motif discovery based on phylogenetic alignments of orthologous sequences from a diverse set of eight primates and seven rodents using the human UMOD gene as a reference (see Materials and Methods). Phylogenetic footprinting is a method for the discovery of regulatory elements in a set of orthologous regulatory regions from multiple species. It does so by identifying the best conserved motifs in those orthologous regions 14. Briefly, 2kb upstream sequence of UMOD gene for human and its orthologs were analyzed by MEME (http://meme.nbcr.net/meme/intro.html), an expectation maximization-based motiffinding algorithm, to identify the potential binding sites conserved across the species. Based on the alignments, position-specific weight matrices (PWMs) 15 representing each of the 10 most 4
ACCEPTED MANUSCRIPT
MANUSCRIPT
Distribution of binding motifs for UMOD across species
RI PT
significant motifs enriched across the analyzed sequences were identified. Motif logos 16 corresponding to each of these 10 significantly conserved ones along with the number of occurrences of the motifs across the 16 sequences are shown in Figure 1. As evident from Figure 1, we found that all of these motifs exhibited a frequency of at least eight occurrences among the 16 sequences analyzed, with Motifs 7, 9 and 10 exhibiting the highest number of instances.
AC C
EP
TE D
M AN U
SC
In all cellular systems, DNA-binding transcription factors mediate the activation or repression of gene expression by binding specific regulatory sequences associated with a given target gene. Genes of many eukaryotes display a more complex architecture of associated regulatory elements, which include proximal promoter elements with binding sites for basal transcription factors, and several distal or upstream elements with binding sites for a host of specific transcription factors 17. Several elegant studies on developmentally regulated 18 and immuneresponse genes 19, 20 have revealed an important role for combinatorial interactions between different transcription factors (TFs) in establishing the complex temporal and spatial patterns of gene expression. Hence, increasing evidence now suggests the importance of not only knowing the binding location of a eukaryotic TF 21 but also the complex combinatorial interplay between them, dominant in eukaryotic transcriptional networks 22. Therefore, we have first mapped the locations of the discovered conserved and novel motif sequences across multiple species. Relative positions of the discovered binding sites in the 2kb upstream regulatory sequences across the species organized by phylogenetic distance along with the combined significance of motif co-occurrence is shown as a block diagram (Figure 2a). We discovered a total of five binding motifs in Human-UMOD 2kb upstream sequence - most of them dispersed on the chromosome compared to other species (Figure 2a), possibly suggesting evolutionary divergence of the cis-regulatory signature in humans and other close relatives. In particular, we found that as the evolutionary distance of the species with respect to human increased, the extent of conservation and clustering of the binding sites increased, indicating either the gain of the binding site clusters or the lack of these signals in the 2kb region of some of these species. Our results suggest the possibility of altered wiring of the transcriptional regulatory network controlling UMOD gene across primates and rodents either due to its altered functionality in kidney or due to increased complexity of the genome in some cases. To further address the functional importance of each of the motifs conserved in humans compared to those which were identified in other species but not in humans, we performed functional enrichment of the genes (in the whole genome) containing each of these motifs using GOMO 23 (see Table 2). This analysis indicated that Motifs 2 and 6 associated with the functional theme G-protein coupled receptors and associated signaling were not found in humans while the Motifs 3, 9 and 10 discovered in humans and are well conserved were found to be enriched in genes annotated with signaling, translational control and immune response associated processes, suggesting that posttranscriptional regulatory control and immune response related binding motifs might be highly preserved across the species in UMOD upstream regions.
DHS profile confirming the feasibility of predicted binding motifs.
5
ACCEPTED MANUSCRIPT
MANUSCRIPT
RI PT
DNase I hypersensitive sites (DHSs) are DNAse I enzyme sensitive region of chromatin, where chromatin has lost its condensed structure due to cleavage and is accessible to binding proteins such as TFs 24. We used the DHS data available for adult mouse kidney, generated from University of Washington as part of the ENCODE project (http://genome.ucsc.edu/cgibin/hgTrackUi?hgsid=356659263&c=chr7&g=wgEncodeUwDnase). Our analysis strongly suggested motifs 1, 2, 6 and 8 and associated TFs in 2 kb upstream region of mouse-UMOD to be active and open for transcription factor binding as shown in Figure 2b. It might be noteworthy to mention that our prediction of TF-TF combination (see below) is likely to be feasible, as we found a cluster of 4 motifs coupled with DHS signal.
SC
Prediction of transcriptional regulatory apparatus targeting discovered motifs of UMOD upstream region
AC C
EP
TE D
M AN U
In order to further dissect the regulatory factors that bind the discovered novel regulatory protein binding sites by our phylogenetic foot-printing analysis, we used a motif comparison tool, Tomtom (http://meme.nbcr.net/meme/cgi-bin/tomtom.cgi) , which aligns and compares the already reported PWMs for well-studied TFs available from motif databases with our discovered motifs (see Materials and Methods). Since several TFs have been reported to have overlapping specificity to binding regions 25 we used a p-value threshold as well as multiple motif databases to identify a robust set of potential regulatory factors for each discovered motif. In particular, we used JASPAR_CORE_2009 , Uniprobe_Mouse and Jolma2013 motif databases (See Materials and Methods) for our study and all possible TFs predicted to bind to the discovered motifs are shown in Figure 3a. Best predicted and highly aligned PWMs of TFs for all 10 motifs include GATA 3, HNF 1, SP1, SMAD3 and STAT3. In addition to these key TFs as well as some others which have been documented to play a central role in kidney processes, we also identified several significant and reliable list of transcription factors that could potentially bind to these 10 discovered motifs of UMOD (Figure 3a). Figure 3b shows a subset of these TFs which exhibited statistically significant alignment with the discovered motifs or those with literature evidence in support of their functional role in controlling biological processes that might be directly or indirectly associated to UMOD expression in normal/diseased state. Some of the listed TFs have no functional evidence in human but have been documented in other species. A subset of these listed TFs with functional evidence from literature is discussed at http://www.iupui.edu/~sysbio/UMOD/.
An investigation of all possible regulatory motifs and associated TFs beyond 2 kb upstream sequence of UMOD Transcription factors’ binding specificity depends on the similarity of a target motif sequence to its consensus15, 21. However, the transcriptional regulatory region of a particular gene may not be restricted to the immediate 2kb upstream sequence but rather its cis-regulatory code might be embedded in regions much further upstream24. Indeed, recent genomic analysis suggests that majority of the TF binding sites occur beyond the 2kb region of the gene start 24, 26 with most of them acting as proximal binding sites with respect to TSS (Transcription Start Site) to define the assembly of transcription pre-initiation complex or some at distal site of TSS to define the rate of 6
ACCEPTED MANUSCRIPT
MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
transcription 27. Despite, 2kb upstream region provides significant information and coverage about the regulatory motifs contributing to the transcriptional control, are not sufficient to encompass all possible motifs involved in regulation of UMOD gene expression. Thus, we extended our footprinting analysis to include the 5 kb upstream sequence of UMOD gene from Human and its orthologs to discover an extended set of 20 most significant novel regulatory binding motifs across the species and their corresponding TFs using the MEME suite of tools (see Materials and Methods). Further DHS signal analysis in this region was performed and active BMo – 1, 2, 3 and 9 were identified. It is important to note that the motif IDs for the 2kb region (Figures 1-3) do not necessarily correlate with the motif IDs annotated for the 5kb region as the motif numbering is arbitrary. It is also worth mentioning that some of the motifs not detected in the 2kb region either due to our stringent thresholds or due to their occurrence in fewer species (with in 2kb) might be identified in this extended analysis. Block diagram in Figure 4a shows the distribution of the 20 motifs across 12 different species analyzed. It is evident from the block diagram that the cluster of binding sites formed by the motifs 3, 1, 2, 9, 7, 12 and 4 which is prevalent in most species is nearly absent in the human genome. We also found that several motif clusters such as those surrounded by motif 5 although are conserved have drifted across the region between species. TF and Transcription Factor Binding Site (TFBS) interaction exist in a co-evolutionary relationship within the eukaryotes28. So we wanted to learn if the discovered motifs can be grouped based on their frequency of occurrence across species to identify potential co-regulatory relationships between motifs. Figure 4b shows hierarchical clustering of the normalized motif occurrence profiles by selecting a representative TF for each motif, identified based on the Tomtom analysis (see Materials and Methods). This analysis suggested that developmental factors like FOX family exhibited a similar abundance profile as the HNF family, while the binding sites of TFs, SMAD3 and Gata3 co-occurred across the studied genomes. Overall this analysis provides higher order evolutionary relationships between motifs across the species based on their abundance, suggesting different motifs might be selectively enriched in various species. Based on Tomtom analysis, we identified a set of predicted TFs which can bind to these 20 discovered motifs. In order to prioritize these associated TFs and to know potential protein complexes that might be responsible for regulation, we integrated the currently available human protein interaction network available from the BioGRID database (http://theBioGRID.org/) to construct a network of physical associations between TFs predicted to be binding to the UMOD regulatory regions (see Materials and Methods). This resulted in a network of 64 TFs with 112 associations, with TFs like SP1, SMAD3, TP53, SP3, RXRA, RARA and SPI1 exhibiting high degree of associations (Figure 4c). This TF-TF interaction network was analyzed and hierarchically clustered using Cluster 3.0 (http://bonsai.hgc.jp/~mdehoon/software/cluster/) using uncentered correlation as distance metric and complete linkage as clustering method to identify potential protein complexes (see Materials and Methods). This resulted in 5 major groups of physically associated TFs - KLF4 (ESRRG, EGR1, ESRRA, ESRRB, GATA1, GATA3, NFYA, NR2E1, HNF4A, SMAD3), POU2F1 (CREB1, HOXB13, RARA), RUNX3 (RUNX2, STAT1), SP3 (SP1, ZBTB7B) and HNF1B (HNF1A, STAT3) were found to form functionally important TF complexes likely contributing to the UMOD gene regulation. TFs in the TF-TF protein interaction network were further investigated in rat transcriptomic database (http://helixweb.nih.gov/ESBL/Database/Transcriptomic/index.html) and HBM (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-513/) to dissect their expression 7
ACCEPTED MANUSCRIPT
MANUSCRIPT
RI PT
profile across major nephron segments (Figure 4d) and across 16 human tissues respectively. We found that majority of the TFs including SP1, TP53, POU2F1, RARA, RARB and RXRA which had relatively high degree of associations were found to be expressed in thick ascending limb (localized umod synthesis zone) and/or other segments of nephron. Most of these identified TFs were not only active in TAL segment hence are likely to form a dense network of physical associations with other TFs to regulate the expression of UMOD as well as other non-segment specific genes. Some TFs like RUNX2 and POU2F1 (Oct1) which were not identified in the 2kb analysis, were discovered exclusively in this analysis.
Conclusions
AC C
EP
TE D
M AN U
SC
UMOD is the most abundant protein in the kidney and has been implicated in a number of kidney disorders however our understanding of its cis-regulatory map across mammals has been limited to at best a handful of TFs. In this study we use a cross-genomic approach to mine the conserved set of binding sites and predicted the associated TFs which are likely responsible for binding these locations to control the expression of UMOD. Our approach not only revealed several novel binding sites to provide insights into their evolutionary dynamics across primates and rodents but also provided a compendium of TFs expressed in the human kidney which are responsible for controlling UMOD’s expression thus providing a roadmap for characterizing the regulatory architecture of its promoter regions. Current motif comparison tools like Tomtom although are powerful in discovering PWMs most similar to the query sequences thereby aiding the discovery of potential TFs likely binding to the identified motifs, a common problem with these approaches is the inability to distinguish between members of a particular TF’s DNA-binding domain family which might significantly share the binding sites 29, 30. While it is possible to argue that these algorithms might result in false positives, increasing evidence from large-scale analysis suggests that most of the transcription factors with similar binding sequences tend to regulate genes with similar biological functions 25, 29 indicating that several of the TFs with very similar binding affinities might be competing to bind to the target sites to result in the final transcriptional output. Therefore, in an attempt to identify a high confidence list of TFs, we integrated the predicted list of TFs from the 5kb region of UMOD with publicly available curated set of protein-protein interactions to build a TF-TF protein interaction network responsible for controlling UMOD expression. This allowed us to uncover several highly connected TFs such as SP1, TP3, POU2F1, STAT3 and RARA as well as the likely protein complexes formed between them. The significant expression of these TFs in segments of rat nephrons as well as in the human kidney further suggested their central role in controlling UMOD expression.
Disclosure
All the authors declared no competing interests.
Acknowledgements 8
ACCEPTED MANUSCRIPT
MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
SCJ acknowledges support from the School of Informatics at IUPUI in the form of startup funds. TME-A is supported by a Veterans Affairs merit review award. The authors would also like to thank members of El-Achkar and Janga labs for providing feedback in the course of this study. The authors report no financial or other conflict of interest relevant to the subject of this article.
9
ACCEPTED MANUSCRIPT
MANUSCRIPT
References
6. 7.
8. 9. 10.
11. 12. 13. 14. 15. 16. 17. 18. 19.
20.
RI PT
SC
5.
M AN U
4.
TE D
3.
EP
2.
Santambrogio, S., Cattaneo, A., Bernascone, I. et al.: Urinary uromodulin carries an intact ZP domain generated by a conserved C-terminal proteolytic cleavage. Biochem Biophys Res Commun, 370: 410, 2008 Rampoldi, L., Scolari, F., Amoroso, A. et al.: The rediscovery of uromodulin (Tamm-Horsfall protein): from tubulointerstitial nephropathy to chronic kidney disease. Kidney Int, 80: 338, 2011 El-Achkar, T. M., Wu, X. R.: Uromodulin in kidney injury: an instigator, bystander, or protector? Am J Kidney Dis, 59: 452, 2012 Mutig, K., Kahl, T., Saritas, T. et al.: Activation of the bumetanide-sensitive Na+,K+,2Clcotransporter (NKCC2) is facilitated by Tamm-Horsfall protein in a chloride-sensitive manner. J Biol Chem, 286: 30200, 2011 El-Achkar, T. M., McCracken, R., Rauchman, M. et al.: Tamm-Horsfall protein-deficient thick ascending limbs promote injury to neighboring S3 segments in an MIP-2-dependent mechanism. Am J Physiol Renal Physiol, 300: F999, 2011 Mo, L., Huang, H. Y., Zhu, X. H. et al.: Tamm-Horsfall protein is a critical renal defense factor protecting against calcium oxalate crystal formation. Kidney Int, 66: 1159, 2004 Mo, L., Zhu, X. H., Huang, H. Y. et al.: Ablation of the Tamm-Horsfall protein gene increases susceptibility of mice to bladder colonization by type 1-fimbriated Escherichia coli. Am J Physiol Renal Physiol, 286: F795, 2004 Kottgen, A., Hwang, S. J., Larson, M. G. et al.: Uromodulin levels associate with a common UMOD variant and risk for incident CKD. J Am Soc Nephrol, 21: 337, 2010 Lhotta, K.: Uromodulin and chronic kidney disease. Kidney Blood Press Res, 33: 393, 2010 Yip, K. Y., Cheng, C., Bhardwaj, N. et al.: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol, 13: R48, 2012 Hesselberth, J. R., Chen, X., Zhang, Z. et al.: Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat Methods, 6: 283, 2009 Klepper, K., Drablos, F.: MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis. BMC Bioinformatics, 14: 9, 2013 Wang, J., Lu, J., Gu, G. et al.: In vitro DNA-binding profile of transcription factors: methods and new insights. J Endocrinol, 210: 15, 2011 Blanchette, M., Tompa, M.: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res, 12: 739, 2002 Stormo, G. D.: DNA binding sites: representation and discovery. Bioinformatics, 16: 16, 2000 Crooks, G. E., Hon, G., Chandonia, J. M. et al.: WebLogo: a sequence logo generator. Genome Res, 14: 1188, 2004 Levine, M., Tjian, R.: Transcription regulation and animal diversity. Nature, 424: 147, 2003 Davidson, E. H., Rast, J. P., Oliveri, P. et al.: A genomic regulatory network for development. Science, 295: 1669, 2002 Wathelet, M. G., Lin, C. H., Parekh, B. S. et al.: Virus infection induces the assembly of coordinately activated transcription factors on the IFN-beta enhancer in vivo. Mol Cell, 1: 507, 1998 Britten, R. J., Davidson, E. H.: Gene regulation for higher cells: a theory. Science, 165: 349, 1969
AC C
1.
10
ACCEPTED MANUSCRIPT
MANUSCRIPT
27. 28.
29. 30.
RI PT
26.
SC
25.
M AN U
24.
TE D
23.
EP
22.
Bulyk, M. L.: Computational prediction of transcription-factor binding site locations. Genome Biol, 5: 201, 2003 Kim, J., Choi, M., Kim, J. R. et al.: The co-regulation mechanism of transcription factors in the human gene regulatory network. Nucleic Acids Res, 40: 8849, 2012 Buske, F. A., Boden, M., Bauer, D. C. et al.: Assigning roles to DNA regulatory motifs using comparative genomics. Bioinformatics, 26: 860, 2010 Thurman, R. E., Rynes, E., Humbert, R. et al.: The accessible chromatin landscape of the human genome. Nature, 489: 75, 2012 Jolma, A., Yan, J., Whitington, T. et al.: DNA-binding specificities of human transcription factors. Cell, 152: 327, 2013 Neph, S., Vierstra, J., Stergachis, A. B. et al.: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature, 489: 83, 2012 Wasserman, W. W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet, 5: 276, 2004 Raviscioni, M., Gu, P., Sattar, M. et al.: Correlated evolutionary pressure at interacting transcription factors and DNA response elements can guide the rational engineering of DNA binding specificity. J Mol Biol, 350: 402, 2005 Itzkovitz, S., Tlusty, T., Alon, U.: Coding limits on the number of transcription factors. BMC Genomics, 7: 239, 2006 Hallikas, O., Palin, K., Sinjushina, N. et al.: Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell, 124: 47, 2006
AC C
21.
11
ACCEPTED MANUSCRIPT
MANUSCRIPT
Figure legends
RI PT
Figure 1: Identification of potential binding motifs by phylogenetic footprinting of 2kbupstream regulatory regions of UMOD gene. Ten phylogenetically conserved and statistically significant (indicated by e-value) novel motifs with the number of sites contributing to their identification were shown for UMOD 2kb upstream. These motifs were displayed as sequence LOGOs representing position weight matrices of each possible letter code occuring at particular position of motif and its height representing the probability of the letter at that position multiplied by the total information content of the stack in bits.
M AN U
SC
Figure 2: Block diagram showing occurrence of conserved motifs. (a) Location of ten motifs identified and their distribution in 2 kb upstream sequences across human-UMOD & its 15 other primate/rodent orthologous species are shown in the block diagram. The combined best matches of a sequence to a group of motifs were shown by combined p-value. Sequence strand specified as “+” (input sequence was read from left to right) and “-” (input sequence was read on its complementary strand from right to left) with respect to the occurrence of motifs. Coordinates of each motif across species is shown as a sequence scale below the diagram. (b) DNase I hypersensitive region was shown in 2kb upstream region of mouse UMOD using ENCODE project coupled with UCSC browser visualization tool. An overlap of DHS signal was found and shown as blue band over motif 1, 2, 6 and 8 near ~0.25 kb UMOD transcription start site (TSS) in the mouse block diagram.
TE D
Figure 3: TOMTOM analysis results for conserved motifs. A) Transcription factors predicted for 10 consensus sequences (as query motif) by TOMTOM analysis B) Selected set of motif alignments for each of the 10 significant motifs with the matched TF’s PWM (top) and query motif (bottom). Binding specificity of TF (1st mammalian TF hit found) was shown for all 10 regulatory motifs.
AC C
EP
Figure 4: An investigation of all possible regulatory motifs and associated TFs beyond 2 kb upstream sequence of UMOD. a) Distribution of 20 binding motifs across human-UMOD & its 15 primate, rodent orthologous species were shown as a block diagram for 5 kb upstream region. b) Occurrence of each motif across the species were grouped and represented as clustered heatmap. A representative TF name is shown for each of the 20 motifs. c) Protein interaction network between TFs constructed for all possible predicted transcription factors using BioGRID database with TFs identified to have BMo’s in DHS regions shown in red asterisk “*”. d) Normalized signal intensity (with > 0.2 mean expression level) of the mapped TFs in the nephron segments: Thick Ascending Limb(TAL), Proximal Collecting Tubules (PCT), Inner Medullary Collecting Duct (IMCD) of rat kidney, each derived from 3 pairs of publicly available Rat 230 2.0 Affymetrix Expression Arrays are shown in the heat map.
12
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
(a) Name
Combined p-value
Human
6.51e-10
ACCEPTED MANUSCRIPT
Motif Location
2.03e-55
Gibbon
3.03e-102
Monkey
3.09e-97
Gorilla
3.87e-106
Callithrix
5.00e-96
Bush baby
1.13e-31
SC
Orangutan
RI PT
Chimpanzee 1.47e-11
3.90e-31
Rabbit
4.48e-32
Guinea pig
1.77e-30
Squirrel
2.25e-84
Mouse
1.07e-58
Rat
3.17e-59
TE D
Tree shrew
M AN U
Mouse lemur 1.72e-47
(b)
AC C
EP
Kangaroo rat 6.76e-65
chr7 (qF2) qF5 F4
Scale chr7:
Mouse
qF2 7qF1
7qE3
7qE1
7qD3
D2 qD1
7qC
7qB5 7qB4
7qB3
B2 7qB1
7qA3
A2
7qA1
mm9
1 kb 126,624,500
126,624,000
126,623,500
126,623,000
Ensembl mouse - UMOD 2 kb upstream region
Umod Kidney
7qF3
DNaseI Hypersensitivity by Digital DNaseI from ENCODE/University of Washington
1.07e-58 -2 kb
-1.5 kb
-1.0 kb
-0.5 kb
0 kb
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
(a)
EP high
Occurence
AC C
low
Human Orangutan Gibbon Monkey Gorilla Callithrix Bush Baby Rabit Squirrel Mouse Rat
IM CD
PT
L TA
(d)
UMOD EGR1 ELF3 ESRRA GABPA GATA3 HNF1A HNF1B HNF4A IRF3 IRF6 MSX1 NFIC NR2F6 NR4A2 POU2F1 RARA RARB RARG RXRA SOX13 SOX4 SP1 SP4 STAT1 STAT3 TP53 ZBTB7B N o 0. Sig 0 n 0. 0 al 2 0. 5 5 0. 0 7 1. 5 00
(c)
HNF (1) NR4A2 (2) RFX3 (3) ATHB-5 (4) HMG-I/Y (5) OC (6) FOXK1 (7) ESRRB (8) HAT5 (9) MSN4 (10) CUP9 (11) SOX5 (12) SMAD3 (13) ATF4 (14) GATA3 (15) MNB1A (16) THRB (17) NR2F6 (18) HNF4A (19) RREB1 (20)
(b)
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
Signal intensity
TAL - Thick Ascending Limb PT - Proximal Tubule IMCD - Inner Medullary Collecting Duct
ACCEPTED MANUSCRIPT
Key of Definitions for Abbreviations UMOD - Uromodulin TSS - Transcription Start Site
RI PT
PWM - Position Weight Matrix BMo - Binding Motifs TFBS - Transcription Factor Binding Site
SC
TFs - Transcription factors
TOMTOM - A tool for comparing a DNA motif to a database of known motifs.
DHS - DNase I hypersensitive site
M AN U
GOMO - Gene Ontology for Motifs
MEME-SUITE - Motif-based sequence analysis tools, open source hub of bioinformatics tools. ENSEMBL - A database, includes Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.
TE D
HBM - Human Body Map Database, contains RNASeq data from Illumina’s Human BodyMap 2.0 project freely available online. BioGRID - An online interaction repository for interaction datasets compiled through comprehensive curation efforts, available online.
EP
Rat Transcriptomic database - http://helixweb.nih.gov/ESBL/Database/Transcriptomic/index.html
AC C
ENCODE - Encyclopedia of DNA Elements. This consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). Data are free available to use.