Hum Genet DOI 10.1007/s00439-015-1544-7
Analysis of human upstream open reading frames and impact on gene expression Yuhua Ye1 · Yidan Liang1 · Qiuxia Yu1 · Lingling Hu1 · Haoli Li1 · Zhenhai Zhang2,3,4 · Xiangmin Xu1
Received: 18 December 2014 / Accepted: 16 March 2015 © Springer-Verlag Berlin Heidelberg 2015
Abstract The upstream open reading frame (uORF) is a post-transcriptional regulatory element in the 5′ untranslated region (5′UTR), which modulates the translation levels of main open reading frame (mORF). Earlier studies showed that disturbed uORF-mediated translation control can result in drastic changes in translation levels of mORF, leading to genetic disorders. To date, there has been no systematic investigation into the relationship between variations in patients and uORF status. Here, taking the advantage of several datasets, including gene ontology (GO) annotations and sequence feature analysis, we have examined uORF impacts in human transcripts. GO annotations indicate that uORF-containing genes are enriched in certain features such as oncogenes and transcription factors. Sequence feature analysis reveals that uORF is a factor for determination of the translation initiation site (TIS)
Electronic supplementary material The online version of this article (doi:10.1007/s00439-015-1544-7) contains supplementary material, which is available to authorized users. * Xiangmin Xu [email protected]
Zhenhai Zhang [email protected]
Department of Medical Genetics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
State Key Laboratory of Organ Failure Research, Southern Medical University, Guangzhou, Guangdong 510515, China
National Clinical Research Center for Kidney Disease, Southern Medical University, Guangzhou, Guangdong 510515, China
Division of Nephrology, Nanfang Hospital, Southern Medical University, Guangzhou, Guangdong 510515, China
in human transcripts. We show that genes with uORFs have lower protein expression levels than genes without uORFs in multiple human tissues. Moreover, by examining three disease variation databases, we identified uORF-altering mutations from a total of 3,740,225 variations, which are highly suspected to be associated with changed levels of gene expression. For an experimental validation, we found four mutations with significant effects on protein expression but with only modest changes in transcription levels. These findings will provide researchers on related diseases with new insights into the importance of known mutations.
Introduction Post-transcriptional regulation in eukaryotes is one of the key mechanisms underlying cellular regulation of gene expression (Somers et al. 2013). To date, various regulatory elements of eukaryotic mature mRNAs have been found in the 5′UTR, such as internal ribosome entry site (IRES), hairpin and upstream open reading frame (uORF) (Pichon et al. 2012). IRES functions as a mediator for capindependent translation initiation in eukaryotes, whereas hairpins induce translation repression. (Holmes et al. 2014; Jiménez-González et al. 2014). uORF is another regulatory element characterized by an ORF sequence with its start codon preceding the translation initiation site (TIS) in the 5′UTR (Calvo et al. 2009; Wethmar 2014). Earlier studies found that uORF is present in 30–49 % of human and rodent transcripts (Sachs et al. 2006; Wethmar et al. 2010). After recognizing its start codon, a scanning ribosome is likely to translate a uORF and disassociates from the mRNA, resulting in poor translation of the main open reading frame (mORF) (Morris and Geballe 2000; Barbosa et al. 2013).
Therefore, uORF-altering mutations can lead to dramatic change in mORF expression and result in genetic disorders (Spriggs et al. 2010). Wen et al. (2009) identified a range of defects in 5′UTR of the gene encoding the human hairless homolog (HR), including the elimination of start codon in the second uORF, which causes the genetic hair loss. A single nucleotide change in the HBB gene introduces a novel uORF that results in the decrease of its translation efficiency (Oner et al. 1991). Calvo et al. (2009) integrated four independent tandem mass spectrometry studies and revealed that genes with uORFs had significantly lower protein expression levels compared to genes without uORF in mouse. The objective of this study is to illustrate the prevalence and functions of uORF and, for the first time, prove that uORF is a factor for determination of TIS in a human transcript. Although high-throughput sequencing technology has enabled us to access the mutation spectrum of any patient, little is known about the relationship between their variations and the alteration in uORF status. We scanned and analyzed three disease variant databases and obtained a list of uORF-altering mutations that are most likely to be associated with disease phenotypes.
Materials and methods mRNA sequence consolidating and uORF analyses We combined the following publicly available datasets: (1) RefGene;1 (2) GenBank human RNA collection;2 (3) hg19 human genome;3 (4) Homo sapiens gene ontology (GO) annotation file;4 (5) protein expression matrix in 30 tissues;5 (6) ClinVar archive of human variations and phenotypes;6 (7) COSMIC7 (Catalogue of Somatic Mutations in Cancer) (Forbes et al. 2014); (8) TCGA8 (The Cancer Genome Atlas) somatic mutation data; (9) RNA-Seq data of normal colon tissue from TCGA; and (10) AnimalTFDB9 (Animal Transcription Factor Database). All the 1
UCSC human genome annotation database (hg19, GRCh37), http:// hgdownload.cse.ucsc.edu/goldenpath/hg19/database/. 2 Human mRNA collection, ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/. 3 UCSC human genome annotation database (hg19, GRCh37), http:// hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/. 4
Gene Ontology Consortium, http://geneontology.org/.
Human Proteome Map, http://www.humanproteomemap.org/.
datasets were processed with in-house scripts written in Perl10 (v5.16.3) and Python11 (v2.7.6) and statistical analyses were performed in R12 (v3.0.2). Firstly, entries starting with NM (reviewed mRNA) and with a 5′UTR were extracted from RefGene. The extracted records that were also present in the GenBank human RNA collection were selected for acquiring their genomic sequences in hg19. In all, 36,019 entries were picked up by this procedure. Then, in order to ensure the integrity of the cross-reference data, these NM entries with nucleotide sequences were globally aligned to their corresponding GenBank sequences via ClustalW2 (Larkin et al. 2007; Goujon et al. 2010). We found that some entries with the same ID in RefGene and GenBank had totally different sequences. Therefore, we filtered the ClustalW2 results by the following thresholds: (1) Match Bases/RefGene sequence length >0.99; (2) 0.99