Analysis of human upstream open reading frames and impact on gene expression.

Hum Genet DOI 10.1007/s00439-015-1544-7

ORIGINAL INVESTIGATION

Analysis of human upstream open reading frames and impact on gene expression Yuhua Ye1 · Yidan Liang1 · Qiuxia Yu1 · Lingling Hu1 · Haoli Li1 · Zhenhai Zhang2,3,4 · Xiangmin Xu1

Received: 18 December 2014 / Accepted: 16 March 2015 © Springer-Verlag Berlin Heidelberg 2015

Abstract The upstream open reading frame (uORF) is a post-transcriptional regulatory element in the 5′ untranslated region (5′UTR), which modulates the translation levels of main open reading frame (mORF). Earlier studies showed that disturbed uORF-mediated translation control can result in drastic changes in translation levels of mORF, leading to genetic disorders. To date, there has been no systematic investigation into the relationship between variations in patients and uORF status. Here, taking the advantage of several datasets, including gene ontology (GO) annotations and sequence feature analysis, we have examined uORF impacts in human transcripts. GO annotations indicate that uORF-containing genes are enriched in certain features such as oncogenes and transcription factors. Sequence feature analysis reveals that uORF is a factor for determination of the translation initiation site (TIS)

Electronic supplementary material The online version of this article (doi:10.1007/s00439-015-1544-7) contains supplementary material, which is available to authorized users. * Xiangmin Xu [email protected] Zhenhai Zhang [email protected] 1

Department of Medical Genetics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China

2

State Key Laboratory of Organ Failure Research, Southern Medical University, Guangzhou, Guangdong 510515, China

3

National Clinical Research Center for Kidney Disease, Southern Medical University, Guangzhou, Guangdong 510515, China

4

Division of Nephrology, Nanfang Hospital, Southern Medical University, Guangzhou, Guangdong 510515, China

in human transcripts. We show that genes with uORFs have lower protein expression levels than genes without uORFs in multiple human tissues. Moreover, by examining three disease variation databases, we identified uORF-altering mutations from a total of 3,740,225 variations, which are highly suspected to be associated with changed levels of gene expression. For an experimental validation, we found four mutations with significant effects on protein expression but with only modest changes in transcription levels. These findings will provide researchers on related diseases with new insights into the importance of known mutations.

Introduction Post-transcriptional regulation in eukaryotes is one of the key mechanisms underlying cellular regulation of gene expression (Somers et al. 2013). To date, various regulatory elements of eukaryotic mature mRNAs have been found in the 5′UTR, such as internal ribosome entry site (IRES), hairpin and upstream open reading frame (uORF) (Pichon et al. 2012). IRES functions as a mediator for capindependent translation initiation in eukaryotes, whereas hairpins induce translation repression. (Holmes et al. 2014; Jiménez-González et al. 2014). uORF is another regulatory element characterized by an ORF sequence with its start codon preceding the translation initiation site (TIS) in the 5′UTR (Calvo et al. 2009; Wethmar 2014). Earlier studies found that uORF is present in 30–49 % of human and rodent transcripts (Sachs et al. 2006; Wethmar et al. 2010). After recognizing its start codon, a scanning ribosome is likely to translate a uORF and disassociates from the mRNA, resulting in poor translation of the main open reading frame (mORF) (Morris and Geballe 2000; Barbosa et al. 2013).

13

Hum Genet

Therefore, uORF-altering mutations can lead to dramatic change in mORF expression and result in genetic disorders (Spriggs et al. 2010). Wen et al. (2009) identified a range of defects in 5′UTR of the gene encoding the human hairless homolog (HR), including the elimination of start codon in the second uORF, which causes the genetic hair loss. A single nucleotide change in the HBB gene introduces a novel uORF that results in the decrease of its translation efficiency (Oner et al. 1991). Calvo et al. (2009) integrated four independent tandem mass spectrometry studies and revealed that genes with uORFs had significantly lower protein expression levels compared to genes without uORF in mouse. The objective of this study is to illustrate the prevalence and functions of uORF and, for the first time, prove that uORF is a factor for determination of TIS in a human transcript. Although high-throughput sequencing technology has enabled us to access the mutation spectrum of any patient, little is known about the relationship between their variations and the alteration in uORF status. We scanned and analyzed three disease variant databases and obtained a list of uORF-altering mutations that are most likely to be associated with disease phenotypes.

Materials and methods mRNA sequence consolidating and uORF analyses We combined the following publicly available datasets: (1) RefGene;1 (2) GenBank human RNA collection;2 (3) hg19 human genome;3 (4) Homo sapiens gene ontology (GO) annotation file;4 (5) protein expression matrix in 30 tissues;5 (6) ClinVar archive of human variations and phenotypes;6 (7) COSMIC7 (Catalogue of Somatic Mutations in Cancer) (Forbes et al. 2014); (8) TCGA8 (The Cancer Genome Atlas) somatic mutation data; (9) RNA-Seq data of normal colon tissue from TCGA; and (10) AnimalTFDB9 (Animal Transcription Factor Database). All the 1

UCSC human genome annotation database (hg19, GRCh37), http:// hgdownload.cse.ucsc.edu/goldenpath/hg19/database/. 2 Human mRNA collection, ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/. 3 UCSC human genome annotation database (hg19, GRCh37), http:// hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/. 4

Gene Ontology Consortium, http://geneontology.org/.

5

Human Proteome Map, http://www.humanproteomemap.org/.

6

ClinVar, ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/.

datasets were processed with in-house scripts written in Perl10 (v5.16.3) and Python11 (v2.7.6) and statistical analyses were performed in R12 (v3.0.2). Firstly, entries starting with NM (reviewed mRNA) and with a 5′UTR were extracted from RefGene. The extracted records that were also present in the GenBank human RNA collection were selected for acquiring their genomic sequences in hg19. In all, 36,019 entries were picked up by this procedure. Then, in order to ensure the integrity of the cross-reference data, these NM entries with nucleotide sequences were globally aligned to their corresponding GenBank sequences via ClustalW2 (Larkin et al. 2007; Goujon et al. 2010). We found that some entries with the same ID in RefGene and GenBank had totally different sequences. Therefore, we filtered the ClustalW2 results by the following thresholds: (1) Match Bases/RefGene sequence length >0.99; (2) 0.99

The regulatory potential of upstream open reading frames in eukaryotic gene expression.

Regulation of plant translation by upstream open reading frames.

Upstream open reading frames regulate cannabinoid receptor 1 expression under baseline conditions and during cellular stress.

Upstream Open Reading Frames Differentially Regulate Gene-specific Translation in the Integrated Stress Response.

Engineering ribosomal leaky scanning and upstream open reading frames for precise control of protein translation.

Roles of Epstein-Barr virus BGLF3.5 gene and two upstream open reading frames in lytic viral replication in HEK293 cells.

Tying Down Loose Ends in the Chlamydomonas Genome: Functional Significance of Abundant Upstream Open Reading Frames.

Upstream Open Reading Frames Located in the Leader of Protein Kinase Mζ mRNA Regulate Its Translation.

Chemoproteomic discovery of cysteine-containing human short open reading frames.

Genome-Wide Search for Translated Upstream Open Reading Frames in Arabidopsis Thaliana.

Starting too soon: upstream reading frames repress downstream translation.

Expression of human Hemojuvelin (HJV) is tightly regulated by two upstream open reading frames in HJV mRNA that respond to iron overload in hepatic cells.

Upstream open reading frames and Kozak regions of assembled transcriptome sequences from the spider Cupiennius salei. Selection or chance?

Upstream open reading frames regulate translation of the long isoform of SLAMF1 mRNA that encodes costimulatory receptor CD150.

Mapping of linear epitopes of human papillomavirus type 16: the L1 and L2 open reading frames.

Detecting actively translated open reading frames in ribosome profiling data.

Computational analysis and mapping of novel open reading frames in influenza A viruses.

Substantial expression of novel small open reading frames in Oryza sativa.

Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III.

The prediction of exons through an analysis of spliceable open reading frames.

Identification of novel Arabidopsis thaliana upstream open reading frames that control expression of the main coding sequences in a peptide sequence-dependent manner.

Reinitiation after translation of two upstream open reading frames (ORF) governs expression of the ORF35-37 Kaposi's sarcoma-associated herpesvirus polycistronic mRNA.

uPEPperoni: an online tool for upstream open reading frame location and analysis of transcript conservation.

An upstream open reading frame regulates LST1 expression during monocyte differentiation.