This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

Genome-wide Search for Translated Upstream Open Reading Frames in Arabidopsis thaliana Qiwen Hu, Catharina Merchante, Anna N. Stepanova, Jose M. Alonso and Steffen Heber  Abstract—Upstream open reading frames (uORFs) are open reading frames that occur within the 5’ UTR of an mRNA. uORFs have been found in many organisms. They play an important role in gene regulation, cell development, and in various metabolic processes. It is believed that translated uORFs reduce the translational efficiency of the main coding region. However, only few uORFs are experimentally characterized. In this paper, we use ribosome footprinting together with a semi-supervised approach based on stacking classification models to identify translated uORFs in Arabidopsis thaliana. Our approach identified 5360 potentially translated uORFs in 2051 genes. GO terms enriched in genes with translated uORFs include catalytic activity, binding, transferase activity, phosphotransferase activity, kinase activity and transcription regulator activity. The reported uORFs occur with a higher frequency in multi-isoform genes, and some uORFs are affected by alternative transcript start sites or alternative splicing events. Association rule mining revealed sequence features associated with the translation status of the uORFs. We hypothesize that uORF translation is a complex process that might be regulated by multiple factors. The identified uORFs are available online at: https://www.dropbox.com/sh/zdutupedxafhly8/AABFsdNR5zDfio zB7B4igFcja?dl=0. This paper is the extended version of our research presented at ISBRA 2015. Index Terms—uORF, translation, ribosome footprinting, semi-supervised learning, stacking, classification, Arabidopsis thaliana.

I. INTRODUCTION

U

PSTREAM open reading frames (uORFs) are open reading frames that appear in the 5’ untranslated region (UTR) of an mRNA. uORFs have been found in many organisms. uORFs have been linked to diseases, and some uORFs play an important role in cell development and gene regulation. It is believed that translated uORFs attenuate translation of the downstream main open reading frame (main ORF) [1-4]. It is estimated that there are more than 20,000 uORFs in Arabidopsis genome, but only few uORFs are experimentally characterized and in most cases it is unknown what biological function they have, and if they are translated [5-7]. Experimental identification of functional uORFs is Q. Hu and S. Heber are with Bioinformatics Research Center, North Carolina State University, 27606 Raleigh, US (e-mail: [email protected], [email protected]). C. Merchante is with Departamento de Biología Moleculary Bioquímica, Universidad de Málaga, 29071 Málaga, Spain (e-mail: [email protected]). AN. Stepanova and JM. Alonso are with Department of Plant and Microbial Biology, North Carolina State University, 27606 Raleigh, US (e-mail: [email protected], [email protected])

time-consuming, and so far, only few uORFs have been directly investigated through forward genetic analysis at the whole plant level [8, 9]. A genome-wide identification of translated uORFs via Mass Spectrometry has been challenging due to the short length of the encoded proteins [10]. Several studies have used evolutionary conservation to predicted functional uORFs [11, 12]. For example, in [12], the authors have developed a BLAST-based algorithm to identify conserved uORFs across various plant species. Their approach resulted in 18 novel uORFs in Arabidopsis thaliana. Unfortunately, conserved uORFs only account for a small part of the uORFs in the Arabidopsis genome - currently, the TAIR database lists only about 70 conserved uORFs which account for less than 1% of total uORFs [10]. The biological function and translation status of most uORFs is still unknown. Recently, ribosome footprinting (RF) has been developed to investigate translation via deep sequencing of ribosome protected mRNA fragments (ribosome footprints) which provide evidence for translated regions [13]. Several studies have used RF information to identify translated open reading frames and translation initiation sites (TISs). For example, Fritsch and colleagues analyzed the coverage of ribosome footprints upstream of annotated TISs and trained a neural network to detect novel TISs. This experiment identified 2994 novel uORFs in the human genome [14]. A similar study in mouse used a support vector machine to learn the distribution of ribosome footprints near TISs and identified 13,454 candidate TISs [15]. In this paper, we used a semi-supervised approach that uses RF data with additional gene information to learn features of translated uORFs in Arabidopsis thaliana. Semi-supervised learning is a machine learning technique that uses unlabeled data in conjunction with few labeled data points for training. If the data follows certain smoothness, cluster, or manifold assumptions, semi-supervised learning can more precisely capture the characteristics of data than a purely supervised learning approach, and as a result it can produce very accurate predictions using only few labeled data points, for a thorough review see [16]. Semi-supervised learning has been applied in many areas such as text mining, disease classification and pattern recognition [17-19]. The approach is most useful in scenarios where labeled instances are expensive and hard to obtain, while unlabeled data is relative easy to collect, similar to the situation in our data. We combine semi-supervised learning with a stacking-based classification model that combines RF data with additional gene information to identify translated uORFs. Stacking combines the predictions of several base-level classifiers by a meta-level classifier to improve

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

predictive accuracy [20, 21]. Our algorithm discovered 5360 translated uORFs that occur in 2051 genes. Analysis of the predicted uORFs shows that the enriched GO-terms of the uORF-containing genes include catalytic activity, binding, transferase activity, phosphotransferase activity, kinase activity and transcription regulator activity and that uORFs are prevalent in multi-isoform genes. Analysis of translation efficiency shows that genes that contain translated uORFs tend to have lower translation efficiency in the main ORF. Association rule mining revealed groups of sequence features that are associated with the translation status of uORFs. Our results suggest that uORF translation is a complex process that may be influenced by multiple factors. This paper is the extended version of our research presented at ISBRA 2015 [22]. II. MATERIAL AND METHODS We used Ribo-seq and accompanying RNA-seq data (NCBI SRA accession number SRP056795) from a study of gene-specific translation regulation mediated by the hormone-signaling molecule EIN2 [23] and developed a semi-supervised learning approach combined with stacked classification to identify translated uORFs in Arabidopsis. Our approach is outlined in the following, a detailed description of the individual steps and semi-supervised learning algorithm is provided in parts A-F. We start with aligning ribosome footprints and the corresponding mRNA reads to the genome sequence of Arabidopsis thaliana. The aligned reads were assigned to predicted uORFs and annotated main coding regions. For each uORF, 12 features were extracted for the subsequent analyses. Then, k-means clustering was used to construct a training dataset. Subsequently, five different base-level classifiers were trained, and a stacking approach was used to combine the results of the base classifiers. Finally, association rule mining was applied to explore sequence features that associate with the translation status of uORFs. A. Data Preparation Ribosome footprints and RNA-seq data were generated from Arabidopsis seedlings. We analyzed over 90 million reads from two biological wildtype replicates. First, we performed quality control and removed adaptor sequences and low quality reads using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The resulting reads were aligned to the genome sequence of Arabidopsis thaliana using Tophat [24]. Reads that mapped to multiple genomic positions, as well as reads with length smaller than 25bp, or larger than 40bp, were discarded. The remaining reads were assigned to transcript regions using custom Perl scripts. Genome and transcript sequences, as well as gene annotation were downloaded from The Arabidopsis Information Resource (TAIR, version 10, http://www.arabidopsis.org/). We generated an exhaustive list of uORFs, where each uORF corresponds to a sequence of start codon (ATG), followed by one or more amino acid coding codons, and a stop codon (TAG, TAA or TGA) in frame. All generated uORFs start in the

5’UTR, but they might extend beyond this transcript region. We excluded uORFs that consist of only a start and a stop codon (but no additional codons) from our analysis, because such uORFs are not considered to be functional [25]. This resulted in 29629 uORFs in 7831 genes. B. Feature Extraction In general, translation of an open reading frame can be divided into 3 stages: initiation, elongation, and termination. Each stage corresponds to a particular ribosome moving pattern [26]. During translation initiation, a ribosome scans through mRNA and starts translation when a translation start site is recognized. The speed of the ribosome slows down, and as a consequence, an accumulation of ribosome footprints before translation start sites of translated open reading frames can be observed [15, 27]. In the elongation step, the ribosome translates the codons three nucleotides at a time. For a well translated region, the amino acid is continuously synthesized by ribosome resulting in an appearance of ribosome footprints in this region without a large break. During termination, the ribosome dissociates at the stop codon. To identify translated uORFs, we generate several features measuring ribosome footprint distribution patterns that are characteristic for the different stages of translation (Figure 1). Our rational is that translated uORFs should display these features accordingly. Open Reading Frame

... Initiation

Den_left Dp De Ds Den_utr

Elongation

Var Den Den_max Den_min l

Distribution of ribosome footprints

Termination

Features Den_right

Fig. 1. Translation stages and features characteristic for the different translation stages.

In addition, several features of functional uORFs are known to impact the translation of downstream main ORFs, for example the length of uORF and the distance between the uORF and its main ORF [28]. We include these features in our model. The following describes the individual features in detail. Denoted u(1), u(2), …, u(l) the sequence positions of an uORF with length l in a transcript t. We assume that the main ORF starts at positions in t. For i=1, ..., l we denote the number of ribosome footprints mapping to u(i) by c(u(i)). We computed 12 features for each possible uORF: 1) Distance from uORF start to the start of main ORF: Ds = s u(1). 2) Distance from uORF end to main ORF: De = s - u(l). 3) Length of uORF: l. 4) Distance from uORF start to the nearest peak of the ribosome density curve: Dp. Assume the number of ribosome footprints aligned to the positions of a gene with length n were counted as (c(1), c(2), …, c(n)). We use a kernel smoother to estimate the

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

ribosome density curve: 𝑛

𝑓𝑘 (𝑥) =

1 𝐾ℎ (𝑥 − 𝑐(𝑖)) ∑ , 𝑛ℎ ℎ 𝑖=1

where Kh is the kernel function and h is the smoothing parameter (bandwidth). We used the R function density with kernel function Gaussian and bandwidth 5. Peaks p(1), p(2), …, p(k) of the density curve indicate positions where ribosome footprints accumulate. We have Dp = min( | p(i)-u(1) | ), i = 1, 2, …, k. 5) Ribosome footprint density of a uORF: Den, with ∑𝑢(𝑙) 𝑖=𝑢(1) 𝑐(𝑖) 𝐷𝑒𝑛 = . 𝑙 6) Maximum local ribosome footprint density of uORF: Den_max. The local ribosome footprint density is calculated using a sliding window of size 3 along the uORF region. Den_max is the maximum value of the resulting local ribosome densities. 7) Minimum local ribosome footprint density of uORF: Den_min. The value is computed analogously to Den_max. 8) Ribosome footprint density for the region left of uORF: Den_left. The ribosome footprint density upstream of uORF indicates ribosome loading before start codon. We chose 15 base pairs as it is about the half length of ribosome in Arabidopsis. We have ∑𝑢(1) 𝑖=𝑢(1)−15 𝑐(𝑖) 𝐷𝑒𝑛_𝑙𝑒𝑓𝑡 = . 15 9) Ribosome footprint density for the region right of uORF: Den_right, with ∑𝑢(𝑙)+15 𝑖=𝑢(𝑙) 𝑐(𝑖) 𝐷𝑒𝑛_𝑟𝑖𝑔ℎ𝑡 = . 15 10) Variance of ribosome footprints distribution along uORF region: Var, with 𝑢(𝑙)

1 𝑉𝑎𝑟 = ∑ (𝑐(𝑖) − 𝜇)2 , 𝑙 𝑖=𝑢(1)

where µ is the mean value of c(i) in the uORF region. 11) Ribosome density of UTR region: Den_utr. Assuming the UTR region extends from position a to position b on a transcript, we have 𝑏

𝐷𝑒𝑛_𝑢𝑡𝑟 = ∑ 𝑖=𝑎

𝑐(𝑖) . 𝑏−𝑎+1

12) Ribosome density of CDS: Den_cds. The value is computed analogously to Den_utr. C. Semi-supervised Learning: Clustering based Classification Translated transcript regions show a characteristic ribosome footprint distribution pattern. Unfortunately, there were only 8 experimentally verified translated uORFs that are well expressed in our samples. Since these experimentally characterized uORFs showed the expected ribosome footprint distribution pattern and occurred in a distinct cluster of selected, high-quality uORFs we used a clustering based semi-supervised approach, similar to the text classification algorithm described in [17], to identify translated uORFs in our

data. Our approach consists of two steps: 1. Clustering step: we cluster selected unlabeled data points under the guidance of labeled data to obtain an expanded training set. The details of the approach are described in part D of the Material and Methods Section, and part A of the Results Section. 2. Classification step: we use the generated training data to train a stacked classifier. The details of the approach are described in part E of the Material and Methods Section. The algorithm is given below. Algorithm: Clustering-based Classification Input: labeled and unlabeled uORF datasets Output: translation status of uORFs Clustering step: 1: Combine labeled data with selected unlabeled data points 2: Perform k-means clustering on combined data 3: Select clustering (k) based on average silhouette value 4: Assign class label based on clustering results 5: Select training data D Classification step: 6: Train base-level classifiers using D 7: Combine outputs of base-level classifiers with features of D 8: Train meta-level classifier MC using combined dataset 9: Stacked classification using base-level classifier and MC to determine the translation status of uORFs

D. Clustering and Training Set Construction We performed k-means clustering to identify groups of similar uORFs. The resulting clusters were characterized with respect to their translation behavior. A detailed description is given in Section III part A. The training set is constructed based on our clustering results. The positive class (translated uORFs) is chosen from cluster 2.1 (see Fig.3). The uORFs in this cluster show the characteristic distribution of ribosome footprints in a canonically translated uORF. There are 76 uORFs in this cluster. The negative class is randomly selected from clusters 1 and 2.2. The uORFs in these clusters exhibit a small ribosome footprint density, or ribosome footprints accumulate far from the translation start site. We selected 76 records from both classes, resulting in a training set with 152 records. E. Stacking Classification Our classification model is developed based on the positive and negative classes described above. We use five base-level classification algorithms: a k-nearest neighbor classifier, a support vector machine (SVM), a decision tree, a Naïve Bayes classifier, and a neural network. We have chosen these five algorithms because they used the different classification strategies. Each classifier was tuned and the model with lowest error rate was used. The performance of the different classifiers was evaluated by a leave-one-out cross validation. Stacking is a method that combines the predictions of several base-level classifiers by a meta-level classifier in order to improve predictive accuracy. More precisely, following the description in [20, 21]: given a training data set D and learning algorithms L1, L2, …, Lt, stacking first uses D to learn a set of

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

base-level classifiers BC1, BC2, …, BCt, with BCi = Li(D) for i=1, ..., t. Subsequently, using the predictions of BC1, BC2, …, BCt, a meta-level classifier MC is learned. Correspondingly, during the prediction step, data features are first used by the base-level classifiers. Then, the resulting base-level predictions are used by the meta-level classifier to generate the final prediction. It has been shown that stacking can combine the expertise of different base-level classifiers while reducing their bias [20, 21]. Here, we apply stacking with five different base-level classifiers to the features from the uORF regions in the training set. First, we use our training set to train the base-level classifier. We record the results of the base-level classifiers and use them together with the extracted features of the training set to generate a meta-level k-nearest neighbor classifier. We use leave-one-out cross validation to evaluate the performance of our approach. Suppose our training dataset D consists of N records D(1), ..., D(N). For each record D(i), i=1, ..., N, we train the base-level and meta-level classifiers using D - D(i), and we evaluate their performance using D(i).

constrain the left-hand-side (X) of r to sequence features, and the right hand side (Y) to the uORF translation status. Three parameters were used to select relevant rules: support, confidence, and lift. Support of rule r:X→Y corresponds to the

F. Association Rule Mining Association rule mining is a powerful tool to discover interesting patterns - for a detailed description we refer the reader to [29]. We apply association rule mining to find groups of sequence features associated with the translation status of uORFs. We computed for each uORF the following sequence features, numerical features were discretized into 5 groups (small, medium_small, medium, medium_large, and large) using equal frequencies:  Length of 5'-utr/cds/3'-utr for the gene that contains the uORF (futr/cds/tutr).  Order of uORF, indicating the relative position of each individual uORF along the 5'-UTR (order).  Reading frame of uORF with respect to the main ORF (rf: 0/1/2).  Number of uORFs that occur in the gene (no_of_uorf).  Minimum free energy of the uORF region (mfe). The minimum free energy was computed using the ViennaRNA package [30].  Occurrence of A or G at position -3 before the start codon (a_g_3: 0 absence, 1 presence).  Occurrence of G at position +4 after the start codon (g_4: 0 absence, 1 presence).  Instability index for uORF peptides (ins: stable or instable) [31]. Peptides with instability index smaller than 40 are predicted as stable.  Potential protein interaction index for uORF peptide (pii: bp or nbp) [32]. Proteins with potential protein interaction index higher than 2.48 are considered as high binding potential (bp).  Codon adaptation index for the uORF region (cai) [33].

A. Cluster Analysis To identify groups of similar uORFs, we performed a cluster analysis using k-means clustering and Euclidean distance. We restricted our analysis to well-expressed genes that contain one single uORF in their 5’ UTRs to avoid ambiguities that might occur if different translated and untranslated uORFs appear in close proximity on the same transcript. To determine a suitable number of clusters k, we used the average silhouette value [34]. The silhouette value measures the fit of a data point within its cluster in comparison with neighboring clusters. Silhouette values are in the range of -1 to 1. A silhouette value close to 1 indicates that a data point is in an appropriate cluster, while a silhouette value close to -1 indicates that it might be erroneously assigned. Fig. 2 shows the average silhouette value for different numbers of clusters k. The average silhouette values for k=2, ..., 6 clusters are similar and clearly larger than average silhouette values for k>6.

For each uORF, we record the corresponding set of sequence features and the translation status (translated or untranslated) of the uORF as items in a transaction t. The resulting transactions form our transaction database T = {t1, t2, …, tm}. An association rule r:X→Y describes an implication between the items of X and Y, where X and Y are disjoint itemsets. In our analysis, we

proportion of transactions in T that contain X∪Y. Confidence measures the proportion of transactions that contain X ∪ Y among the transactions that contain X. Lift is calculated as the support count of X∪Y divided by the product of support count X and support count of Y. The lift value measures the departure from a random model; a lift value larger than 1 indicates a positive association between X and Y. We used the Apriori algorithm [27] implemented in the R package arules (http://cran.r-project.org/web/packages/arule/index.html,[26]) to find rules that meet our thresholds for support (0.02) and confidence (0.8). The resulting rules are sorted according to the lift value and listed in table V. III. RESULT

Fig. 2. Average silhouette value for different numbers of clusters k.

Fig. 3A shows the evolution of clusters fork=2, ..., 6. For k=2, we have 2 clusters: cluster 1 and cluster 2. For k=3, cluster 2 is further divided into cluster 2.1 and cluster 2.2. When k is

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

further increased to 6, the clusters 2.1 and 2.2 remain unchanged, while cluster 1 is divided into multiple clusters. To identify translated uORFs, we have focused on ribosome footprint density in the uORF region (Den) and the distance from uORF start to the nearest peak of the density curve (Dp). These two features are the most important features that determine the translation status of uORFs [15, 27]. Fig. 3B shows a scatter plot of Den and Dp superimposed with the point density as background. The Figure shows three clusters that coincide with the cluster reported by 3-means clustering: cluster 1 (diamond-shaped points at the bottom) consists of uORFs for which only few ribosomal footprints have been detected, indicating that uORFs from this cluster are not translated. In contrast, the uORFs of the other 2 clusters show ample ribosomal footprints. Cluster 2.1 (circular-shaped points) shows small Dp and large Dens values while cluster 2.2 (triangular-shaped points) shows large Dp and smaller Dens values. During k-means clustering with k=4, ..., 6, only cluster 1 is subdivided. Since this cluster does not provide additional information for the translated uORFs, we decided to choose k=3 for our subsequent analyses. A

Cluster 1.1.4

Cluster 1.2

Cluster 1

Cluster 1.1.3

Cluster 1 Cluster 1.1

Cluster 1.1.2

Cluster 1.1.1

Cluster 0

Cluster 2.2

Cluster 2.2

Cluster 2.2

Cluster 2 Cluster 2.2

K=6

Cluster 2.1

K=4

K=3

K=2

K=1

Cluster 2.1

B

Fig. 3. Visualization of clusters generated by k-means clustering. A. Evolution of clusters according to different values of k. B. Scatter plot of ribosome density and distance between the start of uORF and nearest peak of ribosome density curve. Different shape of points indicates different clusters identified by 3-means clustering. Cluster 1: diamond-shaped points. Cluster 2.1: circular points. Cluster 2.2: triangular points. The background color indicates the

density of points. A darker color indicates a higher point density. The reads coverage plots above show uORFs from cluster 2.1 and cluster 2.2.

To learn the characteristics of the uORFs in the different clusters, we analyzed their features. There are no ribosome footprints in the uORFs from cluster 1, and we hypothesize that the uORFs in this group are not translated. The group accounts for about 65% of the total uORFs in our dataset. The uORFs from cluster 2.1 and 2.2 have a positive footprint density Den, but we observed significant differences in the variable Dp. Dp is significantly larger in cluster 2.2 indicating that fewer ribosomal footprints accumulate at the start codon of the corresponding uORFs. This is inconsistent with translated open reading frames [10,15]. In addition, we analyzed: 1) Experimentally verified uORFs: there are two experimentally verified uORFs whose genes are well expressed and translated in the dataset. Both uORFs belong to cluster 2.1. 2) GO terms of uORF-containing genes: we used AgriGO [35] to identify overrepresented GO terms with significance level 0.05. Genes in cluster 2.1 show terms such as biological regulation (GO:0065007), metabolic process (GO:0008152), and cellular process (GO:0009987). This is consistent with the GO term annotation of currently known uORF-containing genes [5, 6, 10, 36]. We did not find significantly overrepresented GO terms for cluster 1 and 2.2. 3) Local minimum coverage of ribosome footprints in a uORF region (Den_min): A Den_min value larger than 0 indicates the continuous translation of ribosomes in the region. Ideally, a well translated uORF should show Den_min=0 for only a small fraction of its length. We found a statistically significant difference of this value between cluster 2.1 and cluster 2.2. For cluster 2.1, about 20% uORFs have Den_min=0, whereas in cluster 2.2 we have 74%. 4) Ribosome footprint density in the first 6bp region immediately after the start codon: we checked the footprint density immediately after the start codon. Our analysis indicates that ~70% of uORFs in cluster 2.2 do not show any footprints in this region. In contrast, all uORFs in cluster 2.1 show a non-zero footprint density in this region. Based on these observations we chose the uORFs in cluster 2.1 as templates for translated uORFs and used them as positive class to train our classifiers. After inspecting representative examples from cluster 2.2 (see reads coverage plot in Fig. 2B), we hypothesize that some of the uORFs in this cluster might use a different, non-canonical translation start site. B. Performance Evaluation To compare the performance of our base-level classifiers, we compared the data points of the training set that are identified correctly by a specific algorithm using leave-one-out cross validation. We found a large overlap between algorithms, however, each classifier also detects a certain proportion of the data which is not detected by the other algorithms (Table I). The results suggest that each classifier has its own expertise for classifying uORFs correctly, thus, we apply stacking to combine the expertise for each classifier in order to get more accurate predictions.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

To further evaluate the performance of our classifiers, we performed leave-one-out cross validation and calculated accuracy, precision, recall and f-score. Receiver Operating Characteristic Curve (ROC curve) was also used to assess the overall performance [37]. The area under the curve (AUC) indicates the performance of a classifier. The larger the area, the better is the performance of a classifier. Accuracy, precision, recall and f-score are defined as follows. Denote TP, FN, TN and FP is the number of true positives, false negatives, true negatives and false positives. We have Accuracy = (TP+TN)/(TP+FN+TN+FP), Precision = TP/(TP+FP), Recall = TP/(TP+FN), F-score = 2TP/(2TP+FP+FN). To demonstrate the power of our stacking approach, we compared the performance of stacking with the performance of the individual base classifiers (Table II, Fig.4). The result shows stacking outperforms the underlying base-level classifiers for all values except for recall which indicates stacking has very good prediction accuracy. TABLE I CORRECTLY IDENTIFIED DATA POINTS BY BASE-LEVEL CLASSIFIERS SVM

DT

NB

NN

KNN

SVM

115

97%

78%

97%

95%

DT

82%

136

71%

90%

82%

NB

83%

91%

105

90%

82%

NN

83%

94%

72%

130

82%

KNN

91%

97%

75%

93%

115

SVM: support vector machine, DT: decision tree, NB: Naïve Bayes, NN: neural network, KNN: k-nearest neighbor. The numbers on the main diagonal in the table indicate the total number of correctly classified data points for the individual classifier. Off-diagonal elements denote the pairwise overlap of correctly classified data points between pairs of classifiers. TABLE II OVERALL CLASSIFIER PERFORMANCE

Algorithms

Accuracy

Precision

Recall

F-score

AUC

KNN

0.76

0.63

0.84

0.72

0.85

SVM

0.76

0.74

0.77

0.75

0.85

DT

0.89

0.84

0.94

0.89

0.86

NB

0.69

0.54

0.77

0.64

0.86

NN

0.86

0.87

0.85

0.86

0.92

Stacking with KNN

0.90

0.95

0.87

0.91

0.94

Fig. 4. ROC curves for different classifiers. SVM support vector machine, DT decision tree, NB Naïve Bayes, NN neural network, KNN k-nearest neighbor, ST stacking with KNN.

We used our model to identify translated uORFs, and compared our predictions to the results from previous studies. We collected experimentally characterized uORFs from current research papers [5, 6, 8, 38-43], untranslated uORFs according to the experiment, and downloaded the conserved uORFs annotated by TAIR. The results are shown in Table III. For the genes expressed and translated in our experiment, our method detects ~75% of the experimentally verified uORFs. Similarly, most of the conserved uORFs identified through conservation search among multiple-species are identified by our approach. These results suggest that conserved uORFs are often translated. TABLE III UORF FROM CURRENT STUDIES

Total

Expressed /translated in our experiment

uORF identified by stacking

TP

FP

EC

19

12

9

75.0%

8%

CD

69

45

33

73.3%

9%

Category of uORF

EC: Experimentally characterized; CD: conserved. True positive rate = TP/(TP+FN), False postive rate = FP/(FP+TN). TP: true positive; FP: False positive

C. Translated uORFs in Arabidopsis thaliana Our stacking approach identified 5360 translated uORFs. The identified uORFs occur in 2051 genes, which account for about 6% of the all annotated genes in Arabidopsis thaliana. Likely, this number is under-estimation since about 30% of uORF-containing genes were not transcribed in our experiment. Among the genes that contain translated uORFs, the majority (~58%) contain only one single uORF. However, there are few genes that include up to 30 uORFs. To further characterize genes that contain translated uORFs, we performed a GO-term enrichment analysis using AgriGO [35]. Enriched GOterms with significance level 0.05 include catalytic activity, binding,

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

transferase activity, phosphotransferase activity, kinase activity and transcription regulator activity (Fig.5, A, B).

Remarkably, the majority of translated uORFs occur in multi-isoform genes. When comparing single-and multi-isoform genes with respect to the occurrence of translated uORFs, we found a significant difference (p-value < 2.2e-16, Fisher’s exact test): only 3.4% of the single-isoform genes contain translated uORFs, whereas about 19% of the multi-isoform genes contain translated uORFs (Fig 5C). About 15% of these uORFs do not occur in all transcripts generated by the multi-isoform gene. See Table IV and Fig.5C for a detailed breakdown. TABLE IV TRANSLATED UORFS IDENTIFIED IN ARABIDOPSIS

Category

Translated uORFs

Gene with translated uORFs

Percentage of all genes

All

5360

2051

6.10%

3783

1121

3.34%

580

293

0.87%

1577

930

2.77%

In multi-isoform genes Affected by AS events In single-isoform genes

There are 27717 single-isoform genes and 5885 multi-isoform genes in the Arabidopsis genome.

Ribosome footprinting provides us with a way to measure translation efficiency (TE) of the main ORF, thus we were able to investigate the role of uORFs in the regulation of gene translation. TE is calculated as the ribosome footprint density divided by its corresponding RNA-seq read density, and it indicates how well a transcript is translated [13]. We compared the TE between genes with translated uORFs and genes without translated uORFs, and we found a significant reduction of TE for transcripts with translated uORFs (p-value = 1.593e-14) (Fig. 5D).

D

Fig. 5. Characteristics of genes that contain translated uORFs. A. Distribution of number of uORFs per gene. B. Enriched GO-terms for genes containing translated uORFs. Background represents the GO-terms for all gene models in TAIR 10. C. Proportion of gene containing translated uORFs. D. Box plot of translation efficiency for genes with translated uORF and genes without translated uORF

D. Features associated with translated uORFs To further investigate the mechanisms involved in uORF translation we searched for features that are associated with the translation status of uORFs. We extracted a set of sequence features from the vicinity of translated/untranslated uORF regions and performed association rule mining to identify features associated with the translation status of the assessed uORFs. The rules that meet support (0.02) and confidence (0.8) thresholds are shown in Table V. We found that translated uORFs tend to be present in the genes with long coding sequences, 5' UTR and 3' UTR and have a smaller number of uORFs in their 5'-UTR region as compared to genes with untranslated uORFs. It has been suggested that the sequence context near the translation start site influences the translation of a coding region [44]. In higher plants, the most conserved sequence patterns near the translation start site of highly expressed genes are the presence of A/G at position -3, and the presence of G at position +4 [45]. However, in general, sequence conservation in the vicinity of the translation start site differs among species. In our analysis, we found rules that associate the presence of A

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

or G at positions -3 and 4 with respect to the start codon with translated uORFs. On the other hand, our analysis produced rules that associate the absence of A or G at the corresponding sequence positions with untranslated uORFs. We also found rules that include features other than the sequence context of the start codon. For example, the peptides encoded by translated uORFs tend to associate with high codon bias. Some features occur in rules that describe both translated and untranslated uORF. For example, there are rules that show the peptides in both translated and untranslated uORFs have non-binding potential, but a uORF with non-binding potential peptides and a short UTR tends to be untranslated. These results indicate that uORF translation is a complex process that may require the interaction with multiple sequence features. TABLE V RULES FOR TRANSLATED/UNTRANSLATED UORFS Rules

Support

Confidence

Lift

{order=2,pi=nbp,tutr=medium_large}→ TR

0.02

0.87

1.74

{order=2,tutr=medium_large}→ TR

0.03

0.86

1.73

{pi=nbp,futr=long,tutr=medium_large,mfe=medi um}→TR

0.02

0.82

1.64

{order= 2,pi=nbp,cai=medium_large} →TR

0.02

0.81

1.62

{a_g_3=1,g_4=1,futr=medium_large}→TR

0.02

0.81

1.62

{rf=0,cds=large,no_of_uorf=1} →TR

0.02

0.81

1.62

{pi=nbp,futr=large,tutr=medium_large}→TR

0.03

0.80

1.60

{tutr=medium_large,no_of_uorf=2,mfe=medium } →TR

0.02

0.80

1.60

{ins=stable,cds=small,tutr=small} →UN

0.02

1.00

2.00

{g_4=0,cds=small,no_of_uorf=[5,52]} →UN

0.03

0.97

1.95

{cds=small,no_of_uorf=[5,52]}→UN

0.03

0.95

1.91

{cds=small,tutr=small,mfe=medium}→UN

0.02

0.94

1.89

{g_4=0,futr=large,cds=short} →UN

0.02

0.92

1.83

{futr=large,cds=small} →UN

0.03

0.89

1.78

0.02

0.88

1.76

0.03

0.85

1.71

{rf=2,order= 1,pi=nbp,tutr=small,mfe=medium} →UN {order=1,a_g_3=0,g_4=0,pi=nbp,ins=stable,tutr =small} →UN

The sequence features (items) are described in Part F of Material and Methods. UN: untranslated uORF, TR: translated uORF. All reported rules are statistically significant and have adjusted p-values smaller than 1.21E-03. P-values are calculated using a chi-square test [46] and adjusted using Bonferroni correction [47].

IV. CONCLUSION In this paper, we described a semi-supervised approach to identify translated uORFs using ribosome footprinting data in combination with additional genomic features. We have chosen this approach because due to the extremely small number of experimentally characterized uORFs a simple classification approach likely would result in unreliable predictions. Since we have observed that the distribution of ribosome footprints follows characteristic patterns during the different stages of translation we expect that incorporating unlabeled data points

that follow these distribution patterns, and cluster with the experimentally validated uORFs, will help learning the true characteristics of translated uORFs, and likely improve our results. Using this approach, we identified 5360 translated uORFs in 2051 genes, which account for 6% of the annotated genes in Arabidopsis thaliana. Previously, it has been shown that genes encoding transcription factors are over-represented among the genes containing conserved uORFs [48]. Our study confirms these results. In addition, our analysis highlights other regulatory functions, such as catalytic activity, transferase activity and kinase activity enriched in genes that contain translated uORFs. Several studies have suggested that uORFs may control expression of downstream main ORFs via different mechanisms [1, 2]. Often, a uORFs attenuates the translation of the main CDS due to a reduction of ribosome re-initiation at the downstream translation start site [1, 2]. The presence of a uORF might also create a premature termination codon, and thus trigger nonsense-mediated decay [49, 50]. In addition, various sequence features that affect translation status and translation speed, often in combination with uORFs, have been reported [28, 51, 52]. Our analysis is consistent with these observations. We also found a statistically significant reduction of TE in genes with translated uORFs. In addition, association rule mining revealed sequence features associated with the translation status of uORFs. We hypothesize that the corresponding association rules describe the translation behavior of uORFs. A further result of our study is the observation that translated uORFs occur with a higher frequency in multi-isoform genes than in single-isoform genes. Several uORFs occur only in a subset of the transcripts generated from the corresponding gene locus. We hypothesize, that in some cases alternative transcription start sites, or alternative splicing events, might regulate presence and absence of translated uORFs. Together, these results suggest that uORF translation and, consequently, the translation of the corresponding downstream main ORFs is a complex process that may be regulated by multiple factors. In summary, this study provides a novel bioinformatics approach to identify potentially translated uORFs in the Arabidopsis genome using Ribo-seq data. Our analysis provides a list of candidate uORFs for future functional analyses. We expect that additional translated uORFs will be found in the future as more ribosome footprinting datasets become available for Arabidopsis and additional uORFs get experimentally validated. V. SUPPLEMENTAL MATERIALS All the predicted translated uORFs are available online at: https://www.dropbox.com/sh/zdutupedxafhly8/AABFsdNR5z DfiozB7B4igFcja?dl=0. ACKNOWLEDGMENT This material is based upon work supported by the National Science Foundation under grant No. IOS-1444561.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

D. R. Morris and A. P. Geballe, "Upstream open reading frames as regulators of mRNA translation," Mol Cell Biol, vol. 20, pp. 8635-42, Dec 2000. S. E. Calvo, D. J. Pagliarini, and V. K. Mootha, "Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans," Proc Natl Acad Sci U S A, vol. 106, pp. 7507-12, May 5 2009. S. Jeon and J. Kim, "Upstream open reading frames regulate the cell cycle-dependent expression of the RNA helicase Rok1 in Saccharomyces cerevisiae," FEBS Lett, vol. 584, pp. 4593-8, Nov 19 2010. H. S. Kwon, D. K. Lee, J. J. Lee, H. J. Edenberg, Y. H. Ahn, and M. W. Hur, "Posttranscriptional regulation of human ADH5/FDH and Myf6 gene expression by upstream AUG codons," Arch Biochem Biophys, vol. 386, pp. 163-71, Feb 15 2001. A. Imai, Y. Hanzawa, M. Komura, K. T. Yamamoto, Y. Komeda, and T. Takahashi, "The dwarf phenotype of the Arabidopsis acl5 mutant is suppressed by a mutation in an upstream ORF of a bHLH gene," Development, vol. 133, pp. 3575-85, Sep 2006. F. Alatorre-Cobos, A. Cruz-Ramirez, C. A. Hayden, C. A. Perez-Torres, A. L. Chauvin, E. Ibarra-Laclette, et al., "Translational regulation of Arabidopsis XIPOTL1 is modulated by phosphocholine levels via the phylogenetically conserved upstream open reading frame 30," J Exp Bot, vol. 63, pp. 5203-21, Sep 2012. I. Ebina, M. Takemoto-Tsutsumi, S. Watanabe, H. Koyama, Y. Endo, K. Kimata, et al., "Identification of novel Arabidopsis thaliana upstream open reading frames that control expression of the main coding sequences in a peptide sequence-dependent manner," Nucleic Acids Res, vol. 43, pp. 1562-76, Feb 18 2015. C. Hanfrey, M. Franceschetti, M. J. Mayer, C. Illingworth, and A. J. Michael, "Abrogation of upstream open reading frame-mediated translational control of a plant S-adenosylmethionine decarboxylase results in polyamine disruption and growth perturbations," J Biol Chem, vol. 277, pp. 44131-9, Nov 15 2002. Selpi, C. H. Bryant, G. J. Kemp, J. Sarv, E. Kristiansson, and P. Sunnerhagen, "Predicting functional upstream open reading frames in Saccharomyces cerevisiae," BMC Bioinformatics, vol. 10, p. 451, 2009. A. G. von Arnim, Q. Jia, and J. N. Vaughn, "Regulation of plant translation by upstream open reading frames," Plant Sci, vol. 214, pp. 1-12, Jan 2014. M. Cvijovic, D. Dalevi, E. Bilsland, G. J. Kemp, and P. Sunnerhagen, "Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation," BMC Bioinformatics, vol. 8, p. 295, 2007. H. Takahashi, A. Takahashi, S. Naito, and H. Onouchi, "BAIUCAS: a novel BLAST-based algorithm for the identification of upstream open reading frames with conserved amino acid sequences and its application to the Arabidopsis thaliana genome," Bioinformatics, vol. 28, pp. 2231-41, Sep 1 2012. N. T. Ingolia, S. Ghaemmaghami, J. R. Newman, and J. S. Weissman, "Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling," Science, vol. 324, pp. 218-23, Apr 10 2009. C. Fritsch, A. Herrmann, M. Nothnagel, K. Szafranski, K. Huse, F. Schumann, et al., "Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting," Genome Res, vol. 22, pp. 2208-18, Nov 2012. N. T. Ingolia, L. F. Lareau, and J. S. Weissman, "Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes," Cell, vol. 147, pp. 789-802, Nov 11 2011. X. Zhu, "Semi-supervised learning literature survey," Computer Sciences, University of Wisconsin-Madison 2008. Z. Hua-Jun, W. Xuan-Hui, C. Zheng, L. Hongjun, and M. Wei-Ying, "CBC: clustering based text classification requiring minimal labeled data," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 443-450. R. Dara, S. C. Kremer, and D. A. Stacey, "Clustering unlabeled data with SOMs improves classification of labeled real-world data," in Neural Networks, 2002. IJCNN '02. Proceedings of the 2002 International Joint Conference on, 2002, pp. 2237-2242.

[19]

[20]

[21] [22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31] [32]

[33]

[34]

[35]

[36]

[37] [38]

[39]

[40]

B. Longstaff, S. Reddy, and D. Estrin, "Improving activity classification for health applications on mobile devices using active and semi-supervised learning," in Pervasive Computing Technologies for Healthcare (PervasiveHealth), 2010 4th International Conference on-NO PERMISSIONS, 2010, pp. 1-7. B. Z. Saso Dzeroski, "Is Combining Classifiers with Stacking Better than Selecting the Best One," Machine Learning, vol. 54, pp. 255-273, 2004. D. H. Wolpert, "Stacked generalization," Neural Networks, vol. 5, pp. 241-259, 1992. Q. Hu, C. Merchante, A. Stepanova, J. Alonso, and S. Heber, "A Stacking-Based Approach to Identify Translated Upstream Open Reading Frames in Arabidopsis Thaliana," in Bioinformatics Research and Applications. vol. 9096, R. Harrison, Y. Li, and I. Măndoiu, Eds., ed: Springer International Publishing, 2015, pp. 138-149. C. Merchante, J. Brumos, J. Yun, Q. Hu, Kristina R. Spencer, P. Enríquez, et al., "Gene-Specific Translation Regulation Mediated by the Hormone-Signaling Molecule EIN2," Cell, vol. 163, pp. 684-697. C. Trapnell, L. Pachter, and S. L. Salzberg, "TopHat: discovering splice junctions with RNA-Seq," Bioinformatics, vol. 25, pp. 1105-11, May 1 2009. S. J. Andrews and J. A. Rothnagel, "Emerging evidence for functional peptides encoded by short open reading frames," Nat Rev Genet, vol. 15, pp. 193-204, 03//print 2014. T. M. Schmeing and V. Ramakrishnan, "What recent ribosome structures have revealed about the mechanism of translation," Nature, vol. 461, pp. 1234-1242, 10/29/print 2009. P. Juntawong, T. Girke, J. Bazin, and J. Bailey-Serres, "Translational dynamics revealed by genome-wide profiling of ribosome footprints in Arabidopsis," Proc Natl Acad Sci U S A, vol. 111, pp. E203-12, Jan 7 2014. C. Vilela and J. E. McCarthy, "Regulation of fungal gene expression via short open reading frames in the mRNA 5'untranslated region," Mol Microbiol, vol. 49, pp. 859-67, Aug 2003. R. I. Agrawal, T.; Swami, A., "Mining association rules between sets of items in large databases," Proceedings of the 1993 ACM SIGMOD international conference on Management of data SIGMOD '93, p. 207, 1993. R. Lorenz, S. H. Bernhart, C. Honer Zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, et al., "ViennaRNA Package 2.0," Algorithms Mol Biol, vol. 6, p. 26, 2011. H. G. Boman, "Antibacterial peptides: basic facts and emerging concepts," J Intern Med, vol. 254, pp. 197-215, Sep 2003. K. Guruprasad, B. V. Reddy, and M. W. Pandit, "Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence," Protein Eng, vol. 4, pp. 155-61, Dec 1990. P. Rice, I. Longden, and A. Bleasby, "EMBOSS: the European Molecular Biology Open Software Suite," Trends Genet, vol. 16, pp. 276-7, Jun 2000. P. J. Rousseeuw, "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis," Computational and Applied Mathematics pp. 53-65, 1987. Z. Du, X. Zhou, Y. Ling, Z. Zhang, and Z. Su, "agriGO: a GO analysis toolkit for the agricultural community," Nucleic Acids Res, vol. 38, pp. W64-70, Jul 2010. T. Tabuchi, T. Okada, T. Azuma, T. Nanmori, and T. Yasuda, "Posttranscriptional regulation by the upstream open reading frame of the phosphoethanolamine N-methyltransferase gene," Biosci Biotechnol Biochem, vol. 70, pp. 2330-4, Sep 2006. T. Fawcett, "An introduction to ROC analysis," Pattern Recogn. Lett., vol. 27, pp. 861-874, 2006. T. Nyiko, B. Sonkoly, Z. Merai, A. H. Benkovics, and D. Silhavy, "Plant upstream ORFs can trigger nonsense-mediated mRNA decay in a size-dependent manner," Plant Mol Biol, vol. 71, pp. 367-78, Nov 2009. T. Nishimura, T. Wada, K. T. Yamamoto, and K. Okada, "The Arabidopsis STV1 protein, responsible for translation reinitiation, is required for auxin-mediated gynoecium patterning," Plant Cell, vol. 17, pp. 2940-53, Nov 2005. O. David-Assael, H. Saul, V. Saul, T. Mizrachy-Dagri, I. Berezin, E. Brook, et al., "Expression of AtMHX, an Arabidopsis vacuolar

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2016.2516950, IEEE Transactions on NanoBioscience

[41]

[42]

[43]

[44] [45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

metal transporter, is repressed by the 5' untranslated region of its gene," J Exp Bot, vol. 56, pp. 1039-47, Mar 2005. X. Zhu, S. K. Thalor, Y. Takahashi, T. Berberich, and T. Kusano, "An inhibitory effect of the sequence-conserved upstream open-reading frame on the translation of the main open-reading frame of HsfB1 transcripts in Arabidopsis," Plant Cell Environ, vol. 35, pp. 2014-30, Nov 2012. J. Puyaubert, L. Denis, and C. Alban, "Dual targeting of Arabidopsis holocarboxylase synthetase1: a small upstream open reading frame regulates translation initiation and protein targeting," Plant Physiol, vol. 146, pp. 478-91, Feb 2008. A. Wiese, N. Elzinga, B. Wobbes, and S. Smeekens, "A conserved upstream open reading frame mediates sucrose-induced repression of translation," Plant Cell, vol. 16, pp. 1717-29, Jul 2004. M. Kozak, "Determinants of translational fidelity and efficiency in vertebrate mRNAs," Biochimie, vol. 76, pp. 815-21, 1994. C. Joshi, H. Zhou, X. Huang, and V. Chiang, "Context sequences of translation initiation codon in plants," Plant Molecular Biology, vol. 35, pp. 993-1001, 1997/12/01 1997. S. A. Alvarez, "Chi-Squared Computation for Association Rules: Preliminary Results," echnical Report BC-CS-2003-01, Computer Science Department, Boston College, 2003. B. Hochberg, "Controlling the false discovery rate: A practical and powerful approach to multiple testing," Journal of the Royal Statistical Society, vol. Serie B, pp. 800-803, 1995. C. A. Hayden and R. A. Jorgensen, "Identification of novel conserved peptide uORF homology groups in Arabidopsis and rice reveals ancient eukaryotic origin of select groups and preferential association with transcription factor-encoding genes," BMC Biol, vol. 5, p. 32, 2007. C. Barbosa, I. Peixeiro, and L. Romao, "Gene expression regulation by upstream open reading frames and human disease," PLoS Genet, vol. 9, p. e1003529, 2013. J. Rehwinkel, J. Raes, and E. Izaurralde, "Nonsense-mediated mRNA decay: Target genes and functional diversification of effectors," Trends Biochem Sci, vol. 31, pp. 639-46, Nov 2006. P. Shu, H. Dai, W. Gao, and E. Goldman, "Inhibition of translation by consecutive rare leucine codons in E. coli: absence of effect of varying mRNA stability," Gene Expr, vol. 13, pp. 97-106, 2006. M. Kozak, "Constraints on reinitiation of translation in mammals," Nucleic Acids Res, vol. 29, pp. 5226-32, Dec 15 2001.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Genome-Wide Search for Translated Upstream Open Reading Frames in Arabidopsis Thaliana.

Upstream open reading frames (uORFs) are open reading frames that occur within the 5' UTR of an mRNA. uORFs have been found in many organisms. They pl...
1MB Sizes 0 Downloads 4 Views