Articles

Detecting actively translated open reading frames in ribosome profiling data

© 2015 Nature America, Inc. All rights reserved.

Lorenzo Calviello1, Neelanjan Mukherjee1,4, Emanuel Wyler1,4, Henrik Zauber1, Antje Hirsekorn1, Matthias Selbach1, Markus Landthaler1, Benedikt Obermayer1 & Uwe Ohler1–3 RNA-sequencing protocols can quantify gene expression regulation from transcription to protein synthesis. Ribosome profiling (Ribo-seq) maps the positions of translating ribosomes over the entire transcriptome. We have developed RiboTaper (available at https://ohlerlab.mdc-berlin.de/ software/), a rigorous statistical approach that identifies translated regions on the basis of the characteristic threenucleotide periodicity of Ribo-seq data. We used RiboTaper with deep Ribo-seq data from HEK293 cells to derive an extensive map of translation that covered open reading frame (ORF) annotations for more than 11,000 protein-coding genes. We also found distinct ribosomal signatures for several hundred upstream ORFs and ORFs in annotated noncoding genes (ncORFs). Mass spectrometry data confirmed that RiboTaper achieved excellent coverage of the cellular proteome. Although dozens of novel peptide products were validated in this manner, few of the currently annotated long noncoding RNAs appeared to encode stable polypeptides. RiboTaper is a powerful method for comprehensive de novo identification of actively used ORFs from Ribo-seq data.

Ribo-seq is a method used to globally investigate protein synthesis1 by sequencing RNA fragments protected by engaged ribosomes. Mapping these ribosomal footprints to the transcriptome produces a detailed quantitative picture of translation across thousands of genes2. In the few years since its first description, Ribo-seq has been used to monitor translation in different organisms3 and biological conditions4. Applications cover a wide range of subjects, including the identification of alternative start codons5 and the definition of short peptides6,7, as well as quantitative aspects such as translation rates and codon optimality8. With the number of annotated noncoding RNAs increasing, a central question is whether some of these transcripts contain short translated ORFs missed by annotation efforts. To address this, a number of studies proposed ad hoc Ribo-seq–based global metrics that aimed at defining the translational status of such noncoding transcripts9–11. The mere presence of Ribo-seq reads in regions of the transcriptome does not, however, imply the presence

of actively elongating ribosomes. The mode of Ribo-seq readlength distributions typically falls at ~29–30 nt, which is the known fragment size protected by 80S ribosomes12. Consequently, this subset of ribosomal footprints displays a striking bias toward the translated frame1,6,9, which can be used to infer, for each read, the position of the peptidyl-site (P-site) compartment of the translating ribosome9. This subcodon resolution holds great promise for discriminating between ribosomal coverage and a periodic footprint profile (PFP) (i.e., a consistent, 3-nt periodic codon-by-codon alignment pattern across a transcript). Yet this property has only recently been exploited in summary statistics to identify small translated ORFs6,13 or to explore translation on multiple frames14. Despite the recent development of Ribo-seq analysis tools15–17, a statistically rigorous method that uses this property to identify translated regions has not been proposed. Here we present a computational approach, based on spectral analysis, that can be used to comprehensively identify the set of PFPs in a given Ribo-seq sample. Our method, called RiboTaper, exploits the subcodon resolution of Ribo-seq reads to call high-confidence translated loci and reconstructs the full set of ORFs in annotated coding and noncoding transcripts. We quantify RiboTaper’s performance relative to that of alternative approaches, show that a limited number of noncoding RNAs contain actively translated ORFs, analyze evolutionary signatures in different ORF classes and cross-reference the identified ORFs with proteome-wide experimental evidence. RESULTS The RiboTaper strategy to test Ribo-seq subcodon profiles The first step in this method is to define P-site positions for the majority of mapped reads according to their aggregate profiles over annotated start codons6,9 (Supplementary Fig. 1). From these, RiboTaper creates data tracks for every annotated exonic region and uses the multitaper approach18 to test for significant PFPs in exonic P-site profiles (Fig. 1a). Finally, exonic profiles are merged according to the annotated transcript structures to detect translation on de novo identified ORFs (Online Methods and Supplementary Software).

1Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Berlin, Germany. 2Department of Biology, Humboldt University, Berlin,

Germany. 3Department of Computer Science, Humboldt University, Berlin, Germany. 4These authors contributed equally to this work. Correspondence should be addressed to U.O. ([email protected]). Received 21 July; accepted 28 October; published online 14 December 2015; doi:10.1038/nmeth.3688

nature methods  |  ADVANCE ONLINE PUBLICATION  |  

Articles a

Data-track creation

P-sites

P-sites

15 10 5 0 0

0

MAPKAPK5–AS1 lincRNA exon

6 5 4 3 2 1 0

50 100 150 200 nt

0

50 100 150 200 nt

Apply multitaper method

0.2 Fn(x)

0.1 0 –0.1 –0.2 0

50

100 nt

150

200

0.28 0.32 0.36 Frequency (Hz)

b 98

100

0.28 0.32 0.36 Frequency (Hz)

40

95 90

78

60

97

10 8 6 4 2 0 0.28 0.32 0.36 Frequency (Hz)

c

98 96

90

80

F-value

10 8 6 4 2 0

CCDS exons

10 8 6 4 2 0

F-value

F-value

Significance values at 3-nt periodicity

Multitaper: RNA-seq

15,000

5,000 0 0

59

20

High

117–138

th

ng

75–97

Med

t) (n

Low

io

ss

re

p Ex

M)

PK

R n(

CCDS exons

0 165–213 Le

  |  ADVANCE ONLINE PUBLICATION  |  nature methods

7 6 5 4 3 2 1 0

50 100 150 200 nt

In the multitaper approach, a set of orthogonal window functions (tapers) is applied to a discrete signal before its Fourier transform is computed. The transformed windowed outputs are averaged to obtain a smoothed spectrum amenable to a statistical test for detecting significant frequencies19,20 (Online Methods). In this way, the multitaper method tests directly for the presence of PFPs, in contrast to current approaches that look at the enrichment of Ribo-seq reads over RNA-seq (translation efficiency 1) or a tendency for reads to align to one frame. RiboTaper detects translation across a wide expression range To assess the performance and utility of RiboTaper, we generated a deep ribosome-profiling data set for HEK293 cells (Online Methods). We obtained >29 Mio reads, of which >25 Mio aligned to the genome (Supplementary Table 1). Profiles at consensus coding sequence (CCDS) exons (Online Methods) displayed a striking frame preference, in excellent agreement with annotated transcript types and regions (Supplementary Fig. 2). To evaluate the sensitivity of the multitaper approach, we applied it to profiles from CCDS exons of different lengths and Ribo-seq coverage (Supplementary Figs. 2–4). As expected, the test achieved higher sensitivity on longer exons and benefitted from higher coverage (Fig. 1b). Twenty-four tapers exhibited the best combination of sensitivity and specificity (Supplementary Fig. 5). We then benchmarked the multitaper approach against a χ2 significance test as a baseline. The χ2 test uses a null hypothesis of a uniform frame P-site distribution and corresponds to the basic assumption behind the ORFscore6. However, sequencing protocols are affected by different sources of variability that cause a nonuniform distribution of reads21. Further, insufficient depth may also lead to sampling biases and spurious enrichments. To evaluate specificity on an appropriate negative sample (i.e., regions without 3-nt periodic reads), we applied the tests to RNA-seq data from annotated CCDS exons. At a significance level of 0.05, the multitaper test had slightly lower sensitivity than the χ2 test (87% versus 94% of positive exons; Supplementary Figs. 2–4). However, when applied to RNA-seq profiles, the χ2 test showed a worrisome number of positive calls compared with the multitaper test (16% versus 3.5%; Fig. 1c and Supplementary Figs. 2–4). We observed strongly skewed P values for the χ2 test but a desirable near-uniform distribution of the multitaper P values for the RNA-seq data (Fig. 1c). This high specificity is critical in studies of translation outside of protein-coding regions and directly pertains to Riboseq data: integrated in our analysis pipeline, the χ2 test detected 67 translated ORFs in small nucleolar RNAs and small nuclear RNAs—transcripts unlikely to generate peptides—whereas the

DEPDC7 coding exon

P-sites

ATP6V1H 5′ UTR exon

CCDS exons (%)

© 2015 Nature America, Inc. All rights reserved.

Figure 1 | The RiboTaper workflow. (a) Data tracks (P-site tracks shown) are created for all annotated exons. P-site positions are color-coded according to the frame; the multitaper method estimates the significance of each frequency component in the different tracks. Three examples of length- and coverage-matched exons from different transcript classes are shown. (b) Effects of ORF length and expression on the percentage of expressed CCDS exons with significant 3-nt periodicity (exact value on each bar). Low, medium (Med) and high bins correspond to 2.34−3.84, 5.91−9.14 and 15.08−29.77 reads per kilobase of exon per million reads (RPKM), respectively. (c) Histograms of P values for all CCDS exons tested by applying the multitaper (top) and χ2 (bottom) tests to RNA-seq data as a negative control. P < 0.05 shown in red.

0.4 0.8 P value χ2: RNA-seq

15,000

5,000 0 0

0.4 0.8 P value

multitaper detected only 3. A principled statistical framework eliminates the need for heuristic parameters or ad hoc cutoffs, a crucial step when score-based metrics are being used with different data sets (Supplementary Fig. 6). RiboTaper defines known and novel translated ORFs We next created transcript tracks for de novo translated ORF identification using 3-nt periodicity and frame definition according to the P-site positions (Fig. 2a and Online Methods). We classified ORFs on the basis of annotation categories and genomic position relative to known coding regions (Fig. 2b,c, Supplementary Fig. 7 and Online Methods). This led to ~21,000 translated ORFs in ~14,000 expressed genes, in coding and noncoding transcripts, across a wide range of expression values (Supplementary Fig. 8).

Articles

60

0.8

0

0

p 0.8

0

0

St o

AT G

0 0.28

0.32 0.36 Frequency (Hz)

0.8

0

AT G St op

0

b

F-value

Total P-sites in frame (%)

AT G

P-sites

0.32 0.36 Frequency (Hz)

10

p

P-sites

60

Upstream P-sites in frame (%) 0.8 60 0

0 0.28

Total P-sites in frame (%)

AT G

0

10

St o

Upstream P-sites in frame (%) 0.8

F-value

Total P-sites in frame (%)

AT G

P-sites

De novo ORF identification ENST00000219400.3

F-value

a

AT G

Figure 2 | De novo ORF reconstruction and examples of detected ORFs. (a) For a given transcript, we anchored the analysis on the stop codon and used the most upstream in-frame ATG with more than five P-site positions (>50% in-frame) between it and its closest downstream ATG (green dashed box). (b) Examples of translated ORFs for two uORFs and one CCDS ORF for MIEF1. (c) A ncORF in the CTD-2162K18.5 gene (HGCN symbol ZNF790-AS1). For a given transcript, each plot depicts the RNA-seq coverage (gray shading), the potential ORFs of the three possible frames (red, green and blue bars) and the ORFs with significant PFPs (colored according to the annotation category). P-sites positions are color-coded according to the frame. All of the ORFs detected were supported by peptides from mass spectrometry data except for the most 5′ uORF in MIEF1.

10 0 0.28

0.32 0.36 Frequency (Hz)

MIEF1 protein-coding gene ENST00000325301.2 104

uORF

P-sites

The vast majority of ORFs in proteincoding genes overlapped known CCDS 10 coding regions; 369 non-CCDS proteincoding genes were identified as harboring 5 translated ORFs. We detected >600 0 genes with translated upstream ORFs 0 (uORFs) and 54 genes with downstream ORFs (dORFs; Online Methods). We also identified ORFs in 504 noncoding genes Frame 0 Frame 1 (ncORFs), mainly belonging to pseudogene, Frame 2 antisense and long intergenic noncoding c RNA (lincRNA) biotypes (Fig. 3a). RiboTaper-identified ORFs exhibited 8 a protein-coding gene–like distribution 6 of Ribo-seq read lengths, as quantified by the FLOSS score11, across all ORF 4 categories (Supplementary Fig. 9). The reconstructed ORF coordinates agreed 2 with translation-initiation sites defined by quantitative translation-initiation 0 sequencing22 (QTI-seq) or the reference 0 annotation (Supplementary Fig. 10). Both QTI-seq and RiboTaper detected Frame 0 149 upstream initiation sites (mostly corFrame 1 Frame 2 responding to uORF start codons) and 52 internal start codons. Approximately 1,000 QTI-seq ATG start-codon candidates did not overlap with either annotated or RiboTaper-defined start codons. More than 99% of CCDS ORFs identified by RiboTaper in our HEK293 data were also found in the cycloheximide-treated Ribo-seq sample studied by Gao et al.22 (Fig. 3b). However, agreement dropped to 68% for lincRNA-antisense ORFs and 47% for uORF-containing genes, possibly because of their relatively short length and low expression levels (Fig. 3b). To demonstrate the general applicability of RiboTaper, we identified thousands of coding ORFs and ncORFs in Ribo-seq data from zebrafish embryo6 (Supplementary Fig. 4, Supplementary Tables 2 and 3 and Supplementary Data). Among the identified ncORFs was the recently discovered ORF in the lincRNA toddler23,

CCDS ORF RNA coverage

200

400

600 800 Nucleotides

1,000

1,200

1,400

CTD-2162K18.5 antisense gene ENST00000589556.1

35 ncORF antisense

P-sites

RNA coverage

© 2015 Nature America, Inc. All rights reserved.

20

0 100

200

300

400

Nucleotides

which encodes a small polypeptide morphogen essential for zebrafish embryonic development (Supplementary Fig. 11). Distinct evolutionary features in different ORF categories To gain insight into the identified ORFs in categories outside of annotated coding regions, we analyzed evolutionary conservation patterns of ORFs in different categories. CCDS ORFs and non-CCDS coding ORFs exhibited the highest nucleotide conservation 24, followed by pseudogenes and processed transcripts (Supplementary Fig. 12a,b). Unlike in bona fide coding regions, we observed elevated nucleotide conservation only around start and stop codons for uORFs and processed transcripts (Fig. 4a and Supplementary Fig. 12a).

nature methods  |  ADVANCE ONLINE PUBLICATION  |  

10

C D S C

C ncORFs

  |  ADVANCE ONLINE PUBLICATION  |  nature methods

154

is

nc

en

R N A Pr tra oce s ns s cr ed ip t

se

e og

Li

ud Ps e

An t

N A Pr tra oce s ns s e cr d ip t

en is

R Li

en

An t

nc

Gao et al.

502

8,807

0.10 0.01

Genes with a uORF This study

2,745

LincRNA and antisense genes with an ncORF Gao et al.

This study

121

173

72

34

1,000 100

is

st u G on dy ao ly et o a O nly l. ve rla p

10

Th

ORF P-sites (sum)

st u G on dy ao ly et o a O nly l. ve rla p

10

is

is st u G on dy ao ly et o a O nly l. ve rla p

10

100

Th

1,000

ORF P-sites (sum)

90 Gao et al.

Th

is st u G on dy ao ly et o a O nly l. ve rla p

Th

Ribo-seq as a proxy to define the cellular proteome The ORFs identified by RiboTaper covered a wide range of expression values, to the extent that the coverage obtained might exceed that of deep mass spectrometry data sets with respect to defining the cellular proteome. To evaluate this, we created a custom database from the set of identified ORFs to match the spectra of a recent HEK293 tandem mass spectrometry data set27. The RiboTaper peptide set corresponded to ~59% of the peptides in UniProt (human entries, October 2014) and 2% of non-UniProt candidates. The RiboTaper set matched >90,000 peptide sequences belonging to >8,000 genes (Fig. 4c), similar to the results of the full UniProt search. More than 3,900 peptide sequences were found only by our custom search, and not from the UniProt database (1% false discovery rate (FDR); Supplementary Table 4). However, our search missed a similar number of peptides. RiboTaper-only peptides

Genes with a CCDS ORF This study

ORF P-sites (sum)

b Translated genes Next we assessed whether sequence conservation reflects evolutionary selecThis study tion on the encoded protein sequence, 3,256 and whether this selection persists in the human population. We quantified coding 9,202 potential by means of hexamer sequence 173 Gao statistics25 and tested the preference for et al. synonymous versus nonsynonymous sin1,000 gle-nucleotide polymorphisms (SNPs) in 10 the human population using appropriate length- and conservation-matched controls (Supplementary Fig. 12b and Online Methods). For CCDS and non-CCDS coding ORFs, as well as processedtranscript ncORFs, nucleotide conservation was accompanied by good hexamer scores and depletion of nonsynonymous SNPs (dN/dS; Fig. 4b). uORFs also showed conservation on the nucleotide level, but they showed no significant enrichment of synonymous substitutions (dN/dS), suggesting potential regulatory rather than protein-coding roles, as their position but not their sequence is conserved. Additional ncORF categories showed low nucleotide conservation (except for pseudogenes), a positive trend for hexamer scores and no depletion for nonsynonymous SNPs. Analysis of codonsubstitution patterns across vertebrate species26 led to a similar outcome (Supplementary Fig. 12b). Taken together, the ncORFs detected by RiboTaper did not necessarily have conservation and selection patterns similar to those of protein-coding regions.

se

e

10

1.00

en

100

ud

LincRNA: 79 genes

Processed transcript: 56 genes

1,000

Ps e

Antisense: 114 genes

Other: 8 genes

og

Pseudogene: 245 genes

ORF P-sites (sum)

© 2015 Nature America, Inc. All rights reserved.

10.00

C D S

uORFs: 656 genes

10.00 1.00 0.10 0.01

R Fs

100

dO

10,000

1,000

O R Fs co No di nng C O CD R S Fs uO R Fs

10,000

P-sites per codon

dORFs: 54 genes Non-CCDS coding ORFs: 369 genes

P-sites per codon

Protein-coding ORFs

CCDS ORFs: 11,552 genes

O R Fs co No di nng C O CD R S Fs uO R Fs dO R Fs

a

ORF length (nt)

Figure 3 | Comparative analysis of translated ORFs from coding and noncoding regions. (a) Number of genes, ORF length and Ribo-seq coverage of protein-coding ORFs and ncORFs in the different subcategories. (b) Overlap (top) and coverage (bottom) of ORFs identified in the data set reported by Gao et al.22 compared to our data set, split by ORF category. For all box plots, in each box, the central black line indicates the median, and the upper and lower edges denote the 25th and 75th percentiles. Whiskers represent 1.5 times the interquartile range from the box hinge. Outliers are plotted as individual data points.

ORF length (nt)

Articles

matched more spectra than UniProt-only peptides did, despite being shorter and having lower matching scores (Supplementary Fig. 13a–c). We found little evidence for expression or translation for most of the UniProt-only peptides (Fig. 4d), suggesting that they may derive from erroneous calls or stable peptides from unstable RNAs. RiboTaper ORFs with peptide support mapped to CCDS genes, with few exceptions (Fig. 4e). We identified peptides belonging to a uORF in MIEF1 (ref. 28) (Fig. 2b) and from dORFs and ncORFs located in conserved and nonconserved genomic regions (Fig. 2c and Supplementary Figs. 14–17). In total, 228 identified peptide sequences were not annotated in UniProt, and 157 were still detected using a merged database encompassing UniProt and RiboTaper entries (Supplementary Table 4). In most cases, the newly identified peptides mapped uniquely to their respective ORFs (Fig. 4e and Supplementary Table 4). DISCUSSION Ribo-seq reads of specific lengths can display a precise subcodon pattern, which allows for accurate identification of the translated frame. However, certain experimental protocols can have a marked influence on the subcodon profiles2 (Supplementary Fig. 1). The Ribo-seq protocol is far from being standardized29, and codon biases and ribosome stalling30 pose challenges to the quantitative understanding of translation at each locus. The RiboTaper method proposed here is robust and tailored to exploit the periodic subcodon pattern to identify PFPs associated with active translation. Despite the identification of many PFPs in noncoding transcripts10,11, evolutionary-conservation analysis suggests that the act of translation, rather than the production of specific

Articles dN/dS ratio

ni

oT

ib

R

C C D S co No O di n- RF ng C s O CD R S F uO s Ps d RF eu O s do RF g s An ene tis s e Li nse Pr ncR o tra ce NA ns ss cr ed ip O t th er C C D S co No O di n- R ng C Fs O CD R S uOFs R Ps d Fs eu O do RF g s An en tis es e L n Pr inc se tra oceRN ns ss A cr ed ip Ot th er

Translated genes with peptide evidence

© 2015 Nature America, Inc. All rights reserved.

U

ap

er

s Pr ear ch ot o se ar nly c h R ib o n oT O ly ap ve er rla U ni sea p Pr ot rch se on ar l ch y on O ly ve rla p

TPM (gene level)

C C D S co No OR di n- F s ng C O CD R S uOFs R F Ps dO s eu R do F s ge An ne ti s s en Li s e nc Pr RN tra oce A ns s s c r ed ip t

–2 –25 –10 –15 0 –5 0 5 10 15 20 25

Average phastCons score

–2 –25 –10 –15 0 –5 0 5 10 15 20 25

Average phastCons score

Figure 4 | Conserved and nonconserved a 1.0 CCDS ORFs b 1.00 Pseudogene 1.0 RiboTaper-identified ORFs define the Non-CCDS Processed *** ** *** coding ORFs 0.75 transcript cellular proteome. (a) Local nucleotide 0.8 0.8 uORFs Antisense 0.50 dORFs LincRNA conservation (computed by phastCons) in 0.6 0.6 0.25 a ±25-nt window around start-codon 0 0.4 0.4 positions. (b) Comparison of length- and 0.2 0.2 conservation-matched control ORFs to ORFs in different categories on the basis of dN/dS ratio. **P < 0.01, ***P < 0.001 by Distance from start codon (nt) Distance from start codon (nt) χ2 test, using as expected frequencies the c d values from the ORF control set (Online Ribo-seq RNA-seq 1,000 Detected peptide sequences In silico digested peptide Detected genes Methods). CCDS ORFs and non-CCDS coding sequences with peptide evidence 100 ORFs 30% of the Ribo-seq coverage was supported by multimapping reads only (filtering was disabled for the custom peptidedatabase creation). nature methods

© 2015 Nature America, Inc. All rights reserved.

Multitaper method. In digital-signal processing, the quantitative estimation of periodic components in a finite signal (the power spectral density (PSD)) is an area of intense study with application to diverse fields of scientific research. A switch from the original representation of a signal (the time domain) to its spectrum of fixed periodic components (the frequency domain) is achieved via the Fourier transform. In the frequency domain, a vector of co­efficients represents the contribution of each frequency component in shaping the original signal. The raw output of the Fourier transform (the periodogram) typically suffers from high variability, and as a result it represents a poor estimate of the PSD. Moreover, the limited amount of available realizations of the same signal (i.e., the lack of replicates) poses a real challenge in the estimation of robust coefficients for different frequencies. Applying a smoothing window (taper) to the signal before calculating its Fourier transform reduces such variability but generally creates a biased estimate of the PSD. The choice of taper functions is fundamental, and many different solutions have been proposed in recent decades to find the ‘optimal’ window function. The multitaper method, originally proposed by Thomson18, offers a promising, nonparametric solution to the PSD-estimation problem. Its central idea is to apply a set of multiple window functions to the signal and average the spectral estimates of the ensemble of tapered signals. Because the window functions used in the multitaper method are a set of orthonormal functions, the resulting spectra are independent and can be averaged. Specifically, discrete prolate spheroidal sequences (also known as Slepian sequences) have been shown to maximize the information content of a finite signal at a given frequency resolution. The use of Slepian sequences in the multitaper method allows for an optimal solution for reducing the variance in the power spectrum. The use of multiple orthonormal window functions on the same signal also makes it possible to test the robustness of the estimated spectral coefficients. The amount of variance captured by the estimated coefficient at each frequency bin can be compared with a null hypothesis of white noise, leading to a reliable statistical test to determine significant frequency components20. RiboTaper. The original multitaper algorithm from Thomson is implemented in R in the package “multitaper”42. We added a stretch of zeros to the input sample to reach a minimum length of 50 nt. We ran the multitaper with 24 tapers, with the time-window parameter set to 12. Moreover, sequences shorter than 500 nt were zero-padded to 1,024 data points before the discrete Fourier transform was computed, to obtain adequate frequency resolution in the spectrum. We extracted F-values from the frequency bin closest to 3-nt periodicity. We calculated P values from the F-statistic20 by using 2 and 2k − 2 degrees of freedom, where k is the number of tapers (24 in this study). ORFs and exons with fewer than six P-sites or shorter than 6 nt were ignored.

and stop codons. ORFs were then scored with PhyloCSF in the “mle” (default) mode, using the “29mammals” parameter set on the 46vertebrate alignment to the human genome (hg19) after alignment filtering steps as described by Bazzini et al.6. We additionally used the hexamer score from the CPAT tool to assess the coding potential of different ORFs, using the available trained model for the human genome. For each category, the scores were compared against a control set of ORFs matching the length and conservation of the category of interest. For CCDS ORFs and non-CCDS coding ORFs, we selected ORFs shorter than 300 nt as meaningful matching controls (Supplementary Fig. 10). SNPs were downloaded as .gvf files from Ensembl (v75, 1000 Genomes phase 1). We removed SNPs in reverse orientation, SNPs falling into genomic repeats (using the RepeatMasker track from the UCSC genome browser, March 22, 2015) and rare SNPs with a derived allele frequency of

Detecting actively translated open reading frames in ribosome profiling data.

RNA-sequencing protocols can quantify gene expression regulation from transcription to protein synthesis. Ribosome profiling (Ribo-seq) maps the posit...
565B Sizes 1 Downloads 15 Views