Gene 539 (2014) 203–208
Contents lists available at ScienceDirect
Gene journal homepage: www.elsevier.com/locate/gene
Upstream open reading frames and Kozak regions of assembled transcriptome sequences from the spider Cupiennius salei. Selection or chance? Andrew S. French ⁎, Audrey W. Li, Shannon Meisner, Päivi H. Torkkeli Department of Physiology and Biophysics, Dalhousie University, Halifax, Nova Scotia, Canada
a r t i c l e
i n f o
Article history: Received 6 November 2013 Received in revised form 24 January 2014 Accepted 31 January 2014 Available online 12 February 2014 Keywords: Arthropod mRNA Translation Transcriptome Sequence structure Evolution
a b s t r a c t We assembled a new set of mRNA sequences from the leg hypodermis transcriptome of the wandering spider, Cupiennius salei. Each sequence was assembled to exhaustion in the 5′ direction to detect all upstream open reading frames (uORFs) both in-frame and out-of-frame with the main open reading frame (mORF). We also counted nucleotide probabilities before and after the START codon of the mORF to establish the optimum Kozak consensus sequence. More than 80% of 5′ sequences had uORFs before the mORF with a range of 1–16 uORFs. Kozak consensus strengths of uORFs were signiﬁcantly weaker than mORFs. Random scrambling of 5′ nucleotide positions did not give signiﬁcantly different numbers, sizes, or Kozak consensus strengths of uORFs. Random simulations of 5′ sequences using either equal or experimental distributions of nucleotides gave similar numbers of uORFs, with similar sizes and Kozak consensus strengths to experimental data. Abundance of mRNA for each gene was estimated by counting matching Illumina reads to assembled genes. Abundance was negatively correlated with numbers of uORFs, but not with 5′ length. Our data are compatible with a random model of 5′ mRNA sequence structure. © 2014 Elsevier B.V. All rights reserved.
1. Introduction The established model of protein translation from messenger RNA, utilizing START and STOP codons to deﬁne a main open reading frame (mORF), has expanded to include a range of features within the entire transcribed gene that are thought to modify translation, including evolutionarily conserved sequences within both the 5′ and 3′ ‘untranslated’ regions (Bicknell et al., 2012; Kozak, 2006, 2007; Matsui et al., 2007; Wethmar et al., 2010). However, evidence for some of these hypothetical mechanisms in eukaryotes has been challenged (Kozak, 2006, 2007) and more random, or stochastic mechanisms of 5′ evolution have been suggested (Chen et al., 2011; Lynch et al., 2005; Reuter et al., 2008). It is well established that the nucleotides immediately surrounding the START codon (AUG), are not random, and certain combinations are statistically frequent in a wide range of genes and species (Cavener and Ray, 1991; Kozak, 1987a). These Kozak consensus sequences have been linked to the strength of initiation of translation (Kozak, 1987b; Wethmar et al., 2010).
Abbreviations: BLAST, Basic local alignment search tool; CDS, Coding DNA sequence; mORF, Main open reading frame; uORF, Upstream open reading frame. ⁎ Corresponding author at: Department of Physiology and Biophysics, Dalhousie University, P.O. BOX 15000, Halifax, Nova Scotia B3H 4R2, Canada. Tel.: + 1 902 494 1302; fax: +1 902 494 2050. E-mail address: [email protected]
(A.S. French). 0378-1119/$ – see front matter © 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.gene.2014.01.079
In addition to the mORF, many transcribed mRNA sequences contain additional complete open reading frames in either the 5′ or 3′ regions or both, that were initially thought not to be translated. Null, or stochastic models have been suggested to account for the existence of upstream, or 5′, open reading frames (uORFs), with some support from experimental data (Chen et al., 2011; Lynch et al., 2005; Reuter et al., 2008). Since the translational mechanism starts near the 5′ end of the sequence, it must encounter any uORFs before the mORF. A number of mechanisms have been proposed by which uORFs may affect translation of the mORF (Bicknell et al., 2012), generally suggesting inhibitory functions, and some of these could be involved in human diseases (Wethmar et al., 2010). One line of evidence that such uORFs are selected by evolution is based on the idea that their numbers are lower than would be expected by purely random positioning of nucleotides in the 5′ region, even for the same probabilistic distributions of nucleotides (Iacono et al., 2005), or that their numbers are differentially determined by their protein function (Bicknell et al., 2012). Genomic studies of arachnids have lagged behind other arthropods, with complete genomes available for only two ticks and one mite species, and none of the true spiders. The wandering spider, Cupiennius salei, is a widely used experimental species, particularly for physiological (Barth, 2002; French et al., 2002), evolutionary and developmental studies (McGregor et al., 2008; Wolff and Hilbrant, 2011). We used next generation sequencing to assemble a set of mRNA sequences from the transcriptome of the leg hypodermis of adult C. salei, a structure rich in sensory neurons, but also containing glia, muscle and
A.S. French et al. / Gene 539 (2014) 203–208
other tissues. In addition to the protein coding sequence (mORF), each gene was assembled to exhaustion in the 5′ direction, to catalog the numbers, sizes and Kozak consensus regions of all uORFs. Based on this data we also conducted several simulations to test the hypothesis that the numbers or properties of uORFs in this species are different from those that would be expected from random collections of nucleotides in the 5′ regions. We found no evidence for such differences from random. 2. Materials and methods 2.1. RNA preparation, sequencing and assembly Details of transcriptome preparation, sequencing and assembly have been described previously (French, 2012). Brieﬂy, tropical wandering spiders, C. salei were maintained in a laboratory colony at room temperature (22 ± 2 °C) and a 13:11 h light:dark cycle. Fifty-six legs from seven adult female sibling spiders were autotomized following a protocol approved by the Dalhousie University Committee on Laboratory Animals. Total RNA (8 μg) was extracted from the combined hypodermal tissues of legs using a Qiagen RNeasy plus mini kit and following the manufacturer's instructions. Separation of mRNA, construction of cDNA library and Illumina processing were performed by McGill University and Génome Québec Innovation Centre, Montréal, Québec. Initial cDNA reads (220 million pairs of 100 nucleotide length) were groomed by requiring at least 80 contiguous nucleotides with Phred score N 19 to give a ﬁnal database of 110 million pairs of reads. A similar, but separate, procedure was used to construct a transcriptome of protocerebral (brain) tissue pooled from two adult female spiders, using a Qiagen RNeasy plus midi kit to extract 10 μg of total RNA. Initial sequences of interest were identiﬁed by searching the database at low stringency versus coding regions from published genomes of arachnids (Metaseiulus occidentalis, Ixodes scapularis, Rhipicephalus sanguineus) and the insect Drosophila melanogaster. All putatively matching reads were compared to the non-redundant protein database using BLASTX (http://blast.ncbi.nlm.nih.gov). Initial target genes were primarily ion channels, chemical transmitter receptors and intracellular second messenger components, but all sequences that were reliably identiﬁed as protein coding by a BLASTX search (i.e. matching the same or similar protein in multiple species) were included for further analysis. Identiﬁed reads were extended by the transcriptome walking algorithm (French, 2012) using an initial minimum overlap of 60 nucleotides. Minimum overlap was increased, in steps of 10 up to a maximum of 90 nucleotides, when required to eliminate multiple possible extension pathways in a few cases of highly redundant sequence structure. Overlap was reduced in steps of 10 to a minimum of 40 nucleotides when no overlaps were found in a complete pass through the database. Walking was always continued to exhaustion in the 5′ direction. Walking was always continued to include the complete protein coding sequence (mORF or CDS) and the STOP codon in the 3′ direction. In many cases the complete sequence was continued to the poly-A tail. Further analysis of the Kozak consensus region around the main START codon, and any shorter upstream open reading frames (uORFs) was performed by visual inspection and by custom-written software. All data analysis and simulation programming was performed in the C++ language using Microsoft Visual Studio. Multiple threaded code was used for searching and for walking, using up to 16 parallel processing cores. 2.2. Statistical analysis Tests for signiﬁcant differences in means between pairs of distributions of ﬁtted parameters were made using the non-parametric Mann–Whitney test. Statistical signiﬁcance in the ﬁgures is indicated by asterisks: *p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001. Tests for non-random
probability of nucleotides in deﬁned positions used both χ-squared and G-tests. 3. Results 3.1. General properties of the transcribed genes A total of 151 transcribed mRNA sequences were used for further analysis. All sequences are available in the GenBank database as GAKT01000001–GAKT01000151 (http://www.ncbi.nlm.nih.gov). All gene identiﬁcations are putative because translated protein functions have not yet been demonstrated. Main open reading frame (mORF or CDS) length varied from 336 (translation factor Sui1) to 25,341 (twitchin) nucleotides with a mean length of 2575 (Fig. 1). Putative protein functions were divided arbitrarily into ﬁve major groups to illustrate the range of genes characterized. The groups included in the initial search, such as ion channels and neurotransmitter receptors, were well represented, but others, including major housekeeping genes, constituted more than half the total count (Fig. 1). We also recorded the phyla of the closest matching proteins using the standard BLAST search algorithm with default settings. Arthropoda (118/151)
Fig. 1. General properties of the 151 transcribed sequences used in the study. (A) Distribution of mORF (CDS) lengths. The longest two sequences (twitchin and ryanodine receptor) are indicated, as well as the mean length (2575 nucleotides). (B) Putative protein functions were pooled into ﬁve major categories to illustrate the range of putative gene functions. (C) mRNA abundance was estimated by counting the number of reads that matched the main reading frame. Values are shown relative to the abundance of actin. The only gene with greater abundance than actin was eukaryotic initiation factor (EIF).
A.S. French et al. / Gene 539 (2014) 203–208
and Chordata (16/151) were the largest representatives. The remaining 17 included mollusca, hemichordata and nematoda. Relative abundance of transcribed mRNA was estimated by searching the groomed data for matches to the mORF of each gene (Fig. 1). A range of 40–230 million reads were tested for each gene, depending on the abundance. At least 90/100 identical nucleotide matches were required to score each read as derived from the gene. Total counts were normalized for mORF length, and then expressed as abundance relative to the putative Actin gene. The number of reads matching the complete actin transcript of 1733 nucleotides was 159,746 in a count of 40 million groomed reads, so actin accounted for ~0.4% of all reads. Abundance ratios varied from 0.000203 (a glutamate-activated chloride channel) to 1.12 (eukaryotic initiation factor — the only gene more abundant than actin) with most relative abundance values in the range 0.001–0.01. 3.2. Validation of low abundance sequences To verify the accuracy of low abundance assemblies we searched a separate database of 89 million paired reads obtained by Illumina sequencing of mRNA from C. salei brain (protocerebrum) tissue. These data were from a different pair of animals, and processed completely separately. Of the twelve lowest abundance hypodermis genes (0.000203–0.000713 relative to actin), ten had matching genes in the brain data, including the Kozak and 5′ regions, with an abundance range in the brain of 0.000112–0.000998. The other two genes, a putative TRP channel and a putative kinase, were completely absent from the brain data. The abundance estimation process used for all genes in the hypodermis data also provided a semi-independent check on assembly by ensuring that the matching reads to each assembled sequence overlapped as expected. 3.3. The Kozak consensus region We explored all sequences to exhaustion at the 5′ end. The numbers of nucleotides found before the START (AUG) codon varied from 7 to 1221, mean = 259 (Fig. 2, insert). To characterize the Kozak consensus region (Kozak, 1987a) we counted nucleotide frequencies at positions from six before the AUG (positions − 6 to − 1) to one following the AUG (position + 4). Both Chi-squared and G-tests (Rayson et al., 2004) were used to test for non-random distributions of nucleotides at each of the seven positions. The most consistent feature was ‘A’ at position − 3 in 87% of sequences (p b 0.0001, Fig. 2). Other strongly
signiﬁcant features were ‘A’ or ‘G’ at position +4 in 60% (p b 0.0001, indicated as ‘R’ in Fig. 2), and ‘U’ at position −6 in 44% (p b 0.0001). Weaker deviations from randomness were seen at positions −4 (p b 0.001) and −1 (p b 0.01), with more ‘A’ than expected in both cases. 3.4. Upstream open reading frames Upstream open reading frames (uORFs) were deﬁned as complete coding regions between START and STOP codons in-frame or out-offrame of the main reading frame (mORF) (Iacono et al., 2005). Of the 151 sequences, 89 had in-frame uORFs. Numbers of uORFs before the mORF normally varied from one to four, but one sequence had six inframe uORFs. Since each sequence can theoretically be translated from three different frame origins, and the mechanism of this decision is not yet clear (Kozak, 2007) we also inspected each sequence for outof-frame uORFs, and found that 73 and 75 sequences contained uORFs starting one and two nucleotides early, respectively. Total numbers of these out-of-frame uORFs were similar to those in-frame. Again, most sequences had 1–4 uORFs, but one had ﬁve and another had eight. Combining in-frame and out-of-frame numbers, only 28 of the 151 transcribed sequences lacked any uORFs. We recorded the numbers and lengths of all the uORFs in all possible frame origins for the ﬁrst four positions preceding the mORF, as well as the lengths of the mORFs. We also drew up a table of hypothetical Kozak consensus initiation strengths based on the probabilities of nucleotide occurrence at the most signiﬁcant two positions, − 3 and + 4 (Table 1). This table was used to score the strength of the Kozak consensus regions of the uORFs, as well as all the mORFs (Fig. 3). Probability of ﬁnding at least one uORF before the mORF was slightly higher in-frame (59%), than both possible out-of-frame conﬁgurations (49% each). While all parameters varied considerably, their mean values were quite consistent, with 1–4 uORFs in each case, uORFs of about 40 nucleotides mean length, separated by 50–150 nucleotides when there was more than one. Kozak strengths for uORFs were always signiﬁcantly lower than the mORF for at least the ﬁrst two positions before the mORF. Earlier positions also had lower strength, but small numbers made it difﬁcult to estimate signiﬁcance. Both Kozak consensus sequences and uORFs have been linked to the effectiveness of translation to protein (Kozak, 1978, 1999; Wethmar et al., 2010). Data regarding protein concentration was unavailable, but we were able to estimate the abundance of each mRNA in the tissue from the number of reads obtained by the Illumina process (Fig. 1). Combining this data with the uORF and mORF data we produced a correlation matrix (Table 2) to search for relationships between the various parameters. Numbers of uORFs in all possible starting frames were positively correlated with total lengths of the 5′ regions before the mORFs, and with
Table 1 Estimated initiation strengths for Cupiennius salei Kozak consensus region.
Fig. 2. Nucleotide densities around the START codon (AUG) of the mORF. Letter ‘R’ indicates ‘A’ or ‘G’. Asterisks indicate statistically signiﬁcant deviations from random, as calculated by χ-squared or G-tests. 87% of sequences had ‘A’ in the −3 position, but ‘A’ was also signiﬁcantly over-represented in the −4, −2 and −1 positions. ‘A’ or ‘G’ were in the +4 position in 60% of sequences. ‘U’ was statistically more common in the −6 position. Inset shows the distribution of 5′ lengths before the START codon.
A A A A C C C C G G G G U U U U
A C G U A C G U A C G U A C G U
3 2 3 2 2 1 2 1 2 1 2 1 2 1 2 1
A.S. French et al. / Gene 539 (2014) 203–208
Fig. 3. Statistical description of the uORFs in Cupiennius mRNA. Data are shown for the in-frame uORFs, and the two possible out-of-frame uORFs. The upper diagram gives a key to the values. No uORFs were detected in 28 of the 151 sequences (18%).
each other. Mean length of the 5′ region for sequences with uORFs (mean = 293, n = 123) was also signiﬁcantly greater than mean 5′ length (mean = 112, n = 28) for sequences without any uORFs (p b 0.001 by t-test or Mann–Whitney). Kozak consensus strengths of mORFs were not signiﬁcantly correlated with any other measured parameters. Relative abundance of mRNA sequences was signiﬁcantly negatively correlated with the numbers of uORFs in the same frame as the mORF, and with total uORFs (Table 2). To explore this further we plotted the average values of the logarithms of abundances versus the total numbers of uORFs (Fig. 4). The regression line had a slope of − 0.05, indicating that each additional uORF, regardless of frame alignment, is associated with a further 11% reduction in mRNA abundance. 3.5. Simulations of the 5′ region Two approaches were used to compare the experimental data to randomly originating 5′ regions. In the ﬁrst, we counted the total numbers of uORFs in all frames for each gene, and then randomly shufﬂed the nucleotides in each 5′ sequence before re-discovering uORFs and counting, as before. In each case we performed 100,000 randomly positioned swaps of nucleotide pairs. uORF counts were pooled into groups around a series of 5′ lengths (Fig. 5). We also computed the average uORF length for each condition, giving 39 and 43 nucleotides respectively for experimental and shufﬂed data. Similarly, we computed the Kozak consensus strengths of uORFs, giving 1.71 and 1.69 respectively.
The second approach used completely simulated 5′ sequences, repeated 1000 times at each length to give average experimental parameters for uORFs, using a series of nucleotide lengths comparable to the transcriptome data. Simulations were performed using uniformly distributed random nucleotides (25% probability of any nucleotide in any position), and also with the same average distributions seen in the experimental 5′ regions (‘A’ 28.6%, ‘C’ 19.7%, ‘G’ 21.1%, ‘U’ 30.5%). Simulated sequences were then examined for uORFs, and the numbers (in-frame and out-of-frame), lengths, and Kozak consensus strengths were calculated, using Table 1. Simulated and experimental values for the total numbers of uORFs in the 5′ regions were closely comparable (Fig. 5). For uniform distributions, lengths of simulated uORFs had a range of 30–64 nucleotides with a mean of 53 nucleotides, and mean simulated Kozak consensus strength was 1.70. For the experimental distributions, lengths of simulated uORFs had a range of 27–49 nucleotides with a mean of 43 nucleotides, and mean simulated Kozak consensus strength was 1.77. Numbers of uORFs from the simulations were plotted as joined lines with the experimental and scrambled experimental data (Fig. 5). 4. Discussion 4.1. Kozak consensus sequence The optimal conﬁguration for vertebrate starting sequences has ‘A’ or ‘G’ at the − 3 position and ‘G’ at the + 4 position (Kozak, 1987a,
Table 2 Correlation matrix (values of r) for Kozak and ORF parameters. Parameter
5′ length mORF Kozak AUG XAUG XXAUG Total
0.671*** 0.028 −0.038 0.49***
0.583*** 0.01 0.048 0.487*** 0.329***
0.836*** −0.009 0.099 0.837*** 0.788*** 0.742***
−0.003 −0.134 0.078 −0.214** −0.138 −0.055 −0.175*
mORF = main open reading frame. uORF = upstream open reading frame. Kozak = Kozak consensus sequence initiation strength (from Table 1) of the mORF. AUG, XAUG, XXAUG, Total = numbers of uORFs in each frame, and their sum. Abundance = log10 (abundance/actin abundance). Statistical signiﬁcance: * p ≤ 0.05, ** p ≤ 0.01, *** p ≤ 0.001.
A.S. French et al. / Gene 539 (2014) 203–208
Fig. 4. mRNA abundance was negatively correlated with the total number of uORFs. Abundance was estimated by counting the number of reads matching the mORF, and is shown as the logarithm of the abundance relative to actin (mean ± s.e.). Total upstream reading frames (uORFs) included those in all three frames relative to the mORF. Values for 5–6, 7–8, 9–10, and 11–16 uORFs were combined to give counts of at least 5 values in each case.
1987b; Wethmar et al., 2010). Mild variations on this pattern have been listed for Drosophila (Cavener, 1987), and a range of other invertebrates, plants and protozoa (Cavener and Ray, 1991). The increased representation of ‘A’ at − 3 and ‘G’ at + 4 in Cupiennius agrees with these other groups, although the probability of ﬁnding ‘A’ at both positions was higher than suggested for some groups, and particularly strong at position − 3. The signiﬁcantly increased probability of ‘U’ at position − 6 has not been reported in vertebrates or other invertebrates (Cavener and Ray, 1991; Kozak, 1987a). It would be interesting to learn if this is consistent in other spiders or arachnids generally. 4.2. Upstream open reading frames Detailed frequencies of uORFs in mRNA sequences are not yet widely established, with estimates of 10–50% in the case of humans (Calvo et al., 2009; Iacono et al., 2005; Matsui et al., 2007). One difﬁculty in making such estimates in large automated searches is the correct
identiﬁcation of the main reading frame (mORF) and ensuring that the 5′ region is complete (Casadei et al., 2003; Iacono et al., 2005; Bicknell et al., 2012). Here, we assembled to exhaustion each sequence of the data in the 5′ direction, and compared the translated protein from the mORF to published proteins from other species, in order to ensure that we identiﬁed the correct START codon of the mORF. The total numbers of uORFs that we found were higher than previous estimates (Calvo et al., 2009; Iacono et al., 2005; Matsui et al., 2007) but the mean lengths of uORFs were similar to values reported for human and mouse (Calvo et al., 2009). There is evidence that uORFs inﬂuence translation of the mORF (Calvo et al., 2009; Wethmar et al., 2010), and a range of mechanisms have been proposed, including interference with the initiation of ribosomal function, exhaustion of essential initiation factors, premature mRNA degradation, and actual translation of short peptides, that in turn modify translation efﬁciency (Bicknell et al., 2012; Iacono et al., 2005; Kozak, 2007; Matsui et al., 2007; Merianda et al., 2013; Vuppalanchi et al., 2012; Wethmar et al., 2010). However, the level of evolutionary conservation of the lengths of 5′ regions, and the numbers, sizes or Kozak consensus strengths of uORFs are not yet clear. Automated analysis of large sets of human and mouse genes suggested that numbers of uORFs were signiﬁcantly lower than would be expected by chance, and that in-frame uORFs were particularly strongly suppressed, indicating that uORFs are deterministically controlled (Iacono et al., 2005). Comparison to chance in that study was made by randomly shufﬂing the positions of nucleotides within the 5′ region. Our data for the numbers, sizes and Kozak consensus strengths of uORFs were closely approximated by the shufﬂed and random simulations (Figs. 3 and 5), and we actually found more uORFs in-frame with the mORF than either of the out-of-frame conﬁgurations (Fig. 3). While random shufﬂing did increase the numbers of uORFs, the shufﬂed data and both simulations were within one standard deviation of the experimental data, indicating that the observed numbers of uORFs were not different to chance. Strongly similar Kozak consensus values of uORFs for experimental, shufﬂed and simulated data support the chance hypothesis. In contrast, Kozak consensus values of the mORFs were signiﬁcantly stronger than for the corresponding uORFs (Fig. 3), which supports the importance of the consensus sequence for normal protein translation. If uORFs were being translated as part of some regulatory process, it might be expected that their Kozak consensus strength would be greater than chance. This was not found, but might not be expected if only a small subset of uORFs were being translated. We also found that mRNA abundance was negatively correlated with the number of uORFs (Fig. 4), but not with the length of the 5′ region. While correlation does not mean causality, this supports the idea that there are speciﬁc inhibitory effects of uORFs on transcription, nuclear export, or more rapid degradation of mRNA containing uORFs, as suggested previously (Bicknell et al., 2012; Calvo et al., 2009; Matsui et al., 2007). 4.3. Accuracy and generality of the data
Fig. 5. Four separate tests of the relationship between numbers of uORFs and 5′ upstream region length. Plotted points show pooled values around each 5′ length (mean ± s.d.): open circles are original experimental data, and ﬁlled squares the same data after random nucleotide shufﬂing. Superimposed lines join mean values obtained by two random simulations of the same lengths of the 5′ region, using uniformly distributed nucleotides, and using nucleotide frequencies of the experimental data (A:C:G:U = 28.6%:19.7%:21.1%:30.5%).
Next generation sequencing methods can provide increasingly accurate and exhaustive sequence data, but assembly of the relatively short nucleotide fragments to complete genes presents challenges that have not yet been completely overcome by automated assembly (Earl et al., 2011; Zhao et al., 2011). This limitation was illustrated by our ﬁnding that up to 90 nucleotide overlaps were required in some cases to prevent incorrect assembly. Our approach required extensive human intervention, which limited the numbers of genes processed but we believe that it ensured greater accuracy, particularly in the 5′ regions being studied, which were usually less well numerically represented in the original data. A disadvantage of our approach was the need to search for initial starting sequences. However, we used such low-stringency initial searches of untranslated nucleotides that each search produced a
A.S. French et al. / Gene 539 (2014) 203–208
majority of reads from genes that were unrelated to the search. This gave an approximately random selection of genes from the initial data, as illustrated in Fig. 1. Nevertheless, the numbers and types of genes that were analyzed here cannot be considered to be a truly random, or representative sample of all the genes in the Cupiennius hypodermis transcriptome, and the following conclusions must be treated with appropriate caution. 5. Conclusions The spider mRNA sequences contained relatively large numbers of in-frame and out-of-frame uORFs (82% contained at least one), but their numbers, sizes, and Kozak consensus strengths could be reproduced by shufﬂed experimental data, or random simulations. This does not challenge models of translational regulation by speciﬁc uORFs, and we did not consider any detailed interactions between speciﬁc uORFs, or their translated peptides, and the translational machinery. Nevertheless, our data are more consistent with the Null hypothesis (Lynch et al., 2005; Reuter et al., 2008) that most uORFs arise from random mutations, rather than speciﬁc selection of uORFs. Conﬂict of interest The authors declare that they have no competing interests. Acknowledgments This work was supported by the Canadian Institutes of Health Research. Sequencing was performed by The McGill University and Génome Québec Innovation Centre. References Barth, F.G., 2002. A spider's world. Senses and BehaviorSpringer-Verlag, Berlin Heidelberg, New York. Bicknell, A.A., Cenik, C., Chua, H.N., Roth, F.P., Moore, M.J., 2012. Introns in UTRs: why we should stop ignoring them. Bioessays 34, 1025–1034. Calvo, S.E., Pagliarini, D.J., Mootha, V.K., 2009. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc. Natl. Acad. Sci. U. S. A. 106, 7507–7512. Casadei, R., Strippoli, P., D'Addabbo, P., Canaider, S., Lenzi, L., Vitale, L., Giannone, S., Frabetti, F., Facchin, F., Carinci, P., Zannotti, M., 2003. mRNA 5′ region sequence incompleteness: a potential source of systematic errors in translation initiation codon assignment in human mRNAs. Gene 321, 185–193.
Cavener, D.R., 1987. Comparison of the consensus sequence ﬂanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res. 15, 1353–1361. Cavener, D.R., Ray, S.C., 1991. Eukaryotic start and stop translation sites. Nucleic Acids Res. 19, 3185–3192. Chen, C.H., Lin, H.Y., Pan, C.L., Chen, F.C., 2011. The genomic features that affect the lengths of 5′ untranslated regions in multicellular eukaryotes. BMC Bioinforma. 12, 53–60. Earl, D., Bradnam, K., St John, J., Darling, A., Lin, D., et al., 2011. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241. French, A.S., 2012. Transcriptome walking: a laboratory-oriented GUI-based approach to mRNA identiﬁcation from deep-sequenced data. BMC Res. Notes 5, 673–680. French, A.S., Torkkeli, P.H., Seyfarth, E.-A., 2002. From stress and strain to spikes: mechanotransduction in spider slit sensilla. J. Comp. Physiol. A. 188, 739–752. Iacono, M., Mignone, F., Pesole, G., 2005. uAUG and uORFs in human and rodent 5′ untranslated mRNAs. Gene 349, 97–105. Kozak, M., 1978. How do eucaryotic ribosomes select initiation regions in messenger RNA? Cell 15, 1109–1123. Kozak, M., 1987a. An analysis of 5′-noncoding sequences from 699 vertebrate messenger. RNAs Nucleic Acids Res. 15, 8125–8148. Kozak, M., 1987b. At least six nucleotides preceding the AUG initiator codon enhance translation in mammalian cells. J. Mol. Biol. 196, 947–950. Kozak, M., 1999. Initiation of translation in prokaryotes and eukaryotes. Gene 234, 187–208. Kozak, M., 2006. Rethinking some mechanisms invoked to explain translational regulation in eukaryotes. Gene 382, 1–11. Kozak, M., 2007. Some thoughts about translational regulation: forward and backward glances. J. Cell. Biochem. 102, 280–290. Lynch, M., Scoﬁeld, D.G., Hong, X., 2005. The evolution of transcription-initiation sites. Mol. Biol. Evol. 22, 1137–1146. Matsui, M., Yachie, N., Okada, Y., Saito, R., Tomita, M., 2007. Bioinformatic analysis of post-transcriptional regulation by uORF in human and mouse. FEBS Lett. 581, 4184–4188. McGregor, A.P., Hilbrant, M., Pechmann, M., Schwager, E.E., Prpic, N.-M., Damen, W.G.M., 2008. Cupiennius salei and Achaearanea tepidariorum: spider models for investigating evolution and development. Bioessays 30, 487–498. Merianda, T.T., Gomes, C., Yoo, S., Vuppalanchi, D., Twiss, J.L., 2013. Axonal localization of neuritin/CPG15 mRNA in neuronal populations through distinct 5′ and 3′ UTR elements. J. Neurosci. 33, 13735–13742. Rayson, P., Berridge, D., Francis, B., 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. 7th International Conference on Statistical Analysis of Textual Data, 2, pp. 926–936. Reuter, M., Engelstadter, J., Fontanillas, P., Hurst, L.D., 2008. A test of the null model for 5′ UTR evolution based on GC content. Mol. Biol. Evol. 25, 801–804. Vuppalanchi, D., et al., 2012. Lysophosphatidic acid differentially regulates axonal mRNA translation through 5′ UTR elements. Mol. Cell. Neurosci. 50, 136–146. Wethmar, K., Smink, J.J., Leutz, A., 2010. Upstream open reading frames: molecular switches in (patho)physiology. Bioessays 32, 885–893. Wolff, C., Hilbrant, M., 2011. The embryonic development of the Central American wandering spider Cupiennius salei. Front. Zool. 8, 15–48. Zhao, Q.Y., Wang, Y., Kong, Y.M., Luo, D., Li, X., Hao, P., 2011. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinforma. 12 (Suppl. 14), S2.