Accepted Manuscript Title: Menzerath-Altmann law in mammalian exons reflects the dynamics of gene structure evolution Author: Christoforos Nikolaou PII: DOI: Reference:

S1476-9271(14)00097-8 http://dx.doi.org/doi:10.1016/j.compbiolchem.2014.08.018 CBAC 6355

To appear in:

Computational Biology and Chemistry

Please cite this article as: Nikolaou, Christoforos, Menzerath-Altmann law in mammalian exons reflects the dynamics of gene structure evolution.Computational Biology and Chemistry http://dx.doi.org/10.1016/j.compbiolchem.2014.08.018 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Menzerath-Altmann law in mammalian exons reflects the dynamics of gene structure evolution

ip t

Christoforos Nikolaou 1

us

71409, Herakleion, Crete, Greece

cr

Computational Genomics Group, Department of Biology, University of Crete

Abstract

an

Genomic sequences exhibit self-organization properties at various hierarchical levels. One such is the gene structure of higher eukaryotes with its complex exon/intron arrangement. Exon sizes and exon

M

numbers in genes have been shown to conform to a law derived from statistical linguistics and formulated by Menzerath and Altmann, according to which the mean size of the constituents of an entity is inversely related to the number of these constituents. We herein perform a detailed analysis of

d

this property in the complete exon set of the mouse genome in correlation to the sequence conservation

Ac ce pt e

of each exon and the transcriptional complexity of each gene locus. We show that extensive linear fits, representative of accordance to Menzerath-Altmann law are restricted to a particular subset of genes that are formed by exons under low or intermediate sequence constraints and have a small number of alternative transcripts. Based on this observation we propose a hypothesis for the law of MenzerathAltmann in mammalian genes being predominantly due to genes that are more versatile in function and thus more prone to undergo changes in their structure. To this end we demonstrate one test case where gene categories of different functionality also show differences in the extent of conformity to Menzerath-Altman law.

1

Correspondence to: [email protected]

Page 1 of 23

Introduction

Self-organization in biological systems has been a widely discussed topic since the seminal work of Eigen and Schuster (Eigen and Schuster, 1977). At the genomic level, reflections of self-organizing

ip t

principles have been reported in the distributions of nucleotides (Buldyrev et al., 1993; Li and Kaneko, 1992; Peng et al., 1992) and coding versus non coding segments (Almirantis and Provata, 2001;

cr

Mantegna et al., 1995; Provata and Almirantis, 1997). The recent advent of next generation sequencing technology and the development of methodologies to interrogate the three dimensional structure of

us

chromosomes within the eukaryotic nucleus has provided additional experimental evidence on the large-scale organization of eukaryotic genomes (Dixon et al., 2012). Interactions between chromosomes, the structural constituents of the genomes, may be seen in this sense as forming a so-

an

called “fractal globule” (Lieberman-Aiden et al., 2009).

At linear level there have been attempts to discern the dynamics behind the organization of complete

M

genomes in chromosomes through the lens of the statistical relationship described by the law of Menzerath and Altmann (Altmann, 1980) derived from statistical linguistics. According to this law, in

d

structured systems, the number of the constituents of an entity follows an inverse relationship with the mean constituent size. The law of Menzerath-Altmann (henceforth "MA law") was initially postulated

Ac ce pt e

by Menzerath for the structure of natural languages, where words may be seen as the entities and the syllables or phonemes as their constituents. It was later formalized by Altmann in the following general equation:

where f(x) is the mean size of the constituent (e.g. mean length of syllables) x is the size of the entity (e.g. the length of the word in number of syllables) and α,β,γ are parameters. The law has been shown to hold for a great number of natural languages (Altmann, 1980) as well as for musical texts (Boroda and Altmann, 1991). Ferrer-i-Cancho and Forns (Ferrer-i-Cancho and Forns, 2010) were among the first to implement this relationship in genomic sequences in an attempt to link the number and mean size of eukaryotic chromosomes with the total size of the genome. Their approach was later criticized by Solé (Solé, 2010) under the main argument that such an inverse

Page 2 of 23

relationship should be expected by the very same definition of genomes and chromosomes. At the same time, the extent of non-coding space in the great majority of eukaryotic genomes renders the biological interpretation of the relationship between chromosomes and total genome size, in terms of linguistic complexity, rather problematic.

ip t

Linguistic analogies in the genomic context are, nonetheless, abundant and a recent, very interesting work by Li (Li, 2011) re-introduced the MA law in genomics at the level of genes and exons. In the framework of information encoded by a genome, genes are much closer to the equivalent of words.

cr

They come in large numbers (varying from 6000 to 30000 or more depending on the organism) and they are predominantly made up of well-defined constituents, the exons, which may be thus seen as the

us

equivalents of syllables. More importantly, exons form only a small part of the total length of an unprocessed gene with large “chunks” of DNA, the introns, intervening between them. In this sense the

an

appear to incorporate a significant part of genomic “dark matter” which has also been suggested to occur in natural language (Ferrer-i-Cancho et al., 2012). Inside the nucleus of all eukaryotes, genes are transcribed into messenger RNA (mRNA) from quite lengthy pieces of genomic DNA that contain both

M

exons and introns. It is from these “transcript” mRNA molecules that introns are removed and exons concatenated through the process of mRNA splicing. The final products are “mature” transcripts, which

d

are entirely formed by the exonic constituents. In the aforementioned work, Li was able to demonstrate

Ac ce pt e

a clear inverse relationship between the number of exons and their mean size that was independent of the overall size of the corresponding genes (i.e. the entity). Thus at the level of genes/exons there is little doubt that the MA law holds. Using this work as a starting point, we wanted to investigate further the dynamics that may be shaping the distribution of exon numbers and sizes in eukaryotic genomes. Exon turnover, the evolutionary process by which new exons are “born” and old exons are rendered obsolete, is a dynamic process that lies at the core of genome evolution in eukaryotes (Rogozin et al., 2005). The birth of new exons occurs mostly through the duplication of existing ones, while processes of de novo exaptation have also been reported (Lev-Maor et al., 2003). At the same time, exons may become obsolete and gradually "erode" through the accumulation of deleterious mutations. Genes may thus "choose" from a vast, dynamic repertoire of exons in order to form viable, functional transcripts and they do so in various ways through the process of alternative splicing, during which only a subset of the available exons within a gene's boundaries are incorporated in the mature mRNA molecule and subsequently translated into protein. Alternative splicing is the rule in higher eukaryotes (Keren et al., 2010) and is the most likely explanation for the apparent discrepancy between the number of encoded

Page 3 of 23

genes and the organismal complexity at physiological level, the so called G-value paradox,(Taft et al., 2007). Considering the amount of transcriptional complexity in eukaryotic genomes a number of questions arise regarding the observed distributions of exon numbers/mean sizes in conformity with the MA law.

ip t

To what extent is the accordance with the law independent of the conservation of exons, their inclusion rates in mature transcripts and the functional role of the encoded proteins? Under this prospect, this work constitutes an attempt to further explore the concordance of gene/exon structures with the MA

cr

law given: a) the transcriptional complexity conferred by the process of alternative splicing b) the existence of constraints at the level of exon sequence conservation and c) the extent of difference in

us

conformity to the MA law by genes belonging to different functional categories. Last but not least we will try to incorporate our observations in a plausible model for the evolution of exon structure in the

an

genes of higher eukaryotes.

d

Sequence Data

M

Sequences and Methods

As the aforementioned study of Li (Li, 2011) was performed on the human genome, we chose to

Ac ce pt e

perform our analysis in a different mammalian genome, that of the mouse (Mus musculus). This provided us with the opportunity to confirm the accordance with the MA law for an organism other than Homo sapiens. All mouse exons (mm9) were downloaded from Ensembl under the Ensembl annotation scheme (Ensembl release 67). The complete dataset consisted of 689492 exons, belonging to 97639 different transcripts of a total of 34590 genes. These numbers referred only to protein coding genes. A first summary of the data was sufficient to show the differences in the scales of the different entities. Exon sizes extended in 3 orders of magnitude (min=21, max=2692 nucleotides, nts) while genes and transcripts ranged from few hundreds to millions of nucleotides (nts). We used the standard genome annotation as provided by Ensembl to assign alternative transcripts to genes and to divide exons into constitutive (defined as the exons that form part of all possible alternative transcripts of the same gene) and alternative (those forming part of only a subset of a gene’s transcripts). In this sense our analysis was entirely based on transcripts instead of genes, which are for that matter to be seen as much more complex entities from which a great number of transcripts may originate (Clark et al., 2011). Exons were enumerated according to the order with which they appear in

Page 4 of 23

a given transcript and exon counts were deduced from transcript instead of gene structures. We considered as constitutive every exon that was present in all possible transcripts of the same gene at 100% of its length and as alternative all exons not fulfilling this prerequisite. Exons from single exon transcripts amounted to almost 2% of the dataset. We distinguished between initial, terminal and

cr

The results of the summary of the dataset characteristics are shown in Table 1.

ip t

internal exons, the latter being exons that formed part of a transcript with a minimum of three exons.

Sequence Conservation

us

Sequence conservation was calculated for each exon in the dataset based on PhastCons values, a measure of sequence similarity on the basis of multiple genome alignments (Siepel et al., 2005). For

an

the purposes of this study we obtained raw PhastCons scores for the entire mouse genome based on a whole-genome alignment of 30 vertebrates (Vert30) from primates to boney fish. The 30 species were: Human, Chimp, Orangutan, Rhesus, Mouse, Rat, Guinea Pig, Rabbit, Marmoset, Bushbaby, Tree

M

Shrew, Shrew, Hedgehog, Dog, Cat, Horse, Cow, Armadillo, Elephant, Tenrec, Opposum, Platypus, Chicken, Lizard, Frog, Tetraodon, Fugu, Stickleback, Medaka and Zebrafish. PhastCons is a hidden

d

Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on a multiple alignment. PhastCons values range therefore between 0 and 1. In order to

Ac ce pt e

obtain one value representing the sequence constraint for each genomic segment we aggregated PhastCons values for each exon in the dataset and took the mean value per segment as its average conservation. The final dataset included ~98000 transcripts each of which was linked to a defined number of exons according to the annotation, while each of these exons carried its size, mean conservation, constitutive/alternative status and its rank in the transcript as internal values.

Gene ontology annotations Gene

ontology

(GO)

annotations

were

obtained

from

the

Gene

Ontology

Database

(www.geneontology.org) (Bourne, 2009) as a single obo file. The most general, all-encompassing hierarchical terms were selected in order to stratify genes according to basic functional categories such as metabolism, development and transcription factor functions.

Page 5 of 23

Calculations All calculations on the dataset described above were performed at log10 scale in order to achieve a clearer representation of the results. Exon counts (number of exons per transcript) and exon sizes were log-transformed and binned according to their numerical value, that is all exons that formed part of a

ip t

transcript with N exons were pooled and a mean value and standard deviation was calculated for the distribution of values of this subset. This gave rise to less fuzzy representations that, nonetheless,

cr

contained the information on the variance in the form of error bars.

an

us

Results and Discussion

Average exon conservation increases with the number of exons in transcripts

M

By first comparing the mean conservation at the level of exons with the number of exons per transcript we were able to observe a clear positive correlation between the two quantities. Thus, the greater the number of exons in a transcript, the greater the mean conservation of its exons. In Figure 1, the average

d

conservation for all exons belonging to transcripts with N exons is plotted against log10(N). Points

Ac ce pt e

represent the mean and bars are based on the standard error of the mean. This positive relationship between exon conservation values and the exon count (number of exons) in the transcript cannot be simply attributed to the effect of external and singular exons (i.e. exons that are part of either one- or two-exon transcripts), which are generally not constrained, as the positive trend in Figure 1 extends for transcripts of up to 10 exons. This finding hints at a general tendency for more complex transcripts (carrying a great number of exons) to be overall more conserved, which is not counter-intuitive. The conservation at protein level has in the past been linked to the size of introns in the Drosophila genome (Marais et al., 2005) which is most likely a reflection of the Hill-Robertson effect (Hill and Robertson, 1966), according to which increased number of recombination events are expected to be occurring in highly conserved genes. If this is a general property of eukaryotes one would expect high recombination rates increasing the degree of "compartmentalization" and thus leading to more and smaller exons, in accordance with what is expected under the MA law, an interesting hypothesis to which we will return later on. Another interesting feature of this analysis is that conservation values tend to be higher for alternative compared to constitutive exons, especially for transcripts with low

Page 6 of 23

exon counts. This is suggestive of a tendency for alternative exons to be more conserved to which we will also come back later on.

ip t

Average exon conservation decreases with exon size An indirect consequence of this observation would also be an inverse relationship between conservation and the size of the exons. This means that if transcripts are "broken down" into smaller

cr

constituents in more constrained sequences, then the mean conservation would tend to decrease with exon size. In Figure 2 such an inverse relationship is shown to hold. In fact, it holds for both internal

us

and external exons, which means that it is largely independent of the limiting factors related to external exons. First and last exons are generally longer than internal ones as they include 5' and 3' un-translated

an

regions and this is reason for the range of mean conservation being smaller in Figure 2b than that in Figure 2a. Nevertheless the trend is very similar in both plots, an indication that the relationship

d

conservation.

M

between exon count and exon size, manifest in the MA law, is also reflected at the level of sequence

Ac ce pt e

Average exon conservation increases with the number of alternative transcripts per gene Next, we wanted to investigate whether sequence constraint may be related to transcript complexity. Genes with a great number of exons tend to have a greater number of alternative transcripts as would be expected purely because of an increased "repertoire" of exons to "choose” from. In this sense, given the inverse relationship of the MA law, genes with a great number of transcripts (and thus greater number of exons) will tend to have smaller exons. A direct consequence stemming from the observations above (see Figures 1,2) would be that they will also carry exons with higher exon conservation. Indeed, this seems to be true as one can observe in Figure 3 (bottom left) where a clear positive tendency is shown between the number of alternative transcripts and the mean conservation of the exons in each transcript. There is thus a combined positive relationship between exon sequence conservation and complexity at both the genic and the transcript level. To better demonstrate this, Figure 3 (right) shows a three-way relationship between exon counts per transcript, transcript counts per gene locus and mean exon conservation. One may clearly see how conservation increases with both the complexity of the transcript (in terms of number of exons) and gene (in terms of number of alternative transcripts).

Page 7 of 23

Menzerath-Altmann law and transcript complexity Having already observed a correlation between exon sequence conservation and the distribution of their sizes in transcripts, we analyzed how transcriptional complexity may affect the conformity to the MA law exhibited by the complete set of mouse genes (Figure 4a). To do this we plotted the relationship

ip t

between mean exon size and exon count for three different groups of genes, depending on the number of alternative transcripts they gave rise to. Subsets of genes with few (1-3), intermediate (5-7) and many (>9) alternative transcripts were analyzed separately. Mean exon sizes were plotted against exon

cr

counts in double-logarithmic (log10) scales and linear fits were calculated in all three cases (Figure 4). Although linear relationships were observed in all cases, the extent of the linear fit, the absolute value

us

of the slope and the coefficient of determination (R 2) were significantly higher for the case of low alternative transcript counts (Figure 4a), in striking contrast to the other two categories. In order to do

an

away with the impact of outliers due to the log-transformation, we also calculated the correlation for each dataset, using a non-parametric test (Spearman’s Rank ρ coefficient). The values obtained were: RComplete Dataset= -0.13, R(1-3 Transcripts)= -0.27, R(5-7 Transcripts)= -0.11 and R(>9 Transcripts)= -0.01, thus reflecting

M

the rather obvious qualitative differences in the linear fits of Figure 4. These findings suggest that the observed overall linear relationship between exons sizes and exon counts in accordance with the MA

d

law (Figure 4a) should be predominantly attributed to genes with low transcriptional complexity. Such

Ac ce pt e

genes, linked to a small number of possible alternative transcripts are moreover expected to carry fewer, bigger, less conserved exons and exons that tend -by definition- to be constitutive. In fact the observed greater extent of linearity in the exon size distributions may be broken down in distinct cases for exons as well, if one simply analyzes constitutive versus alternative exons. Even though constitutive exons represent less than 20% of the total they produce better and more extended linear fits (Figure 5b) when compared to the remaining ~80% that correspond to alternative exons (Figure 5c). A simple inspection of Figure 5 suffices to see that when analyzing the complete set of exons one mostly observes the significantly milder effect of the alternative exons. Overall, the data presented herein suggest that more extensive linearity and steeper slopes are primarily due to the contribution of constitutive exons that belong to genes with a small number of transcripts.

Page 8 of 23

Menzerath-Altmann law and exon conservation The results presented thus far prompted us to perform a more detailed analysis at the level of exon conservation. We thus wanted to examine whether the sequence constraints in eukaryotic exons may be linked to the MA law-like behaviour and to what extent. To do this, we performed an analysis of exon

ip t

sizes versus exon counts, this time taking into account the sequence conservation of each exon before creating artificial transcripts in the following way: Starting from the full transcript annotations we created artificial transcript structures by setting specific filters to the exon conservation of each exon.

cr

For instance, if a transcript carried 5 exons and only 3 out of those passed the conservation filter set for the particular instance, then the transcript's exon count was reduced to 3 and only the sizes of these 3

us

exons were passed on to subsequent calculations. By carefully adjusting conservation filters we created subsets of exons belonging to artificial transcripts that fulfilled specific criteria in terms of mean exon

an

conservation. We employed this analysis to examine whether transcripts created in this way would show differences in the extent of abidance to the MA law. The results, shown in Figure 6, are suggestive of a very distinct behaviour between low and high constraint exons in these artificial

M

transcripts. Low-constraint exons (Figure 6c) representing 12% of total exons with a mean PhastCons values smaller or equal to 0.2 showed an extensive linear relationship in accordance to the behaviour

d

expected under the MA law (Spearman R=-0.11). It should be noted here that a PhastCons value of 0.2

Ac ce pt e

is considered an approximate threshold for the existence of sequence constraint. In this sense, these 12% of total exons are practically unconstrained, or only marginally constrained in terms of sequence conservation. On the other hand, highly constrained exons (PhastCons>0.99) lying at the other end of the spectrum and representing the top 10% of conserved exons showed practically no correlation between exon size and exon count (Figure 6b, Spearman R=0.03). The overall behaviour when one takes into account all the exons, regardless of conservation, lies somewhere inbetween these two extremes (see Figure 6a). As the results presented in Figures 6b and 6c only account for ~20% of the total dataset we split the complete set of exons in two sets of high and low constraint (top 50% and bottom 50% of exons ranked according to mean conservation value). The resulting fits (Figure 7), even though not as striking as the ones shown in Figure 6 remain highly indicative of a strong negative relationship between sequence constraint and conformity with the MA law (Spearman R(low constraint)=0.13. Spearman R(high constraint)=-0.02). Keeping in mind that the results presented in Figures 6 and 7 refer to artificial transcripts we may suggest that MA law-like behaviour would be preferentially attributed to exons with low sequence constraint. Even more so, a process through which new and thus of low constraint exons would appear

Page 9 of 23

in a transcript would be sufficient to produce strong linear relationships in accordance to the MA law. Highly constrained exons on the other hand, when isolated from the less conserved ones, tend to show no correlation between mean size and exon count.

ip t

Conclusions

Eukaryotic genes have specific characteristics that point to a highly dynamic nature. They are long,

cr

predominantly occupied by non-coding space (introns) and constituted by exons that are small in size, unequal in terms of sequence constraint (which reflects their age and importance) and differentially

us

employed in various combinations through alternative splicing. They also undergo a number of molecular processes such as self-replication, recombination, random insertions, excisions and

an

duplications that shape the relative constitution in evolutionary time under the constant constraint of selection. The statistical properties of the distributions of mean exon size and number per transcript are

M

therefore inevitably linked to the underlying dynamics governing their evolution. In this work we have attempted to investigate the extent to which sequence constraints (manifest at the level of exon conservation) and transcriptional complexity (linked to the alternative employment of

d

different exon subsets) may account for the statistical properties exhibited by eukaryotic genes as

Ac ce pt e

conformity to Menzerath-Altmann Law (Li, 2011). The presented results point towards a complex relationship between exon turnover and exon size distributions. In particular, our analyses suggest that constitutive exons with low sequence constraint, being part of genes with a small number of alternative transcripts form the best "substrate" for the emergence of MA law-like behaviour. Taken together these observations prompt us to propose a simple evolutionary model that may account for the variable extents of linearity in the exon size distributions of different categories of transcripts (Figure 8). Such a model would start with the valid assumption that new exons, (that is newly employed exons) are under low sequence constraint. By not being subject to particular constraints, these exons are more likely to be "interrupted" through the processes of recombination and retroposition (insertion of DNA fragments through reverse transcription). It may, therefore, be envisaged that up to some point, properties related to the MA law may emerge through this gradual fragmentation. Such interruptions are, nonetheless, expected to occur less frequently in genes whose exons more conserved, thus less prone to such “interruptions”. As exons are put under increased constraints they become more protected from further fragmentation. At the same time, genes are

Page 10 of 23

attributed with an expanded number of exons and thus provided with a greater exon "repertoire" which they employ in various ways through the production of alternative transcripts. As the process of alternative splicing, that is the differential combination of exons in various ways requires increased control, alternative exons are further protected from mutations due to the additional constraints imposed by the need for a more intricate transcriptional regulation. The overall increased complexity

ip t

at transcript level imposes additional constraints and further decreases the possibility of exon fragmentation thus leading to a "frozen" gene structure, bringing any further "shattering" to a halt. In

cr

the opposite way, as exons become obsolete they become less constrained and thus more prone to insertions which may have the opposite effect, that is "inflating" the exon size with a probability that is

us

inversely related to their sequence conservation. A subset of the exons that are not subject to constraint may now increase in size through the incorporation of neutral insertions, while others may completely

partial restoring of the concordance with the MA law.

an

“disintegrate”. This simultaneous increase in mean size and reduction in number may thus assist in the

The proposed hypothesis is that extensive linearity in the distributions of exon counts as functions of

M

exon sizes is to be predominantly expected among transcripts that are either in the process of fixing their structure through selection or in the process of "erosion". Strongly constrained genes, with

d

complex structures and conserved exons show little if any such linear relationship. A possible

Ac ce pt e

consequence of this would be that genes with more versatile functions and thus more prone to be under mild constraints will conform to the MA law to a greater extent than genes with more constrained functionality. An example of this is shown in Figure 9 where genes annotated as transcription factors (6% of total transcripts) were compared to metabolic genes (2% of total transcripts). Transcription factors are a much more versatile functional category of genes compared to enzymes of the metabolic pathways which are extremely conserved in all eukaryotes. In this sense the extent of linearity in the plot of the transcription factors being greater than the corresponding of the metabolic genes is in agreement with the proposed model. Genes that have not reached a "steady state" are more likely to show signs of self-organization in the form of MA law-like statistical properties. The aim of this work was a detailed investigation of the transcript complexity and sequence constraint in the context of the MA law in eukaryotic exon. Further analysis performed at the level of exon dynamics, distinguishing between "young" and "old" exons, incorporating pseudoexons in the process of erosion as well as recently exapted exons will provide further insight in the relationship between sequence constraint and statistical properties in the distributions of gene components and effectively validate or disprove the proposed model.

Page 11 of 23

Refrences

Ac ce pt e

d

M

an

us

cr

ip t

Almirantis, Y., Provata, A., 2001. An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome. BioEssays 23, 647–656. Altmann, G., 1980. “Prolegomena to Menzerath’s law”. Glottometrica 2, 1–10. Boroda, G.A.G., 1991. Menzerath’s law in musical texts. Musikometrica 3, 1–3. Bourne, P.E. (Ed.), 2009. The Gene Ontology’s Reference Genome Project: a unified framework for functional annotation across species. PLoS Comput. Biol. 5, e1000431. Buldyrev, S.V. V., Goldberger, A.L.L., Havlin, S., Peng, C.-K.C.-K.C.-K.C.-K., Simons, M., Stanley, H.E.E., 1993. Generalized Lévy-walk model for DNA nucleotide sequences. Phys. Rev. E 47, 4514–4523. Clark, M.B., Amaral, P.P., Schlesinger, F.J., Dinger, M.E., Taft, R.J., Rinn, J.L., Ponting, C.P., Stadler, P.F., Morris, K. V, Morillon, A., Rozowsky, J.S., Gerstein, M.B., Wahlestedt, C., Hayashizaki, Y., Carninci, P., Gingeras, T.R., Mattick, J.S., 2011. The reality of pervasive transcription. PLoS Biol. 9, e1000625; discussion e1001102. Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., Ren, B., 2012. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380. Eigen, M., Schuster, P., 1977. The hypercycle. A principle of natural self organization. Part A: Emergence of the hypercycle. Naturwissenschaften 64, 541–565. Ferrer-i-Cancho, R., Forns, N., 2010. The Self-Organization of Genomes 15, 34–36. Ferrer-i-Cancho, R., Forns, N., Hernandez-Fernandez, A., Bel-Enguix, G., Baixeries, J., 2012. The challenges of statistical patterns of language: The case of Menzerath’s law in genomes. Complexity 18, 11–17. Hill, W.G., Robertson, A., 1966. The effect of linkage on limits to artificial selection. Genet. Res. 8, 269–94. Keren, H., Lev-Maor, G., Ast, G., 2010. Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 11, 345–55. Lev-Maor, G., Sorek, R., Shomron, N., Ast, G., 2003. The birth of an alternatively spliced exon: 3’ splice-site selection in Alu exons. Science 300, 1288–1291. Li, W., 2011. Menzerath’s Law at the Gene-Exon Level in the Human Genome. Complexity 17, 49–53. Li, W., Kaneko, K., 1992. Long-range correlation and partial⅟ fα spectrum in a noncoding DNA sequence. Eur. Lett 17, 655–660. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., Sandstrom, R., Bernstein, B., Bender, M.A., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L.A., Lander, E.S., Dekker, J., 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science (80-. ). 326, 289–293. Mantegna, R.N., Buldyrev, S. V., Goldberger, A.L., Havlin, S., Peng, C.K., Simons, M., Stanley, H.E., 1995. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys. Rev. E 52, 2939.

Page 12 of 23

an

us

cr

ip t

Marais, G., Nouvellet, P., Keightley, P.D., Charlesworth, B., 2005. Intron size and exon evolution in Drosophila. Genetics 170, 481–5. Peng, C.-K.C.-K.C.-K.C.-K., Buldyrev, S.V. V., Goldberger, A.L.L., Havlin, S., Sciortino, F., Simons, M., Stanley, H.E.E., 1992. Long-range correlations in nucleotide sequences. Nature 356, 168–170. Provata, A., Almirantis, Y., 1997. Scaling properties of coding and non-coding DNA sequences. Phys. A Stat. Mech. its Appl. 247, 482–496. Rogozin, I.B., Sverdlov, A. V, Babenko, V.N., Koonin, E. V, 2005. Analysis of evolution of exonintron structure of eukaryotic genes. Brief. Bioinform. 6, 118–134. Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R.K., Gibbs, R. a, Kent, W.J., Miller, W., Haussler, D., 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–50. Solé, R., 2010. Genome size, self-organization and DNA’s dark matter. Complexity 16, 20-23. Taft, R.J., Pheasant, M., Mattick, J.S., 2007. The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays 29, 288–99.

M

Figure 1

Mean exon conservation is positively correlated with the number of exons in the transcript. Log10(exon

d

count) as opposed to average PhastCons score for constitutive (left) and alternative (right) exons. The

Ac ce pt e

range of conservation values shows a tendency for higher values as the number of exons increases. Overall conservation values are higher for alternative (~82% of total) as compared to constitutive (~18% of total) exons. Points represent mean conservation values of all exons belonging to transcripts with the same exon count. Error bars are standard errors of the mean.

Figure 2

Mean exon conservation is inversely related to exon size. Log10(exon size) as opposed to average PhastCons score for internal (left) and external (right) exons. An inverse relationship is obvious in particular for the internal exons (~75% of total exons), while it is less pronounced in the case of external (first and last exons in a transcript, ~25% of the total exons). Points represent mean conservation values of all exons belonging to the same length bin (bins of 10nts). Error bars are standard errors of the mean.

Page 13 of 23

Figure 3 Exon conservation increases with both transcript and genic complexity. Mean exon conservation shows a positive relationship not only with the number of exons per transcript (top left) but also with the number of alternative transcripts from the same gene locus (bottom left). Points represent mean

ip t

conservation values of all exons belonging to genes with the same number of transcripts. Error bars are standard errors of the mean. Mean exon conservation shows a positive relationship for both more complex transcripts and more complex gene loci. A three-way relationship of exon counts per

cr

transcript and transcript counts per gene locus (right) shows that exon conservation increases with the

us

complexity of the gene structure at both levels.

an

Figure 4

Concordance with Menzerath-Altmann law is higher for less complex gene loci. Double logarithmic

M

plots of exon counts per transcript vs mean exon size for three distinct categories of all gene loci (a) as compared to subsets of genes with low (b), intermediate (c) and high (d) transcript numbers. Points represent mean exon size of all exons with the same exon count in their given transcript. The extent of

d

linear fits was set to be equal for all cases for reasons of direct comparisons. Coefficients of

Ac ce pt e

determination stably decreased with the number of alternative transcripts (R2a=0.92, R2a=0.93, R2c=0.72, R2d=0.06). Error bars correspond to standard error of the mean. Linear fits were obtained for the data range that gave the greatest extent of linearity. Values of the slope and the mean standard error of linear regression are shown on each plot.

Figure 5

Constitutive exons show increased concordance with Menzerath-Altmann Law. Double logarithmic plots of exon counts per transcript vs mean exon size for a) all, b) constitutive and c) alternative exons. Points represent mean exon size of all exons with the same exon count in their given transcript. Error bars correspond to standard error of the mean. Linear fits were obtained for the data range that gave the greatest extent of linearity. Values of the slope and the mean standard error of linear regression are shown on each plot.

Page 14 of 23

Figure 6 Concordance with Menzerath-Altmann law is inversely related with the mean exon conservation of a transcript. Double logarithmic plots of exon counts per transcript vs mean exon size for artificial transcripts that have been constructed after filtering of exons based on their average conservation. 6a,

ip t

all exons, 6b top 10% conserved exons, 6c bottom 12% conserved exons. Points represent mean exon size of all exons with the same exon count in their given transcript. Error bars correspond to standard

cr

error of the mean.

us

Figure 7

Concordance with Menzerath-Altmann law is inversely related with the mean exon conservation of a

an

transcript. Double logarithmic plots of exon counts per transcript vs mean exon size for artificial transcripts that have been constructed after filtering of exons based on their average conservation. 7a,

M

all exons, 7b top 50% conserved exons, 7c bottom 50% conserved exons. Notice the significant qualitative differences between the b,c that represent the total dataset. Points represent mean exon size

Figure 8

Ac ce pt e

of the mean.

d

of all exons with the same exon count in their given transcript. Error bars correspond to standard error

A simple evolutionary model that accounts for the observations regarding concordance with Menzerath-Altmann law. As new exons arise they are under no constraint and extended in size. Gradual fragmentation permitted by low constraint is sufficient to produce size distributions with high concordance with the law of Menzerath-Altmann. Further fragmentation increases the possibility for the employment of alternative splicing which in turn increases the constraints on the exon sequence due to increased demands for splicing regulation. Complex gene loci, with multiple alternative transcripts contain fragmented, smaller exons with increased conservation but overall low concordance with Menzerath-Altmann law.

Page 15 of 23

Figure 9 Different functional categories of genes show different concordance with Menzerath-Altmann law. Double logarithmic plots of exon counts per transcript vs mean exon size for a) all transcripts b) transcripts annotated as transcription factors c) transcripts annotated as metabolic enzymes based on

ip t

Gene Ontology Annotation. Points represent mean exon size of all exons with the same exon count in their given transcript. Error bars correspond to standard error of the mean. Linear fits were obtained for the data range that gave the greatest extent of linearity. Values of the slope and the mean standard error

Ac ce pt e

d

M

an

us

cr

of linear regression are shown on each plot.

Page 16 of 23

Ac ce pt e

Figure 2

d

M

an

us

cr

ip t

Figure 1

Page 17 of 23

d Ac ce pt e

Figure 4

M

an

us

cr

ip t

Figure 3

Page 18 of 23

Ac ce pt e

d

M

an

us

cr

ip t

Figure 5

Page 19 of 23

Ac ce pt e

d

M

an

us

cr

ip t

Figure 6

Page 20 of 23

Ac ce pt e

d

M

an

us

cr

ip t

Figure 7

Page 21 of 23

Ac ce pt e

d

M

an

us

cr

ip t

Figure 8

Page 22 of 23

Ac ce pt e

Table 1

d

M

an

us

cr

ip t

Figure 9

Detailed summary of the complete dataset in terms of annotated exons and transcripts. Averagre numbers Transcripts per page 2.82 Exons per transcript 7.06 Constitutive exons per gene 3.88 Alternative exons per gene 17.63 Dataset breakdown Constitutive exons 18.8% Alternative exons 81.2% Initial exons 12.2% Terminal exons 12.2% External exons (initial+terminal) 24.4% Internal exons 73.6%

Page 23 of 23

Menzerath-Altmann law in mammalian exons reflects the dynamics of gene structure evolution.

Genomic sequences exhibit self-organization properties at various hierarchical levels. One such is the gene structure of higher eukaryotes with its co...
1MB Sizes 0 Downloads 6 Views