Journal of Bioinformatics and Computational Biology Vol. 13, No. 1 (2015) 1540009 (24 pages) # .c Imperial College Press DOI: 10.1142/S0219720015400090

Sequence-based prediction of transcription upregulation by auxin in plants

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

Petr M. Ponomarenko Children's Hospital Los Angeles 4640 Hollywood Blvd, Los Angeles, CA 90027, USA [email protected] Mikhail P. Ponomarenko Institute of Cytology and Genetics 10 Lavrentyev Ave., Novosibirsk, 630090, Russia and Novosibirsk State University, 2 Pirogov Str., Novosibirsk 630090, Russia [email protected] Received 3 August 2014 Revised 30 November 2014 Accepted 23 December 2014 Published 30 January 2015 Auxin is one of the main regulators of growth and development in plants. Prediction of auxin response based on gene sequence is of high importance. We found the TGTCNC consensus of 111 known natural and arti¯cially mutated auxin response elements (AuxREs) with measured auxin-caused relative increase in genes' transcription levels, so-called either \a response to auxin" or \an auxin response." This consensus was identical to the most cited AuxRE motif. Also, we found several DNA sequence features that correlate with auxin-caused increase in genes' transcription levels, namely: number of matches with TGTCNC, homology score based on nucleotide frequencies at the consensus positions, abundances of ¯ve trinucleotides and ¯ve B-helical DNA features around these known AuxREs. We combined these correlations using a four-step empirical model of auxin response based on a gene's sequence with four steps, namely: (1) search for AuxREs with no auxin; (2) stop at the found AuxRE; (3) repression of the basal transcription of the gene having this AuxRE; and (4) manifold increase of this gene's transcription in response to auxin. Independently measured increases in transcription levels in response to auxin for 70 Arabidopsis genes were found to signi¯cantly correlate with predictions of this equation (r ¼ 0:44, p < 0:001) as well as with TATA-binding protein (TBP)'s a±nity to promoters of these genes and with nucleosome packing of these promoters (both, p < 0:025). Finally, we improved our equation for prediction of a gene's transcription increase in response to auxin by taking into account TBP-binding and nucleosome packing (r ¼ 0:53, p < 10 6 ). Fisher's F -test validated the signi¯cant impact of both TBP/promoter-a±nity and promoter nucleosome on auxin response in addition to those of AuxRE, F ¼ 4:07, p < 0:025. It means that both TATA-box and nucleosome should be taken into account to recognize transcription factor binding sites upon DNA sequences: in the case of the TATA-less nucleosome-rich

1540009-1

P. M. Ponomarenko & M. P. Ponomarenko promoters, recognition scores must be higher than in the case of the TATA-containing nucleosome-free promoters at the same transcription activity. Keywords: Auxin; auxin responsive element; primary auxin response; plant; prediction in silico.

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

1. Introduction Auxin (Aux, synonym: indole-3-acetic acid, IAA) is a key player among all plant hormones that regulate growth and division of cells,1 lateral root development,2 meristem maintenance3 and gravitropism.4 Early/primary auxin-response genes are the most studied in the auxin pathway and they are characterized with increase5 in transcription of primarily6Aux/IAA, SAUR and GH3 gene families 30 min after treatment with auxin. Auxin response transcription factors (ARFs) and co-repressors from Aux/IAA family form dimers that search DNA for auxin response elements (AuxREs) using steric intermolecular interaction between surfaces of the protein and DNA contacting with one another as Emil Fischer's \key-lock" principle7 when auxin concentration is sub-threshold8 (Fig. 1(a)). If ARF–Aux/IAA dimer ¯nds AuxRE site on DNA, then ARF binds DNA (Fig. 1(b)) and co-repressor Aux/IAA represses basal transcription of corresponding gene9 (Fig. 1(c)). When auxin concentration becomes above-threshold, then Aux/IAA co-repressor dissociates from DNA, allowing transcription that leads to transcription increasing many folds10 (Fig. 1(d)). DNA sequence of AuxRE11 was reported for the ¯rst time in the IAA4/5 gene promoter of Pisum sativum as contextual invariants kGTnCCCAT and mACATGGnmr TGTyym (nomenclature12), which became common e.g. cgTGTCGC,13 TGTCCC,14 TGTCys15 and yGTCGs.16 Canonical AuxRE motif, TGTCTC, was found in experiment with ARF1 binding to promoter with single-nucleotide substitutions.17 Its generalization to other transcription factors from the ARF family18 was reported to be TGTCnn.19 Signi¯cant excess of AuxRE-like sequences in proximal promoter regions, of 250 bp,20 300 bp21,22 and 500 bp20,23,24 in length, was published for genes that increase transcription in response to auxin. Function of these AuxRE-like sequences is yet to be identi¯ed experimentally. Search for invariants in promoters of co-expressed genes with unknown sites of corresponding transcription factors25

Fig. 1. The four-step mechanism of primary auxin response8: (a), (b), (c) at the sub-threshold auxin concentration, (d) when the auxin concentration is above-threshold. PIC is transcription preinitiation complex. 1540009-2

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

Sequence-based prediction of transcription upregulation by auxin in plants

resulted in an extended AuxRE motif TGTstsbc.20 The most-cited AuxRE motif is TGTCNC.26 This uncertainty in AuxRE sequence may be due to ARF of the ARF– Aux/IAA dimer binding sensitivity to AuxRE on DNA according to the known 3Dstructure.16 The co-repressor Aux/IAA can be responsible for the speci¯city of this dimer binding to DNA according to transcriptome analysis.5 There are many experiments published on site-speci¯c mutagenesis in well-identi¯ed AuxREs9,17,22,27,28 and on reporter design based on these experiments.9 The great diversity of AuxREs and ARF–AUX/IAA dimers can be the reason why the dataset of experimentally proven AuxREs is yet to be published. In our previous paper,29 we solved an ill-posed inverse problem for experimental data9,22,27 on mutagenesis within multiple AuxREs that contained 69 natural and arti¯cially mutated AuxREs with measured auxin-dependent increases in the genes' transcription levels. In this work, we added experimental data17,28 on auxin response of 42 other natural and arti¯cially mutated AuxREs, increasing their number to 111 for analysis. For the \Key-Lock" step in auxin response (Fig. 1(a)), we found consensus30 of these AuxREs. For the \Stop" step (Fig. 1(b)), we estimated the linearly additive score of nucleotide frequencies, so-called \homology score,"31 at this consensus in terms of the theory32 of statistical mechanics of regulatory proteins binding to DNA. For the \Repression" and \Activation" steps (Figs. 1(c) and 1(d)) we found negative and positive correlations respectively between genes' transcription increases in response to auxin and contextual features around AuxRE using Activity33 tools described in our previous paper.29 Taking all these ¯ndings together we derived an empirical equation34 of four-step auxin response (Fig. 1). As an independent control, using this equation, we predicted the best motif for AuxRE, TGTCtCws. It was signi¯cantly matched with TGTstsbc,20 the most frequent motif of promoters for the RNAs enriched in the auxin response transcriptome5 (p < 0:05, Fisher's exact test). We found that the measured increases of transcription levels in response to auxin for 70 Arabidopsis genes with unknown AuxREs correlate signi¯cantly with our predictions34 as well as with two other predictions, namely, TATA-binding protein (TBP)'s a±nity34 to these genes' promoters35 and their nucleosome score.36 Fisher's F -test validated the signi¯cant impact of these phenomena on increase of transcription levels in response to auxin, taking into account AuxRE's impact, F ¼ 4:07, p < 0:025. It means that TATA-box and nucleosome should be taken into account to recognize transcription factor binding sites on DNA sequences: in the case of the TATA-less and nucleosome-rich promoters, recognition scores must be higher than in the case of the TATA-containing nucleosome-free promoters. This is the main novelty of this work. 2. Materials and Methods 2.1. Datasets Table 1 shows six datasets: I, V and VI are from our previously published dataset,29 while II, III and IV are from other papers.17,28 Overall we analyzed 111 natural and 1540009-3

P. M. Ponomarenko & M. P. Ponomarenko

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

Table 1. Six datasets comprising 111 natural and mutant AuxREs with known auxin responses (’-values).

Note: All sets contain one more non-AuxRE sequence \tgtgtgagtagttcccagataagggaattagg" of ’-value norm, lnð’Þ  0.

1540009-4

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

Sequence-based prediction of transcription upregulation by auxin in plants

(a)

(b)

(c)

(d)

Fig. 2. Standard representation logos43 of TGTCNC consensus of all 111 AuxREs with known relative changes of transcription levels in response to auxin in natural logarithm numbers (’-values) of the plant gene (Table 1) at the signi¯cance p < 0:05 of Student's t-test according to the de¯nition30 (a); signi¯cant (p < 10 11 ) correlation between known auxin response ’-values and number of matches to the TGTCNC consensus (-values) (b); the logos43 representation of fðsi Þ-values in percent of frequency estimates for nucleotide s at ith position within the framework of the consensus of AuxREs (c); signi¯cant (p < 10 12 ) correlation between known ’-values of auxin response and homology scores31 (-values) calculated using these frequencies estimates of all 111 AuxREs shown in Table 1(d). Broken curves depict 95% con¯dence intervals for the linear regression (solid lines) built using STATISTICA (Statsoft TM , Tulsa, USA).

arti¯cially mutated AuxREs' DNA sequences of length 32 bp as de¯ned by an arti¯cial mutagenesis design.9,17,22,27,28 Each AuxRE was characterized by experimentally measured relative increases in transcription level after auxin treatment (’-value) that varied from 0.005 to 1.90 of natural logarithm units (ln). These datasets are shown in Table 1 and in Figs. 2(b), 2(d) and 3(a) (the y-axis). We added a known non-AuxRE sequence \tgtgtgagtagttcccagataagggaattaggc" to each dataset, in order to later normalize experimental data9,17,22,27,28 using it. For each dataset of size N we assumed  ¼ 3N  1 degrees of freedom37 in our analysis, because all reported experimental values were calculated as the average of four independent measurements (Table 2). To exemplify how our results obtained from the analysis of these 111 AuxRE could be used in practice, we took an independent dataset from the Genevestigator39 database. This dataset included 500 bp proximal35 promoters of 70 genes of the primary auxin response of three families GH3 (1, 3–7, 12, 13, 16, 19), Aux/IAA (1–7, 9, 11–14, 17–20, 26, 29–34) and SAUR (1, 7–10, 12, 15, 16, 18, 23, 25, 27, 28, 30, 1540009-5

P. M. Ponomarenko & M. P. Ponomarenko

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

(a)

(b)

(c)

Fig. 3. Signi¯cant (p < 10 14 ) correlation between measured auxin responses (’-values) and predicted with Eq. (4) for all 111 AuxREs in our dataset (Table 1) taken together (a); several examples of the putative candidate AuxREs from 100 putative AuxREs selected from 1 million random 32 bp DNAs sequences based on highest predicted increases of transcription in response to auxin with Eq. (4) (b); standard logos43 representation of the consensus30 of that 100 AuxREs candidates (c). Broken curves and solid lines are as in Fig. 2.

33–36, 41, 44–48, 50–52, 54, 62–68, 70, 71, 76) in Arabodopsis thaliana from TAIR database.38 Their relative increases in transcription levels in response to auxin (’-values) ranged from 0.07 ln to 1.74 ln after 1 h of treatment with 1 M IAA as shown in Fig. 4, y-axis.

(a)

(b)

(c)

(d)

Fig. 4. Signi¯cant correlations of the experimental ’-values of the 70 genes of the families GH3, Aux/IAA and SAUR of the primary response to auxin in Arabidopsis taken from Genevestigator39 database (a) to the maximal ’-values predicted by Eq. (4) upon 500 bp proximal promoters35 of these genes (p < 0:001); (b) to the maximal a±nity rates34 of TBP binding to [70, 20] region of these promoters (p < 0:025, the region where all the known TATA boxes reside); (c) to nucleosomal DNA36 of these promoters (p < 0:025); and to the same ’-values predicted by Eq. (5), p < 10 6 (d). Broken curves and solid lines, see the legend to Fig. 2. 1540009-6

Sequence-based prediction of transcription upregulation by auxin in plants

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

2.2. DNA sequence analysis To predict relative changes in transcription levels after treatment with auxin (’-values), we applied an approach40 in which the obligatory features that are common for all AuxREs can determine the basic level of the transcription relative changes, while the facultative features that are speci¯c for each AuxRE modulate these changes near the basic level. Based on this, we evaluated the information content and the frequency of each nucleotide at each of 32 positions for all the 111 AuxREs (Table 1) weighted using their measured transcription-level increase in response to auxin (’-values) using Boltzmann's approximation41 as recommended42 (Figs. 2(a) and 2(c), respectively, the y-axis). For all 111 AuxREs with experimentally measured relative increases in transcription level in response to auxin (’-values) in our dataset (Table 1), consensus30 and position-frequency matrix31 (PFM) were calculated. Their commonly accepted representation logos43 are shown in Figs. 2(a) and 2(c) (each letter's height corresponds to its y-axis value). For all the 111 AuxREs (Table 1), signi¯cant correlations of relative increase in transcription in response to auxin (’-values) with the number of matches with the consensus30 () and with homology score31 () are shown in Figs. 2(b) and 2(d), respectively. As an estimate of the prediction reliability for the number of matches with the consensus30 () and with homology score31 (), we have made a comprehensive cross-validation test using each of the six datasets I–VI (Table 2). We analyzed each of the ¯ve datasets I–V (dataset VI was too small) independently using our tool Activity,33 as it done in our previous work29 in the case of dataset V. Since Activity33 was already described in detail in our previous work,29 here we brie°y recall only its fundamentals. Activity uses for \input" a dataset of DNA sequences with known corresponding values of their \activity"; in that work we used their measured relative increases in transcription level after auxin treatment (see Table 1). Its \output" is one of a million of the mean  ½a;b -values of DNA helix Table 2. Cross-validation test results for consensus and PFM using datasets I–VI. Consensus30 TGTCNC (Fig. 2(a)), , number of matches Dataset Degrees a of freedom,37 I II III IV V VI All

40 40 61 28 151 19 334

Linear correlation, r

Signi¯cance, p

0.75 0.69 0.47 0.65 0.71 0.92 0.61

< 10 8 < 10 6 < 10 4 < 10 3 < 10 24 < 10 8 < 10 36

PFM (Fig. 2(c)), , homology score31 Linear correlation, r Signi¯cance, p 0.79 0.64 0.47 0.59 0.67 0.92 0.62

< 10 9 < 10 5 < 10 4 < 10 3 < 10 20 < 10 8 < 10 36

a All experimental data was obtained as mean values of at least four independent experimental measurements.

1540009-7

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by VIRGINIA POLYTECHNIC INSTITUTE on 03/13/15. For personal use only.

P. M. Ponomarenko & M. P. Ponomarenko

feature44  at region [a; b] and the abundance part -values of oligonucleotides  within some part of the sequences that ¯ts best to \activity" values in terms of Pearson's, Spearman's, Kendall's, Fisher's and  2 correlations tested using multiple bootstraptests. Activity33 characterizes both the selected  ½a;b - and F -features using ð ½a;b Þand ðpart Þ-values of a prioritization ranging from 0 to 1, where 1 corresponds to exact match to the provided \activity" values. Using the Bonferroni correction for the binomial test, we have con¯gured Activity33 so that the statistical signi¯cance of its results corresponded to p < 10 20 . For this reason, any result of Activity33 must be additionally validated using some independent experimental data. Thus, each of the  ½a;b - and -features found by Activity33 upon any single dataset from among datasets I–V, was then completely cross-validated on each of the remaining datasets (Table 3). 2.3. Sequence-based prediction of transcription up-regulation by auxin in plants According to earlier recommendations,34 the four-step model of transcription upregulation by auxin in plants (Fig. 1) was modeled by the following empirical equation: ’ ¼ 1 KKEY=LOCK þ 2 KSTOP þ 3 KREPRESSION þ 4 KACTIVATION :

ð1Þ

Tables 4 and 5 represent the details of our heuristic evaluation of this equation's terms using the contextual features of DNA around AuxRE. We have estimated the KKEY=LOCK term (Fig. 1(a)) using the number of matches with the consensus30 AuxRE (). Analogously, we have estimated the KSTOP term (Fig. 1(b)) using the simplest linearly additive score of nucleotide frequencies at the AuxRE consensus positions, so-called \homology score"31 of PFM (). In the same way, we estimated the KREPRESSION term (Fig. 1(c)) using the above described  ½a;b - and F -values of the signi¯cant negative correlation with the ’-values of the experimentally measured auxin response as well as we estimated the KACTIVATION term (Fig. 1(c)) using the positive correlations of the same matter as one can see in Table 4. As recommended,34 the stoichiometric coe±cients  for each term were estimated by bringing these terms to the scale of transcription relative changes in response to auxin (’-values) using Pearson's simple regression and averaging these regression using Lomonosov–Lavoisier's principle of conservation of mass. Figures 3(a) and 4(a) correspond to the application of Eq. (1) to datasets I–VI, to all 111 known AuxREs (Table 1) and to 70 genes in Arabidopsis. Tables 3 and 4 show the complete cross-validation test of the application of Eq. (1) to each dataset from datasets I–VI using the remaining datasets separately for validation. Analogously, we modeled experimental observations45 of the \transcriptional bursts" in response to auxin when frequency, amplitude and duration of these bursts correspond to the ARF, TBP and histone octamer a±nity rates to bind to the 1540009-8

Prioritization, 

Dataset I

Dataset III

Dataset IV

Dataset V

Linear correlation coe±cient, r (statistical signi¯cance, p) Dataset II

1540009-9

0.61 (< 10 2 ) 0.91 (< 10 7 ) 0.87 (< 10 6 )

0.62 (< 10 2 )

0.75 (< 10 3 ) 0.93 (< 10 8 )

Dataset VI

a The test results obtained using the training dataset are boldfaced; empty cells display insigni¯cant correlations (p > 0:05); VVT, SNW, YVD, HYR and TSD trinucleotides    nomenclature IUPAC-IUB CBN12 of the 16 letter code; B-helical DNA features44: , inclination angle; , tip angle; !, twist angle; ", slide; , minor groove size.

Five positive correlations correspond to the auxin-caused transcription increase as Step IV of auxin response 0.46 0.94 (< 10 17 ) 0.47 (< 10 8 ) [YVD]center ½16;1 0.34 0.80 (

Sequence-based prediction of transcription upregulation by auxin in plants.

Auxin is one of the main regulators of growth and development in plants. Prediction of auxin response based on gene sequence is of high importance. We...
2MB Sizes 0 Downloads 7 Views