Accepted Article

Received Date : 06-May-2014 Revised Date : 16-Sep-2014 Accepted Date : 17-Sep-2014 Article type

: Original Article

Characterization of Stress-responsive lncRNAs in Arabidopsis thaliana by Integrating Expression, Epigenetic and Structural Features

Chao Dia,1, Jiapei Yuana,1, Yue Wua, Jingrui Lia, Huixin Linb, Long Hua, Ting Zhangb, Yijun Qia, Mark B. Gersteinc, Yan Guob and Zhi John Lua,* a

MOE Key Laboratory of Bioinformatics, Center for Plant Biology, School of Life

Sciences, Tsinghua University, Beijing 100084, China b

State Key Laboratory of Plant Physiology and Biochemistry, College of Biological

Sciences, China Agricultural University, Beijing 100193, China c

Department of Biophysics and Biochemistry, Yale University, New Haven, CT

06520, USA

1

These authors contributed equally to this work.

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process which may lead to differences between this version and the Version of Record. Please cite this article as an 'Accepted Article', doi: 10.1111/tpj.12679 This article is protected by copyright. All rights reserved.

Accepted Article

* For correspondence: Zhi Lu, Phone: +86-10-62789217, Email: [email protected]

Running title: Stress-responsive lncRNAs in Arabidopsis

Keywords: poly(A)+, poly(A)-, lncRNA, stress, structure, epigenetics, Arabidopsis.

Accession number for sequence data: The RNA-seq data from this study have been deposited in NCBI GEO under accession number GSE49325.

SUMMARY Recently, in addition to poly(A)+ long noncoding RNAs (lncRNAs), many lncRNAs without poly(A) tails, have been characterized in mammals. However, the non-polyA lncRNAs and their conserved motifs, especially those associated with environmental stresses, have not been fully investigated in plant genomes. We performed poly(A)RNA-seq for seedlings of Arabidopsis thaliana under four stress conditions, and predicted lncRNA transcripts. We classified the lncRNAs into three confidence levels according to their expression patterns, epigenetic signatures and RNA secondary structures. Then, we further classified the lncRNAs to poly(A)+ and poly(A)-

This article is protected by copyright. All rights reserved.

Accepted Article

transcripts. Compared with poly(A)+ lncRNAs and coding genes, we found that poly(A)- lncRNAs tend to have shorter transcripts and lower expression levels, and they show significant expression specificity in response to stresses. In addition, their differential expression is significantly enriched in drought condition and depleted in heat condition. Overall, we identified 245 poly(A)+ and 58 poly(A)- lncRNAs that are differentially expressed under various stress stimuli. The differential expression was validated by qRT-PCR, and the signaling pathways involved were supported by specific binding of transcription factors (TFs), phytochrome-interacting factor 4 (PIF4) and PIF5. Moreover, we found many conserved sequence and structural motifs of lncRNAs from different functional groups (e.g., a UUC motif responding to salt and a AU-rich stem loop responding to cold), indicated that the conserved elements might be responsible for the stress-responsive functions of lncRNAs.

INTRODUCTION Plants possess an elaborate response system to external stimuli (e.g., stress) (Hirayama and Shinozaki 2010, Mittler 2006). Such stress responsive signal transduction pathways have been well studied, and many of the key components of these pathways are protein-coding genes (Hirayama and Shinozaki 2010). However, the stress-responsive noncoding RNAs (ncRNAs), especially the long ncRNAs (lncRNAs), are not well characterized in plants (Ben Amor et al., 2009, Sunkar and Zhu 2004). Long ncRNAs (usually larger than 200 nt), a class of more recently

This article is protected by copyright. All rights reserved.

Accepted Article

identified ncRNAs, have been demonstrated as having essential regulatory roles in various biological processes within several species (Guttman and Rinn 2012, Lee 2012). Several lncRNAs have recently been functionally characterized in plant stress responsive pathways (e.g., COLDAIR, COOLAIR, At4/IPS1, npc48, and npc536) (Ben Amor et al., 2009, Ding et al., 2012, Franco-Zorrilla et al., 2007, Heo and Sung 2011, Shin et al., 2006). However, to date, comprehensive surveys of lncRNAs response to environmental stress are lacking. Comparison of RNA secondary structure in vivo and in vitro reveals that RNAs might undergo conformational changes in response to stress conditions (Ding et al., 2013), and the structural rearrangement is thought to be regulated by RBPs (RNA binding proteins) (Kwok et al., 2013, Rouskin et al., 2014). Thus, discovery of the potential RBP binding motifs (both sequence and structure) is very important for understanding the mechanism of lncRNAs (Rabani et al., 2008, Ray et al., 2013).

With the development of high-throughput sequencing technologies, many lncRNA transcripts have been identified in different species by transcriptome reassembly (Cabili et al., 2011, Nam and Bartel 2012, Pauli et al., 2012, Young et al., 2012). The reconstruction of these transcriptomes is based on large amounts of deep sequencing data, from multiple tissues, or cell lines. In 2012, an array-based strategy was developed for long intergenic ncRNA (lincRNA) prediction in Arabidopsis, with some of the candidates also supported by RNA-seq data (Liu et al., 2012). In addition, most of the RNA-seq only sequenced polyadenylated RNAs by This article is protected by copyright. All rights reserved.

Accepted Article

oligo(dT) selection. However, many of the lncRNA transcripts lack poly(A) tails (Djebali et al., 2012, Yang et al., 2011).

Moreover, it is difficult to identify all

functional lncRNAs via sequencing, or array data alone, as many lncRNAs are often lowly expressed or are only expressed in specific conditions (e.g., stress conditions).

In addition to expression patterns, chromatin signatures have been used to identify lincRNAs in human and mouse genomes. For instance, H3K4me3 and H3K36me3 peaks were used together as transcripts markers because these two histone modifications were enriched in the promoters and gene bodies separately (Guttman et al., 2009, Khalil et al., 2009). Although this method only produces a small portion of lincRNAs (Marques and Ponting 2009), it does indicate an association between lincRNAs and chromatin signatures. In addition, many functional lncRNAs are coordinated with specific chromatin signatures to establish epigenetic regulations (Guttman and Rinn 2012, Lee 2012). On the other hand, many known ncRNAs such as tRNAs, snoRNAs, snRNAs and rRNAs have conserved structures, and computational methods for their identification are widely used in genome annotation (Curwen et al., 2004, Gruber et al., 2010, Swarbreck et al., 2008). Thousands of ncRNAs have been predicted by structure-based methods, such as INFERNAL, Evofold, RNAz, REAPR and others (Gruber et al., 2010, Nawrocki et al., 2009, Pedersen et al., 2006, Will et al., 2013). In a previous study, structural features have been combined with expression patterns in an effort to predict conserved canonical ncRNAs in C. elegans (Gerstein et al., 2010, Lu et al., This article is protected by copyright. All rights reserved.

Accepted Article

2011). However, these methods usually rely on genomic sequence alignments, and some ncRNAs, especially the long ncRNAs, are not conserved between species, or do not have stable secondary structures (this is especially true for long ncRNA transcripts), thus they cannot be identified by these methods.

In eukaryote cells, most of the mature RNAs are polyadenylated in the 3’ end by adding multiple adenines. The length of poly(A) tails are variant form dozens of nucleotides to as long as 300 nucleotides, and it is important for nuclear export, translation, and stability of mRNAs (Elkon et al., 2013, Marzluff et al., 2008). Recent studies have shown that a significant fraction of the transcriptome are not polyadenylated or only have short poly(A) tails in mammalian cells, these transcripts are usually called the non-polyadenylated RNAs (poly(A)- RNA) (Djebali et al., 2012, Livyatan et al., 2013, Yang et al., 2011). In metazoan, the replication-dependent histone mRNAs are poly(A)- RNAs, their transcription ends at a stem-loop structure which is crucial for the histone synthesis and other regulation process. However, the histone mRNAs are polyadenylated in lower eukaryotes and plants, the molecular basis of their regulation in not known (Marzluff et al., 2008). As far as we know, there is no poly(A)- transcriptome analysis in plants.

We developed a comprehensive bioinformatics framework to identify and characterize lncRNAs in the Arabidopsis genome (Figure 1). We carried out poly(A)This article is protected by copyright. All rights reserved.

Accepted Article

RNA-seq and total RNA-seq for various of samples. By assembling sequencing reads and applying our previous method incRNA (Gerstein et al., 2010, Lu et al., 2011), we integrated expression patterns, epigenetic signatures, and RNA secondary structures, and then classified predicted lncRNAs into three confidence levels according to their preferable features. We further classified lncRNA candidates into poly(A)+ and poly(A)- transcripts, and validated them by RT-PCR. We found that poly(A)- lncRNAs tend to have shorter transcripts and lower expression levels. Also, poly(A)- lncRNAs tend to exhibit higher degrees of transcriptional specificity than poly(A)+ lncRNAs and coding genes under stress conditions. Overall, we identified 245 poly(A)+ and 58 poly(A)- lncRNAs that are differentially expressed under stress stimuli, and most of the selected candidates were validated by qRT-PCR. We found additional evidence supporting these stress associations (i.e., PIF regulation of targeted lncRNAs). We built a co-expression network and screened stress-related conserved motifs from lncRNAs in different functional groups. In summary, we sought to provide a genome-wide, well-characterized set of stress-responsive lncRNA candidates (including both poly(A)+ and poly(A)- lncRNAs) in Arabidopsis for future research.

This article is protected by copyright. All rights reserved.

Accepted Article

RESULTS Identification of lncRNAs by high-throughput sequencing We developed a pipeline for lncRNA identification in Arabidopsis, which was initiated by high-throughput RNA sequencing and followed by multiple stages of filtering (Figure 1). First, we sequenced nine non-poly(A) (poly(A)-) RNA samples from Arabidopsis seedlings under four stress conditions (heat, cold, drought, and salt), and a total RNA sample from mixed tissues (i.e., inflorescence, stem and leaf) (see Supplemental Methods). In total, we obtained ~180 M 101 nt reads (rRNA reads removed), and ~150 M of them were mapped to the Arabidopsis genome (Table S1). Subsequently, we assembled 47,462 transcripts by using Cufflinks and Cuffmerge (Trapnell et al., 2010) (see Supplemental Methods); ~83% coding transcripts and ~71% annotated lncRNAs in TAIR10 were successfully assembled, which indicates that our RNA-seq data has a high coverage rate for known transcripts (Figure S1).

To identify lncRNA candidates from the assembled transcripts, we introduced five filter steps (Figure 1): [1] we filtered 36,499 transcripts that overlap with TAIR10 annotated coding genes or noncoding RNAs, [2] we filtered 64 transcripts that can be aligned to the known canonical ncRNAs annotated in miRBase, Plant snoRNAdb, and tRNAdb (Brown et al., 2003, Burge et al., 2013, Juhling et al., 2009, Kozomara and Griffiths-Jones 2011) for further purification, [3] we used INFERNAL to scan the Arabidopsis genome for homologs of structured ncRNAs in Rfam, and filtered 42 This article is protected by copyright. All rights reserved.

Accepted Article

overlapped transcripts, [4] we filtered ncRNA transcripts with high coding potential (CPC score > 0) (Kong et al., 2007), and [5] we also filtered transcripts that aligned to mitochondria or chloroplast sequences using the BLASTn program. We identified total 955 lncRNA candidates (>200 nt) (Appendix 1) (see Supplemental Methods). The identification of lncRNAs by RNA-seq and stringent filters provided a confident set of lncRNA candidates for characterization.

Characterization of lncRNAs by integrating multiple supporting features Assembling transcripts from RNA-seq data is a direct method for ncRNA detection. However, it is hard to successfully distinguish ncRNAs from coding sequences (CDSs) and untranslated regions (UTRs) using expression data only (Lu et al., 2011). In addition to expression patterns, RNA secondary structure, DNA conservation and epigenetic signatures (e.g., H3K4me3 and H3K36me3 histone modifications) have been used to identify noncoding RNAs in human and mouse (Guttman et al., 2009, Khalil et al., 2009, Will et al., 2013). In Arabidopsis, we observed different structure and conservation intensities and epigenetic modification signal distributions in canonical ncRNAs (rRNA, tRNA, snoRNA, miRNA, snRNA and siRNA) and annotated lncRNAs, compared with coding genes (Figure S2). For example, both canonical ncRNAs and annotated lncRNAs have stronger H3K27me1 signals than do coding genes; however, they tend to have weaker H3K4me3 modifications than do coding genes around the transcription start site (TSS) (Figure

This article is protected by copyright. All rights reserved.

Accepted Article

S2b).

Accordingly, we applied our supervised machine learning method, incRNA

(Gerstein et al., 2010, Lu et al., 2011), to characterize the lncRNAs by integrating 25 supporting features, including expression pattern, RNA structural and epigenetic features. To collect these features, we investigated 147 published datasets (including both expression and epigenetic data), sequenced ten RNA samples, and predicted seven sequence and structural scores (i.e., GC content, DNA conservation, protein conservation, RNA structure conservation, RNA structure stability, coding potential and homologs in Rfam) (Figure 2a and Table S2). We used Random Forest for a multi-class classification (CDS, unexpressed intergenic regions, lncRNA and Canonical ncRNAs). The ncRNAs annotated in TAIR10 (both canonical ncRNAs and lncRNAs) were used as the positive set, and the CDS and unexpressed intergenic regions (see Supplemental methods) were used as the negative set. As in Lu et al., 2011, we did 10 fold cross validation in the training set (2/3 of the data), and left an independent set (1/3 of the data) for further validation. We have made a supplemental website for the training and testing data sets, and the prediction model/tool. The web link is: http://plant.ncrnalab.org/stress. Performance of the model is evaluated by the predicting bins (100 nt) belonging to 4 classes. For the ROC curve, we used the bins being lncRNA or canonical ncRNA as positives, and the bins being unexpressed intergenic region or CDS as negatives. The model was able to predict the known ncRNAs with high accuracy (AUC of ROC is larger than 0.9; Figure S3a). We also tested the model by lncRNAs from different genomic locations: TE overlapping regions, antisense regions, intergenic regions, and cis

This article is protected by copyright. All rights reserved.

Accepted Article

regions (Supplemental methods). As expected, we found that antisense lncRNAs are the hardest type to be predicted. This is probably due to the undifferentiated feature values (e.g., sequence conservation and histone modification) between the complimentary strands. For each type of lncRNAs, we used different sub-sets of features to evaluate the model performance (Figure S3b-e). The model performs best when all features included. We found the expression features performed better than the epigenetic features. The sequence and structure features were the worst. The comparison further demonstrated that none of the individual type of features could accurately distinguish ncRNAs from other genomic elements. We then calculated the ncRNA probabilities on every 100 nt sequences (called bins) across the Arabidopsis genome, overlapped bins with annotated lncRNAs and coding transcripts, and assigned the averaged bin score to each. Using this approach, we were able to separate lncRNAs from coding transcripts with high accuracy (Figure S3f).

As described above, we assembled 995 lncRNA transcripts from the RNA-seq data based on multiple filters (Figure 1). Then we classified them into two confidence levels: 366 of them were labeled as level 1 lncRNAs because they were also confirmed by the incRNA model (FDR=0.05); the other 629 lncRNA transcripts were labeled as level 2 lncRNAs (Figure 1). In addition to the transcribed regions from our RNA-seq samples, we obtained 4,396 lncRNA loci (labeled as level 3) by merging the ncRNA bins predicted by the incRNA model with a very stringent cutoff This article is protected by copyright. All rights reserved.

Accepted Article

(FDR=0.001) in the intergenic regions. They were predicted as lncRNAs by nonexpression features, such as sequence, structure and epigenetic features. Besides, from the saturation plot of expression data, we observe that the level 3 lncRNA coverage is increasing with the numbers of RNA-seq samples, which suggests that more level 3 lncRNAs would be detected if more experimental samples were added (Figure S4). In addition, we compared the predicted intergenic lncRNAs with the lincRNAs reported by Liu et al., 2012. Although the biological samples and prediction methods are very different, we still found 91 (38.9%, 91/234) predicted lincRNAs (lncRNAs located in the intergenic regions) overlapped with 101 RepTAS predictions (Table S3).

Using the signal patterns of 25 features, we clearly separate different levels of predicted lncRNAs (Figure 2b; Figure S5). Because of the large standard deviations for feature scores across the whole genome (Figure S5), we carried out MannWhitney test for the normalized feature scores between two populations (the lncRNAs and the controls) in each feature. The lncRNA scores are significantly higher than controls in most of the 25 features (Figure 2b), which further indicates they are real transcripts, rather than just genomic noise. We noted that level 1 lncRNAs tended to be highly expressed, and had higher histone modification intensities for most of the active markers (e.g., H3K4me3 and H3K9Ac). In contrast, level 2 lncRNAs exhibited higher epigenetic modification levels in the repressive markers, such as DNA methylations, H3K27m1, H3K27me3 and H3K9me2 histone This article is protected by copyright. All rights reserved.

Accepted Article

modifications, and also higher DNA conservation and structural feature scores. These observations suggest that level 1 and level 2 lncRNAs might associate with different epigenetic marks. In addition, level 3 lncRNAs had significantly higher signals in structure stability, poly(A)+ RNA-seq and CHH DNA methylation compared to the controls. Furthermore, we divided both annotated and predicted lncRNAs into several subgroups according to their genomic locations (see Supplemental Methods) (Figure S6a), and investigated the feature signal distributions of these subtypes. We found different types of lncRNAs tend to have specific feature patterns (Figure 2c and Figure S6b). For example, TE or smRNA precursor lncRNAs have very high smRNA-seq signals, they are highly affected by repressive epigenetic marks, such as H3K9me2, H3K27me1 and DNA methylations. This is consistent with previous studies about small RNAs (Liu et al., 2010, Reinhart and Bartel 2002). In summary, we classified three groups of lncRNAs associated with different features. The characterization of different groups of lncRNAs by preferable features may help us to better reveal their molecular functions.

Classification of poly(A)+ and poly(A)- lncRNAs We compared the nine poly(A)- RNA-seq samples we sequenced with those of 12 public poly(A)+ RNA-seq samples (SRP000935) (Filichkin et al., 2010) in the same stressed conditions and classified the lncRNAs into poly(A)+ and poly(A)- transcripts as previous described in (Yang et al., 2011) (Figure 3a; see Methods). Most of the

This article is protected by copyright. All rights reserved.

Accepted Article

annotated lncRNAs were derived from poly(A)+ expressed sequence tag (EST) or cDNA libraries (Lamesch et al., 2012, Swarbreck et al., 2008), so we found more poly(A)- transcripts in the predicted lncRNAs (level 1 and level 2) than in the annotated lncRNAs (20.5% vs 5.8%). We also noticed that ~9.8% of coding genes were defined as poly(A)-, because they may have shorter poly(A) tails that were not enriched by oligo(dT) when constructing the cDNA libraries. To confirm the prediction of polyadenylation, we performed RT-PCR to validate poly(A)+ or poly(A)lncRNA candidates, also referred to the methods in (Yang et al., 2011). We randomly selected nine poly(A)+ lncRNA candidates and nine poly(A)- lncRNA candidates. All the predicted poly(A)+ lncRNAs were successfully enriched in poly(A)+ libraries. Among the nine poly(A)- lncRNA candidates, two were not detected by RT-PCR, six out of seven were successfully enriched in poly(A)- libraries (Figure 3b). These data confirmed our definitions of poly(A)+ or poly(A)- lncRNAs.

We then compared the sequence length, exon number and expression level between poly(A)+ and poly(A)- lncRNAs. Not surprisingly, lncRNAs were shorter than coding transcripts, had fewer exons and lower expression levels (Figure 3c and 3d; Figure S7), consistent with that observed in previous studies (Cabili et al., 2011, Liu et al., 2012, Nam and Bartel 2012, Pauli et al., 2012). Moreover, we found poly(A)- lncRNAs tended to be shorter and have fewer exons than poly(A)+ lncRNAs, with lower expression levels compared to poly(A)+ lncRNAs, measured by the median expression values among multiple datasets (Figure 3c and 3d). For This article is protected by copyright. All rights reserved.

Accepted Article

cases where RPKM=0, we added a small number (i.e., 0.001) to them, and then took the log2 transformation. The low expression of poly(A)- lncRNAs was also observed in our total RNA-seq data (Mann-Whitney test, p < 0.05). We performed the same analysis only on the level 1 lncRNAs, and observed the same trends (Figure S8). We found that poly(A)+ lncRNAs have significantly higher signals than do poly(A)lncRNAs in most situations with the exception of the poly(A)- RNA-seq feature and DNA methylation features (Figure S9). Poly(A)+ lncRNAs, compared with poly(A)lncRNAs, tend to have higher histone modification levels (Mann-Whitney test, pvalue < 0.05) for the active marks (i.e., H3K18Ac, H3K9Ac, H3K4me2, H3K4me3, H3K36me2 and H3K36me3). However, we did not observe significant differences between poly(A)+ and poly(A)- lncRNAs (Mann-Whitney test, p-value > 0.05) for the repressive marks (i.e., H3K27m1, H3K27me3 and H3K9me2). The low expression levels of poly(A)- lncRNAs are more associated with DNA methylation signals (Figure S9). In addition, we compared their sequence conservation across multiple plant species, and found that the conservation of predicted lncRNAs (both poly(A)+ and poly(A)-) occurred between coding genes and unexpressed intergenic sequences (Mann-Whitney test, p-value < 0.05) (Figure S10). This conservation is consistent with that found in previous studies of mammal lncRNAs (Cabili et al., 2011, Chodroff et al., 2010, Guttman et al., 2009, Marques and Ponting 2009, Ulitsky et al., 2011).

Expression pattern of lncRNAs responding to stresses This article is protected by copyright. All rights reserved.

Accepted Article

Previous studies have shown that a couple of lncRNAs have essential roles in the development and stress signaling transduction pathways of plants (Ben Amor et al., 2009, Shin et al., 2006, Swiezewski et al., 2009, Wang et al., 2014). Here, we systematically assayed the stress associated expression patterns of the lncRNAs, which were just identified in the Arabidopsis genome mentioned above (Figure 4a and Figure S11). We found that poly(A)- lncRNAs tended to be more specifically expressed, compared to either annotated poly(A)+ lncRNAs or predicted poly(A)+ lncRNAs (Figure 4a and Figure S11). To confirm these results, we calculated the stress specificity scores, similar to tissue specificity scores as defined in Cabili et al., 2011 (see Supplemental Methods), and found that poly(A)- lncRNAs do have higher stress specificity scores than do poly(A)+ lncRNAs, and all lncRNAs have higher stress specificity scores than do coding genes (Figure 4b). Since poly(A)- lncRNAs, poly(A)+ lncRNAs and coding genes have biased expression levels, we also performed the comparison using only transcripts with similar expression levels (log2(rpkm) between -1 and 3, Figure S12a). The results were still affirmative (Figure S12b). These results indicated that many poly(A)- lncRNAs might be selectively activated upon a specific stress stimulus, whereas the coding genes tend to be less specific and have synergetic responses to these stresses.

We further classified the differentially expressed lncRNAs by the stress conditions noted above (Figure 4c and 4d). The numbers of differential expressed lncRNAs in multiple stress conditions are shown in Figure 4d. In total, we found that 303 lncRNAs (including 245 poly (A)+ and 58 poly(A)- lncRNAs) were differentially expressed under at least one of the stress conditions (DEGseq MA-plot based This article is protected by copyright. All rights reserved.

Accepted Article

method, p-value = 2) (details are listed in Appendix 2). Compared with mRNAs, more lncRNAs tended to be differentially expressed under high light or drought, but fewer lncRNAs in responded to heat (χ² test, p-value < 0.05). (Figure 4c; Table S4). All these observations indicate that lncRNAs, especially poly(A)- lncRNAs, might have very different expression patterns compared to coding genes in response to stress.

To validate the differentially expressed lncRNA candidates detected by RNA-seq data, we tested them by qRT-PCR in a different biological replicate. Overall, 12 of the 15 drought-responsive candidates, 13 of the 15 salt-responsive candidates, 13 of the 18 cold-responsive candidates and 15 of the 18 heat responsive candidates were successfully validated (Figure S13).

Transcription factor targeting of stress-responsive lncRNAs We found that many stress-responsive lncRNAs we identified are supported by the binding signals of stress related transcription factors, such as PIF4 and PIF5. By calling genome-wide binding peaks of PIF4 and PIF5 using their ChIP-seq data (GSE35315 and GSE35059) (Oh et al., 2012), we identified 107 lncRNAs that were potential targets (see targeting definition in Supplemental methods). Among which, 15 poly(A)+ lncRNAs were differentially expressed under high light stress. Several lncRNA examples targeted by PIF4 and also up-regulated by high light stimulus are This article is protected by copyright. All rights reserved.

Accepted Article

shown in Figure 5a. We found that the lncRNAs and nearby coding genes may response differently to stresses (Figure 5b). For example, lnc-173 and its upstream gene AT3G43190 (SUCROSE SYNTHASE 4) are transcribed in opposite directions. Both of them are targeted by PIF4. Lnc-173’s expression is significantly induced in high light, but not in other stress conditions. However, AT3G43190 is only responsive to heat. This indicates that they are involved in different biological pathways. In another case, lnc-225 and AT4G27950 (CYTOKININ RESPONSE FACTOR 4) are both induced by the same stimuli. But they might use different promoters (two PIF4 binding peaks) under high light.

To further investigate the regulatory relationships between PIF4 and the targeted lncRNAs, we carried out high light treatment (Filichkin et al., 2010) for wild-type seedlings, pif4 mutant and pifq mutant (quadruple pif mutant: PIF1, PIF3, PIF4 and PIF5). We first verified that the pif4 and pifq mutants, and the decay of PIF genes in wild-type under high light (Figure S14). Then, the expression of lncRNAs and marker genes were validated by qRT-PCR (Figure 5b). We used GER3 and NPQ4 as marker genes, because they were reported to be negatively regulated by PIF genes in multiple datasets (Oh et al., 2012). In the normal condition of this study, we found that the marker genes and two lncRNAs (lnc-119 cannot be detected by qRT-PCR) are up-regulated in the mutants. This supports that the two lncRNAs are probably targeted and inhibited by PIFs. When PIFs were disrupted by the mutations, the inhibited targets were activated (Figure 5b). In high light condition, the response of This article is protected by copyright. All rights reserved.

Accepted Article

the lncRNAs is abolished in pifq mutant for lnc-225 and NPQ4, but not for lnc-173 and GER3. The high light response of lnc-173 and GER3 is probably rescued by other factors in pifq mutant. These data further demonstrate that multiple lines of evidence support our identification of stress-responsive lncRNAs. The connections with specific TF regulators may further indicate different regulatory mechanisms other than well-characterized signaling pathways.

Function prediction for stress-responsive lncRNAs To predict the potential biological functions of the lncRNAs, we associated them with coding genes by constructing a co-expression network (Guttman et al., 2009, Liao et al., 2011). We also added the sequence pattern search and RNA secondary structure prediction to the co-expressed modules to associate these conserved motifs with biological functions.

We combined poly(A)+ and poly(A)- RNA-seq data under stress conditions to build the co-expression network (see Supplemental Methods). In the co-expression network, there were totals of 9602 expressed coding genes and 319 lncRNAs (196 poly(A)+ lncRNAs, 42 poly(A)- lncRNA and 81 bimorphic lncRNAs) that were linked by 83,973 edges, among which 80,654 edges connected coding gene pairs, 103 edges connected lncRNA pairs, and 3,216 edges connected coding genes with lncRNAs (Figure 6a). Among 3,319 lncRNA related pairs, 713 pairs were within the This article is protected by copyright. All rights reserved.

Accepted Article

same chromosome, and only 8 lncRNAs were located within 10 kb of their neighbors. This suggested that these lncRNAs are involved in cis-regulation with their associated genes and most of the lncRNAs we identified might involve transregulation (De and Babu 2010, Williams and Bowles 2004).

We selected 184 lncRNAs connected by at least three neighbors. Of these, 1,567 neighbors were coding genes annotated by 618 Gene Ontology (GO) terms. And we found 338 GO terms significantly enriched in these co-expressed coding genes (FDR < 0.05). We selected 25 enriched GO terms related to stress response (Appendix 3), such as response to salt stress (GO:0009651, FDR = 3.4e-12), response to red or far red light (GO:0009639, FDR = 0.041), and response to cold (GO:0009409, FDR = 2.9e-18) et al.(Figure 6b). Coding gene and lncRNA pair connections in these three GO terms are shown in Figure 6a. In total, we found 99 lncRNAs associated with 218 coding genes in these 25 stress related GO term groups, including stress types (e.g., response to reactive oxygen species) other than the five conditions (i.e., cold, heat, salt, drought, and high light) in which we had the RNAs sequenced (Figure 6c; Appendix 3). There were also other categories of GO terms enriched, such as hormone related GO terms (e.g., response to auxin stimulus, FDR = 4.4e-14), metabolism related GO terms (e.g., oxidation reduction, FDR = 1.2e-11) and cell death related GO terms (e.g., apoptosis, FDR = 1.1e-3) (Figure 6c; Appendix 3). These biological processes are closely related to stress

This article is protected by copyright. All rights reserved.

Accepted Article

responses and intersect with each other, forming a complex network between lncRNAs and coding genes.

Conserved sequence motifs and structural motifs enriched within stressresponsive lncRNAs RNAs bound and regulated by the same proteins usually contain conserved motifs which can be either primary sequences or secondary structures (Ray et al., 2013). Here, we predicted and characterized the conserved motifs of lncRNAs in the same functional groups. From the above co-expression network, we classified 25 groups of lncRNAs associated with different stresses by co-expression analysis. In each group of lncRNAs, we further investigated the conserved sequence motifs using MEME (Bailey et al., 2009) (width = 4~12, allow 0 or 1 motif per sequence, E-value < 0.05). In summary, we found 16 groups including 95 lncRNAs that have at least one enriched stress-related sequence motif, and these motifs are predominantly enriched by A/G or U/C (Figure S15). For example, a UCC motif was found to respond to salt stress (Figure 6b, upper left). In addition, we found that these motifs were evolutionary conserved across plant species, compared to random sampled motifs in the genome (Mann Whitney test, p < 0.05; see details in Supplemental Methods). These conserved motifs could be recognized by other regulatory factors (e.g., RNA binding proteins) that cause the RNA molecules to rearrange their structures in response to stress conditions (Kwok et al., 2013, Ray et al., 2013, Rouskin et al., 2014). For this reason, they might play critical roles in the stress responses. In addition to sequence motifs, we also predicted conserved structural motifs in the lncRNAs grouped by the co-expression network using RNApromo (Rabani et al., This article is protected by copyright. All rights reserved.

Accepted Article

2008) (p-value < 0.001, maximum number of motifs = 5). All 25 groups, including 80 lncRNAs, had at least one enriched stress-related structural motif (Appendix 1). For example, a AU-rich stem-loop was found to respond to cold stress (Figure 6b, bottom right). To further verify the enriched motifs, we simulated background sequences as in (Zhong et al., 2010). For sequence motifs, we used MAST (Bailey et al., 2009) to search them in the background sequences. The enrichments is calculated by comparing the number of matched motifs in lncRNAs to that in the background sequences (Supplemental Methods). We found all sequence motifs in the 16 groups were significantly enriched (Fisher’s exact test, p-value

Characterization of stress-responsive lncRNAs in Arabidopsis thaliana by integrating expression, epigenetic and structural features.

Recently, in addition to poly(A)+ long non-coding RNAs (lncRNAs), many lncRNAs without poly(A) tails, have been characterized in mammals. However, the...
1MB Sizes 0 Downloads 5 Views