Genetica DOI 10.1007/s10709-014-9783-4

Technical challenges in metatranscriptomic studies applied to the bacterial communities of freshwater ecosystems Noe´mie Pascault • Valentin Loux • Sandra Derozier • Ve´ronique Martin • Didier Debroas • Selma Maloufi • Jean-Franc¸ois Humbert • Julie Leloup

Received: 5 May 2014 / Accepted: 9 August 2014  Springer International Publishing Switzerland 2014

Abstract Metatranscriptome analysis relates to the transcriptome of microbial communities directly sampled in the environment. Accessing the mRNA pool in natural bacterial communities presents some technical challenges such as the RNA extraction, rRNA depletion, and the choice of the high-throughput sequencing technique. The lack of technical details in scientific articles is a major problem to correctly obtained mRNA from a microbial community and thus the corresponding sequencing data. In our study, we present the methodological procedure that was developed in order to access to the metatranscriptome of the microbial communities during two cyanobacterial blooms successively occurring in a freshwater eutrophic lake. Each procedure step was detailed and discussed with regard to the choices and difficulties encountered and to the recent literature. Finally, the two major limits for metatranscriptomic approaches targeting bacterial communities

N. Pascault  J.-F. Humbert  J. Leloup (&) iEES Paris, UMR 7618 UPMC-CNRS-INRA-IRD-Paris 7-UPEC, 4 place Jussieu, 75005 Paris, France e-mail: [email protected] V. Loux  S. Derozier  V. Martin UMR MIG, INRA, Domaine de Vilvert, Jouy-en-Josas, France D. Debroas Laboratoire ‘‘Microorganismes: Ge´nome et Environnement’’, Clermont Universite´, Universite´ Blaise Pascal, BP 10448, 63000 Clermont-Ferrand, France D. Debroas UMR 6023, LMGE, CNRS, 63171 Aubiere, France S. Maloufi UMR 7245 CNRS-MNHN Mole´cules de Communication et Adaptation des Microorganismes, Case 39, 12 rue Buffon, 75231 Paris Cedex 05, France

from natural environments were (i) the removal of rRNA in order to increase the putative mRNA reads number after sequencing, and (ii) for most of the bacterial communities living in natural environments, the lack of reference genomes in databases that leads to the non-assignation of numerous reads. Once these challenges overcome, we managed to access putative mRNA of dominant species, i.e. cyanobacteria (from 6 to 72 % of mRNA assigned), and of the surrounding bacteria (from 1 to 5 % of mRNA assigned). Keywords Metatranscriptome  Messenger RNA  High-throughput sequencing  Illumina HiSeq 2000  Natural bacterial communities

Introduction In order to proliferate in natural ecosystems, organisms must rapidly respond to a wide variety of changing environmental conditions. This short-term response can be estimated by studying the population of mRNA (i.e. transcriptome), which is an active and responsive fraction. Two technological platforms have been developed to perform transcriptome analyses: microarrays and RNA-Seq (whole transcriptome shotgun sequencing using next generation sequencing technology). These platforms can contribute to the comprehensive re-annotation of bacterial genomes and to the quantification of gene expression at a high resolution (Filiatrault 2011; Ma¨der et al. 2011). One interesting advantage of RNA-Seq approach over microarrays being that sequencing does not require knowledge of genomic sequence (Ma¨der et al. 2011). Nowadays, due to the breakthrough of next generation of sequencing (NGS) development, environmental ‘‘omic’’

123

Genetica

approaches have moved to the next logical step: metatranscriptomic. This approach relates to the transcriptome of a microbial assemblage directly sampled in the environment without any prior information on which genes the community might be expressing (Poretsky et al. 2005). In this way, this is an extended version of the former transcriptomic approach. Knowing that the majority of microbial communities in sample environment are unknown and highly diversified, microarrays are not well adapted to perform metatranscriptome studies on them. On the other hand, the real benefit with sequencing approaches is to study the community without a priori, especially when no specific metabolisms are targeted. Herein, we decided to exclude metatranscriptome study carried out with microarrays in the present study. From 2008 to January 2014, we have listed all the articles dealing with a metatranscriptome approach present in Web of ScienceTM database (metatranscriptom*, Table 1). Aquatic environments were preferentially studied (39 publications) than soil environment (15). The discrepancy between aquatic and soil ecosystems could be explain by differences in the inherent characteristics of these habitats and by the fact that environmental heterogeneity is probably greater in soil than in aquatic ecosystem. One explanation could also reside in the physicochemical structure of soil aggregates leading in differences in extraction efficiency, the presence of humic acids (limiting the extraction efficiency and inhibiting PCR), and a higher RNA adsorption to soil aggregates during extraction (Carvalhais and Schenk 2013). Within the aquatic ecosystems, phytoplankton/cyanobacteria bloom appeared to be a choice model (15), certainly due to the fact that blooms constitute hotspot of microbial cellular activities, ensuring an high production of RNA (Hewson et al. 2009). The first ‘‘metatranscriptome’’ publications have proven to be very useful for investigating the functional diversity of a community and providing information on active processes within the dominant microorganisms (e.g. Bailly et al. 2007; Frias-Lopez et al. 2008; Hewson et al. 2009). FriasLopez et al. (2008) have recovered the expressed genes associated with key metabolic pathways in open ocean microbial species (photosynthesis, carbon fixation, nitrogen acquisition) as well as hypothetical proteins. In a marine ecosystem, Hewson et al. (2009) have investigated the gene expression of bacterial consortia associated with Trichodesmium blooms (marine N2 fixing cyanobacterium) during a dial cycle and showed that cyanobacterial transcripts were mostly represented by photosynthesis, N2 fixation and S-metabolism genes, whereas co-occurring bacterial transcripts were mostly involved in genetic information storage and processing. In addition to the experimental studies, Stewart (2013) and Carvalhais et al. (2012), have reviewed the different

123

Table 1 Summary of the metatranscriptomic studies published from January 2008 to January 2014. Searching was performed in January 2014 using Web of Science v.5.13.1 Keywords

Number of publications in January 2014

Metatranscriptom* AND water

39

Metatranscriptom* AND soil

15

Metatranscriptom* AND (phytoplancton OR cyanobacteria)

15

Metatranscriptom* AND Illumina

7

All the publications were manually checked to exclude inappropriate publications

methodologies used in metatranscriptomic approaches for the marine plankton and soil environments respectively. They underlined a common methodological outline for all the metatranscriptomic studies: sample collection and RNA preservation, RNA extraction, rRNA depletion, RNA amplification, cDNA synthesis and high-throughput sequencing and detailed each step with the dedicated custom and/or commercially methods. Even if some metatranscriptome studies have been realised (Table 1), the nature of the raw material (solid, liquid, soil, sediment, water…) brings some challenges in each new study prior analysis: sampling RNA to reach substantial quantity of RNA (activity of communities), avoiding degradation and modification (Carvalhais and Schenk 2013). First, the RNA extraction step could introduce bias, as its efficiency could be variable from one bacterial species to another (Carvalhais et al. 2012). Secondly, the RNA instability is one of the major concerns, mRNA half-lives being shorter than the rRNA ones in the minute range. This instability could also depend on environments and organisms, with a differential degradation of mRNA between species. It is known for example that eukaryote mRNAs are more stable than prokaryote mRNAs (Rauhut and Klug 1999; Stewart 2013). At last, in order to obtain enough material, most of the studies presented above have performed an mRNA amplification, that could mostly lead to preferential amplification of the dominant transcripts (Hewson et al. 2010), and consequently limit the interpretation of data in term of presence/absence of transcripts. Concurrently to these methodological constraints, sequencing technology has to be chosen: 454 technology gives long fragments (max 1 kb) but low output (up to 1 million reads) while Illumina technology gives shorter fragments (up to 300 bp) but higher output (up to 4 billion reads) (Liu et al. 2012; Lee et al. 2013). In parallel, 454 technology is well-known for reading errors in homopolymers stretches (more than 6) that is not observed with Illumina technology (Liu et al. 2012). If the majority of the metatranscriptome data have been obtained up to now by

Activated sludge

Small intestine microbiota

Kimchi (traditional Korean food) Marine water

Activated sludge

Mouse intestine

Muskoxen rumen

Cai et al. (2013)

Leimena et al. (2013)

Jung et al. (2013)

Yu and Zhang (2012)

Xiong et al. (2012)

Qi et al. (2011)

n.c. non communicated

Gifford et al. (2012)

Environment

Publications

No depletion

rRNA depletion

No depletion

Eukaryotic and prokaryotic rRNA depletion

16S and 23S rRNA depletion

16S and 23S rRNA depletion

No depletion

rRNA technical depletion

no

no

no

yes

no

no

no

mRNA ampification

Illumina GAIIx

Illumina GAIIx

Illumina

Illumina GAIIx

Illumina GAIIx

Illumina Hiseq 2000

Illumina Hiseq 2000

Sequencing technology

n.c.

85 %

\20

n.c.

n.c.

*94 %

\20

\20

*63 %

\10

n.c.

n.c

\20

n.c.

Reads after bioinformatic cleaning (%)

Bioinformatic trimming quality score

18 %

72 %

71 %

62 %

*61 %

*84 %

n.c.

rRNA after bioinformatic cleaning (%)

59 %

26 % (without mouse transcripts)

n.c.

34 %

*34 %

85 %

n.c

Assignation of putative mRNA (%)

Table 2 Current state of the literature dealing with the key words Metatranscriptom* AND Illumina (January 2014; Web of Science v.5.13.1)

nr/in-house database named as NRMO

nt and protein non redundant database

M5NR protein database

RefSeq

Complete genomes of the six representative LAB strains

NCBI genome database ? house bacterial genomes isolated from ileostomy effluent

Composite arsenic metabolism protein database (GenBank)

Database

Genetica

123

Genetica

using 454-technology (Stewart 2013), it is understood that Illumina sequencing will soon replace pyrosequencing as the choice method for low-cost and high-quality sequence generation, knowing that with both technologies, the preparation protocol for the RNA will remain the same. The few metatranscriptomic studies published with Illumina are listed in Table 2. Facing the different methodological constraints discussed above, we will present in this paper the methodologies applied to study the metabolisms of microbial communities during cyanobacterial blooms in a freshwater eutrophic lake. Two successive blooms were sampled during the summer season: one dominated by Anabaena sp. and one by Microcystis sp. Total RNA was extracted and followed a specific procedure in order to be sequenced with HiSeq 2000 Illumina technology and technical data will be discussed at each procedure step with regard to the literature.

Materials and methods Study sites, sampling and fraction separation The sampling site is a recreation lake located near the city of Champs-sur-Marne (Seine-et-Marne, Iˆle-de-France, France, 48510 47.0N, 02350 53.9E). This aquatic ecosystem has a surface area of 0.1 km2 with an average depth of 2.70 m, and since 2006, it knows several episodes of cyanobacterial blooms (C. Bernard and B. Marie, personal communication). In 2012, two distinct cyanobacterial species have bloomed during the seasonal phytoplanktonic proliferation: Anabaena sp. in July and Microcystis sp. in September. During each of them, two series of raw water were collected during 3 days with differential filtrations strategy to separate the free-living bacteria (AB-F for Anabaena bloom and MB-F for Microcystis bloom) and the cyanobacterial-attached bacteria (AB-A for Anabaena bloom and MB-A for Microcystis bloom) fractions. One series of raw water sampling corresponds to three differential filtrations per day, either 9 differential filtrations for 3 days. Differential filtration protocols were applied according to the cell size and the shape of the colonies of each cyanobacterial species. For Anabaena sp., water samples were first filtered through a 20 lm phytoplankton net and then 50 mL of filtrate were filtered through a 1.2 lm polycarbonate filter (Isopore Membrane Millipore, Dustcher, France) to concentrate Anabaena-attached bacterial fraction (AB-A). Then, the filtrate was passed through a 0.2 lm polycarbonate filter (Nuclepore Polycarbonate Whatman, Fischer, France) in order to concentrate the freeliving bacterial fraction (AB-F). For Microcystis sp., water

123

samples were also filtered through a 20 lm phytoplankton net, and then concentrated samples were directly store in cryotubes to compose the Microcystis-attached bacterial fraction (MB-A). In a second time, like for Anabaena, the filtrate was passed through a 0.2 lm polycarbonate filter (Nuclepore Polycarbonate Whatman, Fischer, France) to concentrate the other free-living bacterial fraction (MB-F) (Table 3). To minimize damages in the metatranscriptome of the communities, the whole filtration process in less than 10 min and all samples (filter or liquid) were directly frozen in liquid nitrogen and store at -80 C until RNA extraction.

RNA extraction RNA was extracted with FastRNA Pro Soil-Direct Kit (MP Biomedicals) following the manufacturer’s instructions with some adaptations to filter and water samples. Briefly one filter (AB-A, AB-F and MB-F) or 500 lL of MB-A fraction was placed in Lysing Matrix E tube complemented by 1 mL of lysis solution. The bead beating (FastPrep Instrument) was processed three times at 6.5 speed during 30 s; the tubes were stored in ice during 5 min between each process. The supernatants were then treated with 750 lL of Phenol:Chloroform (1:1, pH = 4, MP Biomedicals), and then with 200 lL of Inhibitor Removal Solution. RNA were precipitated with cold isopropanol (100 %) overnight, purified with RNAMATRIX-bound and eluted in DEPC-H2O. For each sample of the MB-A fraction, corresponding to one differential filtration, this extraction protocol was realized twice in order to have enough material. DNA contamination was removed by treating samples with DNaseI (Ambion). RNA quality and RNA integrity number (RIN) were checked with Agilent RNA 6000 Pico Kit on Agilent 2100 Bioanalyzer (Agilent Technologies) following the manufacturer’s instructions. The bioanalyzer software calculated RNA Integrity Number (RIN) (Schroeder et al. 2006). RNA quantity was measured with QubitTM RNA Assay Kits on the Qubit 2.0 Fluorometer.

RNA treatment Different steps were processed on RNA samples to optimize the bacterial mRNA content. First, in order to remove eukaryotic RNA, a capture hybridization procedure with MICROBEnrichTM kit (Ambion, Life technologies) was performed following the manufacturer’s instructions. The principle is based on magnetic beads, made up with oligonucleotides that hybridize and then capture the 18S rRNA, 28S rRNA and polyadenylated 30 ends of eukaryotic

Genetica Table 3 Quality and quantity survey of the RNA pool during the enrichment procedure Samples

Anabaena

Microcystis

RIN

Attached fraction

Filter

Free-living fraction

Filter

Attached fraction

Liquid

Free-living fraction

Filter

Extraction (ng)

Eukaryotic and prokaryotic rRNA removal (ng)

Prokaryotic rRNA removal (ng)

Removal efficiency (%)

Expected mRNA (%)

Repeat 1

5.7

8,720

4,356

2,688

69.2

3.24

Repeat 2

4.1

6,504

2,430

1,668

70.4

3.37

Repeat 1 Repeat 2

3.8 3.0

2, 024 2,912

903.6 1,461.6

600 1,131

74.4 61.2

3.90 2.57

Repeat 1

6.1

8,400

2,811.6

1,836

78.1

4.58

Repeat 2

7.0

9,600

3,236.4

2,283

68.2

3.14

Repeat 1

3.4

2,168

921.6

690

76.2

4.20

Repeat 2

3.3

2,072

914.4

765

63.1

2.71

RIN (RNA integrity number) was obtained with Agilent RNA 6000 Pico Kit on Agilent 2100 Bioanalyzer (Agilent Technologies) and RNA quantities were measured with QubitTM RNA Assay Kits on the Qubit 2.0 Fluorometer (in ng)

mRNAs. In a second step, 16S and 23S rRNAs were removed by using the MICROBExpressTM kit, based on the same principle as MICROBEnrichTM kit (Ambion, Life technologies). As these two kits are optimized to be successively used, RNA quality was checked after both processes with Agilent RNA 6000 Pico Kit on Agilent 2100 Bioanalyzer (Agilent Technologies) following the manufacturer’s instructions and RNA quantities were measured with QubitTM RNA Assay Kits on the Qubit 2.0 Fluorometer (Invitrogen). Third, a second round of bacterial ribosomal RNA removal was performed with mRNAONLYTM Prokaryotic mRNA Isolation Kit following the manufacturer’s instructions (Epicentre). In contrast with the previous kits, this one is based on 50 -PhosphateDependent Exonuclease, a processive 50 ? 30 exonuclease that digests RNA having a 50 monophosphate. At the end, the putative mRNAs were purified by phenol extraction and precipitated by ethanol. Then RNA qualities and quantities were checked and measured as described above.

(http://migale.jouy.inra.fr/galaxy/). In a first time, all the reads were controlled with FastQC for the assessment of the sequencing quality (Simon Andrews, Babraham Bioinformatics, 2011) (www.bioinformatics.babraham.ac.uk/ projects/). In a second step, the reads were cleaned with Sickle Tool (Joshi and Fass 2011; https://github.com/ najoshi/sickle) with the following criteria: quality threshold of 20; length threshold of 40 and discard of any read with any number of Ns activated. Then, the remaining rRNA reads were screened with SortMeRNA (Kopylova et al. 2012) against rRNA database SILVA Small subunit reference database (16S/18S) and SILVA Large subunit reference database (23S/28S). All the cleaned reads were fast assigned with KRAKEN (taxonomic sequence classification system) (http://ccb.jhu.edu/software/kraken/) (Wood and Salzberg 2014) against RefSeq microbial databank (release 62), allowing us to get a global idea of the composition of the pools of mRNA. The results of taxonomic assignation of putative mRNA were visualized with a web browser allowing building zoomable pie charts (Krona: Ondov et al. 2011).

High-throughput sequencing At least 600 ng of rRNA depleted/mRNA enriched total RNA for each series (2 AB-A, 2 AB-F, 2 MB-A, 2 MB-F) have been used for the construction of random primed cDNA libraries and were then sequenced (100pb single read) in two lanes of an Illumina HiSeq 2000 (GATC Biotech, Mulhouse, France).

Nucleotide read accession numbers The nucleotide reads have been deposited to the Sequence Read Archive (http://trace.ncbi.nlm.nih.gov/Traces/sra/) under experiments: SRX528308, SRX528794, SRX528795, SRX528797, SRX528799, SRX528800, SRX528801, SRX528802.

Reads analysis Results–discussion The first reads analyses were realized under Galaxy webbased platform (Goecks et al. 2010; Blankenberg et al. 2010; Giardine et al. 2005) with a local Galaxy instance

In most of the studies dealing with metatranscriptome, materials and methods described the procedure used to

123

Genetica

obtain mRNA and the corresponding reads, without explaining the results obtained at each procedure step. Herein, we present the methodological procedure to access the metatranscriptome of two distinct cyanobacterial blooms containing thousand of different bacterial cells, with a special attention to define and discuss each step. Acquisition and preservation of RNA Bacterial mRNAs are highly sensitive as they can rapidly be digested in the presence of RNase enzymes and their half-lives are around a few minutes (Rauhut and Klug 1999; Schroeder et al. 2006). Then, special cares were taken with the sampling duration: less than 10 min between lake sampling and freezing in liquid nitrogen. In order to sample enough RNA material (a min of 2 lg in this study), we have opted for multiple filtrations of small quantities not to exceed the time set (9 differential filtrations during 3 days per series). The sampling strategy will allow minimizing an artefact picture corresponding to a limited snapshot of the activities occurring during the bloom. The filtered volumes are listed in the Table 3, with a maximum of 735 mL. In comparison to bloom sampling in the Ocean (Hewson et al. 2009: 3 L; Gifford et al. 2012: 6–8 L), these volumes are quite low due to the high bacterial abundances; above 1 9 1010 cyanobacteria cells L-1 for our samples (data not shown), in comparison to 4.2 9 109 cells L-1 in southern coastal waters (Gifford et al. 2010) and 1.13 9 104 trichomes L-1 in Trichodesmium bloom occurring in open Pacific Ocean (Hewson et al. 2009). After the sampling procedure, the next challenge in the metatranscriptome analysis concerns the RNA extraction. In our study, the extraction method was optimized according (i) to the cyanobacterial species and (ii) to the type of sample (i.e. filter or liquid). Its principle is based on a combination of mechanic and chemical lyses, with a very effective action of bead-beating to break the refractory components (i.e. aggregates, cell walls, and extracellular polymeric substances) and the polycarbonate filters. The extracted RNA quantity ranged from 2,024 to 9,600 ng per sample (Table 3). Whatever the sample type (filter vs. liquid), the RNA amounts were of the same order of magnitude. Thus, the RNA quality and quantity were determined with the Bioanalyzer (Agilent) that provides electropherograms (Fig. 1). For all samples, the higher peaks corresponded to the ribosomal RNA (Fig. 1a). The first eluted peak between 1,000 and 2,000 nt size, encompassed the pro- and eukaryotic rRNA small subunits and the two peaks, above 2,000 nt size, corresponded to the large subunits of the pro- and eukaryotic rRNA. In the literature, only very few articles displayed their electropherogram data, despite the interest of this information to

123

check quality and composition of the RNA samples (Bailly et al. 2007; Stewart 2013). The rRNA content analysis can be performed by either amplicon sequencing (an insight in the identity of the present cells that can be PCR-detected) or metatranscriptome analysis (an insight in the identity and the activity level of the living cells), while the only way to analyse environmental mRNA without any a priori is the metatranscriptome analysis. Thus, one challenge of this approach is to remove rRNAs, as mRNAs only represents about 1 to 5 % of the RNA pool (He et al. 2010). Herein, we have performed three different treatments commonly used in metatranscriptomic studies for removing the eukaryotic rRNAs (in one step), and the bacterial rRNAs (in two steps) (He et al. 2010; Stewart 2013). After the removal of eukaryotic RNAs, which is combined to the first step of the removal of prokaryotic rRNAs, the resulting RNA quality was visualized in an electropherogram (Fig. 1b) and compared to the previous one (Fig. 1a). Based on the fluorescence emission, a diminution (around 60 %, see Table 3) of the rRNA peaks was observed, explaining the need for a further rRNA removal step. After this second step (Fig. 1c), the three major peaks (eukaryotic and prokaryotic rRNAs) have disappeared, and a broad scattered hump between 200 to 2,000 nt encompassed the remaining ribosomal RNA, some possible broken rRNA residues and the targeted mRNA. Based on the RNA quantity measure (Table 3), these RNA treatments allowed eliminating about 70 % of total RNA; a similar RNA removal rate was also obtained by Leimena et al. (2013). The RIN values (RNA integrity number,—an indicator of the degradation state of the RNA pool) were ranging from 3.3 to 7. These values were below 7 and might indicate that the rRNA removal was not completely efficient or might have degraded the mRNA pool, thus we could expect that the most ‘‘fragile’’ mRNA might have been degraded during treatment, principally the free-living fractions that were showing the lowest RIN values (Table 3) (He et al. 2010; Stewart 2013). Moreover, due to the fact that our samples were isolated in a bloom of cyanobacteria, which is considered as a hot-spot of microbial activities, there is probably a high turnover of the rRNA pool. Thus degraded rRNA might have polluted the samples since their collection on field. As previously stated, 1 to 5 % of the prokaryotic total RNA pool is mRNA, if we consider that mRNA was not affected by the removal treatments and that eukaryotic RNA removal was effective, we have estimated the percentage of mRNA remaining in the samples after enrichment, thus we could expect that 2.7 to 4.5 % of the total reads would be putative prokaryotic mRNA (Table 3 and Table 4). Although all the rRNAs were not fully removed and the RIN values were lower than recommended (He et al. 2010),

Genetica

Fig. 1 Quality electropherograms (Agilent technology) of total RNA from Microcystis-Attached bacterial fraction during the enrichment procedure: a after RNA extraction, b after eukaryotic RNA and

prokaryotic rRNA removal (combination of MICROBEnrichTM and MICROBExpressTM kits), and c after mRNA enrichment (mRNAONLYTM Prokaryotic mRNA Isolation Kit)

Table 4 Comparative results after bio-informatics cleaning steps (FastQC, Sickle and SortMeRNA softwares) Samples

Anabaena

Attached fraction Free-living fraction

Microcystis

Attached fraction Free-living fraction

FastQC

Sickle

SortMeRNA

Total reads Nb

Removed reads (%)

Eukaryotic rRNA (%)

Prokaryotic rRNA (%)

mRNA putative reads

mRNA putative reads (%)

Repeat 1

60,788,880

1.9

63.78

32.42

2,264,565

3.8

Repeat 2

34,796,052

1.9

65.39

30.97

1,243,313

3.6

Repeat 1

37,421,192

1.8

66.41

26.49

2,606,753

7.1

Repeat 2

47,882,993

2.0

67.22

24.94

3,675,259

7.8

Repeat 1

33,274,737

2.2

74.33

19.59

1,978,216

6.1

Repeat 2

33,292,662

2.0

73.1

22.42

1,460,542

4.5

Repeat 1 Repeat 2

44,765,879 42,023,487

1.9 1.9

66.73 65.11

25.98 29.27

3,204,145 2,317,465

7.3 5.6

no further treatments were performed in order to minimize the mRNA loss and damages. For each sample, at least 600 ng was obtained, as required by the NGS-manufacturer’s instructions (a minimum of 500 ng of rRNA depleted/ mRNA enriched total RNA). Thus, it was unnecessary to perform the additional in vitro transcription to synthesize aRNA (generating multiple copies of antisense aRNA from the ds cDNA templates), commonly used in some studies but generating possible biases by preferential amplification (Stewart 2013). Consequently, the libraries preparation for Illumina sequencing, incorporating the reverse transcription to synthesize first strand cDNA and the second strand cDNA synthesis, was directly performed on the mRNA pools. Acquisition of reads After sequencing, from 33,274,737 to 60,788,880 reads of 101 bp were obtained per sample (Table 4). The quality was checked with FastQC that provides a summary of the quality metrics, including a graph with quality scores across all reads positions. As expected with Illumina sequencing (Liu et al. 2012), the FastQC analysis showed an overall high read quality but also a low decrease quality at the read tail (Fig. 2a). To reduce this source of bias, all

reads with an average quality score below 20 per base were deleted from the dataset, conducting to the removal of 1.8 to 2.2 % of the reads per sample (Table 4). In comparison to the quality score obtained in other studies (Table 2), our data displayed a higher quality, most of them presenting a quality score per base [28 (Fig. 2b). After screening the quality score, the next essential step resides in the removal of the remaining rRNA reads. As reported in several studies (Table 2) and previously visualized in our electropherograms and the RIN values (Fig. 1 a–c and Table 3), rRNAs could constitute a large noise background in the dataset. The software SortMeRNA (under Galaxy instance) was used to remove these rRNA reads from the dataset (Kopylova et al. 2012). The Table 4 summarized the read number at each bioinformatic step. About 2 millions of mRNA putative reads (3.6 to 7.8 %) were obtained for each sample, which is consistent with the expected mRNA proportion (Table 3). In parallel, rRNA reads still constituted 92–96 % of the datasets, which is consistent with our estimation of the rRNA removal efficiency based on the electropherogram visualisation (Tables 3 and 4). The remaining eukaryotic rRNA pool still constituted between 63 and 74 % of rRNA total (Table 4). Thus the efficiency of the eukaryotic RNA removal must

123

Genetica

Fig. 2 Per base read qualities determined by FastQC obtained for Anabaena-attached bacterial fraction before a and after b the bioinformatic cleaning step

have been weak, and the remaining eukaryotic RNA pool must have been underestimated when determined via the fluorescence emission in the electropherograms (Agilent 2100 Bioanalyszer, see above), similar drawback was also mentioned in Stewart (2013). The contamination by rRNAs is a major problem commonly encountered when studying metatranscriptome and transcriptome. As reported in Table 2, after technical rRNA depletion, the authors have obtained from 18 to 84 % of rRNA still remaining in their dataset. This proportion seemed to be environment-dependent. In our study, the abundant and very active populations of cyanobacteria during a bloom might explain this high proportion of rRNA (Hewson et al. 2010). As proposed by several authors, a way to overcome this problem consists in increasing the sequencing depth instead of technical removal rRNA (Croucher and Thomson 2010). This solution can increase the quantity of non-useful data and the associated storage cost, which at a long term could be more expensive than the sequencing itself (Stein 2010). For example, in our study, without RNA treatments, we can estimate that three times more of raw sequencing data should have been needed to obtain the same number of putative mRNA reads, i.e. 246 GB (82 GB 9 3). Consequently, in our study, the best option remained the improvement of rRNA technical removal. Thus, the balance between deeper sequencing and technical removal of rRNAs must be estimated according to the state of microbial community activity. After rRNAs removal from the dataset, 1,243,313 to 3,675,259 reads were obtained in each sample (Table 4). To get an overview of the taxonomic assignation of the putative mRNA, they were analysed with Kraken, a fastsoftware that gives an approximate assignation by exact alignments of k-mers against RefSeq bacterial genomes

123

reference database (http://www.ncbi.nlm.nih.gov/books/ NBK21091/). From 11 to 73 % of all the mRNA reads could be assigned using RefSeq microbial database (Fig. 3). The highest percentages were obtained for the Microcystis sp. bloom fractions, MB-A (attached fraction) and MB-F (free fraction). In these fractions, most of the assigned mRNA reads were affiliated to Cyanobacteria phylum. In the same way, in the AB-A fraction that corresponds to the attached fraction of the Anabaena sp. bloom, among the 43 % of the assigned reads, cyanobacterial reads were largely dominant (Fig. 3). These observations are consistent when considering that both cyanobacterial genera were highly predominant during the sampling period (C. Bernard and S. Maloufi, personal communication). In addition, the substantial representation of cyanobacterial genomes in Refseq database probably contributes to the high number of assigned reads in these ‘‘cyanobacteria-attached’’ fractions. On the other hand, in the free-living fractions that were expected to contain a few cyanobacterial cells (cyanobacterial cells not retained by the 1.2 lm filter or free cyanobacterial RNAs), the percentage of annotated reads was lower (Fig. 3). These findings highlight one limit when using RefSeq database for metatranscriptome analyses: the lack of sequenced and annotated reference genomes for most of the bacteria living in natural environments (24,302 reference microbial genomes on RefSeq vs. 6.105–6.106 estimated microbial species on Earth). Moreover, this major problem when studying metatranscriptome is independent of the chosen NGS methodology (454 vs. Illumina technology). In the literature (Table 2), several authors have also highlighted large variations (from 26 to 85 %) in the transcripts percentage that could be assigned. These variations are dependent of the targeted environment. For example, when working on the small intestine microbiota that is well

Genetica

Fig. 3 Taxonomic assignation of the putative mRNA reads for (AB-A) Anabaena-attached bacterial fraction, (AB-F) Anabaena free-living bacterial fraction, (MB-A) Microcystis-attached bacterial fraction and (MB-F) Microcystis free-living bacterial fraction

described, around 85 % of the putative mRNAs could be assigned (Leimena et al. 2013). Alike in our study, the presence of a few dominant organisms with available genomes arises the percentage of assigned transcripts. A recent perspective offers to use single-cell genome sequencing to obtain the genome of a dominant uncultivated bacterium directly from the environment, in combination with metatranscriptomic to improve the mRNA putative assignation (Embree et al. 2013). Another alternative strategy could also consist in targeting specific metabolism and building a corresponding protein database (see for example Cai et al. 2013 with arsenic metabolism, Table 2). In conclusion, the major limits for the metatranscriptomic analysis of aquatic ecosystems were (i) the removal of rRNA (mostly eukaryotic ones) in order to increase the proportion of prokaryotic mRNA reads, and (ii) the lack of reference genomes of most bacterial

populations in databases leading to the non-assignation of numerous reads. We have also observed that many of the technical challenges experienced during the implementation of a metatranscriptomic approach (from the sampling to the assignation of the reads) are environment dependent, and consequently will conduct to the selection of the available technical alternatives, such as the removal of eukaryotic and prokaryotic RNAs without damaging the mRNAs and the choice of the sequencing technology. In our case, a minimum of 2 ng starting RNA material was sufficient for the library construction before sequencing, and fortunately no mRNA amplification was necessary. Although the RIN values appeared to be quite low, we managed to access to putative mRNA (from 3 to 8 % of the total dataset) of dominant species, i.e. cyanobacteria (from 6 to 72 % of mRNA assigned) and of the surrounding bacteria (from 1 to 5 % of mRNA assigned).

123

Genetica Acknowledgments This study was funded by the PHYCOCYANO project in the Jeunes Chercheurs—Jeunes Chercheuses program of the French ANR (Agence Nationale de la Recherche; ANR-11-JSV7-01401). C. Bernard and B. Marie (National Museum of Natural History, UMR MCAM 7245 MNHN-CNRS) are greatly thanked for their collaboration, and input to the data collection. We also would like to thank L. Albaric and A. Roullot for access to the Champ-sur-Marne recreational area. And we are grateful to the INRA MIGALE bioinformatics platform (http://migale.jouy.inra.fr) for providing computational resources. The anonymous reviewers are also thanked for the useful comments improving the quality of the manuscript. Conflict of interest of interest.

The authors declare that they have no conflict

References Bailly J, Fraissinet-Tachet L, Verner M-C, Debaud J-C, Lemaire M, We´solowski-Louvel M, Marmeisse R (2007) Soil eukaryotic functional diversity, a metatranscriptomic approach. ISME J 1:632–642 Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protocols Mol Biol 19:Unit 19.10.1-21 Cai L, Yu K, Yang Y, Chen B-W, Li X-D, Zhang T (2013) Metagenomic exploration reveals high levels of microbial arsenic metabolism genes in activated sludge and coastal sediments. Appl Microbiol Biotechnol 97:9579–9588 Carvalhais LC, Schenk PM (2013) Methods in ENZYMOLOGY, volume 531 microbial metagenomics, metatranscriptomics, and metaproteomics. In: DeLong EF (ed) Sample processing and cDNA preparation for microbial metatranscriptomics in complex soil communities. Academic Press, Burlington, pp 251–267 Carvalhais LC, Dennis PG, Tyson GW, Schenk PM (2012) Application of metatranscriptomics to soil environments. J Microbiol Methods 91:246–251 Croucher NJ, Thomson NR (2010) Studying bacterial transcriptomes using RNA-seq. Curr Opin Microbiol 13:619–624 Embree M, Nagarajan H, Movahedi N, Chitsaz H, Zengler K (2013) Single-cell genome and metatranscriptome sequencing reveal metabolic interactions of an alkane-degrading methanogenic community. ISME J 8:757–767 Filiatrault MJ (2011) Progress in prokaryotic transcriptomics. Curr Opi Microbiol 14:579–586 Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, Chisholm SW, DeLong EF (2008) Microbial community gene expression in ocean surface waters. PNAS 105:3805–3810 Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A (2005) Galaxy: a platform for interactive largescale genome analysis. Genome Res 15:1451–1455 Gifford SM, Sharma S, Rinta-Kanto JM, Moran MA (2010) Quantitative analysis of a deeply sequenced marine microbial metatranscriptome. ISME J 5:461–472 Gifford SM, Sharma S, Booth M, Moran MA (2012) Expression patterns reveal niche diversification in a marine microbial assemblage. ISME J 7:281–298 Goecks J, Nekrutenko A, Taylor J, The Galaxy Team (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86

123

He S, Wurtzel O, Singh K, Froula JL, Yilmaz S, Tringe SG, Wang Z, Chen F, Lindquist EA, Sorek R, Hugenholtz P (2010) Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. Nat Methods 7:807–812 Hewson I, Poretsky RS, Dyhrman ST, Zielinski B, White AE, Tripp HJ, Montoya JP, Zehr JP (2009) Microbial community gene expression within colonies of the diazotroph, Trichodesmium, from the Southwest Pacific Ocean. ISME J 3:1286–1300 Hewson I, Poretsky RS, Tripp HJ, Montoya JP, Zehr JP (2010) Spatial patterns and light-driven variation of microbial population gene expression in surface waters of the oligotrophic open ocean. Env Microbiol 12:1940–1956 Joshi NA, Fass JN (2011) Sickle: a sliding-window, adaptive, qualitybased trimming tool for FastQ files (Version 1.21) [Software]. https://github.com/najoshi/sickle Jung JY, Lee SH, Jin HM, Hahn Y, Madsen EL, Jeon CO (2013) Metatranscriptomic analysis of lactic acid bacterial gene expression during kimchi fermentation. Int J Food Microbiol 163:171–179 Kopylova E, Noe´ L, Touzet H (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics 28:3211–3217 Lee C-Y, Chiu Y-C, Wang L-B, Kuo Y-L, Chuang EY, Lai L-C, Tsai M-H (2013) Common applications of next-generation sequencing technologies in genomic research. Transl Cancer Res 2:33–45 Leimena MM, Ramiro-Garcia J, Davids M, van den Bogert B, Smidt H, Smid EJ, Boekhorst J, Zoetendal EG, Schaap PJ, Kleerebezem M (2013) A comprehensive metatranscriptome analysis pipeline and its validation using human small intestine microbiota datasets. BMC Genom 14:530 Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol, Article ID 251364, 11 pp Ma¨der U, Nicolas P, Richard H, Bessie`res P, Aymerich S (2011) Comprehensive identification and quantification of microbial transcriptomes by genome-wide unbiased methods. Current Opin Biotech 22:32–41 Ondov BD, Bergman NH, Phillippy AM (2011) Interactive metagenomic visualization in a web browser. BMC Bioinformatics 12:385 Poretsky RS, Bano N, Buchan A, LeCleir G, Kleikemper J, Pickering M, Pate WM, Moran MA, Hollibaugh JT (2005) Analysis of microbial gene transcripts in environmental samples. Appl Env Microbiol 71:4121–4126 Qi M, Wang P, O’Toole N, Barboza PS, Ungerfeld E, Leigh MB, Selinger LB, Butler G, Tsang A, McAllister TA, Forster RJ (2011) Snapshot of the eukaryotic gene expression in muskoxen rumen–a metatranscriptomic approach. PLoS One 6:e20521 Rauhut R, Klug G (1999) mRNA degradation in bacteria. FEMS Microbiol Rev 23:353–370 Schroeder A, Mueller O, Stocker S, Salowsky R, Leiber M, Gassmann M, Lighfoot S, Menzel W, Granzow M, Ragg T (2006) The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 7:3 Stein LD (2010) The case for cloud computing in genome informatics. Genome Biol 11:207 Stewart FJ (2013) Methods in ENZYMOLOGY, volume 531 microbial metagenomics, metatranscriptomics, and metaproteomics. In: DeLong EF (ed) Preparation of microbial community cDNA for metatranscriptomic analysis in marine plankton. Academic Press, Burlington, pp 187–218 The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 18, the reference sequence (RefSeq) project. http://www.ncbi.nlm.nih.gov/books/NBK21091/

Genetica Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46 Xiong X, Frank DN, Robertson CE, Hung SS, Markle J, Canty AJ, McCoy KD, Macpherson AJ, Poussier P, Danska JS, Parkinson J (2012) Generation and analysis of a mouse intestinal

metatranscriptome through Illumina based RNA-sequencing. PLoS One 7:e36009 Yu K, Zhang T (2012) Metagenomic and Metatranscriptomic analysis of microbial community structure and gene expression of activated sludge. PLoS One 7:e38183

123

Technical challenges in metatranscriptomic studies applied to the bacterial communities of freshwater ecosystems.

Metatranscriptome analysis relates to the transcriptome of microbial communities directly sampled in the environment. Accessing the mRNA pool in natur...
720KB Sizes 0 Downloads 6 Views