AEM Accepted Manuscript Posted Online 24 April 2015 Appl. Environ. Microbiol. doi:10.1128/AEM.00111-15 Copyright © 2015, American Society for Microbiology. All Rights Reserved.
1
Evaluation of the Ion Torrent PGM for gene-targeted
2
studies using amplicons of the nitrogenase gene, nifH
3 4
Bangzhou Zhanga,b, C. Ryan Pentona, Chao Xuea,c, Qiong Wanga, Tianling Zhengb#,
5
James M. Tiedjea#
6 7
Running title: Evaluation of PGM for gene-targeted studies using nifH
8 9
Center for Microbial Ecology, Michigan State University, East Lansing, Michigan, USAa;
10
State Key Lab of Marine Environmental Science and Key Lab of the MOE for Coastal
11
and Wetland Ecosystems, School of Life Sciences, Xiamen University, Xiamen, Chinab;
12
Jiangsu Collaborative Innovation Center for Solid Organic Waste Utilization and National
13
Engineering Research Center for Organic-based Fertilizers, Department of Plant
14
Nutrition, Nanjing Agricultural University, Nanjing, Chinac
15 16 17
#
Address correspondence to James M. Tiedje and Tianling Zheng:
[email protected],
[email protected] 18
1
19
ABSTRACT
20
The Ion Torrent Personal Genome Machine (PGM), which employs semiconductor
21
technology to measure pH changes in polymerization events, has recently upgraded its
22
sequencing chips and kits. The quality of PGM sequences has not been re-assessed, and
23
results have not been compared in the context of a gene-targeted microbial ecology study.
24
To address this, we compared sequence profiles across available PGM chips and
25
chemistries and with 454 pyrosequence data by determining error type and rate and
26
diazotrophic community structure. The PGM was then used to assess differences in
27
nifH-harboring bacterial community structure among four corn-based cropping systems.
28
Using our suggested filters from mock community analyses, the overall error rates were
29
0.62%, 0.36%, and 0.39% per base for chips 318 and 314 with the 400 bp kit and 318
30
with the Hi-Q chemistry, respectively. Compared with the 400 bp kit, the Hi-Q kit
31
reduced indel rates by 28-59% and produced 1-7 times more acceptable reads for
32
downstream analyses. The PGM produced higher frameshift rates than pyrosequencing
33
that were corrected by the RDP’s FrameBot tool. Significant differences among platforms
34
were identified, although diversity indices and overall site-based conclusions remained
35
similar. For the cropping system analyses, a total of 6,182 unique NifH OTUs at 5%
36
amino acid dissimilarity were obtained. Current crop type as well as crop rotation history
37
significantly influenced the composition of the detected soil diazotrophic community.
38
2
39
INTRODUCTION
40
Next-generation sequencing (NGS) technologies have evolved rapidly in the years
41
since pyrosequencing was initially launched in 2005 (1), followed by light imaging
42
Illumina (2, 3) HiSeq and MiSeq and by post-light sequencing Ion Torrent (4, 5) Personal
43
Genome Machine (PGM) and Proton. Sequencing quality (error type and rate) and length
44
has suffered with NGS platforms (3, 6, 7) compared to Sanger sequencing, but the greatly
45
improved sequence yields and reduced cost have resulted in NGS dominating use. The
46
higher error rate has been problematic for microbial ecology, but by parallel sequencing
47
of mock communities of known composition, the type and extent of insertion, deletion or
48
substitution errors can be characterized. This information can then be used to determine
49
the proper parameters and filters for parsing raw sequence data to minimize the impact of
50
these errors on biological conclusions (8, 9).
51
To date, sequencing platform comparisons or case studies employing the Ion Torrent
52
PGM were mainly focused on microbial whole genome sequencing (6, 7, 10, 11), or on
53
bacterial community structure using 16S rRNA gene multiplexing tags (12-14). A few
54
studies applied PGM in medical diagnosis using marker genes (15-17). No studies have
55
utilized eco-functional genes to evaluated Ion Torrent PGM error profiles with 400 bp or
56
the recently released Hi-Q kits or directly comparing the PGM data against another
57
sequencing platform.
58
In this study, we determined the error types and rates and the sequencing output of the
59
Ion Torrent PGM using chip 314 with Sequencing 400 Kit and chip 318 with both 3
60
Sequencing 400 and Hi-Q Sequencing Kits. Mock communities were analyzed to guide
61
the use of appropriate filters for removal of low-quality reads. Sequencing error profiles
62
of the Ion Torrent PGM and Roche 454 platforms were directly compared using the same
63
samples to determine differences in community structure, diversity indices,
64
presence/absence, and relative abundances of major OTUs. Lastly, we utilized the Ion
65
Torrent PGM chip 318 with Hi-Q Sequencing Kit to sequence nifH amplicons from bulk
66
soil samples as a case study to characterize the nitrogen-fixing bacterial communities
67
among four cropping systems that varied in crop type, rotation, and cover crop.
68 69
MATERIALS AND METHODS
70
Community DNA. Genomic DNA samples consisted of a defined community (mock)
71
containing
72
Desulfitobacterium hafniense DCB-2, and Burkholderia vietamensis G4, six Oklahoma
73
prairie soil samples from 0 to 15 cm sequenced by 454, and 12 bulk soil samples from
74
four different cropping systems. The Oklahoma grassland samples were from warming
75
treatments (T) and control (C) sites from a tallgrass prairie at the Great Plains Apiaries
76
site (34°58’54”N, 97°31’14”W) where experimental warming has been underway for
77
over a decade (18-20). The DNA of the Oklahoma samples was extracted using a
78
mechanical lysis freeze-griding method (21). The 12 bulk soil samples (G1-G4, 3 plots
79
each) were collected at Great Lakes Bioenergy Research Center intensive site at
80
Arlington, Wisconsin (43°18’9.47’’N, 89°20’43.32’’W) in July 2013. The cropping
nifH-harboring
strains
Polaromonas
4
naphthalenivorans
CJ2,
81
system G1 was continuous corn; G2 was corn+cover crop annually, while G3 and G4
82
were soybean + cover crop and corn + cover crop, which were in a corn-soybean rotation
83
since 2012. Previously G2, G3, and G4 were in a corn-soybean-canola rotation from
84
2008-2011 (See Table S1 for the cropping history and Table S2 for soil properties). Six
85
cores (2.5 cm diameter × 25 cm depth (0-25 cm)) collected from the same plot were
86
sieved (4 mm) and mixed together into one sample. Well-mixed soil (0.3 g) was used for
87
DNA extraction with the Power Soil DNA Isolation Kit (MOBIO, CA, USA) according
88
to the manufacturer’s protocol. The DNA was quantified with a Nanodrop ND-1000
89
spectrophotometer (Nanodrop Technology) and stored at -20ºC for the following
90
application.
91 92
NifH library generation and sequencing. The nifH gene was amplified using the Poly
93
primers (22) in triplicate PCR reactions using fusion primers (Table S3). PCR mixtures
94
for both PGM and pyrosequencing contained 1 × green buffer (PROMEGA, WI, USA),
95
1.8 mM MgCl2 (PROMEGA), 0.5mM dNTP (PROMEGA), 500 nM each primer (IDT,
96
IA, USA),
97
(PROMEGA), and
98
for both PGM and pyrosequencing were performed as follows: 3 min at 95°C, 30 cycles
99
of 45 s at 94°C, 45 s at 61°C (62°C for 454), and 1 min at 72°, followed by final
100
extension for 7 min at 72°C. PCR products were checked on a 1.6 % agarose gel using
101
SYBR Safe gel stain (Invitrogen, CA, USA) and extracted from the gel using the
0.1 mg/mL BSA (NEB, Massachusetts, USA ), 2.5 U Taq polymerase 1 ng/µL template DNA in a 20 µL reaction volume. Amplifications
5
102
QIAquick Gel Extraction Kit (QIAGEN, MD, USA). The eluted DNA was purified again
103
using the Qiagen PCR Purification Kit (QIAGEN). The mock community was amplified
104
for Ion Torrent sequencing using two barcoded primers to produce M-1 and M-2 samples.
105 106
For PGM and pyrosequencing, each purified sample was quantified by the Qubit
107
Fluorometer (Invitrogen). Equivalent DNA of each sample was pooled, and the final
108
concentration was adjusted to more than 10 ng/µl. For PGM sequencing, the pooled
109
library was sent to the Research Technology Support Facility (RTSF) at Michigan State
110
University (East Lansing) for sequencing on an Ion Torrent PGM machine (Life
111
Technologies). Briefly, the library was diluted to 26pM and set up on the OneTouch2
112
instruments using the Ion PGM Template OT2 400 Kit following the manufacturer’s
113
instructions. These templated beads were then purified on the OneTouch ES system.
114
Following enrichment the beads were then loaded onto the PGM chip and sequenced
115
following manufacturer’s instructions. Sequencing data was processed by the Torrent
116
Suite Software V4.0. To compare the differences between PGM chips and sequencing kits,
117
mock (M-1 and M-2) and Oklahoma amplicons were sequenced on 314 v2 chip with Ion
118
PGM Sequencing 400 Kit (400bp kit) first, and then were sequenced on 318 v2 chip with
119
both the 400bp kit and the Ion PGM Hi-Q Sequencing Kit (Hi-Q kit) along with the
120
Arlington crop rotation samples. The Hi-Q kit sequencing was completed in the
121
University of Wisconsin Biotechnology Center with the same procedures as RTSF, and
122
processed by the synchronized Torrent Suite Software V4.4. The mock community was 6
123
also sequenced in additional chip 318 runs with 400bp kit (M-R2, M-R3, and M-R4) at
124
RTSF. Pyrosequencing was performed at the Utah State University Center for Integrated
125
Biosystems using the 454 Titanium platform following the manufacturer’s instructions .
126 127
Mock community analysis. To assess the sequencing error type and rate in concert with
128
the read Q (quality) score, the Defined Community Analysis Tool in the FunGene
129
pipeline was adopted to analyze the mock sample reads. This tool compares the
130
sequencing reads with the known corresponding region (reference sequences) from the
131
mock community organisms (9). To this end, filters for passing the raw reads through
132
initial quality filtering were set to: Forward primer max edit distance 2, Reverse primer
133
max edit distance 1, Max number of N's 0, Min sequence length 300 (excluding primers).
134
In order to examine the relationship of error rates at each read Q score cutoff, Minimum
135
Read Q Score of 0 was applied to allow reads with any read Q score to pass. Usearch (23)
136
and BLAST(24) were used to detect the small numbers of chimeric and contaminant
137
reads, which were then excluded when the ‘summary file’ was generated by this tool. The
138
numbers of total substitutions and indels, read Q score distributions, overall error rates
139
(total errors divided by total bases for each read Q score cutoff), percentages of reads
140
remaining and reads with specific errors that varied by Q score were summarized. Results
141
from this tool further optimized the analysis parameters, such as the Q score quality filter,
142
for the environmental reads.
143 7
144
Data processing. For both PGM and pyrosequencing, all reads obtained from soil DNA
145
were
146
(http://fungene.cme.msu.edu/FunGenePipeline/) using same quality filtering parameters
147
as above, except with a minimum read Q score of 22 to filter reads of low quality. The
148
resulting filtered reads with primers removed were subjected to Usearch (23) for chimera
149
check using de novo mode. After the removal of chimeras, all samples were translated
150
and
151
frameshift-corrected protein sequences were aligned using FunGene HMMER3 Aligner,
152
and were then clustered by RDP mcClust using complete linkage algorithm. The cluster
153
file was converted to an R-formatted operational taxonomic unit (OTU) table by the RDP
154
cluster file formatter tool for use by R (version 2.12.0; http://www.r-project.org/) for
155
ordination analysis. Sequences were randomly resampled to 1,814 amino acid sequences
156
per sample for platform comparisons and to 22,616 sequences for the cropping system
157
study.
trimmed
by
the
frameshift-corrected
Initial
using
Process
FrameBot
tool
in
(25)
the
with
FunGene
default
pipeline
settings.
(9)
The
158 159
Platform Comparison of Oklahoma Samples. OTUs were generated at 0% amino acid
160
dissimilarity (OTU0) and at 5% dissimilarity (OTU0,05). OTU abundances were
161
Hellinger-transformed (square root of relative abundance), and Bray-Curtis and
162
Euclidean dissimilarity matrices (with dummy variable +1) were constructed using
163
PRIMER-v6
164
was used to construct cluster dendograms with similarity profile (SIMPROF) testing (26).
(Primer-E LTD, Plymouth, United Kindom). Complete linkage clustering
8
165
Permutational Multivariate Analysis of Variance (PERMANOVA) (27) was used to test
166
for significant differences in community structure. Shannon diversity (H’), Margalef’s
167
Richness (d), Pielou’s Evenness (J’), and the number of individuals (Sab) were tested for
168
significant differences by one-way Analysis of Variance (ANOVA) among sequencing
169
platforms using Minitab 16 (Minitab Inc., PA, USA) with a Tukey correction for multiple
170
comparisons. Representative sequences were generated as the minimum sum of square
171
distances of each OTU0.05. These representative sequences were assigned to a closest
172
match using BLASTp with FunGene NifH database consisting of 675 hand-curated
173
sequences where the extracted protein corresponded to the amplicon generated by the
174
primers (modified Zehr-set) (25).
175 176
Nitrogen-fixing communities in soils of different cropping systems. All sequences of
177
the 12 soil samples were clustered at OTU0.05 level (See Fig S1 for the rarefaction curve),
178
and then were resampled to 22,616 sequences per sample for calculation of diversity
179
indices (H’ and J’) and downstream analyses. Significant differences in diversity indices
180
were tested by one-way ANOVA using SPSS 18 (SPSS, IL, USA). A heatmap of the 30
181
most abundant OTUs0.05 (Fig S2) was produced using labdsv package (version 1.6-1,
182
http://CRAN.R-project.org/package=labdsv).
183
Hellinger-transformed and subjected to redundancy analysis (RDA) for ordination under
184
constraint of the four cropping systems using the Vegan package (version 2.0-10,
185
http://CRAN.R-project.org/package=vegan). The significance of the cropping system 9
OTU0.05
abundance
data
were
186
influence was assessed using PERMANOVA and differences between within-sample
187
variance were tested using the PERMutational DISPersion test (PERMDISP). Reads
188
assignment into the taxa and NifH clusters was completed using FrameBot when
189
conducting frameshift correction. Soil properties were fitted to RDA model using the
190
Vegan package.
191 192
Data availability. Raw sequences from PGM and 454 were deposited in NCBI Sequence
193
Read Archive with accession number PRJNA266566 and ENA-SRA with accession
194
number PRJEB8005 (sample accessions ERS629288-ERS629293), respectively.
195 196
RESULTS
197 198
PGM sequencing and error type. Sequencing resulted in 327,994 raw reads from the
199
PGM 314 chip, while the 318 chip produced an average of 5,588,636 (±450,701) raw
200
reads per run, an approximate 16-fold increase in yield. Read length distribution was
201
consistent among different runs, as the relative percentage of reads by read length in each
202
run showed a single and highly similar peak at the full nifH amplicon size (360 bp) (Fig.
203
1A). Full-length reads accounted for 45% of all reads from chip 314, an average of 26%
204
(±8%) of all reads from chip 318 with 400 bp kit, and 60% of all reads from chip 318
205
with Hi-Q kit (Table 1).
206
Mock community analyses with the Defined Community Analysis Tool showed that 10
207
the errors per full-length read (Fig. 1B) and percentage of reads with errors (Table 1)
208
varied according to chip type, independent runs, and sequencing chemistry. The indels
209
and substitutions per full-length read were higher in chip 318 runs with 400 bp kit
210
(average 2.60 (±0.43) and 0.58 (±0.1), respectively), while much lower in chip 314 (1.22
211
and 0.35) and chip 318 with the Hi-Q kit (1.35 and 0.52) (Fig. 1B). Thus, the Hi-Q kit
212
reduced the indel rate by 28%-59% (mean 48%) compared to the 318 chip-400bp kit runs.
213
In combination with the higher percentage of reads with indels (2.2-3.7 times of reads
214
with substitutions) (Table 1), this error rate result also indicates that indels are the
215
primary source of PGM sequencing errors. Replicate mock communities (M-1 and M-2)
216
exhibited similar error patterns across chips and sequencing chemistry. Using the 400bp
217
kit, chip 318 produced more errors than chip 314, even in the best mock run (M-R3, third
218
run of chip 318), while chip 318 with the Hi-Q kit reduced the error rate and approached
219
the performance of chip 314 with the 400 bp kit.
220 221
Error rate and Q score cutoff. In order to determine the error rate and filtering
222
parameters (e.g. minimum read Q score) for the environmental samples, mock
223
communities among different runs were further analyzed (Fig. 2). For all mock samples,
224
the read Q score distributions from chip 314 peaked at range of 24-25, while they peaked
225
at range of 20-22 in chip 318 runs (Fig. 2A). The Hi-Q sequencing kit displayed the best
226
read Q score distribution in chip 318 runs, with M-R4-318 representing an average
227
sequencing output of chip 318 runs with the 400bp kit (Fig. 2A). The two replicates (M-1 11
228
and M-2) showed a consistent distribution pattern in their individual runs. Both the
229
overall error rate and percentage of reads remaining decreased with increasing read Q
230
score cutoffs (Fig. 2B). The percentage of reads with the specified number of allowed
231
errors decreased sharply with read Q score cutoff, taking the Hi-Q kit run as an example
232
(Fig 2C, see Fig S3 for the M-R4-318 run) which showed an overall error rate of 0.40%
233
per base and 65% of the full length reads remained when the read Q score was set to 22
234
(Table 1, Fig. 2B).
235
error), 78% for E1, 62% for E2, 46 % for E3, 33% for E4, 21% for E5, and much less for
236
reads with more errors (Fig. 2C,). This indicates that reads containing more errors,
237
especially those with more than three errors (1% of nifH amplicon length), were mostly
238
excluded from analysis at a read Q score of 22. As such, a read Q score of 22 was chosen
239
for processing the environmental DNA samples. Overall, the Hi-Q kit allowed the
240
retention of 1-7 times more acceptable reads, compared to the 318 chip-400 bp kit runs
241
(Table 1).
The percentage of remaining reads was 91% for E0 (reads with 0
242 243
Sequencing platform comparisons of nifH results. To further evaluate PGM
244
sequencing, six soil samples that were amplified for nifH and sequenced using the 454
245
platform were re-amplified and sequenced on the PGM using chips 314 and 318 with 400
246
bp kits and chip 318 with the Hi-Q kit. After initial quality processing, the number of
247
sequences per sample ranged from 4831 to 8175 (454), 1997 to 2967 (PGM 314), 2382 to
248
4068 (PGM 318), and 16270 to 23868 (Hi-Q). The percentage of chimeras differed 12
249
significantly (ANOVA, F=27.74, P