Evaluation of the Ion Torrent Personal Genome Machine for Gene-Targeted Studies Using Amplicons of the Nitrogenase Gene nifH.

AEM Accepted Manuscript Posted Online 24 April 2015 Appl. Environ. Microbiol. doi:10.1128/AEM.00111-15 Copyright © 2015, American Society for Microbiology. All Rights Reserved.

1

Evaluation of the Ion Torrent PGM for gene-targeted

2

studies using amplicons of the nitrogenase gene, nifH

3 4

Bangzhou Zhanga,b, C. Ryan Pentona, Chao Xuea,c, Qiong Wanga, Tianling Zhengb#,

5

James M. Tiedjea#

6 7

Running title: Evaluation of PGM for gene-targeted studies using nifH

8 9

Center for Microbial Ecology, Michigan State University, East Lansing, Michigan, USAa;

10

State Key Lab of Marine Environmental Science and Key Lab of the MOE for Coastal

11

and Wetland Ecosystems, School of Life Sciences, Xiamen University, Xiamen, Chinab;

12

Jiangsu Collaborative Innovation Center for Solid Organic Waste Utilization and National

13

Engineering Research Center for Organic-based Fertilizers, Department of Plant

14

Nutrition, Nanjing Agricultural University, Nanjing, Chinac

15 16 17

#

Address correspondence to James M. Tiedje and Tianling Zheng: [email protected],

[email protected]

18

1

19

ABSTRACT

20

The Ion Torrent Personal Genome Machine (PGM), which employs semiconductor

21

technology to measure pH changes in polymerization events, has recently upgraded its

22

sequencing chips and kits. The quality of PGM sequences has not been re-assessed, and

23

results have not been compared in the context of a gene-targeted microbial ecology study.

24

To address this, we compared sequence profiles across available PGM chips and

25

chemistries and with 454 pyrosequence data by determining error type and rate and

26

diazotrophic community structure. The PGM was then used to assess differences in

27

nifH-harboring bacterial community structure among four corn-based cropping systems.

28

Using our suggested filters from mock community analyses, the overall error rates were

29

0.62%, 0.36%, and 0.39% per base for chips 318 and 314 with the 400 bp kit and 318

30

with the Hi-Q chemistry, respectively. Compared with the 400 bp kit, the Hi-Q kit

31

reduced indel rates by 28-59% and produced 1-7 times more acceptable reads for

32

downstream analyses. The PGM produced higher frameshift rates than pyrosequencing

33

that were corrected by the RDP’s FrameBot tool. Significant differences among platforms

34

were identified, although diversity indices and overall site-based conclusions remained

35

similar. For the cropping system analyses, a total of 6,182 unique NifH OTUs at 5%

36

amino acid dissimilarity were obtained. Current crop type as well as crop rotation history

37

significantly influenced the composition of the detected soil diazotrophic community.

38

2

39

INTRODUCTION

40

Next-generation sequencing (NGS) technologies have evolved rapidly in the years

41

since pyrosequencing was initially launched in 2005 (1), followed by light imaging

42

Illumina (2, 3) HiSeq and MiSeq and by post-light sequencing Ion Torrent (4, 5) Personal

43

Genome Machine (PGM) and Proton. Sequencing quality (error type and rate) and length

44

has suffered with NGS platforms (3, 6, 7) compared to Sanger sequencing, but the greatly

45

improved sequence yields and reduced cost have resulted in NGS dominating use. The

46

higher error rate has been problematic for microbial ecology, but by parallel sequencing

47

of mock communities of known composition, the type and extent of insertion, deletion or

48

substitution errors can be characterized. This information can then be used to determine

49

the proper parameters and filters for parsing raw sequence data to minimize the impact of

50

these errors on biological conclusions (8, 9).

51

To date, sequencing platform comparisons or case studies employing the Ion Torrent

52

PGM were mainly focused on microbial whole genome sequencing (6, 7, 10, 11), or on

53

bacterial community structure using 16S rRNA gene multiplexing tags (12-14). A few

54

studies applied PGM in medical diagnosis using marker genes (15-17). No studies have

55

utilized eco-functional genes to evaluated Ion Torrent PGM error profiles with 400 bp or

56

the recently released Hi-Q kits or directly comparing the PGM data against another

57

sequencing platform.

58

In this study, we determined the error types and rates and the sequencing output of the

59

Ion Torrent PGM using chip 314 with Sequencing 400 Kit and chip 318 with both 3

60

Sequencing 400 and Hi-Q Sequencing Kits. Mock communities were analyzed to guide

61

the use of appropriate filters for removal of low-quality reads. Sequencing error profiles

62

of the Ion Torrent PGM and Roche 454 platforms were directly compared using the same

63

samples to determine differences in community structure, diversity indices,

64

presence/absence, and relative abundances of major OTUs. Lastly, we utilized the Ion

65

Torrent PGM chip 318 with Hi-Q Sequencing Kit to sequence nifH amplicons from bulk

66

soil samples as a case study to characterize the nitrogen-fixing bacterial communities

67

among four cropping systems that varied in crop type, rotation, and cover crop.

68 69

MATERIALS AND METHODS

70

Community DNA. Genomic DNA samples consisted of a defined community (mock)

71

containing

72

Desulfitobacterium hafniense DCB-2, and Burkholderia vietamensis G4, six Oklahoma

73

prairie soil samples from 0 to 15 cm sequenced by 454, and 12 bulk soil samples from

74

four different cropping systems. The Oklahoma grassland samples were from warming

75

treatments (T) and control (C) sites from a tallgrass prairie at the Great Plains Apiaries

76

site (34°58’54”N, 97°31’14”W) where experimental warming has been underway for

77

over a decade (18-20). The DNA of the Oklahoma samples was extracted using a

78

mechanical lysis freeze-griding method (21). The 12 bulk soil samples (G1-G4, 3 plots

79

each) were collected at Great Lakes Bioenergy Research Center intensive site at

80

Arlington, Wisconsin (43°18’9.47’’N, 89°20’43.32’’W) in July 2013. The cropping

nifH-harboring

strains

Polaromonas

4

naphthalenivorans

CJ2,

81

system G1 was continuous corn; G2 was corn+cover crop annually, while G3 and G4

82

were soybean + cover crop and corn + cover crop, which were in a corn-soybean rotation

83

since 2012. Previously G2, G3, and G4 were in a corn-soybean-canola rotation from

84

2008-2011 (See Table S1 for the cropping history and Table S2 for soil properties). Six

85

cores (2.5 cm diameter × 25 cm depth (0-25 cm)) collected from the same plot were

86

sieved (4 mm) and mixed together into one sample. Well-mixed soil (0.3 g) was used for

87

DNA extraction with the Power Soil DNA Isolation Kit (MOBIO, CA, USA) according

88

to the manufacturer’s protocol. The DNA was quantified with a Nanodrop ND-1000

89

spectrophotometer (Nanodrop Technology) and stored at -20ºC for the following

90

application.

91 92

NifH library generation and sequencing. The nifH gene was amplified using the Poly

93

primers (22) in triplicate PCR reactions using fusion primers (Table S3). PCR mixtures

94

for both PGM and pyrosequencing contained 1 × green buffer (PROMEGA, WI, USA),

95

1.8 mM MgCl2 (PROMEGA), 0.5mM dNTP (PROMEGA), 500 nM each primer (IDT,

96

IA, USA),

97

(PROMEGA), and

98

for both PGM and pyrosequencing were performed as follows: 3 min at 95°C, 30 cycles

99

of 45 s at 94°C, 45 s at 61°C (62°C for 454), and 1 min at 72°, followed by final

100

extension for 7 min at 72°C. PCR products were checked on a 1.6 % agarose gel using

101

SYBR Safe gel stain (Invitrogen, CA, USA) and extracted from the gel using the

0.1 mg/mL BSA (NEB, Massachusetts, USA ), 2.5 U Taq polymerase 1 ng/µL template DNA in a 20 µL reaction volume. Amplifications

5

102

QIAquick Gel Extraction Kit (QIAGEN, MD, USA). The eluted DNA was purified again

103

using the Qiagen PCR Purification Kit (QIAGEN). The mock community was amplified

104

for Ion Torrent sequencing using two barcoded primers to produce M-1 and M-2 samples.

105 106

For PGM and pyrosequencing, each purified sample was quantified by the Qubit

107

Fluorometer (Invitrogen). Equivalent DNA of each sample was pooled, and the final

108

concentration was adjusted to more than 10 ng/µl. For PGM sequencing, the pooled

109

library was sent to the Research Technology Support Facility (RTSF) at Michigan State

110

University (East Lansing) for sequencing on an Ion Torrent PGM machine (Life

111

Technologies). Briefly, the library was diluted to 26pM and set up on the OneTouch2

112

instruments using the Ion PGM Template OT2 400 Kit following the manufacturer’s

113

instructions. These templated beads were then purified on the OneTouch ES system.

114

Following enrichment the beads were then loaded onto the PGM chip and sequenced

115

following manufacturer’s instructions. Sequencing data was processed by the Torrent

116

Suite Software V4.0. To compare the differences between PGM chips and sequencing kits,

117

mock (M-1 and M-2) and Oklahoma amplicons were sequenced on 314 v2 chip with Ion

118

PGM Sequencing 400 Kit (400bp kit) first, and then were sequenced on 318 v2 chip with

119

both the 400bp kit and the Ion PGM Hi-Q Sequencing Kit (Hi-Q kit) along with the

120

Arlington crop rotation samples. The Hi-Q kit sequencing was completed in the

121

University of Wisconsin Biotechnology Center with the same procedures as RTSF, and

122

processed by the synchronized Torrent Suite Software V4.4. The mock community was 6

123

also sequenced in additional chip 318 runs with 400bp kit (M-R2, M-R3, and M-R4) at

124

RTSF. Pyrosequencing was performed at the Utah State University Center for Integrated

125

Biosystems using the 454 Titanium platform following the manufacturer’s instructions .

126 127

Mock community analysis. To assess the sequencing error type and rate in concert with

128

the read Q (quality) score, the Defined Community Analysis Tool in the FunGene

129

pipeline was adopted to analyze the mock sample reads. This tool compares the

130

sequencing reads with the known corresponding region (reference sequences) from the

131

mock community organisms (9). To this end, filters for passing the raw reads through

132

initial quality filtering were set to: Forward primer max edit distance 2, Reverse primer

133

max edit distance 1, Max number of N's 0, Min sequence length 300 (excluding primers).

134

In order to examine the relationship of error rates at each read Q score cutoff, Minimum

135

Read Q Score of 0 was applied to allow reads with any read Q score to pass. Usearch (23)

136

and BLAST(24) were used to detect the small numbers of chimeric and contaminant

137

reads, which were then excluded when the ‘summary file’ was generated by this tool. The

138

numbers of total substitutions and indels, read Q score distributions, overall error rates

139

(total errors divided by total bases for each read Q score cutoff), percentages of reads

140

remaining and reads with specific errors that varied by Q score were summarized. Results

141

from this tool further optimized the analysis parameters, such as the Q score quality filter,

142

for the environmental reads.

143 7

144

Data processing. For both PGM and pyrosequencing, all reads obtained from soil DNA

145

were

146

(http://fungene.cme.msu.edu/FunGenePipeline/) using same quality filtering parameters

147

as above, except with a minimum read Q score of 22 to filter reads of low quality. The

148

resulting filtered reads with primers removed were subjected to Usearch (23) for chimera

149

check using de novo mode. After the removal of chimeras, all samples were translated

150

and

151

frameshift-corrected protein sequences were aligned using FunGene HMMER3 Aligner,

152

and were then clustered by RDP mcClust using complete linkage algorithm. The cluster

153

file was converted to an R-formatted operational taxonomic unit (OTU) table by the RDP

154

cluster file formatter tool for use by R (version 2.12.0; http://www.r-project.org/) for

155

ordination analysis. Sequences were randomly resampled to 1,814 amino acid sequences

156

per sample for platform comparisons and to 22,616 sequences for the cropping system

157

study.

trimmed

by

the

frameshift-corrected

Initial

using

Process

FrameBot

tool

in

(25)

the

with

FunGene

default

pipeline

settings.

(9)

The

158 159

Platform Comparison of Oklahoma Samples. OTUs were generated at 0% amino acid

160

dissimilarity (OTU0) and at 5% dissimilarity (OTU0,05). OTU abundances were

161

Hellinger-transformed (square root of relative abundance), and Bray-Curtis and

162

Euclidean dissimilarity matrices (with dummy variable +1) were constructed using

163

PRIMER-v6

164

was used to construct cluster dendograms with similarity profile (SIMPROF) testing (26).

(Primer-E LTD, Plymouth, United Kindom). Complete linkage clustering

8

165

Permutational Multivariate Analysis of Variance (PERMANOVA) (27) was used to test

166

for significant differences in community structure. Shannon diversity (H’), Margalef’s

167

Richness (d), Pielou’s Evenness (J’), and the number of individuals (Sab) were tested for

168

significant differences by one-way Analysis of Variance (ANOVA) among sequencing

169

platforms using Minitab 16 (Minitab Inc., PA, USA) with a Tukey correction for multiple

170

comparisons. Representative sequences were generated as the minimum sum of square

171

distances of each OTU0.05. These representative sequences were assigned to a closest

172

match using BLASTp with FunGene NifH database consisting of 675 hand-curated

173

sequences where the extracted protein corresponded to the amplicon generated by the

174

primers (modified Zehr-set) (25).

175 176

Nitrogen-fixing communities in soils of different cropping systems. All sequences of

177

the 12 soil samples were clustered at OTU0.05 level (See Fig S1 for the rarefaction curve),

178

and then were resampled to 22,616 sequences per sample for calculation of diversity

179

indices (H’ and J’) and downstream analyses. Significant differences in diversity indices

180

were tested by one-way ANOVA using SPSS 18 (SPSS, IL, USA). A heatmap of the 30

181

most abundant OTUs0.05 (Fig S2) was produced using labdsv package (version 1.6-1,

182

http://CRAN.R-project.org/package=labdsv).

183

Hellinger-transformed and subjected to redundancy analysis (RDA) for ordination under

184

constraint of the four cropping systems using the Vegan package (version 2.0-10,

185

http://CRAN.R-project.org/package=vegan). The significance of the cropping system 9

OTU0.05

abundance

data

were

186

influence was assessed using PERMANOVA and differences between within-sample

187

variance were tested using the PERMutational DISPersion test (PERMDISP). Reads

188

assignment into the taxa and NifH clusters was completed using FrameBot when

189

conducting frameshift correction. Soil properties were fitted to RDA model using the

190

Vegan package.

191 192

Data availability. Raw sequences from PGM and 454 were deposited in NCBI Sequence

193

Read Archive with accession number PRJNA266566 and ENA-SRA with accession

194

number PRJEB8005 (sample accessions ERS629288-ERS629293), respectively.

195 196

RESULTS

197 198

PGM sequencing and error type. Sequencing resulted in 327,994 raw reads from the

199

PGM 314 chip, while the 318 chip produced an average of 5,588,636 (±450,701) raw

200

reads per run, an approximate 16-fold increase in yield. Read length distribution was

201

consistent among different runs, as the relative percentage of reads by read length in each

202

run showed a single and highly similar peak at the full nifH amplicon size (360 bp) (Fig.

203

1A). Full-length reads accounted for 45% of all reads from chip 314, an average of 26%

204

(±8%) of all reads from chip 318 with 400 bp kit, and 60% of all reads from chip 318

205

with Hi-Q kit (Table 1).

206

Mock community analyses with the Defined Community Analysis Tool showed that 10

207

the errors per full-length read (Fig. 1B) and percentage of reads with errors (Table 1)

208

varied according to chip type, independent runs, and sequencing chemistry. The indels

209

and substitutions per full-length read were higher in chip 318 runs with 400 bp kit

210

(average 2.60 (±0.43) and 0.58 (±0.1), respectively), while much lower in chip 314 (1.22

211

and 0.35) and chip 318 with the Hi-Q kit (1.35 and 0.52) (Fig. 1B). Thus, the Hi-Q kit

212

reduced the indel rate by 28%-59% (mean 48%) compared to the 318 chip-400bp kit runs.

213

In combination with the higher percentage of reads with indels (2.2-3.7 times of reads

214

with substitutions) (Table 1), this error rate result also indicates that indels are the

215

primary source of PGM sequencing errors. Replicate mock communities (M-1 and M-2)

216

exhibited similar error patterns across chips and sequencing chemistry. Using the 400bp

217

kit, chip 318 produced more errors than chip 314, even in the best mock run (M-R3, third

218

run of chip 318), while chip 318 with the Hi-Q kit reduced the error rate and approached

219

the performance of chip 314 with the 400 bp kit.

220 221

Error rate and Q score cutoff. In order to determine the error rate and filtering

222

parameters (e.g. minimum read Q score) for the environmental samples, mock

223

communities among different runs were further analyzed (Fig. 2). For all mock samples,

224

the read Q score distributions from chip 314 peaked at range of 24-25, while they peaked

225

at range of 20-22 in chip 318 runs (Fig. 2A). The Hi-Q sequencing kit displayed the best

226

read Q score distribution in chip 318 runs, with M-R4-318 representing an average

227

sequencing output of chip 318 runs with the 400bp kit (Fig. 2A). The two replicates (M-1 11

228

and M-2) showed a consistent distribution pattern in their individual runs. Both the

229

overall error rate and percentage of reads remaining decreased with increasing read Q

230

score cutoffs (Fig. 2B). The percentage of reads with the specified number of allowed

231

errors decreased sharply with read Q score cutoff, taking the Hi-Q kit run as an example

232

(Fig 2C, see Fig S3 for the M-R4-318 run) which showed an overall error rate of 0.40%

233

per base and 65% of the full length reads remained when the read Q score was set to 22

234

(Table 1, Fig. 2B).

235

error), 78% for E1, 62% for E2, 46 % for E3, 33% for E4, 21% for E5, and much less for

236

reads with more errors (Fig. 2C,). This indicates that reads containing more errors,

237

especially those with more than three errors (1% of nifH amplicon length), were mostly

238

excluded from analysis at a read Q score of 22. As such, a read Q score of 22 was chosen

239

for processing the environmental DNA samples. Overall, the Hi-Q kit allowed the

240

retention of 1-7 times more acceptable reads, compared to the 318 chip-400 bp kit runs

241

(Table 1).

The percentage of remaining reads was 91% for E0 (reads with 0

242 243

Sequencing platform comparisons of nifH results. To further evaluate PGM

244

sequencing, six soil samples that were amplified for nifH and sequenced using the 454

245

platform were re-amplified and sequenced on the PGM using chips 314 and 318 with 400

246

bp kits and chip 318 with the Hi-Q kit. After initial quality processing, the number of

247

sequences per sample ranged from 4831 to 8175 (454), 1997 to 2967 (PGM 314), 2382 to

248

4068 (PGM 318), and 16270 to 23868 (Hi-Q). The percentage of chimeras differed 12

249

significantly (ANOVA, F=27.74, P

Simultaneous Whole Mitochondrial Genome Sequencing with Short Overlapping Amplicons Suitable for Degraded DNA Using the Ion Torrent Personal Genome Machine.

High-throughput sequencing using the Ion Torrent personal genome machine for clinical evaluation of somatic hypermutation status in chronic lymphocytic leukemia.

The complete mitochondrial genome of the invasive Ponto-Caspian goby Ponticola kessleri obtained from high-throughput sequencing using the Ion Torrent Personal Genome Machine.

WITHDRAWN: Erratum to "Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM)" [Forensic Sci. Int.: Genet. 7 (2013) 543-549].

IonGAP: integrative bacterial genome analysis for Ion Torrent sequence data.

Underlying Data for Sequencing the Mitochondrial Genome with the Massively Parallel Sequencing Platform Ion Torrent™ PGM™.

GHRH, PRP-PACAP and GHRHR Target Sequencing via an Ion Torrent Personal Genome Machine Reveals an Association with Growth in Orange-Spotted Grouper (Epinephelus coioides).

Complete Genome Sequences of Hepatitis B Virus from North India Using Ion Torrent.

Ion Torrent sequencing as a tool for mutation discovery in the flax (Linum usitatissimum L.) genome.

Less is more: extreme genome complexity reduction with ddRAD using Ion Torrent semiconductor technology.

Improved transposon-based library preparation for the Ion Torrent platform.

Ion Torrent sequencing for conducting genome-wide scans for mutation mapping analysis.

A comparison of Illumina and Ion Torrent sequencing platforms in the context of differential gene expression.

Strategies for complete mitochondrial genome sequencing on Ion Torrent PGM™ platform in forensic sciences.

Peak detection method evaluation for ion mobility spectrometry by using machine learning approaches.

Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes.

A comprehensive aligned nifH gene database: a multipurpose tool for studies of nitrogen-fixing bacteria.

Pipeline for amplifying and analyzing amplicons of the V1-V3 region of the 16S rRNA gene.

Ion torrent next-generation sequencing reveals the complete mitochondrial genome of endangered mahseer Tor khudree (Sykes, 1839).

A Personal history of the early development of the flowing afterglow technique for ion-molecule reaction studies.

Genetic mutation analysis of human gastric adenocarcinomas using ion torrent sequencing platform.

Kinetic studies of Bacillus polymyxa nitrogenase.

Evaluation of the Machine Performance Check application for TrueBeam Linac.

Comparison of the performance of Ion Torrent chips in noninvasive prenatal trisomy detection.