Forensic Science International: Genetics 22 (2016) 11–21

Contents lists available at ScienceDirect

Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig

Research paper

Strategies for complete mitochondrial genome sequencing on Ion Torrent PGMTM platform in forensic sciences Yishu Zhoua , Fei Guob,c , Jiao Yub , Feng Liub , Jinling Zhaob , Hongying Shenb , Bin Zhaob , Fei Jiab , Zhu Sunb , He Songa , Xianhua Jianga,b,* a b c

China Medical University School of Forensic Medicine, No. 77, Puhe Road, Shenyang North New Area, Shenyang, Liaoning 110122, PR China Criminal Science and Technology Institute of Liaoning Province, No. 2, Qishan Middle Road, Huanggu District, Shenyang, Liaoning 110032, PR China Department of Forensic Medicine, National Police University of China, No. 83, Tawan Street, Huanggu District, Shenyang, Liaoning 110854, PR China

A R T I C L E I N F O

A B S T R A C T

Article history: Received 21 April 2015 Received in revised form 30 December 2015 Accepted 8 January 2016 Available online 15 January 2016

Next generation sequencing (NGS) is a time saving and cost-efficient method to detect the complete mitochondrial genome (mtGenome) compared to Sanger sequencing. In this study we focused on developing strategies for mtGenome sequencing on the Ion Torrent PGMTM platform and NGS data analysis. With our experience, 4, 15 and 30 samples could be loaded onto Ion 314TM, Ion 316TM and Ion 318TM chips respectively at a pooling concentration of 26 pM, achieving to sufficient average coverage of 1500  and well strand balance of 1.05. Data processing software is essential to NGS mega data analysis. The in-house Perl scripts were developed for primary data analysis to screen out uncertain positions and samples from variant call format (VCF) reports and for pedigree study to perform pairwise comparisons. The Integrative Genomic Viewer (IGV) and the NextGENe software were introduced to secondary data analysis. The mthap and EMMA were employed for haplogroup assignment. The dataset was reviewed and approved by the EMPOP as the final version, which showed 2.66% error rate generated from the Torrent Variant Caller (TVC). Across the mtGenome, 4022 variants were found at 725 nucleotide positions, where ratio of transitions to transversions was estimated at 20.89:1 and 22.18% of variants was concentrated at hypervariable segments I and II (HVS-I and HVS-II). Totally, 107 complete mtGenome haplotypes were observed from 107 Northern Chinese Han and assigned to 88 haplogroups. The random match probability (RMP) of complete mtGenome was calculated as 0.009345794, decreasing 26.19% by comparison to that of HVS-I only, and the haplotype diversity (HD) was evaluated as 1, increasing 0.33% by comparison to that of HVS-I only. Principal component analysis (PCA) showed that our population was clustered to East and Southeast Asians. The strategies in this study are suitable for complete mtGenome sequencing on Ion Torrent PGMTM platform and Northern Chinese Han (EMP00670) is the first complete mtGenome dataset contributed to the EMPOP from East Asian. ã 2016 Elsevier Ireland Ltd. All rights reserved.

Keyword: Next generation sequencing (NGS) Ion Torrent PGMTM Mitochondrial genome (mtGenome) Northern Chinese Han Forensic sciences

1. Introduction The human mitochondrial genome (mtGenome) is an approximately 16,569 base pairs (bp) long extra-chromosomal genome presenting in the subcellular organelle. The 15,447 bp long coding region (CodR) is economical construction, which encodes 37 genes correlate to oxidative phosphorylation. The 1,122 bp long noncoding control region (CR) has proven to be informatics, which

* Corresponding author at: China Medical University School of Forensic Medicine, No. 77, Puhe Road, Shenyang North New Area, Shenyang, Liaoning 110122, PR China. Tel.: +86 24 31939433; fax: +86 24 31939433. E-mail address: [email protected] (X. Jiang). http://dx.doi.org/10.1016/j.fsigen.2016.01.004 1872-4973/ ã 2016 Elsevier Ireland Ltd. All rights reserved.

comprises an origin of replication along with three hypervariable segments (HVS) [1]. Compared with nuclear DNA, mitochondrial DNA (mtDNA) possesses some unique characteristics including simplistic inheritance and great copy number per cell [2]. It is considered to be inherited strictly from our mothers and to be transmitted without recombination across many generations. While specimens like highly decomposed remains, bones or hair shafts are hard to obtain full short tandem repeat (STR) profiles, mtDNA typing can provide an alternative clue to investigation by comparison with known maternal relatives [3]. Thus, it is commonly used in human evolutionary studies [4] and forensic practice [5]. Previous studies often restrict to sequence the HVS-I, II and III of CR [6,7] as well as some specific single nucleotide polymorphisms (SNPs) of CodR [8]. Such partial information may limit the

12

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

polymorphism information content of this genetic marker and hinder its application in practical forensic casework. Undoubtedly, the complete mtGenome sequencing will provide new insights beyond current capabilities. However, most available mtGenomes are generated with conventional Sanger sequencing technology, which is a time-consuming and expensive task particularly when large mtGenome databases are established or high quality data are required with redundant sequence coverage. Fortunately, next generation sequencing (NGS) technology has the potential to dramatically increase sample throughput, workflow efficiency and detection resolution, therefore facilitating to obtain reliable and accurate complete mtGenome information. In this study, we have developed strategies for complete mtGenome sequencing on the Ion Torrent Personal GenomeTM Machine (PGMTM) platform and investigated mtGenome features of the Northern Chinese Han population to evaluate the application in forensic sciences. 2. Materials and methods 2.1. Samples and DNA extraction A total of 125 peripheral blood samples of Northern Han were collected from Shenyang City, Liaoning Province, Northeast China after informed consent, including 107 samples from unrelated individuals and 18 samples from five maternal genealogies (four generations in three families and three generations in two families). DNA was extracted on the EZ11 Biorobot workstation using the EZ11 DNA Investigator Extraction Kit (QIAGEN, Hilden, Germany). Purified DNA was stored at 20  C. 2.2. PCR amplification The mtGenome was amplified in two independent reactions using primers described in [9], namely L644, H8982, L8789, and H877 (Table S1). Two amplicons, fragment A (8.3 kilo base pairs, kbp) and fragment B (8.6 kbp), were partially overlapped at positions 622–898 and 8764–9003. Long range PCRs were carried out using the SequalPrepTM Long PCR Kit (Thermo Fisher Scientific, MA, USA) according to the manufacturer’s recommendations with two modifications: (1) the total DNA input was 30–50 ng per reaction. Fragment A was dealt with 2 ml enhancer A and 2 ml enhancer B; fragment B was done with 1 ml enhancer A. (2) amplification was performed on the GeneAmp1 9700 System thermal cycler (Thermo Fisher) with the thermal cycling condition for both reactions: denaturation for 2 min at 94  C, amplification for 32 cycles of 10 s at 94  C, 45 s at 60  C, 9 min at 68  C, final extension for 5 min at 72  C, and hold at 4  C. PCR products (5 ml) were qualified and quantified by 1.0% agarose gel electrophoresis with GoodView Nucleic Acid Stain (SBS Genetech, Beijing, China) to confirm successful amplification. 2.3. Library construction Library construction included three steps as follows: fragmentation, adapter ligation and size selection. (1) Fragmentation: fragments A and B were quantified on the Qubit1 2.0 fluorometer using the Qubit1 dsDNA BR Assay kit (Thermo Fisher) and were pooled with each of 50 ng. Then enzymatic shearing was performed using the Ion ShearTM Plus Reagent (Thermo Fisher) following the manufacturer’s instruction. In order to yield fragments with a median size of 200 bp, shearing condition was adjusted to 8 min at 37  C; (2) Adapter ligation: both ends of each fragment were ligated with the Ion P1 Adapters and the Ion Xpress Barcode Adapters according to the manufacturer’s recommendations; (3) Size selection: it was conducted on the E-Gel1 iBaseTM Power System with E-Gel1 SizeSelectTM 2% agarose gels (Thermo

Fisher) according to the manufacturer’s recommendations. The selected libraries were amplified for 8 cycles with the reagents supplied in Ion Plus Fragment Kit (Thermo Fisher). Purification was an essential step after each process (fragmentation, adapter ligation and libraries amplification), using the Agencourt AMPure1 XP PCR Purification system following the manufacturer’s recommendations. 2.4. Template preparation and PGMTM sequencing Quantification and qualification for the amplified libraries were examined on the Agilent 2100 Bioanalyzer System with the Agilent High Sensitivity DNA Kit (Agilent Technologies, CA, USA) according to the manufacturer’s recommendations. All barcoded libraries were pooled in equimolar amounts of 26 pM to ensure equal representation of each barcoded library in the sequencing run. Pooled libraries were then subjected to emulsion PCR on the Ion OneTouchTM 2 instrument (Thermo Fisher) with the Ion PGMTM Template OT2 200 Kit (Thermo Fisher), and followed by templatepositive Ion Sphere Particles (ISPs) enrichment on the Ion OneTouchTM ES instrument (Thermo Fisher) according to the manufacturer’s recommendations. NGS was performed on PGMTM platform (Thermo Fisher) with three size chips (Ion 314TM Chip, Ion 316TM Chip and Ion 318TM Chip) using the Ion PGMTM Sequencing 200 v2 Kit. All preparations were conducted to the protocol released by the manufacturer: chlorite cleaning and 18.2 MV water cleaning, initialization the PGMTM system and chip check. Enriched template-positive ISPs were mixed with control ISPs, sequencing primers and sequencing polymerase in a consecutive way, and then loaded onto an Ion chip carefully. 2.5. Data analysis All raw reads were mapped with the revised Cambridge Reference Sequence (rCRS, NC_012920.1) [10] by the Ion Torrent Suite Software (TSS) version 4.0 based on the TMAP SmithWaterman alignment algorithm. Variants were called using the Torrent Variant Caller (TVC) plugin v.4.0 with somatic-low stringency and filed into variant call format (VCF) reports. The in-house Perl scripts (Boxes 1–3) were compiled for primary data analysis. Base by base investigation was performed using the Integrative Genomic Viewer (IGV) package v.2.3.34 [11] and the NextGENe version 2.3.4.5 (SoftGenetics, PA, USA), where binary alignment map (BAM) and binary alignment index (BAI) files were visualized based on the Torrent mapping alignment program (TMAP) using the IGV, and FASTQ-converted-FASTA files were re-aligned based on the Burrow–Wheeler transform (BWT) alignment algorithm using the NextGENe. Parameters of NextGENe were described in [12]. The preliminary haplogroup assignment was chosen from the highest ranked haplogroup from mthap (http://dna.jameslick.com/mthap) and EMMA [13] in the European DNA Profiling Group mtDNA Population (EMPOP) database (www.empop.org) [14]. The final haplogroup assignment was determined according to rCRS-oriented version of the mtDNA tree Build 16 (released 19 Feb., 2014) [15]. Homopolymers in rCRS and each sample were detected using mreps 2.6 software [16]. Profiles were compared among maternal genealogies using Perl script (Box 4) for pedigree study. 2.6. Sanger sequencing Sanger sequencing was employed to confirm variants in discordance between IGV and NextGENe. Primers for Sanger sequencing were listed in Table S1, where five primers were described in [17] and one primer was self-designed in this study.

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

Sequencing was performed on the Applied Biosystems1 3130xl Genetic Analyzer (Thermo Fisher) using the BigDye1 Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher) following the manufacturer’s recommendations. Raw data was analyzed using the Sequencing Analysis Software v5.3.1 (Thermo Fisher).

13

The random match probability (RMP) was defined as RMP = pi2, where pi is the frequency of the ith haplotype. The haplotype diversity (HD) was calculated as HD = (1–RMP) (n/(n  1)), where n is the sample size. Based on haplogroup frequencies from populations, principal component analysis (PCA) was measured and visualized using R version 3.0.2. Proportion test was performed using the OpenStat [18], and analysis of variance (ANOVA), Fisher’s exact test and Chi-square test were done using SPSS version 20.0.0 software (IBM, NY, USA).

600 Mbp and 1100 Mbp (0.5 million (M), 3 M and 5.5 M reads  200 bp/read), which could load with 4, 15 and 30 samples at a pooling concentration of 26 pM within 30%–35% of low quality data and polyclonal ISPs, respectively. Most of population samples were sequenced on the Ion 316TM chips in this study. Take one of running summaries for example: 85% ISP density was obtained with 362 Mbp raw data, 34% low quality data, 35% polyclonal ISPs and 43% usable data. Interestingly, the percentage of low quality seemed quite high, but it decreased to 8% when we reanalyzed this run from Signal Processing using TSS version 4.2. We referred the Torrent Suite Software User Documentation, which noted all reads with off-scale (pinned) signal were counted as low quality ISPs for purposes of the ISP summary in version 4.0 but now as empty wells in version 4.2 or above. According to the equation of % Low Quality = Low quality ISPs/Clonal ISPs, the result would become low with the numerator decreased.

3. Results and discussion

3.2. Data analysis

3.1. PGMTM sequencing summary

Data analysis on the complete mtGenome can be seen as a challenging part in practice. The strategy was developed in this study (Fig. 1) mainly including primary data analysis (align variants from VCF reports; sort alignment by positions and by samples; collect positions by samples for investigation) and secondary data analysis (base by base review; haplogroup assignment; EMPOP evaluation).

2.7. Statistics P

The processes of complete mtGenome sequencing by NGS are basically shared among current platforms, including DNA extraction, long range PCR amplification, library construction, clonal amplification, sequencing and raw data analysis. Although manufacturers are trying to shorten the library preperation time relatively by seeking to streamline workflows, it takes longer (about 4 days) from sample to profile with our experience when factors are considered as follows: pipetting time, working hours and multiplexing number of samples. Ion 314TM, Ion 316TM and Ion 318TM chips for the PGMTM platform had the theoretical maximum throughputs of 100 Mbp,

3.2.1. Primary data analysis 3.2.1.1. Step 1: align variants from VCF reports. Information on VCF position, VCF reference, VCF variant, frequency (freq) and coverage (cov) was extracted from VCF reports by Box 1. Also, it calculated

Fig. 1. Workflow of data analysis. Data analysis mainly includes two parts presenting the strategies herein: primary data analysis (align variants from VCF reports; sort alignment by positions and by samples; collect positions by samples for investigation); secondary data analysis (base by base review; haplogroup assignment; EMPOP evaluation).

14

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

the times of variants observed at each position from all samples and the amount of variants and the average cov for each sample, and finally exported the alignment report in an .xls file (Table S2). In this file, positions followed by the reference and variant from all samples were vertically arranged from the lowest to the highest, and variant details of each sample, including variant, freq and cov, were horizontally arranged every three columns according to file names (e.g., N001–N108) from the smallest to the biggest. Herein, sample N081 was present the same variants as sample N073, possibly double-sampling from the same individual. It was removed after confirmation by STR profiles. 3.2.1.2. Step 2: sort alignment by positions and by samples. According to minimum variant frequency and sufficient coverage for detection, the alignment report from Step 1 was divided into four categories by positions (Table S3) with Box 2 and by samples (Table S4) with Box 3: I. freq  15% and cov  100 ; II. freq  15% and cov < 100 ; III. freq < 15% and cov  100 ; IV. freq < 15% and cov < 100  By positions, the rule for confirming/rejecting variants was mainly followed as Category I, confirmed (88.75%); Category II, documented and further investigated (7.98%); Category III and IV, rejected (3.27%). A total of 31 complicated positions were also documented and needed further investigation including different variants reported at the same positions (e.g., 195T ! C and 195T ! A) and multiple nucleotides reported at the reference (e.g., 12236GC ! G) and variant (e.g., 2150T ! TA). By samples, the qualification level was measured with % Category I (the amount of variants in Category I divided by total variants in each sample), where 71 good samples were convinced (90% Category I) and 36 poor samples needed to be documented and further investigated (

Strategies for complete mitochondrial genome sequencing on Ion Torrent PGM™ platform in forensic sciences.

Next generation sequencing (NGS) is a time saving and cost-efficient method to detect the complete mitochondrial genome (mtGenome) compared to Sanger ...
566B Sizes 0 Downloads 10 Views

Recommend Documents