Molecular Ecology Resources (2014) 14, 437–446

doi: 10.1111/1755-0998.12194

A DNA mini-barcode for land plants DAMON P. LITTLE Cullman Program for Molecular Systematics, The New York Botanical Garden, 2900 Southern Boulevard, Bronx, NY 10458, USA

Abstract Small portions of the barcode region – mini-barcodes – may be used in place of full-length barcodes to overcome DNA degradation for samples with poor DNA preservation. 591,491,286 rbcL mini-barcode primer combinations were electronically evaluated for PCR universality, and two novel highly universal sets of priming sites were identified. Novel and published rbcL mini-barcode primers were evaluated for PCR amplification [determined with a validated electronic simulation (n = 2765) and empirically (n = 188)], Sanger sequence quality [determined empirically (n = 188)], and taxonomic discrimination [determined empirically (n = 30 472)]. PCR amplification for all minibarcodes, as estimated by validated electronic simulation, was successful for 90.2–99.8% of species. Overall Sanger sequence quality for mini-barcodes was very low – the best mini-barcode tested produced sequences of adequate quality (B20 ≥ 0.5) for 74.5% of samples. The majority of mini-barcodes provide correct identifications of families in excess of 70.1% of the time. Discriminatory power noticeably decreased at lower taxonomic levels. At the species level, the discriminatory power of the best mini-barcode was less than 38.2%. For samples believed to contain DNA from only one species, an investigator should attempt to sequence, in decreasing order of utility and probability of success, mini-barcodes F (rbcL1/rbcLB), D (F52/R193) and K (F517/R604). For samples believed to contain DNA from more than one species, an investigator should amplify and sequence mini-barcode D (F52/R193). Keywords: DNA barcoding, rbcL, mini-barcode Received 25 July 2013; revision received 11 October 2013; accepted 18 October 2013

Introduction DNA barcoding is designed to provide specimen identifications to nonspecialists via the DNA sequence of a standardized genomic region. Portions of two plastid protein coding genes – matK and rbcL – have been sanctioned by the Consortium for the Barcode of Life as the land plant barcode (CBOL Plant Working Group 2009). The 5′ half of rbcL, which is used as a barcode, has been termed rbcLa to distinguish it from the complete rbcL coding region (Kress & Erickson 2007). It can sometimes be difficult to obtain DNA barcode identifications from samples that were not handled in a way specifically intended to preserve DNA. The mode and cause of DNA degradation in nonliving plant tissues is not yet fully understood (Bauer et al. 2003; Staats et al. 2011), but substantial research has focused on the extent of degradation in plant tissues subjected to various types of processing (Bryan et al. 1998; Hellebrand et al. 1998; Hupfer et al. 1998; Straub et al. 1999; Bauer et al. 2003; Duggan et al. 2003; Sandberg et al. 2003; Tilley 2004; Chen et al. 2005; Murray et al. 2007; Gryson et al. 2008; Correspondence: Damon P. Little, Fax: 1.718.817.8101; E-mail: [email protected]

© 2013 John Wiley & Sons Ltd

Bergerov a et al. 2010; Costa et al. 2010; Bergerov a et al. 2011; S€ arkinen et al. 2012; Fernandes et al. 2013). These empirical experiments have repeatedly shown that PCR amplification success greatly increases with a decrease in amplicon size. From these studies, the median regularly recoverable plant DNA fragment is 190 bp (IQR = 118– 286 bp). Given that the median matK barcode is 889 bp (IQR = 880–889) and that rbcLa is uniformly 654 bp (Hollingsworth et al. 2011), one would not expect full-length barcodes to reliably PCR amplify from degraded samples. To overcome DNA degradation, small portions of the barcode region – mini-barcodes – may be used in place of full-length barcodes (Meusnier et al. 2008). Due to their reduced size, mini-barcodes are presumably PCRamplified at a higher rate than full-length barcodes, but taxonomic discrimination is often curtailed due to the reduced number of nucleotides (Meusnier et al. 2008; Little 2011). Although several sets of rbcLa mini-barcode primers have been published, neither PCR universality nor taxonomic discrimination have been systematically evaluated for these primer sets (Poinar et al. 1998; Hofreiter et al. 2000; Palmieri et al. 2009; Murphy et al. 2011; Tables 1 and 2; Fig. 1). It is essential to know the limitations of PCR universality and taxonomic discrimination for mini-barcodes –

438 D . P . L I T T L E Table 1. Novel and published rbcLa mini-barcode primers. Position numbers are in reference to the first nucleotide of the rbcL start codon. For CoDeHOP primers (F52, F517, R193 and R604), the consensus clamp (underlined) and the most common 3′ degenerate core are included in the table. Other 3′ degenerate cores are reported as footnotes Primer name

Sequence (5′–3′)

3′ Position

Orientation

Reference

19bR F52* F517† h1aF h1aR h2aR R193‡ R604§ rbcL1 rbcL2 rbcL19 rbcLA rbcLB rbcLF2 rbcLR3a rbcLZ1 Z1aF

CTTCTTCAGGTGGAACTCCAG GTTGGATTCAAAGCTGGTGTTA GGWCGTCCMCTATTGGGATGTA GGCAGCATTCCGAGTAACTCCTC GAGGAGTTACTCGGAATGCTGCC CGTCCTTTGTAACGATCAAG CVGTCCAMACAGTWGTCCATGT CTGRGAGTTMACGTTTTCATCATC TTGGCAGCATTYCGAGTAACTCC TGGCAGCATTYCGAGTAACTC AGATTCCGCAGCCACTGCAGCCCCTGCTTC CCTTTRTAACGATCAAGRC AACCYTCTTCAAAAAGGTC TGTTTACTTCCATTGTGGGTAATG TTCGGTTTAATAGTACAGCCCAAT ATGTCACCACAAACAGAGACTAAAGCAAGT ATGTCACCACCAACAGAGACTAAAGC

137 52 517 133 111 229 193 604 131 130 154 227 316 370 507 30 26

Reverse Forward Forward Forward Reverse Reverse Reverse Reverse Forward Forward Reverse Reverse Reverse Forward Reverse Forward Forward

Hofreiter et al. (2000) This paper This paper Poinar et al. (1998) Poinar et al. (1998) Poinar et al. (1998) This paper This paper Palmieri et al. (2009) Palmieri et al. (2009) Poinar et al. (1998) Palmieri et al. (2009) Palmieri et al. (2009) Murphy et al. (2011) Murphy et al. (2011) Poinar et al. (1998) Hofreiter et al. (2000)

*AGA, ATA, CGA, CTA, CTC, GTA, TAA, TCA, TGA, TTC, TTG, TTT †ATA, CTA, GAA, GAC, GAT, GCA, GGA, GTC, GTG, GTT, TCA, TTA, TTT ‡AGT, CGC, CGT, CTT, GGG, GGT, TAT, TCT, TGC, TGG, TTT §AAC, AAT, ACC, ACT, AGC, ATG, ATT, CTC, GTC, TTC Table 2. Novel and published rbcLa mini-barcodes. Primers correspond to those in Table 1. Unorthodox combinations of primers are omitted

Mini-barcode

Amplicon size (bp)

Sequence excluding primers (bp)

F primer

R primer

A B C D E F G H I J K L

157 183 248 184 137 226 136 225 138 184 132 133

110 123 202 140 95 184 96 185 95 136 86 83

Z1aF rbcLZ1 Z1aF F52 rbcL1 rbcL1 rbcL2 rbcL2 h1aF rbcLF2 F517 rbcLZ1

19bR rbcL19 h2aR R193 rbcLA rbcLB rbcLA rbcLB h2aR rbcLR3a R604 h1aR

particularly when more than one species contributed to a sample (e.g. an environmental sample). A species that is not PCR-amplified with a given primer set cannot be detected in a DNA sample and thus will not contribute to characterization of that sample. Likewise, taxa with identical sequences cannot be differentiated and thus may distort the characterization of the sample. The selection of a mini-barcode mirrors that of fulllength barcodes: quantitative analyses of PCR universality, sequence quality and taxonomic discrimination are

used to make an informed decision (CBOL Plant Working Group 2009; Hollingsworth et al. 2011). Unlike the selection of full-length barcodes, standardization of mini-barcodes is not as important because while fulllength barcodes can be used both as reference sequences and query sequences, mini-barcode sequences are generally used only as query sequences. Thus, different mini-barcodes may be used to answer different research questions without compromising data reuse – the synergy at the core of DNA barcode standardization. This research was designed to (i) identify a set of (near) universal rbcLa mini-barcode primers, (ii) determine the level of PCR universality of novel and published rbcLa mini-barcode primer sets using an electronic PCR simulation, (iii) measure sequence quality for the most universal rbcLa mini-barcode primer sets and (iv) quantify taxonomic discriminatory power of novel and published rbcLa mini-barcodes.

Materials and methods Novel primer design All possible pairs of 18, 20, 22, 24, 26 and 28 bp primers capable of producing a 100–200 bp amplicon were extracted from the rbcLa region of 134 publicly available fully sequenced Embryophyta plastid genomes (all finished genomes available on 2 December 2010 that contained a functional rbcLa region; Appendix S1, Supporting

© 2013 John Wiley & Sons Ltd

R B C L M I N I - B A R C O D E S 439 information). Annealing temperature, secondary structure, etc., of each primer pair was evaluated with PRIMER 3 1.1.4 (Koressaar & Remm 2007). Primers with annealing temperatures between 55 C and 63 C and no evidence of strong secondary structure were retained. All possible combinations of PRIMER3-approved primers were electronically evaluated for PCR universality using RE-PCR 2.3.12 (Schuler 1997). The primers were preliminarily tested, allowing as many as two indels and five mismatches per primer (word size 7, no discontinuous words), on 111 rbcLa sequences (in order to reduce bias, the data set was decreased from 134 by arbitrary retaining only one species per genus; Appendix S1, Supporting information). Sequence diversity at the top three pairs of priming sites was surveyed by locating the sites within all available rbcLa sequences and creating an alignment for each site. Sequences were retrieved from GenBank using a recursive BLAST 2.2.21 (Altschul et al. 1990) search: each of the 134 rbcLa sequences (Appendix S1, Supporting information) was queried against the Embryophyta sequences deposited in GenBank; the top 1000 sequences retrieved were in turn each queried against GenBank; this process was repeated until no new sequences were returned. The sequence at each priming site was extracted, from as many of these putative rbcLa sequences as possible, by searching for regions with a matching score greater than 75% of the priming site’s length (indel weight of two and substitution weight of one) using the TRE-AGREP 0.8.0 (Laurikari 2009) implementation of the AGREP algorithm (Wu & Manber 1992). For each site, the set of extracted sequences was filtered to remove duplicates. Alignments of each site were created using MUSCLE 3.8.31 (Edgar 2004). For each priming site, Consensus-Degenerate Hybrid Oligonucleotide Primers (CoDeHOP; Rose et al. 1998) were designed. A 50% consensus, calculated with SEAVIEW 4.2.6 (Galtier et al. 1996), was used for the ‘consensus clamp’ portion of the primer. The ‘degenerate core’ was composed of all unique variants of the 3′ antepenultimate, penultimate and ultimate bases.

Electronic PCR simulation The universality of novel and published rbcLa mini-barcode primers was estimated using RE-PCR. Putative rbcLa sequences, retrieved from GenBank via recursive BLAST search, were filtered to remove those containing ambiguous bases, those not identified to species and those containing stop codons. The remaining nucleotide sequences were aligned as amino acids using TRANSLATORX 1.1 (Abascal et al. 2010) and KALIGN 2.04 (Lassmann & Sonnhammer 2005). Electronic PCR using word size 3, discontinuous word count 2, no indels and requiring at least a 70%

© 2013 John Wiley & Sons Ltd

match for each primer was conducted on sequences that included all novel and published mini-barcode priming sites (Appendix S2, Supporting information). The electronic PCR settings were selected to mimic, as best as possible with RE-PCR, empirically determined properties of Taq polymerase (Sommer & Tautz 1989; Kwok et al. 1990; Sarkar et al. 1990; Huang et al. 1992; Ayyadevara et al. 2000). Universality at the taxonomic level of order, family, genus and species was considered present if a given mini-barcode primer set electronically amplified all sequences belonging to the taxon in question and did not produce more than one amplicon per sequence. Statistical differences in estimated universality among mini-barcodes were quantified using the binomial distribution, with each mini-barcode and taxon combination considered an independent test. Confidence intervals (Wilson 1927) were calculated in R 2.15.2 (R development core team 2012) using the package HMISC 3.10-1 (Harrell 2012). Differences in universality among mini-barcodes were quantified using Scheffe’s (1953) test with the binomial distribution, at P = 0.05, as implemented in the R package AGRICOLAE 1.1-3 (de Mendiburu 2012).

Discriminatory power The ability of novel and published rbcLa mini-barcodes to distinguish among orders, families, genera and species was evaluated. Mini-barcode sequences, excluding priming sites, were extracted from the nucleotide multiple sequence alignment (Appendix S3, Supporting information). Discriminatory power was assayed using BRONX 2.0 (Little 2011). GenBank species identifications were retained; however, a comprehensive set of familial and ordinal classifications (Duff et al. 2007; Goffinet et al. 2008; Crandall-Stotler et al. 2009; Christenhusz, et al. 2011a, b; Reveal & Chase 2011) were used in place of the GenBank classification (Appendix S3, Supporting information). Each mini-barcode sequence was queried against a reference database containing sequences of the corresponding mini-barcode. Identifications at the taxonomic level of order, family, genus and species were considered correct if BRONX did not return sequences belonging to any other taxon at the taxonomic level in question. The binomial distribution, with each taxon considered an independent test, was used to compute 95% confidence intervals. Differences in discrimination between mini-barcodes were quantified using Scheffe’s test at P = 0.05 (binomial distribution). Correlation between discriminatory power and the length of the sequence (excluding primers) was quantified using Spearman’s (1904) test corrected for multiple comparisons using the method of Benjamini & Hochberg (1995).

440 D . P . L I T T L E

PCR primer evaluation Physical testing was conducted on five sets of primers characterized by high rates of predicted PCR amplification and/or discriminatory power. DNA was extracted from 188 arbitrary selected silica preserved samples (representing 46 orders, 103 families, 181 genera and 188 species; Appendix S4, Supporting information) using the Qiagen DNeasy96 Kit following the manufacturer’s protocol. The polymerase chain reaction (PCR) was used to amplify rbcLa and rbcLa mini-barcodes in a 15 lL volume containing: 20 mM Tris pH 8.8, 10 mM KCl, 10 mM (NH4)2SO4, 2 mM MgSO4, 0.1% (v/v) Triton X-100, 5% (w/v) sucrose, 0.025% (w/v) cresol red, 0.025 lg/lL BSA, 0.2 mM dNTPs, 0.5 units Taq polymerase, 0.5 lL genomic DNA and primers (0.5 lM of each primer for reactions containing one primer pair; for CoDeHOP mixtures 0.5 lM of each consensus clamp without the degenerate core along with 0.05 lM of each consensus clamp/degenerate core variant). Primers a_f (5′-ATGTCACCACAAACAGAGAC TAAAGC-3′; Levin et al. 2003) and ajf634R (5′-GA AACGGTCTCTCCAACGCAT-3′; Fazekas et al. 2008) were used for rbcLa. Mini-barcodes D, F and K were amplified using pairs of primers (Tables 1 and 2; for CoDeHOP primers the consensus clamp and the most common 3′ degenerate core were used). Mini-barcodes D and K were also amplified using CoDeHOP mixtures – designated D* and K* hereafter (Tables 1 and 2). For rbcLa, the reaction mixture was incubated for 2.5 min at 95 C, cycled 35 times (0.5 min at 95 C, 0.5 min at 58 C, 0.5 min at 72 C) and then incubated at 72 C for 10 min. For rbcLa mini-barcodes, the reaction mixture was incubated for 2.5 min at 95 C, cycled 10 times (0.5 min at 95 C, 0.5 min at the annealing temperature, 0.5 min at 72 C), cycled 25 times (0.5 min at 88 C, 0.5 min at the annealing temperature, 0.5 min at 72 C) and then incubated at 60 C for 10 min (an 88 C melting temperature was used for the last 25 cycles to conserve Taq activity; Yap & McGee 1991). The annealing temperature for mini-barcodes D, D* and F was 52 C. For minibarcode K and K*, the annealing temperature was 50 C. Reactions were evaluated by electrophoresis using 1.2% agarose gels buffered with sodium borate (Brody & Kern 2004) and stained with ethidium bromide. ExoSAPIT (USB) was used to neutralize unused primers and dNTPs in successful reactions. PCR products were bidirectionally sequenced (Sanger et al. 1977) with BigDye v3.1 (1 lL PCR product in each 5 lL reaction volume) and a 3730 sequencer (Life Technologies) at the HighThroughput Genomics Unit (University of Washington). The amplification primers were used for sequencing reactions containing one primer pair. The consensus clamp without degenerate core was used for sequencing CoDeHOP mixtures.

Base calls and quality values (QV) were calculated using KB 1.4 (Life Technologies). The 5′ end of each sequencing read was trimmed to the first window of 20 nucleotides with two or fewer low quality positions (QV < 20). The 3′ end of each sequence was trimmed by locating and removing the priming sequence using TRE-AGREP. Contigs were assembled and conflicts were resolved by retaining the nucleotide with the highest QV. An index of contig quality, B (Little 2010), was calculated with a c value of 607 bp for rbcLa, 140 bp for minibarcodes D and D*, 184 bp for mini-barcode F and 86 bp for mini-barcodes K and K*. An x value of 30 was used for rbcLa and a value of 20 was used all mini-barcodes. Contigs of rbcLa were screened to remove those with B30 values less than 0.75 (bidirectional by definition), those without a full-length open reading frame and those that BRONX identified as being the result of laboratory contamination. Contigs of rbcLa mini-barcodes were screened to remove those without bidirectional reads, those with sequences that did not exactly match the corresponding portion of the rbcLa sequence generated from the same sample and those with B20 values less than 0.5 (mini-barcodes were held to a lower sequence quality standard than full-length barcodes because mini-barcodes are usually used as queries rather than references). Differences in PCR amplification and sequencing success among mini-barcodes were quantified using Scheffe’s test at P = 0.05 (binomial distribution).

Electronic PCR validation The success or failure of PCR amplification, as evaluated by gel electrophoresis, for mini-barcodes D, F and K was compared to electronic PCR simulations on the 188 rbcLa sequences (Appendix S4, Supporting information) generated from the samples used for PCR. Simulations spanned RE-PCR parameter space: word size varied from 2 to 12 nucleotides, discontiguous word count varied from 1 (completely contiguous) to 3, allowed primer/template mismatch varied from 0 to 50%, and the number of indels allowed varied from 0 to 4. For each mini-barcode, the RE-PCR simulation parameters with the lowest deviation from observed PCR results were compared using McNemar’s (1947) test. Results were corrected for multiple comparisons using the method of Benjamini & Hochberg (1995).

Results Novel primer design A total of 24 018 forward and 24 627 reverse PRIMER3approved rbcLa mini-barcode primers were identified. Therefore, 591,491,286 primer combinations were

© 2013 John Wiley & Sons Ltd

R B C L M I N I - B A R C O D E S 441 Mini-barcode CoDeHOP mixtures D* and K* performed statistically worse (P = 0.05), in simulated PCR amplification, than any of the other mini-barcodes including pairs of the most common 3′ degenerate cores (compare D/D* and K/K* in Fig. 2). This discrepancy is the result of CoDeHOP mixtures producing multiple amplicons of different sizes from the same template. Simulated reactions that produced multiple products were considered failures, and thus, the rate of electronic amplification for CoDeHOP mixtures is zero. Other than the profound failure of CoDeHOP mixtures D* and K*, there were no statistically significant differences (P = 0.05) in electronic amplification rates detected with Scheffe’s test. Mini-barcode K consistently had the highest rates of electronic PCR amplification at all taxonomic levels – ranging from 97.1% (order level) to 99.8% (species level). At the level of order and family, mini-barcode D had the second highest rate of simulated PCR amplification. The relative rank of mini-barcode D fell to third place at the level of genus and eighth place at the level of species. Mini-barcode C had the second highest rate of simulated PCR amplification at the levels of genus and species. The rate of simulated amplification for mini-barcode F was below the median rank at the level of order, family and genus. In combination, the three mini-barcodes with the highest rates of electronic amplification (C, D and K) were able to amplify all 103 orders. For 12 orders, electronic amplification failed for two of the three minibarcodes: Arecales (Arecaceae), Asparagales (Iridaceae and Orchidaceae), Caryophyllales (Nyctaginaceae),

electronically evaluated for PCR universality. Examination of the top three pairs of priming sites revealed overlapping nucleotide positions. Thus, electronic PCR identified two promising sets of priming sites that flank rbcLa nucleotide positions 53–192 and 518–603 (Fig. 1). In addition, electronic PCR identified priming sites corresponding to rbcLa mini-barcode L (ranked 28th in electronic amplification frequency; Table 2). Nucleotide conservation of rbcLa is generally high, but there are concentrated areas of poorly conserved nucleotide positions (e.g. positions 250–300; Fig. 1). A total of 42 191 putative rbcLa sequences were retrieved from GenBank via recursive BLAST search. There were 3296 sequence variants in the priming site proceeding position 53, 1587 sequence variants in the priming site following position 192, 4632 sequence variants in the priming site proceeding position 518 and 3429 sequence variants in the priming site following position 603. These variants ultimately resulted in a consensus clamp without polymorphic bases and 13 degenerate core variants for F52, 3 (15.8%) polymorphic consensus clamp bases and 12 degenerate core variants for R193, 2 (10.5%) polymorphic consensus clamp bases and 14 degenerate core variants for F517, and 2 (9.5%) polymorphic consensus clamp bases and 11 degenerate core variants for R604 (Table 1).

Electronic PCR simulation A total of 2765 unambiguous rbcLa sequences included all novel and published mini-barcode priming sites. These sequences represented 103 orders, 251 families, 1029 genera and 2233 species. I

h1aF

h2aR

E G

rbcL1 rbcL2

F

rbcL1 rbcL2 F52

rbcLB

R193 h1aR 19bR

B

rbcLZ1

rbcL19

C

Z1aF

Conservation

D

A

Z1aF

rbcLB

H

L

rbcLZ1

rbcLA rbcLA

h2aR

J

rbcLF2

K

F517

R604

rbcLR3a

1.0

0.6

0.2 1

50

100

150

200

250

300

350

400

450

500

550

600

650

rbcLa nucleotide position Fig. 1. Novel and published rbcLa mini-barcodes (Tables 1 and 2). The frequency of the modal base was calculated as a measure of nucleotide conservation from 111 fully sequenced plastid genomes (Appendix S1, Supporting information). Positions are numbered in reference to the first nucleotide of the rbcL start codon.

© 2013 John Wiley & Sons Ltd

442 D . P . L I T T L E

0.8

(a)

H F

J D*

EG

H F

C J

D*

D

0.7

0.9

(b)

C

0.8

D

0.7

B I

I B E G

0.6

K*

K

0.6

A

A

K*

K

0.5

0.5

Familial discrimination success

Ordinal discrimination success

0.9

Fig. 2. Electronic PCR amplification versus discriminatory power of rbcLa minibarcodes for (a) orders, (b) families, (c) genera and (d) species. Mini-barcode names correspond to those in Table 2. CoDeHOP primer mixtures are indicated by an asterisk. Error bars indicate 95% confidence intervals. Indistinguishable mini-barcodes are plotted as single points.

L L

(c)

H FC

0.55

(d)

0.4 C HF

J

0.35 J

0.50 D*

D

0.45

D*

0.30

D

B B

0.40

0.25 EG

0.35

I A I

A GE

0.20

0.30 K*

K

0.25

L

K*

L

Specific discrimination success

Generic discrimination success

0.4 0.60

K

0.15 0.0

0.2

0.4

0.6

0.8

Proportion electronically amplified

1.0 0.0

0.2

0.4

0.6

0.8

1.0

Proportion electronically amplified

Fabales (Fabaceae), Funariales (Funariaceae), Jungermanniales (Calypogeiaceae and Trichocoleaceae), Lycopodiales (Lycopodiaceae), Metzgeriales (Aneuraceae, Metzgeriaceae and Mizutaniaceae), Osmundales (Osmundaceae), Polypodiales (Aspleniaceae, Davalliaceae, Polypodiaceae and Tectariaceae), Proteales (Nelumbonaceae) and Selaginellales (Selaginellaceae).

Discriminatory power There were a total of 30 472 unambiguous rbcLa sequences deposited in GenBank that included the sequenced region for all novel and published minibarcodes. These sequences represent 116 orders, 394 families, 2656 genera and 5825 species. Using these sequences, most mini-barcodes were able to consistently provide correct identifications at higher taxonomic levels in excess of 68% of the time (Fig. 2). Discriminatory power noticeably decreased at lower taxonomic levels. At the species level, the discriminatory power of the best mini-barcode was 38.2%. No statistically significant differences in discriminatory power were detected among the mini-barcodes examined (P = 0.5). Mini-barcodes C, F and H consistently ranked in the top three at all taxonomic levels. Mini-barcode D was always fifth ranked. Mini-barcode K has low discriminatory power – it was perpetually the second (family, genus and species) or third (order level) lowest ranked mini-barcode. Mini-barcode L consistently had the lowest discriminatory power.

Discriminatory power was positively correlated with the length of the sequence (excluding primers) at all taxonomic levels (order P = 2.173910 4, family P = 4.6609 10 5, genus P = 4.660910 5, species P = 7.950910 8).

PCR primer evaluation PCR amplification, as evaluated by gel electrophoresis, for mini-barcodes D, F and K was very successful – amplifying 97.9–98.9% of the samples examined. Amplification using CoDeHOP mixture D* was also largely successful (91.5%). The frequency of PCR amplification for CoDeHOP mixture K* was, however, significantly worse (P = 0.05) than all other mini-barcodes – it was able to amplify only 35.6% of samples. Statistical differences could not be discerned among the other mini-barcodes. The sequence quality of mini-barcode K and CoDeHOP mixture K* were significantly worse (P = 0.05) than all other mini-barcodes – none of the 188 samples produced sequences with B20 ≥ 0.5 using either primer set. Unambiguous statistical differences between the other mini-barcodes were not apparent: 30.9% of samples produced adequate quality sequences for mini-barcode D, 53.2% for CoDeHOP mixture D* and 74.5% for minibarcode F.

Electronic PCR validation The electronic simulations of mini-barcode D amplification that deviated the least from the observed PCR result

© 2013 John Wiley & Sons Ltd

R B C L M I N I - B A R C O D E S 443 differed for four samples (2% error), optimal simulation parameters for mini-barcode F differed from observed results by four samples (2% error) and the best predictions for mini-barcode K differed from observed results by two samples (1% error). The observed pattern of PCR amplification was not significantly different from the pattern predicted by simulation when using optimal parameters (P = 0.26–0.62). There were 21 optimal combinations of simulation parameters for mini-barcode D, two combinations for mini-barcode F and 18 combinations for mini-barcode K. Two combinations of word size (2 or 3 nucleotides) and discontiguous word count (2) were common to the optimal simulation parameters for all three mini-barcodes. The primer/template mismatch parameter common to the optimal setting for all three mini-barcodes was 27–32% of the length of the shortest primer in a pair. There were no indel parameters common to all three mini-barcodes: optimal values were 0, 2 or 3 – with 0 occurring in two-thirds of the optimal parameter combinations common to all three minibarcodes.

Discussion The selection of optimal mini-barcodes should follow the same three criteria employed in the selection of fulllength barcodes: (i) PCR universality (the ability to regularly retrieve the marker from a sample), (ii) sequence quality (the ability to accurately read the sequence from the PCR products generated) and (iii) taxonomic discrimination (the ability to distinguish between taxa using the sequences; CBOL Plant Working Group 2009; Hollingsworth et al. 2011).

PCR universality Although the relationship between simulated and real PCR is imprecisely known, the validation experiment demonstrates that it is possible to closely mimic real PCR using RE-PCR (1–2% error). Optimal RE-PCR settings, identified in the validation experiment, were used to calculate mini-barcode PCR universality. Thus, the in silico estimates reported here are expected to closely parallel in vitro results. The top performing mini-barcodes (C, D and K) are expected to amplify almost all template DNA samples – failing to amplify all species in less than 5% of families. The rate of electronic PCR amplification for mini-barcodes A, B, C and L may be artificially inflated due to the inadvertent inclusion of synthetic oligo sequence alongside genomic sequence in GenBank records. The PCR amplification of rbcL and rbcLa almost always relies upon a forward primer that begins at the rbcL start codon and includes the first eight to ten codons – just like the

© 2013 John Wiley & Sons Ltd

forward primers for mini-barcodes A, B, C and L. The sequences that include these priming sites are highly unusual – representing about 9% of the 30 472 sequences in GenBank that include the sequenced region of all mini-barcodes. Although some of these sequences were generated using protocols that preserve the true genomic sequence, it is likely that a substantial number of them were simply improperly trimmed and thus contain synthetic oligo sequence fused to genuine genomic sequence.

Sequence quality Overall Sanger sequence quality for mini-barcodes was very low. Although mini-barcode F produced the highest quality sequences, these quality values are lower than those typically reported for rbcLa (Little 2010; Jeanson et al. 2011; Aubriot et al. 2013). The underlying cause(s) of poor sequence quality could not be determined. The sequence quality values reported here are poor predictors of the quality values that can be expected from samples with DNA from multiple species (e.g. environmental samples) because such samples cannot be directly sequenced using the Sanger technique. Such samples must be sequenced directly using next-generation single molecule sequencing techniques or individual sequence types must first be isolated by cloning or digital PCR (Vogelstein & Kinzler 1999) followed by Sanger sequencing. Higher sequence quality values are expected from next-generation sequencing and Sanger sequencing of cloned PCR products. Thus, the poor sequence quality values reported here should not influence the selection of a rbcLa mini-barcode for projects that will not use direct Sanger sequencing. Sequences derived using CoDeHOP mixture D* had better average sequence quality than those of the same region generated using a single primer pair. Unfortunately, the PCR amplification rate of D* is 6.38% lower than D. Further optimization of the PCR conditions may boost the rate of D* amplification. Alternatively, amplifications of D produced using a single primer pair could be sequenced using the consensus clamp without the degenerate core – as D* products were sequenced.

Taxonomic discrimination Among the GenBank rbcLa sequences used for discrimination calculations, some are likely misidentified. In addition, some are probably misclassified due to the inconsistent application of competing taxonomic concepts. Although the magnitude of these errors cannot be determined, they lead inexorably to a decrease in the estimate of discrimination success. The presence of identification and classification errors in the data set

444 D . P . L I T T L E is not expected to change the relative rate of discrimination. Thus, it is possible to distinguish between mini-barcodes even though the exact magnitude of each mini-barcode’s discriminatory power cannot be known. None of the rbcLa mini-barcodes examined are capable of distinguishing the majority of species from one another (Fig. 2D). This result is not unexpected given the low rate of species discrimination observed in rbcLa sequences generally (e.g. CBOL Plant Working Group 2009; Little 2011; Pang et al. 2012). Thus, one should not rely upon the species identifications provided by even the most discriminatory of rbcLa mini-barcodes without careful validation. Although mini-barcodes C, F, H and J were able to distinguish among a majority of genera, the rate of correct identification is relatively low (54–59%; Fig. 2C). As with the identification of species using rbcLa mini-barcodes, the identification of genera requires careful validation before the results can be considered trustworthy. It is likely that additional information (e.g. geographical distribution) could be used to increase the rate of discrimination success, to an acceptable level, in some circumstances. Mini-barcodes C, F and H have a relatively high (>85%) rate of discrimination success for families and orders (Fig. 2A,B). Unfortunately, the identifications provided by these mini-barcodes cannot be accepted uncritically because there is a substantial possibility of error – particularly if additional information cannot be used to limit the taxonomic scope of the query and thus increase the rate of discrimination success. Careful validation of the combination of taxa and sequences used in a study is advisable. Identifications provided by mini-barcodes with poor discriminatory power are tenuous at best and cannot be used to draw strong conclusions. For example, Poinar et al. (1998) used mini-barcode B to identify 13 sequence types to family. In the evaluation reported here, minibarcode B was able to consistently distinguish between 70% of families (Fig. 2B). Of the four sequence types that had an exact match in the reference database used by Poinar et al , only two belong to families that can be consistently identified using mini-barcode B. Unfortunately, Poinar et al used all of these identifications to make inferences about the diet of an extinct ground sloth. The rate of taxonomic discrimination is positively correlated with the length of the mini-barcode – unfortunately long mini-barcodes are the least likely to be amplified from samples with degraded DNA. If resources permit, an investigator may wish to attempt PCR amplification of mini-barcodes in decreasing order of amplicon size – thereby maximizing the chance of producing a correct identification.

Conclusions When it is impossible to obtain full-length barcode sequences via Sanger sequencing from a sample due to DNA degradation, mini-barcodes are the best option. Although PCR amplification of a mini-barcode is more likely to be successful, the taxonomic discriminatory power of a mini-barcode is impaired relative to that of a full-length barcode. For samples believed to contain DNA from only one species, an investigator should first attempt to amplify mini-barcode F (it has relatively high PCR universality, sequence quality and taxonomic discrimination, but it is relatively large and therefore less likely to amplify). If amplification or sequencing of mini-barcode F fails, the investigator should next attempt to amplify mini-barcode D using either the CoDeHOP mixture or a single primer pair (D has higher estimated PCR universality, acceptable sequence quality and respectable discriminatory power, it is smaller than mini-barcode F and therefore more likely to amplify from a degraded DNA sample). If amplification or sequencing of mini-barcode D fails, mini-barcode K is the next best option (it has the highest estimated PCR universality and, unfortunately, poor sequence quality and low discriminatory power, it is the smallest mini-barcode and therefore most likely to amplify). For samples believed to contain DNA from more than one species, an investigator should amplify and sequence mini-barcode D using either the CoDeHOP mixture or a single primer pair. It has the highest rate of taxonomic discrimination among the mini-barcodes that are less than 200 bp in length. Therefore, it offers the best compromise between estimated PCR universality, probability of amplification success, sequence quality and taxonomic discrimination.

Acknowledgements I thank Daniel Atha, Elisa Suganuma, Leah Reilly and Rolando Rojas for providing excellent technical assistance. Funding from the Alfred P. Sloan Foundation (2010-6-02) is gratefully acknowledged.

References Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Research, 38, W7–W13. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410. Aubriot X, Lowry PP, Cruaud C, Couloux A, Haevermans T (2013) DNA barcoding in a biodiversity hot spot: potential value for the identification of Malagasy Euphorbia L. listed in CITES Appendices I and II. Molecular Ecology Resources, 13, 57–65.

© 2013 John Wiley & Sons Ltd

R B C L M I N I - B A R C O D E S 445 Ayyadevara S, Thaden JJ, Reis RJS (2000) Discrimination of primer 3′–nucleotide mismatch by Taq DNA polymerase during polymerase chain reaction. Analytical Biochemistry, 284, 11–18. Bauer T, Weller P, Hammes WP, Hertel C (2003) The effect of processing parameters on DNA degradation in food. European Food Research and Technology, 217, 338–343. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 289–300. Bergerov a E, Hrncırov a Z, Stankovska M, Lopasovska M, Siekel P (2010) Effect of thermal treatment on the amplification and quantification of transgenic and nontransgenic soybean and maize DNA. Food Analytical Methods, 3, 211–218. Bergerov a E, God alov a Z, Siekel P (2011) Combined effects of temperature, pressure and low pH on the amplification of DNA of plant derived foods. Czech journal of food science, 29, 337–345. Brody JR, Kern SE (2004) Sodium boric acid: a tris–free, cooler conductive medium for DNA electrophoresis. Bio Techniques, 36, 214–216. Bryan GJ, Dixon A, Gale MD, Wiseman G (1998) A PCR based method for the detection of hexaploid bread wheat adulteration of durum wheat and pasta. Journal of Cereal Science, 28, 135–145. CBOL Plant Working Group (2009) A DNA barcode for land plants. Proceedings of the National Academy of Sciences, 106, 12794–12797. Chen Y, Wang Y, Ge Y, Xu B (2005) Degradation of endogenous and exogenous genes of roundup– ready soybean during food processing. Journal of Agricultural and Food Chemistry, 53, 10239–10243. Christenhusz MJM, Reveal JL, Farjon A, Gardner MF, Mill RR, Chase MW (2011a) A new classification and linear sequence of extant gymnosperms. Phytotaxa, 19, 55–70. Christenhusz MJM, Zhang XC, Schneider H (2011b) A linear sequence of extant families and genera of lycophytes and ferns. Phytotaxa, 19, 7–54. Costa J, Mafra I, Amaral JS, Oliveira MBPP (2010) Monitoring genetically modified soybean along the industrial soybean oil extraction and refining processes by polymerase chain reaction techniques. Food Research International, 43, 301–306. Crandall-Stotler B, Stotler RE, Long DG (2009) phylogeny and classification of the Marchantiophyta. Edinburgh Journal of Botany, 66, 155–198. Duff RJ, Villarreal JC, Cargill DC, Renzaglia KS (2007) Progress and challenges toward developing a phylogeny and classification of the hornworts. The Bryologist, 110, 214–243. Duggan PS, Chambers PA, Heritage J, Forbes JM (2003) Fate of genetically modified maize DNA in the oral cavity and rumen of sheep. British Journal of Nutrition, 89, 159–166. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32, 1792–1797. Fazekas AJ, Burgess KS, Kesanakurti PR et al. (2008) Multiple multilocus DNA barcodes from the plastid genome discriminate plant species equally well. PLoS One, 3, e2802. Fernandes TJR, Oliveira MBPP, Mafra I (2013) Tracing transgenic maize as affected by bread making process and raw material for the production of a traditional maize bread, broa. Food Chemistry, 138, 687– 692. Galtier N, Gouy M, Gautier C (1996) SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Computer Applications in the Biosciences, 12, 543–548. Goffinet B, Buck WR, Shaw AJ (2008) Morphology and classification of the Bryophyta. In: Bryophyte Biology, 2nd edn. (eds Goffinet B, Shaw AJ), pp. 55–138. Cambridge University Press, New York. Gryson N, Messens K, Dewettinck K (2008) PCR detection of soy ingredients in bread. European Food Research and Technology, 227, 345–351. Harrell FE (2012) The Hmisc Package version 3.10-1. http://cran.r-pro ject.org/. Accessed 5 January 2013. Hellebrand M, Nagy M, M€ orsel JT (1998) Determination of DNA traces in rapeseed oil. Zeitschrift f€ ur Lebensmitteluntersuchung und–Forschung A, 206, 237–242.

© 2013 John Wiley & Sons Ltd

Hofreiter M, Poinar HN, Spaulding WG et al. (2000) A molecular analysis of ground sloth diet through the last glaciation. Molecular Ecology, 9, 1975–1984. Hollingsworth PM, Graham SW, Little DP (2011) Choosing and using a plant DNA barcode. PLoS One, 6, e19254. Huang MM, Arnheim N, Goodman MF (1992) Extension of base mispairs by Taq DNA polymerase: implications for single nucleotide discrimination in PCR. Nucleic Acids Research, 20, 4567–4573. Hupfer C, Hotzel H, Sachse K, Engel KH (1998) Detection of the genetic modification in heat– treated products of Bt maize by polymerase chain reaction. Zeitschrift f€ ur Lebensmitteluntersuchung und–Forschung A, 206, 203–207. Jeanson ML, Labat JN, Little DP (2011) DNA barcoding: a new tool for palm taxonomists? Annals of Botany, 108, 1445–1451. Koressaar T, Remm M (2007) Enhancements and modifications of primer design program Primer3. Bioinformatics, 23, 1289–1291. Kress WJ, Erickson DL (2007) A two–locus global DNA barcode for land plants: the coding rbcL gene complements the noncoding trnH–psbA spacer region. PLoS One, 2, e508. Kwok S, Kellogg DE, McKinney N et al. (1990) Effects of primer template mismatches on the polymerase chain reaction: human immunodeficiency virus type 1 model studies. Nucleic Acids Research, 18, 999–1005. Lassmann T, Sonnhammer EL (2005) Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics, 6, 298. Laurikari V (2009) TRE: The Free and Portable Approximate Regex Matching Library. http://laurikari.net/tre/. Accessed 13 May 2013. Levin RA, Wagner WL, Hoch PC et al. (2003) Family–level relationships of Onagraceae based on chloroplast rbcL and ndhF data. American Journal of Botany, 90, 107–115. Little DP (2010) A unified index of sequence quality and contig overlap for DNA barcoding. Bioinformatics, 26, 2780–2781. Little DP (2011) DNA barcode sequence identification incorporating taxonomic hierarchy and within taxon variability. PLoS One, 6, e20552. McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153–157. de Mendiburu F (2012) Agricolae version 1.1-3. http://cran.r-project.org/. Accessed 5 January 2013. Meusnier I, Singer G, Landry JF, Hickey D, Hebert P, Hajibabaei M (2008) A universal DNA mini–barcode for biodiversity analysis. BMC Genomics, 9, 214. Murphy TM, Ben-Yehuda N, Taylor RE, Southon JR (2011) Hemp in ancient rope and fabric from the Christmas Cave in Israel: talmudic background and DNA sequence identification. Journal of Archaeological Science, 38, 2579–2588. Murray SR, Butler RC, Hardacre AK, Timmerman-Vaughan GM (2007) Use of quantitative real–time PCR to estimate maize endogenous DNA degradation after cooking and extrusion or in food products. Journal of Agricultural and Food Chemistry, 55, 2231–2239. Palmieri L, Bozza E Giongo L (2009) Soft fruit traceability in food matrices using real–time PCR. Nutrients, 1, 316–328. Pang X, Liu C, Shi L et al. (2012) Utility of the trnH–psbA intergenic spacer region and its combinations as plant DNA barcodes: a meta– analysis. PLoS One, 7, e48833. Poinar HN, Hofreiter M, Spaulding WG et al. (1998) Molecular coproscopy: dung and diet of the extinct ground sloth Nothrotheriops shastensis. Science, 281, 402–406. R development core team (2012) R: A Language and Environment for Statistical Computing (version 493 2.15.2). R Foundation for Statistical Computing, Vienna, Austria. Reveal JL, Chase MW (2011) APG III: bibliographical information and synonymy of Magnoliidae. Phytotaxa, 19, 71–134. Rose TM, Schultz ER, Henikoff JG, Pietrokovski S, McCallum CM, Henikoff S (1998) Consensus–degenerate hybrid oligonucleotide primers for amplification of distantly related sequences. Nucleic Acids Research, 26, 1628–1635.

446 D . P . L I T T L E Sandberg M, Lundberg L, Ferm M, Malmheden Yman I (2003) Real time PCR for the detection and discrimination of cereal contamination in gluten free foods. European Food Research and Technology, 217, 344–349. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74, 5463–5467. Sarkar G, Cassady J, Bottema CDK, Sommer SS (1990) Characterization of polymerase chain reaction amplification of specific alleles. Analytical Biochemistry, 186, 64–68. S€ arkinen T, Staats M, Richardson JE, Cowan RS, Bakker FT (2012) How to open the treasure chest? optimising DNA extraction from herbarium specimens. PLoS One, 7, e43808. Scheffe H (1953) A method for judging all contrasts in the analysis of variance. Biometrika, 40, 87–104. Schuler GD (1997) Sequence mapping by electronic PCR. Genome Research, 7, 541–550. Sommer R, Tautz D (1989) Minimal homology requirements for PCR primers. Nucleic Acids Research, 17, 6749. Spearman C (1904) The proof and measurement of association between two things. The American Journal of Psychology, 15, 72–101. Staats M, Cuenca A, Richardson JE et al. (2011) DNA damage in plant herbarium tissue. PLoS One, 6, e28448. Straub JA, Hertel C, Hammes WP (1999) Limits of a PCR based detection method for genetically modified soya beans in wheat bread production. Zeitschrift f€ ur Lebensmitteluntersuchung und–Forschung A, 208, 77–82. Tilley M (2004) PCR amplification of wheat sequences from DNA extracted during milling and baking. Cereal Chemistry, 81, 44–47. Vogelstein B, Kinzler KW (1999) Digital PCR. Proceedings of the National Academy of Sciences of the United States of America, 96, 9236–9241. Wilson EB (1927) Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209–212. Wu S, Manber U (1992) Fast text searching: allowing errors. Communications of the ACM, 35, 83–91.

Yap EPH, McGee JO (1991) Short PCR product yields improved by lower denaturation temperatures. Nucleic Acids Research, 19, 1713–1713.

D.P.L. designed the study, analyzed the data and wrote the manuscript.

Data accessibility Newly generated rbcLa sequences have been deposited in GenBank (Accessions: KF724189–KF724376).

Supporting Information Additional Supporting Information may be found in the online version of this article: Appendix S1 The rbcLa sequences of fully sequenced plastid genomes used to generate and valuate mini-barcode primers as well as to calculate nucleotide conservation. Appendix S2 The rbcLa sequences used to electronically evaluate primer PCR universality. Appendix S3 The rbcLa sequences used to evaluate discriminatory power of mini-barcodes. Mini-barcode names correspond to those in Table 2. Appendix S4 Specimens used to test rbcLa mini-barcode primers.

© 2013 John Wiley & Sons Ltd

A DNA mini-barcode for land plants.

Small portions of the barcode region - mini-barcodes - may be used in place of full-length barcodes to overcome DNA degradation for samples with poor ...
246KB Sizes 0 Downloads 0 Views