Molecular Ecology Resources (2014) 14, 871–881

doi: 10.1111/1755-0998.12235

VIP Barcoding: composition vector-based software for rapid species identification based on DNA barcoding LONG FAN,* JEROME H. L. HUI,* ZU GUO YU†‡ and K A H O U C H U * *School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China, †School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China, ‡School of Mathematical Sciences, Queensland University of Technology, Brisbane Q4001, Qld, Australia

Abstract Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multilocus barcoding data sets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g. >5000 sequences), but its accuracy is a concern and has been criticized for its local optimization. However, current more accurate software requires sequence alignment or complex calculations, which are time-consuming when dealing with large data sets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: a user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, twostage algorithm. First, an alignment-free composition vector (CV) method is utilized to reduce searching space by screening a reference database. The alignment-based K2P distance nearest-neighbour method is then employed to analyse the smaller data set generated in the first stage. In comparison with other software, we demonstrate that VIP Barcoding has (i) higher accuracy than Blastn and several alignment-free methods and (ii) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multilocus barcoding data with accuracy and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at http://msl.sls.cuhk.edu.hk/vipbarcoding/. Keywords: DNA barcoding, sequence analysis, software Received 9 August 2013; revision received 22 January 2014; accepted 24 January 2014

Introduction The traditional species identification approach based on morphology is limited by many factors: high demand on time, expertise and labour. Furthermore, morphological differences within a species at different life stages, and partial specimens, will also complicate precise assignment (Valentini et al. 2009). As a complementary tool for species identification, DNA barcoding was proposed by Hebert et al. in 2003 and may help to improve this situation (Hajibabaei et al. 2007). In the past decade, this molecular approach has been successfully applied in studies of pest monitoring, conservation biology, biodiversity estimation and other associated fields (Savolainen et al. 2005; Hajibabaei et al. 2007; Valentini et al. 2009; Collins et al. 2012a; Taylor & Harris 2012; van Velzen et al. 2012). Yet, DNA barcoding encounters several problems which need to be addressed (Taylor & Harris Correspondence: Ka Hou Chu, Fax: +852-2603-5391; E-mail: [email protected]

© 2014 John Wiley & Sons Ltd

2012), such as the identification of new markers for improving resolution, the construction of comprehensive reference databases and the capability to analyse highthroughput data. Among the aforementioned problems, data analysis is the significant step that directly determines the final results for species identification in DNA barcoding. To date, various approaches have been developed for matching a query sequence against a reference barcode library. In general, these approaches can be classified into the following four categories (Little & Stevenson 2007; Little 2011; van Velzen et al. 2012): (i) tree-based methods [also called clustering or phylogenetic methods, for example, neighbour joining (NJ), maximum likelihood (ML), ATIM (Little & Stevenson 2007), SAP (Munch et al. 2008) and CAOS (Sarkar et al. 2008)]; (ii) similaritybased methods [e.g. 1NN (one nearest-neighbour or best match method; Meier et al. 2006; Austerlitz et al. 2009), BLAST (Altschul et al. 1990), TaxI (Steinke et al. 2005), BRONX (Little 2011) and jMOTU (Jones et al. 2011)]; (iii)

872 L . F A N E T A L . character-based diagnostics [also called diagnostic methods, e.g. BLOG (Bertolazzi et al. 2009) and DNA-BAR (DasGupta et al. 2005)]; and (iv) statistical methods (Matz & Nielsen 2005; Nielsen & Matz 2006; Abdo & Golding 2007). The comparison of these different data analysis methods for DNA barcoding has been performed previously (Meier et al. 2006; Little & Stevenson 2007; Ross et al. 2008; Austerlitz et al. 2009; Virgilio et al. 2010; Little 2011; Reid et al. 2011; Zou et al. 2011; van Velzen et al. 2012), and the conclusions are summarized as follows: (i) none of the methods can consistently outperform the others for all data sets; (ii) character-based diagnostics such as DNA-BAR and BLOG usually have the highest accuracy; (iii) the nearest-neighbour method (or best match method) based on the Kimura 2-parameter (K2P) distance [i.e. K2P-1NN (Meier et al. 2006; Austerlitz et al. 2009)] is more reliable than most of the other methods except character-based diagnostics; and (iv) tree-based methods perform the worst with the lowest accuracy, even when compared to BLAST. From these different methodological comparisons, two existing problems are realized. First, apart from BLAST, most of the methods have difficulty in coping with large data sets (e.g. >5000 sequences; van Velzen et al. 2012), and that is why the tested data sets used in these comparisons are usually small (

VIP Barcoding: composition vector-based software for rapid species identification based on DNA barcoding.

Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, s...
358KB Sizes 0 Downloads 0 Views