Family-Based Benchmarking of Copy Number Variation Detection Software.

RESEARCH ARTICLE

Family-Based Benchmarking of Copy Number Variation Detection Software Marcel Elie Nutsua1, Annegret Fischer1, Almut Nebel1, Sylvia Hofmann1, Stefan Schreiber1, Michael Krawczak2, Michael Nothnagel2,3* 1 Institute of Clinical Molecular Biology, Christian-Albrechts University, Kiel, Germany, 2 Institute of Medical Informatics and Statistics, Christian-Albrechts University, Kiel, Germany, 3 Cologne Center for Genomics, University of Cologne, Cologne, Germany * [email protected]

Abstract

OPEN ACCESS Citation: Nutsua ME, Fischer A, Nebel A, Hofmann S, Schreiber S, Krawczak M, et al. (2015) FamilyBased Benchmarking of Copy Number Variation Detection Software. PLoS ONE 10(7): e0133465. doi:10.1371/journal.pone.0133465 Editor: Michael Edward Zwick, Emory University School Of Medicine, UNITED STATES Received: January 27, 2015 Accepted: June 29, 2015 Published: July 21, 2015 Copyright: © 2015 Nutsua et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All data used in this study were generated using the Affymetrix GenomeWide Human SNP Array 6.0 Sample Data Set and are available on DVD on demand from Affymetrix Inc. (http://www.affymetrix.com/support/technical/ byproduct.affx?product = genomewidesnp_6). Funding: This study was supported by the German Ministry of Education and Research (BMBF) through the DFG Cluster of Excellence EXC 306 “Inflammation at Interfaces”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The analysis of structural variants, in particular of copy-number variations (CNVs), has proven valuable in unraveling the genetic basis of human diseases. Hence, a large number of algorithms have been developed for the detection of CNVs in SNP array signal intensity data. Using the European and African HapMap trio data, we undertook a comparative evaluation of six commonly used CNV detection software tools, namely Affymetrix Power Tools (APT), QuantiSNP, PennCNV, GLAD, R-gada and VEGA, and assessed their level of pairwise prediction concordance. The tool-specific CNV prediction accuracy was assessed in silico by way of intra-familial validation. Software tools differed greatly in terms of the number and length of the CNVs predicted as well as the number of markers included in a CNV. All software tools predicted substantially more deletions than duplications. Intra-familial validation revealed consistently low levels of prediction accuracy as measured by the proportion of validated CNVs (34-60%). Moreover, up to 20% of apparent family-based validations were found to be due to chance alone. Software using Hidden Markov models (HMM) showed a trend to predict fewer CNVs than segmentation-based algorithms albeit with greater validity. PennCNV yielded the highest prediction accuracy (60.9%). Finally, the pairwise concordance of CNV prediction was found to vary widely with the software tools involved. We recommend HMM-based software, in particular PennCNV, rather than segmentation-based algorithms when validity is the primary concern of CNV detection. QuantiSNP may be used as an additional tool to detect sets of CNVs not detectable by the other tools. Our study also reemphasizes the need for laboratory-based validation, such as qPCR, of CNVs predicted in silico.

Introduction The term ‘copy number variation’ (CNV) refers to the recurrence of moderately sized stretches of DNA (>1 kb) that exhibit inter-individual differences in the number of times they occur in a genome [1,2]. Scientific interest in human CNVs has been stirred partly by the fact that only a minor proportion of the heritability of common complex diseases is explained by

PLOS ONE | DOI:10.1371/journal.pone.0133465 July 21, 2015

1 / 17

Benchmarking of CNV Detection Software

Competing Interests: The authors have declared that no competing interests exist.

disease-associated single-nucleotide polymorphisms (SNPs) [3–5]. Other forms of genetic variation, including CNVs, are therefore likely to play an important role in the etiology of these diseases [3]. Correspondingly, CNVs have been implicated in various common disorders, including Crohn disease [6], rheumatoid arthritis and diabetes [7], psoriasis [8], intellectual disability [9], obesity [10], myocardial infarction [11], schizophrenia [12] and autism [13]. While CNV detection in the past was based solely upon aCGH or SNP array signal intensity data [14], technological progress of DNA sequencing today allows direct detection of CNVs [15], for example, in exomes [16]. However, genome-wide SNP array data still form an important basis of CNV detection, not the least due to their ample availability from past genomewide association studies (GWAS). A re-assessment of phenotypic associations of CNVs in these large legacy sample collections is warranted and to be expected for the coming years. A variety of software tools have been developed for the detection of CNVs in SNP array data. Depending upon the underlying mathematical model, these tools can be divided broadly into two classes, namely those implementing a Hidden Markov model (HMM) and those using a segmentation algorithm. In a nutshell, HMM-based approaches aim at predicting covert copy number (CN) states along a Markov chain whereas segmentation algorithms split chromosomes into segments and try to sensibly assign a CN state to each segment. The interpretation of the derived CN states also differs between algorithms because ‘state’ either refers to a nominal class or a numerical genotype. Thus, a copy number class merely indicates the type of variation, i.e. whether there is a gain or loss of genetic material, whereas a copy number genotype specifies the number of copies present in a diploid genome. All available HMM algorithms predict up to six different copy number genotypes whilst all segmentation algorithms predict copy number class as one of three different types. Different approaches have been taken in the past to benchmark CNV detection software [17–21]. Using early Affymetrix 100K SNP data, Baross et al. (2007) [17] noted substantial false-positive prediction rates with software tools CNAG (Copy Number Analyzer for GeneChip) [22], dChip (DNA-Chip Analyzer) [23] and GLAD [24]. The same authors also reported a high variability of these tools in terms of the number of CNVs predicted. Winchester et al. (2009) [18] assessed the accuracy of CNV prediction for five other software tools, using data from the more recent Affymetrix Genome-Wide Human SNP Array 6.0 and Illumina 1M-Duo BeadChip chips. They compared their SNP-based results to those of previously published sequencing studies [1,25,26], but only in single HapMap samples. In any case, the Winchester et al. study revealed that a large number of predicted CNVs could not be confirmed by any previous publication (up to 80%, depending upon the software used), and that predictions differed greatly both between software tools and between confirmation studies. In the same vein, Zhang et al. (2011) [19] applied Birdsuite [27], Partek (Partek Inc, St. Loius, MO), HelixTree (Golden Helix, Inc) and PennCNV [28] to three different data sets and observed a positive correlation between the number of markers included in a CNV and the ‘recovery rate’, defined by the authors as the proportion of previously published, validated CNVs that were also detected in their own study. Interestingly, the recovery rate was found to be negatively correlated with CNV population frequency. The same study also revealed a low consistency of the CNVs predicted in eight samples previously analyzed by Kidd et al. (2008) [25] and Conrad et al. (2010) [2]. More recently, Eckel-Passow et al. (2011) [20] reported substantial variability of the pairwise concordance of CNV predictions by PennCNV [28], Affymetrix Power Tools (APT) [29], Aroma. Affymetrix [30] and CRLMM (Corrected Robust Linear Model with Maximum Likelihood Distance) [31]. An in-depth assessment of PennCNV and CRLMM revealed a median concordance of 52% for deletions and of 48% for duplications. More deletions than duplications were predicted by both tools, and the empirical false-positive prediction rates were as high as 26% for CRLMM and 24% for PennCNV. Pinto et al. (2011) [21] analyzed six samples


2 / 17


on 11 different microarrays and predicted CNVs using as many different software tools including PennCNV and QuantiSNP. The data generated by each microarray platform was analyzed with one to five of these tools. The experiments were performed in triplicate for each sample, and the authors observed inter-software concordance of < 50% and a reproducibility in replicate experiments of < 70%. None of the above studies used family data for CNV validation but instead relied upon experimental validation of a very limited set of CNVs, DNA sequencing information, or a concordant prediction made by different algorithms. Moreover, none of the studies paid any attention to population differences in CNV prediction, despite previous reports that such differences do exist [1,32–34]. A general conclusion has been that more than one software tool should be used synergistically to increase specificity, and that CNVs should be validated experimentally by more reliable methods such as qPCR. However, although many of the currently available software tools were included in at least one of the studies, no systematic comparison has yet been undertaken of the main characteristics of CNVs predicted by a given algorithm, including the length, marker density and inter-marker distance. We therefore assessed in detail the performance of six commonly used software tools for CNV detection in Affymetrix SNP array data. The tools of interest included HMM-based algorithms APT [29], QuantiSNP [35] and PennCNV [28] in addition to segmentation-based algorithms R- gada [36], GLAD [24] and VEGA [37]. The SNP genotyping of APT is based on the birdseed algorithm of the well-known Birdsuite software package and can be seen as a extension of the Birdsuite approach. The Birdsuite software package was therefore not included in our comparison. We used publicly available SNP array signal intensity data from the International HapMap project [38–42] for CNV detection and a trio design for validation. Our results may guide future choices of CNV software for particular applications and should also instruct the interpretation of the results obtained.

Materials and Methods Proband Data We used signal intensity data of 60 trios (180 individuals) from the Affymetrix Sample Data Set, which were part of HapMap Phases 1 and 2 public releases 21a (released on 1st November 2007). Half of the trios were African (Yoruba in Ibadan, Nigeria, YRI) whereas the other half was of European ancestry (Utah Residents with Northern and Western European Ancestry, CEU). All samples had been genotyped with Affymetrix Genome-Wide Human SNP Array 6.0, which contains probes for 906,600 SNPs and an additional 945,826 CNV probes [43]. NetAffx annotation files (release 31, UCSC hg19) were used to map the markers on the chip to the human genome. The average genotyping call rate in the complete public release 21a data set (270 samples) was 99.83% (technical documentation) and the concordance with HapMap genotypes (release 21a) was 99.84%. Since there are no general CNV-specific quality control measures, we used all samples and applied software-specific default quality control if available (see below). We used the 60 unrelated offspring samples for CNV detection and the 120 parental samples for subsequent validation. Our analysis was confined to autosomes.

CNV definition In contrast to previous publications [1,2], we defined as a CNV any stretch of DNA that either has additional copies (duplication, gain) or is lacking (deletion, loss) compared to a reference genome, not restricting the CNV predictions by their size. We used the NetAffx Annotation File (release 31), containing marker positions according to the UCSC genome assembly version hg19 to annotate the predicted CNVs. The two copy number (CN) classes of ‘gain’ and ‘loss’


3 / 17


were subdivided into CN states according to the number of chromosomes (1 or 2) affected by the respective gain or loss.

SNP array data Depending upon technology, SNP arrays contain multiple probes for each of the two alleles (A and B) of a SNP. All probes specific to one allele are collectively called a ‘channel’. The intensity of each of the two channel signals (denoted RA and RB) reflects the amount of genetic material hybridized. The signal ratio allows inference, not only of the SNP genotype, but also of the relative amount of genomic material present at the target locus. Affymetrix Human SNP Array 6.0 contains six to eight probes per SNP, corresponding to three to four probes per channel. The array also contains a large number of probes for regions that may contain CNVs, but not SNPs, and these probes directly measure the total amount of genetic material present [44,45]. Inference of the copy number status at a given locus is made by comparing the sum of the observed channel signals, Robs = RA + RB, to its expectation Rexp. The definition of Rexp varies between CNV detection algorithms. However, all algorithms use the marker-specific Log-2 Raw Data Ratio LLR = log2Robs-log2Rexp as the basic input to infer a CN state, although some also rely upon the B allele fraction (BAF). If a CNV is present, the BAF is notably different from 0 (genotype AA), 0.5 (AB) and 1 (BB). The BAF is derived from a transformation, θi, of the sample-specific channel intensity ratios for the ith sample, calculated as 2=p arctanðRAi =RBi Þ. For each marker, this yields three genotype-specific median θ values, taken over all samples, namely θAA, θAB and θBB. The BAF of an observed θ value is then calculated as 8 0; if y < yAA > > > > < 0:5 ðy yAA Þ=ðyAB yAA Þ; if yAA < y < yAB BAF ¼ > > 0:5 þ 0:5 ðy yAB Þ=ðyBB yAB Þ; if yAB < y < yBB > > : 1; if y>yBB

Software tools We studied six commonly used software tools for CNV detection in Affymetrix SNP array data, namely APT [29], QuantiSNP [35] and PennCNV [28], implementing an HMM algorithm, and segmentation-based tools R-gada [36], GLAD [24], and VEGA [37]. All programs were run with their default options unless stated otherwise, which also includes default quality control measures by the respective software (see S1 File for the used commands). CNVs were defined separately for each sample, ignoring familial relationships. APT. The Affymetrix Power Tools (APT) [29] equate Rexp to the median sum of the sample-specific-channel signals, taken over all markers, or use a pre-computed reference [46] to obtain marker-specific LRR values. A Hidden Markov model is then fitted to the sequence of LRR values along the genome to assign hidden copy number states. We used program aptcopynumber-workflow of the APT bundle (version 1.14.2) with default settings in the singlesample mode and option—text-output set to true. The pre-computed Copy Number Analysis HapMap Reference File (Release 31) was used as a reference and the NetAffx Annotation File (Release 31) was used for alignment (publically available at the Affymetrix website http://www. affymetrix.com/support/index.affx). PennCNV. CNV analysis of Affymetrix data with PennCNV [28] follows the Penn-Affy protocol (http://www.openbioinformatics.org/penncnv/) according to which LRR and BAF are inferred from canonical genotype clusters [47] by means of linear interpolation. The genotype


4 / 17


clusters are generated from genotype calls that have been obtained with the APT software (see above). The sequences of LRR and BAF values are used in an HMM algorithm to infer the hidden copy number states. The PennCNV 2011Jun16 version was included in our study. First, apt-probeset-genotype and apt-probeset-summarize of APT (version 1.14.2) were used for genotype calling and allele-specific signal extraction, as laid down in the protocol. Second, canonical genotype clusters [47] were generated using generate_affy_geno_cluster.pl, which is part of the PennCNV-Affy tool. Clusters were then used to calculate LRR and BAF values via linear interpolation with normalize_affy_geno_cluster.pl. The sequence of LRR and BAF values was analyzed using detect_cnv.pl with default parameters. QuantiSNP. QuantiSNP [35] relies on pre-computed LRR and BAF values (e.g. from PennCNV) that are subjected to its own HMM algorithm to infer hidden copy number states. QuantiSNP (version 2) was applied according to the instructions given on the QuantiSNP project webpage (https://sites.google.com/site/quantisnp/). Signal files created with PennCNV were used as input. R-gada. The segmentation algorithm implemented in R-gada [36] uses LRR values precomputed along the genome and tries to find discontinuities by sparse Bayesian learning. For the resulting segments, the average LRR of all markers falling into the segment is compared to the median LRR of the respective chromosome. Based upon the outcome, a CN class is assigned to the segment. R-gada (version 0.8–5) was run using LRR values calculated with APT program apt-copynumber-workflow. GLAD. The GLAD software [24] was developed for the analysis of aCGH data. However, since the program uses signal intensities for segmentation, it can also be applied to SNP array data. GLAD uses pre-computed LRR values and applies the Adaptive Weights Smoothing algorithm to find discontinuities along the genome. CN classes are then assigned depending upon the difference between the segment-specific median LRR and the median LRR closest to zero. GLAD (version 2.20.0) was run using R 2.15 and APT-derived LRR values (see above). VEGA. The segmentation algorithm implemented in the VEGA software [37] is based upon the Mumford and Shah model [37] and uses pre-computed LRR values as well. After segmentation, CN states are assigned to the resulting segments depending upon whether mean LRR is smaller or larger than zero. VEGA (version 1.7.0) was run using R 2.15 and APTderived LRR values (see above).

Standardization of output While PennCNV, QuantiSNP, R-gada and VEGA report a list of segments and their respective CN state, APT and GLAD output a list of markers and their CN states. To allow comparison between tools, we converted all output to lists of segments, if not provided by the software itself. We also summarized CN genotypes into CN classes with three possible states per segment (‘normal’, ‘gain’ or ‘loss’). Since some algorithms (e.g., PennCNV) do not support sex-chromosomal analyses by default, we considered only autosomal CNVs. All autosomal CNV predictions including outliers < 1kb were retained for the benchmark.

CNV benchmarking We evaluated the six software tools in terms of both the characteristics of the predicted CNVs (i.e. their number, length and type) and the validity of the predictions made. We also compared the marker density within those CNVs that were detected in the 60 unrelated children from trios. The presence or absence of a CNV in the parents was used to validate each prediction in the offspring, since the overwhelming majority (up to 99%) of all CNVs in a genome is inherited [44,48] and will also be present in one of the parental genomes. While de-novo CNVs do


5 / 17


exist, they play only a minor role. More specifically, a CNV detected in an offspring was considered validated if one or more segments of the same CN class (i.e. ‘gain’ or ‘loss’) that covered >90% of the offspring CNV were found in at least one parent. We also applied other thresholds for the required overlap. All analyses were repeated separately in the African (YRI) and European (CEU) samples to recognize possible population differences in terms of CNV detection. To assess the likelihood of a CNV being validated by chance alone following our family approach, we randomly reassigned parents to offspring and repeated this procedure ten times. This analysis was carried out twice, once drawing parents from a joint pool of CEU and YRI trios and once considering CEU and YRI trios separately. The possible influence on CNV prediction of the underlying mathematical model was assessed by comparing the median of the outcome variables of interest for the three HMMbased programs (APT, PennCNV, QuantiSNP) to those for the three segmentation-based programs (GLAD, R-gada, VEGA). Again, to evaluate possible population differences, we repeated our analyses separately for the CEU and YRI samples. Finally, we investigated the inter-software concordance in terms of CNV prediction by considering the proportion of CNVs predicted by one tool that were also found by the other tool, using only validated CNV predictions in this approach. A CNV predicted by one tool was considered verified by the other if >90% of the CNV was assigned the same CN state by the second tool. We also considered other thresholds for the necessary overlap. Additionally, we generated a sample-specific set of CNVs concordantly called by at least three algorithms and compared each tool to this call set. External verification data based on other technologies than SNP genotyping were obtained from the Database of Genomic Variants (DGV, http://dgv.tcag.ca, build 37, release 2013-0723). We used a sequencing-based set of variants containing the results of 12 studies [26,49–58] (“DGV sequencing”) and employed a 90% verification threshold. All statistical analyses were performed with R 2.15.2. Outcome differences between software tools were tested for statistical significance using a pairwise Wilcoxon signed-rank test.

Results CNV prediction The spectra of CNVs predicted by different programs varied widely, both in terms of their number and length and of the marker density within CNVs. In the offspring of the 60 European (CEU) and African (YRI) HapMap trios, the median CNV number ranged from 75 per sample, predicted by PennCNV, to 211 per sample for R-gada (Fig 1 and Table 1). Segmentation algorithms predicted significantly more CNVs than HMM algorithms (median: 182 vs. 98, Wilcoxon signed rank test p = 1.6×10–11) and showed a (non-significant) trend towards a higher inter-software variability in CNV number (median absolute deviation 42.3 vs. 19.3, p = 0.12 from 10,000 permutations of class labels). All software except PennCNV predicted fewer CNVs in Europeans (CEU) than in Africans (YRI, p

CONSERTING: integrating copy-number analysis with structural-variation detection.

Copy number variation detection using next generation sequencing read counts.

Copy number variation in schizophrenia.

Using XHMM Software to Detect Copy Number Variation in Whole-Exome Sequencing Data.

Evolutionary impact of copy number variation rates.

High-resolution assessment of copy number variation.

Copy number variation of glyoxalase I.

Copy number variation in Thai population.

Copy number variation in the horse genome.

tRNA gene copy number variation in humans.

Genomic copy number variation in Mus musculus.

Copy number variation in bipolar disorder.

Copy number variation in bronchopulmonary dysplasia.

Mosaic copy number variation in human neurons.

Copy Number Variation at the APOL1 Locus.

SCNVSim: somatic copy number variation and structure variation simulator.

Exome sequencing and whole genome sequencing for the detection of copy number variation.

An evaluation of copy number variation detection tools for cancer using whole exome sequencing data.

Traditional karyotyping vs copy number variation sequencing for detection of chromosomal abnormalities associated with spontaneous miscarriage.

An evaluation of copy number variation detection tools from whole-exome sequencing data.

PSCC: sensitive and reliable population-scale copy number variation detection method based on low coverage sequencing.

Copy number variation of FCGR genes in etiopathogenesis of sarcoidosis.

The impact of human copy number variation on gene expression.

Extensive copy-number variation of young genes across stickleback populations.