Computers in Biology and Medicine 61 (2015) 48–55

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Allele frequency calibration for SNP based genotyping of DNA pools: A regression based local–global error fusion method Ashfaqur Rahman a,n, Andrew Hellicar a, Daniel Smith a, John M Henshall b a b

Digital Productivity Flagship, CSIRO, Hobart, Tasmania, Australia Agriculture Flagship, CSIRO, Armidale, NSW, Australia

art ic l e i nf o

a b s t r a c t

Article history: Received 29 July 2014 Accepted 17 March 2015

Background: The costs associated with developing high density microarray technologies are prohibitive for genotyping animals when there is low economic value associated with a single animal (e.g. prawns). DNA pooling is an attempt to address this issue by combining multiple DNA samples prior to genotyping. Instead of genotyping the DNA samples of the individuals, a mixture of DNA samples (i.e. the pool) from the individuals is genotyped only once. This greatly reduces the cost of genotyping. Pooled samples are subject to greater genotyping inaccuracies than individual samples. Wrong genotyping will lead to wrong biological conclusions. It is thus required to calibrate the resulting genotypes (allele frequencies). Methods: We present a regression based approach to translate raw array output to allele frequency. During training, few pools and the individuals that constitute the pools are genotyped. Given the genotypes of individuals that constitute the pool, we compute the true allele frequency. We then train a regression algorithm to produce a mapping between the raw array outputs to the true allele frequency. We test the algorithm using pool samples withheld from the training set. During prediction, we use this map to genotype pools with no prior knowledge of the individuals constituting the pools. Results and discussion: After data quality control we have available a dataset comprised of 912 pools. We estimate allele frequency using three approaches: the raw data, a commonly used piecewise linear transformation, and the proposed local–global learner fusion method. The resulting RMS errors for the three approaches are 0.135, 0.120, and 0.080 respectively. Crown Copyright & 2015 Published by Elsevier Ltd. All rights reserved.

Keywords: Allele frequency calibration DNA pooling SNP genotyping Microarray Machine learning

1. Introduction Genotyping refers to the process of determining the differences in the genetic make-up (i.e. genotype) of the individuals of a species. SNPs (Single Nucleotide Polymorphisms) are one of the most common forms of genetic variation. SNP genotyping refers to the measurement of genetic variation of SNPs. A SNP refers to a DNA sequence variation at a specific locus (a specific location on gene) where a Single Nucleotide (A, T, C or G) in the genome differs between members of a species. These genetic variations within humans are commonly utilized in DNA fingerprinting (used in forensic science for identification of individuals by their DNA profile) or to provide genetic linkage to a disease to assist in drug development or identify an individual's susceptibility to the disease. Whilst there are a number of technology platforms available for SNP genotyping studies, the choice of platform is largely dependent upon the available budget, the number of

n

Corresponding author. E-mail address: [email protected] (A. Rahman).

http://dx.doi.org/10.1016/j.compbiomed.2015.03.020 0010-4825/Crown Copyright & 2015 Published by Elsevier Ltd. All rights reserved.

samples being genotyped and the required coverage of the gene or genome for the study at hand. The real time PCR technology, TaqMan (PE Applied Biosystems), will genotype a single SNP per assay. Furthermore, the assay requires a highly specialized, labor intensive design of allelespecific oligonucleotides (ASO) probes for each SNP in order to produce optimal hybridization between alleles [14]. Hence, TaqMan is only attractive in studies with a very small number of polymorphic markers, given cost and labor scale linearly with the number of SNP. At the opposite end of the spectra, the Illumina and Affymetrix multiplex platforms genotype a significant number of SNPs per assay. The Illumina [5] and Affymetrix [8] platforms are comprised of an array of beads or chips, respectively, with a set of SNP specific oligonucleotides probes placed upon each element. Such platforms are suited to genome wide association studies, particularly for humans, given a chip such as the Illumina Human Omni5 Beadarray can genotype over four million polymorphisms in the human genome. One of the main issues with high density array technologies, however, is the level of prior knowledge that is required to design an array for species where gene or genome coverage of polymorphic markers have yet to be determined. In this case, a low to medium density SNP platform with multiplexing

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

is more a cost effective compromise. The Sequenom iPLEX platform [4] genotypes between tens and hundreds of SNP per assay and is a low cost technology for custom, low density studies. The platform utilizes a reaction where primers adjacent to SNP loci are extended with a terminator neuclotide (ddNTP) from a set of ddNTP of different masses. The platform then uses MALDI-TOF (matrix assisted laser desorption and ionization-time of flight) mass spectrometry to detect alleles by exploiting differences in the mass of reaction products. The Sequenom iPLEX platform is utilized in this paper to investigate calibration strategies for low cost genotyping technology. In addition to the coverage aspects of the study design, the number of samples to genotype is of critical importance. In many SNP based association studies, sufficient samples need to be genotyped to obtain the statistical power necessary to achieve meaningful results. This is often an expensive and labor intensive exercise. DNA pooling is a practical attempt to improve the efficiency of SNP genotyping by combining multiple DNA samples prior to genotyping [13]. Each pool of N samples is genotyped as a single sample reducing the cost of assays by up to a factor of N. For genotyping individuals, platforms commonly use algorithms to convert their continuous intensity measurement of a SNP into a discrete call of one of three possible pairs of allele. Such calls are not informative for DNA pools given that the intensity measurement represents one of 2N þ1 possible allele calls. Consequently, when DNA pools are used, the output intensities must be acquired to compute a quantitative genotype for each SNP. This quantitative genotype is known as its allele frequency. Despite the improvement in cost and efficiency, a major shortcoming of pooled DNA samples is that allele frequency estimates are more sensitive to measurement error than the calls of individual samples [13,10,2]. The sources of error associated with DNA pooling include pool construction, biochemical reactions and allele frequency estimation [1]. Pool construction error is associated with the process of trying to obtain equal concentrations of DNA from the samples and then mix these in equal volumes to form the pool. For the biochemical reaction associated with PCR amplification of target DNA, one allele might be more efficiently amplified than the other allele. This differential amplification is a major source of error as its causes the signal that represents the more efficiently amplified allele to be higher than its expected value, thereby inflating its estimate in the pooled DNA sample. Finally there is an analytical error associated with attempting to model the effects of differential amplification in allele frequency estimation. Whilst individual samples are subject to the same measurement errors as pooled samples, this noise is not signifi cant enough to change the allele call of individual samples. The

49

continuity of allele frequency estimates, however, makes the pooled samples far more susceptible to being distorted by noise. A number of works in the current literature are aimed at solving this problem. In [15] the coefficient of preferential amplification/hybridization (CPA) is used to quantify the degree of bias. The ratio of average peak intensities between two alleles was used as the bias factor. It was found that bias introduced through preferential hybridization was adequately modeled using lognormal distributions. This results in reduced error of allele frequency estimation for the human genome. A general linear model that accounts for the nested structure of the data was used in [9] for SNP genotyping. It is not required to know the CPA in [9]. It thus avoids the need for individual SNP genotyping to determine allelic ratio of hybridization. This offers scaling up to arrays with many thousands of SNPs. The piecewise linear interpolation of pooled alleles was used in [11] to correct for the bias in pooled DNA data of the human genome. In this paper, a novel calibration method is proposed to reduce the measurement error associated with a low cost SNP genotyping approach. More precisely, local and global errors are treated differently with our proposed method. Local errors relate to issues that are specific to each SNP. For instance, differential amplification of alleles during PCR and signal to noise ratio issues associated with allele detection using mass spectrometry. The global error relates to errors that are common to all of the SNPs in the assay. Biochemical artifacts are independent of any specific SNP and are one kind of global error [11]. We designed two separate regression algorithms to calibrate allele frequency estimates: (i) one for a specific SNP to deal with local errors, and (ii) one across all the remaining SNPs to deal with global errors. The estimates from this stage are combined into a single allele frequency estimates by a fusion predictor. Experimental results demonstrate the effectiveness of the proposed method by producing accurate estimates of the allele frequencies. Such solutions will be useful for new applications or studies where there is little prior genomic knowledge available and where study budget is limited.

2. Proposed calibration method Given the DNA sample, the genotyping process (e.g. from iPLEX platform) generates a (x, y) pair (Fig. 1) for each SNP, where x and y are the average peak intensities of the two alleles. The (x, y) pair for an individual (Fig. 1(a)) can fall in either of three clusters. The (x, y) pair for a pooled DNA sample (Fig. 1(b)) can fall into any of the 2N þ1 clusters if constructed from N individuals. Because of the PCR process and different errors, a change in (x, y) values is

Fig. 1. Genotyping of (a) individual and (b) pooled DNA samples.

50

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

likely to call a wrong allele frequency (different cluster) with a high probability for a pool (because of small gap between clusters) and the following biological conclusions are likely to be wrong. We thus propose a method to correct the allele frequencies. Given the alleles of individuals that constitute the pool, we compute the true allele frequency and train a regression algorithm to produce a mapping between raw array outputs to the true allele frequency. We then use this map to genotype pools with no prior of the individuals constituting the pools. We detail the proposed method next. Given a pair of intensities ðx; yÞ produced by the genotyping machine for a pool, the objective of the regression based calibration process is to produce an estimate of the allele frequency f^ . Given a pooled sample k (where 1 r k r N P and N P represents the number of pools), the genotyping machine produces an x– and y– intensity pair for each SNP. Let ðxik ; yik Þ present an x– and y– intensity pair of pool k for SNP i where 1 ri r N S and N S represents the maximum number of SNPs. In order to produce the training set for a regression algorithm, the true allele frequency estimate of the pools is also required. Let the pool k be composed of N I ðkÞ individuals and the B allele count of individual j for SNP i in pool k be bi;j ðkÞ where 1 r j rN I and bi;j ðkÞ A f0; 1; 2g. The true allele frequency estimate f ik of the pool k for SNP i can be computed as: PNI ðkÞ j ¼ 1 bij ðkÞ ð1Þ f ik ¼ NI ðkÞ A regression algorithm can be provided with a set of vector ðxik ; yik ; f ik Þ where x– and y– intensity pair acts as input and the true allele frequency estimate f ik can as target. Given a set of vectors ðx; y; f Þ a supervised learning algorithm Γ learns a map

Γðx; yÞ ¼ f

SNP Population (here each small ellipse refers to the pools belonging to the corresponding SNP) SNP # 2

SNP # 1

SNP # 4

SNP # 3

SNP # x

SNP # 5 SNP # 6

SNP # n

Training

Training

Global Calibration Model

Local Calibration Model

Prediction

Prediction

x Intensity

y Intensity

Global Prediction

Local Prediction

Fusion Predictor

Allele Frequency Estimate

Fig. 2. Calibration of Allele frequency estimates based on (a) global model: to compensate for the errors across all the SNPs, and (b) local model: to compensate for errors related to the PCR process across all the pools within the SNP. In this figure, the red and green color indicates the flow of information from the global learner and local learner respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

ð2Þ

i.e. given a pair of intensities ðx; yÞ it produces an estimate of the allele frequency f^ . When DNA samples are pooled together and go through the hybridization process, errors can be introduced from different sources. These errors can be grouped into two categories: (a) local error: that affects a specific SNP (e.g. differential amplification), and (b) global error: that are not tied to specific SNP (e.g. biochemical artifacts). When doing the calibration it is thus required to consider both of these error sources. We thus propose a two-step calibration process to estimate the allele frequency (Fig. 2). A local learner ΓL is trained on the intensity patterns of all the pools for an SNP i and a global learner ΓG is trained on the intensity patterns of all the pools for all the remaining SNPs j where j ai. As mentioned previously ΓL and ΓG compensate for the local and global errors respectively. Given a pair of intensities ðx; yÞ, ΓL and ΓG produce estimate of the allele frequency f L and f G respectively. A new training set is generated where the sample vectors are ðf L ; f G ; f Þ. A fusion predictor is trained on this new training set to fuse the local and global predictions into final allele frequency estimate f^ .

3. Experimental setup Two genotyping experiments were conducted each using the same Sequenom iPLEX panel which generated results for 61 SNPs. In the first experiment 1041 individuals and 22 pool samples were genotyped. The pool sizes varied between 18 and 26 individuals all drawn from the 1041 individuals. A second experiment was conducted to allow a quality control preprocessing step on the first experiment data set. The need for a quality control is a consequence of errors in the genotyping process for particular SNPs. The Sequenom iPLEX panel generates pairs of raw

measurements (i.e. x and y values) from the primer extension products of each SNP loci using MALDI-TOF spectrometry. The x and y values generated by the MALDI-TOF spectrometer are the peak intensities in the mass spectrum that correspond to the SNP alleles. Two types of error were present in the results: low signal level error and erroneous x and y values. The effect of these errors needs to be minimized before any assessment of calibration approaches can be conducted. The errors vary across the SNPs being assessed and from sample to sample. Therefore the simplest approach is the removal of the most erroneous SNPs and samples from the test set. Low signal error could be removed by requiring a minimum threshold to be reached in amplitude of the x and y signals such as in [18]. However, we note both low signal levels and erroneous x and y errors manifest themselves as either no call or incorrect calls by the Sequenom system. Therefore call accuracy is used to identify bad samples (samples that are not correctly called by many SNPs) and bad SNPs (SNPs which fail to detect many samples). Call accuracy is quantified by comparing the call results from the large experiment with the more stringent experiment. A sub set of 78 individuals from the first experiment were genotyped under a strict experimental regime to ensure minimal errors were committed during the genotyping process. The allele calls for the individuals were compared with those in the main data set calls. Fig. 3 shows the performance of each SNP in terms of the number of correct SNPs (call identical to call in the stringent experiment), and the number of samples that were not called, but were called in the stringent experiment (“no call to call” SNPS). Trend lines for the best performing SNPs are plotted. The number of “no call to call” SNPs joins the trend line after the worst 13 SNPs. Therefore the worst 13 performing SNPs were identified as bad and removed from the first experiment data set leaving 48 SNPs. Note that we have used the data from the first experiment only in

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

Table 1 Configuration of the different regression algorithms.

1.1 correct no call to call

1

51

Regression algorithm

Parameter setting

Support Vector Regression (SVR)

Implementation: libSVM [3] Algorithm: epsilon SVR Hidden Layers: 1 Learning Rate: 0.3 Momentum: 0.2 Epochs: 500 Attribute Selection Method: M5 method [6] Ridge: 10  8

0.9

fraction of calls

0.8 Multi Layer Perceptron (MLP)

0.7 0.6 Linear Regression (LR)

0.5 0.4 0.3 0.2 0

10

20

30

40

50

60

SNP index ordered in increasing number correct calls

fraction of individuals with more than fraction SNPs called

Fig. 3. Quality control results. The fraction of calls for each SNP that were either identical to the quality control experiment (correct), or were called in the quality control experiment but not in the experiment used for analysis (no call to call). On horizontal axis SNPs are ordered in increasing number of correct calls. Trend lines superimposed for region of better performing SNPs.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Fig. 5. Color code used to explain the results in Figs. 6, 7, and 9. Each color represents a bin centered around allele frequency mentioned on the left. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fraction SNPs called Fig. 4. Fraction of individual samples which achieve the fraction of SNPs called. Note approximately 10% of individuals do not contain any calls (fraction of 0 SNPs called) and only 1% fraction call all SNPs. Linear trend lines for worst performing and best performing individuals are included.

the regression algorithms. The data from the second experiment is used for quality control only. The performance of the individual samples is highlighted by plotting the fraction of individuals that achieved a given fraction of SNPs calls (Fig. 4). Linear trend lines for the best and worst regions show a transition between the two regions between 70–90% of SNPs called. 0.8 was selected as a threshold for required fraction of SNPs called and individuals not achieving this call rate were culled leaving 850 individuals. The same approach was applied to pools resulting in 19 valid pool samples. Finally a baseline x, y RMS error (0.065) was estimated by the inclusion of duplicate samples in the first experiment by comparing the x and y values over the remaining SNPs. We used regression algorithm as a local and global predictor in Fig. 2. The outcomes of the local and global predictors are fused into a single prediction using another regression method. We have

implemented a number of regression algorithms including Linear Regression, SVM, and MLP to evaluate the effectiveness of the proposed allele frequency estimation method. The algorithms and their parameter settings are presented in Table 1. The same regression algorithm was used in local, global, and fusion predictor. The WEKA and libSVM functions were called from MATLAB for implementation purposes. We have used leave-out-one approach to test the results. Leave-out-one is a cross validation approach. In cross-validation approach, a segment of the data is left out for testing and the remaining data is used for training. In this way, the training and test data becomes completely separate. In order to reduce the bias, an n-fold cross validation is done, where the data is divided into n different folds. Each fold becomes a test set in turn and the remaining folds are concatenated to form the training set. The average accuracy on all the n folds is reported. In case of leave-out-one approach, the number of folds equals the number of samples.

4. Results and discussion We present the results on how accurately regression algorithms can predict allele frequencies in this section. In order to explain

52

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

the results we have presented the distribution of intensity patterns and allele frequencies for different SNPs in Figs. 6, 7 and 9. The color code used in these figures is presented in Fig. 5. Each color in these figures represents a bin centered on a specific allele frequency. Fig. 6 presents the distribution of intensity patterns from the Sequenom iPLEX panel and the allele frequency of pools used in this study for particular SNPs. Note that the distribution of allele frequencies vary between SNPs. The allele frequency variance of some SNPs (e.g. 0.6 for SNP # 11 in Fig. 6(b)) is greater than that of some other SNPs (0.3 for SNP # 1 in Fig. 6(a)). The x– and y– intensity patterns of the SNPs are also different. Pools having identical allele frequencies are aligned at different angles in the x–y plane for different SNPs. For example, pools in allele frequency bin 0.8 in SNP # 1are aligned at an angle close to zero whereas identical pools (i.e. allele frequency bin 0.8) are aligned at a higher angle for SNP # 11 and SNP # 20. This happens due to the influence of local errors. It is required to train a regression algorithm locally to transform the x– and y– intensity patterns to allele frequency. Training regression algorithms locally, however, does not compensate for the global

errors. Additional measures are required to deal with global errors. In order to calibrate for the global error associated with assays, additional (x–intensity, y–intensity, Allele-Frequency) records are required to increase coverage across the allele frequency space. The raw intensity of individual samples can be combined with pools for this purpose. Fig. 7 presents the distribution of intensity patterns and allele frequency of SNP # 1 and SNP # 11 when data from the pool and the individuals that are mixed together. A number of observations can be made. There are gaps in the allele frequency space. The allele frequencies of individual samples represent the three genotype calls situated at 0, 0.5, and 1. Thus, other than the regions covered by the pools and the individuals in the SNP, the remaining space is empty. Moreover, the clusters belonging to the individuals dominate over that of pools in terms of number of samples. Similar observation can be made with the distribution of the x– and y– intensity patterns. We have conducted some experiments to evaluate the correctness of these observations. We trained regression algorithms on (a) pools only, and (b) mixture of individuals and pools. Fig. 8 presents the regression errors on these two alternative scenarios. On average,

Fig. 6. Distribution of intensity patterns from Sequenom chip (the top plot) and allele frequency (the bottom plot) for different SNPs. The genotyping results obtained from pooled DNA samples are presented in this figure. Each allele frequency bin (bar in the bottom graphs and points in the top graphs) is represented by a separate color. The color codes used for the separate bins are presented in Fig. 5. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

53

Fig. 7. Distribution of intensity patterns from Sequenom chip (the top plot) and allele frequency (the bottom plot) for SNP # 1 and SNP # 11. The genotyping results obtained from individual and pooled DNA samples are presented in this figure. Each allele frequency bin (bar in the bottom graphs and points in the top graphs) is represented by a separate color. The color codes used for the separate bins are presented in Fig. 5. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 8. Regression errors when genotyping records are used from (a) pool only and (b) pools and individuals.

the use of pool data alone produces a regression error that is 22.37% lower than when data from both the pools and individuals are used. Addition of individual records does not compensate for the global error and fails to improve the regression error. Fig. 9 presents the distribution of intensity patterns and allele frequency spectrum by mixing the pool genotypes across all of the SNPs. The allele frequency spectrum now covers the whole allele range with fewer gaps in the x– and y– intensity patterns compared to that in Fig. 7. We computed the regression errors using the fusion of local and global calibration models (Fig. 2). When compared to the local model (i.e. using pools specific to SNPs), the local–global model performs 5.56% better than the local model only. We have evaluated the effectiveness of the proposed fusion calibration model in Fig. 2 using a number of different regression methods. The best performance was achieved using Linear Regression (Table 2). We used a leave-out-one approach to test the results. 48 SNPs from the 19 pools gave a data set of 912 samples. To evaluate the effectiveness of the fusion predictor, we used each

sample for test by rotation and used the remaining samples for training. That way we conducted 912 different tests. Linear Regression performs better than MLP and SVM. We have conducted a one tailed sign test [16,17] to find if these improvements are significant. Let the two algorithms compared are X and Y. The null hypothesis is that the two algorithms X and Y are equivalent. We use the test statistic which sums the number of occurrences where LR performs better than the alternative algorithm (MLP/SVM). Therefore under the null hypothesis, the test statistic ispnormally distributed with mean N=2 and standard ffiffiffiffi deviation N =2 where N ¼ 912 is the number of test cases. As stated in [16], pffiffiffi if “the number of wins for an algorithm is at least N=2 þ 1:96 2N, the algorithm is significantly better with p o0:05.” In our case the test statistic is 563 and 557 when comparing LR to MLP and SVM respectively (Table 3). Therefore LR is significantly better than MLP and SVM with p o 0:05. The performance of the proposed method is compared with some of the existing results. The RMSE error obtained using the proposed local–global learner fusion method is better than any of

54

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

the whole spectrum. As can be observed from Table 4, the RMSE error with pool data only is 0.072 whereas RMSE error with mixture of pool and individual data only is 0.080.

5. Conclusions

Fig. 9. Distribution of intensity patterns from Sequenom chip (the top plot) and allele frequency (the bottom plot) when data from all the pools are used. Each allele frequency bin (bar in the bottom graphs and points in the top graphs) is represented by a separate color. The color codes used for the separate bins are presented in Fig. 5. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

In this paper we have presented a regression based approach to calibrate the allele frequency estimates for SNP genotyping of pooled DNA samples. The errors in genotyping process affect the estimates on the pools more than the individuals. We grouped the errors into two categories: local and global. Local errors are specifics to SNPs and global errors affect all the SNPs. We have designed two separate regression algorithm based allele frequency estimator to deal with these two different sources of errors. The estimates generated by these two predictors are fused into a single estimate by a fusion predictor at stage two. The proposed calibration method achieves better allele frequency estimate accuracy compared to the well known estimation methods. In future we aim to design a parametric approach where each source of error can be expressed as a parameter. This will assist in understanding the influence of different sources of error on the estimated allele frequency.

Conflict of interest statement I worked in Monash University and Central Queensland University before. I am now working for CSIRO. I am also working in collaboration with some professors in University of New South Wales and Deakin University. Reviewers from these institutions should be avoided.

Table 2 Performance of different regression algorithms with the fusion model in Fig. 2. Regression method

RMSE

LR MLP SVR

0.072 0.089 0.087

Acknowledgments

Table 3 Summary of the performance between LR, MLP, and SVM as fusion classifier.

LR vs. MLP LR vs. SVM

# wins for LR

# losses for LR

# total

563 557

349 355

912 912

Table 4 Performance of the fusion model in Fig. 2 w.r.t. other methods in the literature. Regression method

RMSE

Local–global fusion method (Pool only) 0.072 0.080 Hierarchical learning (Pools and individuals) No calibration 0.135 Piecewise linear transformation 0.120

the previously published results (Table 4). When no calibration is done, the errors remain as they are and the allele frequency estimation performance is poor. The piecewise linear transformation does not take the complex distribution of intensity patterns. We also conducted experiments by mixing individual data with pooled data to cover the allele frequency distribution. As mentioned previously gaps exist in these data sets and cannot cover

The authors would like to acknowledge Gold Coast Marine Aquaculture for their contribution towards the development of the Black Tiger Prawn SNP assay used in this study, and for the tissue samples used in evaluating the methods. We are grateful to Leanne Dierens and Melony Sellars who undertook sample collection and DNA extractions. References [1] B.J. Barratt, F. Payne, H.E. Rance, S. Nutland, J. Todd, D.G. Clayton, Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design, Ann. Hum. Genet. 73 (2009) 118–124. [2] A. Earp, M. Rahmani, K. Chew, A. Brooks–Wilson, Estimates of array and poolconstruction variance for planning efficient DNA-pooling genome wide association studies, BMC Med. Genomics 4 (2011) 81. [3] R.–E. Fan, P.–H. Chen, C.–J. Lin, Working set selection using second order information for training SVM, J. Mach. Learn. Res. 6 (2005) 1889–1918. [4] S. Gabriel, L. Ziuagra, D. Tabbaa, SNP genotyping using the Sequenom MassARRAY iPLEX platform, Curr. Protoc. Hum. Genet. 60 (2) (2009) 12.1– 2.12.16, http://dx.doi.org/10.1002/0471142905.hg0212s60. [5] K. Gunderson, F. Steemers, G. Lee, L. Mendoza, M. Chee, A genome wide scalable SNP genotyping assay using microarray technology, Nat. Genet. 37 (2005) 549–554. [6] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA data mining software: an update, SIGKDD Explor. 11 (1) (2009) 2009. [8] G. Kennedy, H. Matsuzaki, S. Dong, W. Liu, J. Huang, G. Liu, X. Su, M. Chu, Large-scale genotyping of complex DNA, Nat. Biotechnol. 21 (2003) 233–237. [9] S. Macgregor, P. Visscher, G. Montgomery, Analysis of pooled DNA samples on high density arrays without prior knowledge of differential hybridization rates, Nucleic Acids Res. 34 (2006) 7. [10] N. Norton, N. Williams, H. Williams, G. Spurlock, G. Kirov, D. Morris, B. Hoogendorn, M. Owen, M. O'Donovan, Universal, robust, highly quantitative allele frequency measurement in DNA pools, Hum. Genet. 110 (2002) 471–478. [11] D. Peiffer, J. Le, F. Steemers, W. Chang, T. Jenniges, F. Garcia, K. Haden, J. Li, C. Shaw, J. Belmont, S. Cheung, R. Shen, D. Barker, K. Gunderson, High-

A. Rahman et al. / Computers in Biology and Medicine 61 (2015) 48–55

resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping, Genome Res. (2006) 1136–1148. [13] P. Sham, J. Bader, I. Craig, M. O’Donavon, M. Owen, DNA pooling: a tool for large-scale association studies, Nat. Rev. Genet. 3 (2002) 862–871. [14] A.–C. Syvanen, Assessing genetic variation: genotyping single neuclotide polymorphisms, Nat. Genet. 2 (2001) 930–942. [15] H. Yang, Y. Liang, M. Huang, L. Li, C. Lin, J. Wu, Y. Chen, C. Fann, A genome-wide study of preferential amplification/hybridization in microarray-based pooled DNA experiments, Nucleic Acids Res. 34 (2006) 15.

55

[16] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [17] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Chapman & Hall/CRC, Florida, USA, 2002. [18] Sequenom, model based clustering of genotyping samples using a mixture of Gaussians approach. Sequenom Technical Note, 2007.

Allele frequency calibration for SNP based genotyping of DNA pools: A regression based local-global error fusion method.

The costs associated with developing high density microarray technologies are prohibitive for genotyping animals when there is low economic value asso...
2MB Sizes 1 Downloads 15 Views