Genomic relationships computed from either next-generation sequence or array SNP data.

J. Anim. Breed. Genet. ISSN 0931-2668

ORIGINAL ARTICLE

Genomic relationships computed from either next-generation sequence or array SNP data rez-Enciso1,2,3,4 M. Pe 1 2 3 4

Centre for Research in Agricultural Genomics (CRAG), Bellaterra, Spain noma de Barcelona, Bellaterra, Spain Veterinary School, Universitat Auto Institut Catala de Recerca i Estudis Avancats (ICREA), Barcelona, Spain Animal Breeding and Genomics Group, Wageningen University, Wageningen, The Netherlands

Summary

Keywords Coalescence; genomic selection; molecular relationship matrix; next-generation sequencing; SNP ascertainment. Correspondence rez-Enciso, Centre for Research in M. Pe Agricultural Genomics (CRAG), 08193 Bellaterra, Spain. Tel: +3493 563 66 00; Fax: +3493 563 66 01; E-mail: [email protected] Received: 2 October 2013; accepted: 2 December 2013

The use of sequence data in genomic prediction models is a topic of high interest, given the decreasing prices of current ‘next’-generation sequencing technologies (NGS) and the theoretical possibility of directly interrogating the genomes for all causal mutations. Here, we compare by simulation how well genetic relationships (G) could be estimated using either NGS or ascertained SNP arrays. DNA sequences were simulated using the coalescence according to two scenarios: a ‘cattle’ scenario that consisted of a bottleneck followed by a split in two breeds without migration, and a ‘pig’ model where Chinese introgression into international pig breeds was simulated. We found that introgression results in a large amount of variability across the genome and between individuals, both in differentiation and in diversity. In general, NGS data allowed the most accurate estimates of G, provided enough sequencing depth was available, because shallow NGS (49) may result in highly distorted estimates of G elements, especially if not standardized by allele frequency. However, high-density genotyping can also result in accurate estimates of G. Given that genotyping is much less noisy than NGS data, it is suggested that specific high-density arrays (~3M SNPs) that minimize the effects of ascertainment could be developed in the population of interest by sequencing the most influential animals and rely on those arrays for implementing genomic selection.

Introduction There is an increasing interest in implementing sequence data into genomic selection breeding programmes (Druet et al. 2014; Hickey 2013), a trend that is being fostered by dwindling sequencing technology (NGS) prices and by the popularization in the use of molecular information for genetic improvement. Having access to complete or almost complete sequence from whole populations is certainly very attractive for population genetics analyses because of the possibility of directly interrogating genomes for causal mutations, instead of relying on indirect signals © 2014 Blackwell Verlag GmbH

• J. Anim. Breed. Genet. 131 (2014) 85–96

based on linkage disequilibrium. For the estimates on the gains in genetic progress with NGS versus highdensity array genotyping, the evidence is, however, still limited. Some of the previous studies on this have employed a somewhat simplistic approach, either in generating polymorphism data or in the analyses of NGS data (Meuwissen & Goddard 2010). Recently, more realistic simulations by Druet et al. (2014) have suggested that sequence data will be particularly useful if causal mutations are at low frequency. How NGS data are generated and analysed is relevant because NGS is a highly error-prone technology, especially in detecting rare variants. For instance, the number and doi:10.1111/jbg.12074

rez-Enciso et al. M. Pe

NGS versus SNP-based genetic matrices

reliability of SNPs detected with NGS critically depend on parameters like average depth and the error pattern in sequencing (Nielsen et al. 2011). As we shall show here, additionally, the history of a population is also a determining factor in the behaviour of the different methodologies. In principle, barring for non-SNP variants like copy number variants (CNVs), the same algorithms and software can be used for genomic selection using either array-based SNPs or sequence-based SNPs. Extant software should nevertheless be optimized to deal with the increase in markers resulting from sequence data. Although variants exist to compute the molecular relationship matrix (G), such as those used in G-BLUP (VanRaden 2008), all are based in computing the allele difference counts between two individuals, summed over SNPs and divided by either the number of SNPs or by a function of allele frequency (p), typically the variance, 2p(1p) (Toro et al. 2011). The denominator is important though. With array-SNP data, the only information available is the number of SNPs and their frequencies whereas, with sequence data, the total length of DNA sequenced is also known. Two important concepts are needed to fully understand the differences between genotypes provided by either arrayed SNPs or sequence. The first one is SNP ascertainment bias and the second one, the Most Recent Common Ancestor (MRCA) between two sequences, a concept in turn that is closely related to nucleotide diversity h = 4 Ne l where Ne is effective population size and l, the mutation rate. Ascertainment bias is the criterion by which SNPs that are present in an array or genotyping protocol are chosen. This process often results in markers with high minimum allele frequency, that is, markers that are segregating in a large number of individuals and populations and are therefore highly ‘informative’. Standard population genetics theory (Nielsen & Slatkin 2013) tells us that rare variants (with very low minimum allele frequency, MAF) are the most frequent ones, so the effect of discarding these low informative but very common class of markers is potentially important. So far, all genomic selection programmes are based on arrays made up of ascertained SNPs. Despite this, the effect of ascertainment bias on the computation of the genomic relationship matrix has not been studied, to the best of our knowledge. A pair of DNA sequences, say the two haplotypes in a single individual, coalesce in their MRCA if they do have not undergone recombination. If recombination has occurred, as will be normally the case, each 86

stretch of DNA for the same two haplotypes will have a different MRCA (Nordborg & Tavare 2002). The number of nucleotide differences between a pair of sequences, for example, the number of SNPs found in a single individual, divided by total DNA length sequenced is directly proportional to the MRCA if no recombination has occurred and is also proportional to nucleotide diversity h (Nielsen & Slatkin 2013). If recombination has occurred, it will be proportional to average time to all MRCAs. If an element of the sequence-based genomic relationship matrix is computed as the number of allelic differences between the two individuals divided by the number of base pairs actually sequenced, then it can be interpreted as the average time to MRCA between the two individuals’ pairs of haplotypes. In turn, recall that time to MRCA is proportional to h. Therefore, a principal difference of having access to sequence data instead of only SNP arrays is the possibility of estimating DNA diversity. In this work, we study the effect of SNP ascertainment bias on molecular relationship matrices. Simulations are used to compare how well molecular relationship matrices G can be estimated from either NGS or array data. Therefore, we do not directly analyse how much can be gained in terms of genetic progress with say G-BLUP but, rather, we provide guidelines on the scenarios and experimental settings where NGS is to perform comparatively better than SNP arrays to estimate G, and on how SNP arrays could be designed to improve its performance. Nevertheless, the assumption is made that the true G (the matrix computed if complete sequence from all individuals were known without errors) is the best possible scenario for genomic prediction. This assumption is not discussed here, and may not be true, but it is reasonable starting point.

Methods Molecular relationships compared

A wide literature is available on different metrics to compute G (VanRaden 2008; Toro et al. 2011). Here, we compare two approaches to measure genetic similarity that use or not allele frequencies. In the first one (GL), we count the number of alleles shared between two individuals, summed over SNPs in the region and divided by total DNA length analysed (L): gLij ¼

X k ¼1;nsites

dðaik ; ajk Þ = L;

ð1Þ

where aik is genotype for i-th individual and k-th locus, and d is an indicative variable taking values 1, © 2014 Blackwell Verlag GmbH




0.5 and 0 for genotype pairs (AA, AA), (AB, –) and (AA, BB), respectively. As mentioned previously, the quantity gL has a clear evolutionary interpretation: average pairwise difference is proportional to the MRCA between any two sequences, and the average of GL elements is an estimator of nucleotide diversity. In NGS data, because coverage will never be complete, instead of L we used the number of base pairs that fulfilled the minimum depth and base and mapping quality requirements, which was computed for every pair of individuals. Then, with NGS data, L varies from one comparison to another. For SNP arrays, gL cannot be properly computed because the number of interrogated base pairs (L) has no meaning and so the denominator in (1) is the total number of SNPs. Despite its theoretical appeal from a population genetics point of view, a potential disadvantage of GL is that it gives equal weights to each SNP irrespective of its frequency (Yang et al. 2011). The second metrics used, also widely known and used in many G-BLUP programs, corrects for this dividing by the variance of genotype values: P k ¼ 1;nsp ðxik lk Þðxjk lk Þ P ; ð2Þ gFij ¼ k ¼ 1;nsp 2pk ð1 pk Þ Where xij is arbitrarily coded as 1, 0 and 1 for genotypes AA, AB and BB, respectively, p is the population frequency of allele B, and l = 2p is the population mean. We use Equation (2) above instead of averagP ing over loci gFij ¼ k ¼ 1;nsp ðxik lk Þðxjk lk Þ=2pk ð1 pk Þ because the former has been shown to have a better behaviour (Toro et al. 2011). A problem with Equation (2) is what frequency to use, a matter of additional complication if several populations are analysed. Here, we employed average frequency over all populations and individuals. In most cases presented here, this gave reasonable results. Note that GF and GL are equivalent, for an equal set of markers, if p is 0.5 across all markers.

The simplest scenario considered, called ‘cattle’, consists of a bottleneck occurred t2 = 2000 generations ago followed by a recent splitting t1 = 50 generations ago in two populations that become isolated after splitting. These two populations may represent dairy and beef cattle breeds, created after a bottleneck as a result of domestication. Following the literature (Gibbs et al. 2009), we assumed a nucleotide variability h = 0.0012 per bp in both breeds, and effective population size Ne = 300. The second scenario considered is a ‘pig’ scenario. In pigs, it is well known that current international breeds are the result of introgression of Chinese pigs into the ancestral local European pig, and that both Chinese and ancestral European pigs are highly divergent (Giuffra et al. 2000; Groenen et al. 2012). The extent of autosomal introgression into modern breeds is quite high and has been estimated as ~ 30% in two independent studies (Ojeda et al. 2011; Groenen et al. 2012). We simulate individuals from four populations: a local European pig without admixing with Chinese pigs (called ‘Iberian’, population 1), two international breeds, admixed with Chinese pigs (called ‘Large White’ and ‘Duroc’, populations 2 and 3) and a Chinese population (population 4). Chinese admixture has occurred between 25 and 30 generations ago, at a rate of 1 or 2% per generation, with a small migration also between international breeds; three bottlenecks has occurred approximately 30, 100 and 5000 generations ago, the latter time is when Chinese population diverged from the European clade. We employ coalescence simulations with software MaCS (Chen et al. 2009) to generate DNA data in stretches of 100 kb. The exact coalescence commands for both scenarios are in the Appendix 1. CATTLE

PIG

General setting

The main forces in shaping livestock genomes are bottlenecks, that is, a reduction in effective population size due to domestication and modern selection, admixture, that is, hybridization between different breeds, and artificial selection. Although bottlenecks and selection are usually considered as highly influential, here, we show that admixture can also have a dramatic influence. We consider two scenarios that may represent extreme cases in livestock (Figure 1).

© 2014 Blackwell Verlag GmbH


DC

BC

IB

DU LW

Breed

θ (x103)

Cattle

1.2

IB

0.9

DU

1.5

LW

1.5

CN

3.3

CN

Figure 1 Demographic scenarios considered: left, ‘cattle’ scenario; right, ‘pig’ scenario. In pigs, dotted arrows mean limited introgression from China into international breeds. Migration between international breeds was also allowed.

87



bioinformatics subtleties affect the final list and reliability of identified SNPs (Nielsen et al. 2011). Besides, compared with SNP arrays, NGS data are highly unbalanced, and there is no guarantee that all sites will be equally covered across samples. To carefully account for these effects, we simulate a typical NGS analysis pipeline as realistically as possible. The complete bioinformatics pipeline is detailed in Figure 2. First, we simulate polymorphisms with the coalescence using the models detailed in Figure 1, which are mapped onto a fasta file of 100 kb. Initially, the fasta file consisted of randomly chosen nucleotides. One of the simulated sequences is considered as the ‘reference’ sequence and fasta files for the simulated sequences are generated using this reference sequence and the coalescence simulations. In each demographic model and simulation replicate, a subset of pairs of sequences (diploid individuals) is selected to identify SNPs following the ascertainment process described previously. The rest of pairs of sequences are subject to genotyping using both the sparse and the high-density SNP sets; to achieve this, the ascertained SNP positions are interrogated in every genotyped individual. Array genotyping is assumed to be error-free and without missing data. In parallel, for the same genotyped individuals, 100 bp long pairedend reads and 500 2 bp libraries are simulated with software art_illumina (Huang et al. 2012) using the default error profile for Illumina’s technology. Depths are either 49 or 109, but most of results presented refer to 109 only. Reads are aligned with bwa (Li & Durbin 2009) against the reference sequence allowing

SNP ascertainment

The most popular criterion to select array-SNPs is ‘informativity’, that is, the requirement that the SNP segregates in different breeds and that its MAF exceeds a given threshold; this results in a bias towards intermediate allele frequency SNPs. From a genealogical perspective, high MAF SNPs are also older on average than low MAF SNPs. Here, we compare two different ascertainment criteria. In the first case, we required a MAF ≥ 0.15 for a SNP to be included in the genotyping panel, and we randomly selected 10 SNPs per 100 kb out of those that fulfilled this criterion; in the second case, we chose a milder allele frequency requisite MAF ≥ 0.05, and 1 of every 100 SNPs was randomly selected. The former criterion would result in SNP arrays of ~300k SNPs, given that, for a typical mammalian genome of 3 Gb, 10 SNPs per 100 kb is equivalent to 300k SNPs per genome. The latter strategy implies 10 times more SNPs, approximately, or 3M SNPs per genome. In the cattle simulation, we ascertained SNPs in 25 individuals form the first breed; in the pig simulation, SNPs were ascertained in 50 individuals, 25 from each of the ‘international’ breeds. Individuals in the SNP discovery panel are not present in the genotyped and sequenced panel. Simulating NGS errors and bioinformatics pipeline

It is important to realize that NGS-based sequence is not simply a dramatic increase in SNP density. NGS reads present a high error rate, and many

Coalescence simulation

mscript.pl macs, perl

Reference genome

Diploid DNA sequence

Diploid DNA sequence

art_illumina

perl, f95

NGS reads

Ascertained SNPs

b bwa

perl, f95

Individual bamfiles

Population SNP genotypes

Population vcffiles

samtools

f95 Molecular relationship matrices

100 kb windows simulated Figure 2 Bioinformatics pipeline.

88





for six mismatches. SNPs are called with samtools mpileup (Li et al. 2009) using all samples from all populations simultaneously, filtering by base and mapping quality 20, the minimum depth to call a SNP is set to three. We also explored minimum depth of five to call a SNP, but results were in general slightly worse. For a SNP to be considered, its minimum quality from samtools is set to 10. To compute relationships and allele frequencies, we consider the genotype likelihoods from samtools, that is, we weigh each genotype by its probability rather than taking as true genotype the most likely genotype. Multiallelic calls, indels and ambiguous genotypes, that is, the calls where two genotypes have the same likelihood, are discarded. Finally, we compute the different G matrices (Table 1) for all pairs of individuals in each of the 100 kb simulated windows. The correlation between the elements of the G matrix obtained from the ‘true’ sequence and with the different methods is taken as a measure of how well a given method performs. Correlations are obtained between and within populations. The whole-simulation process is replicated 200 times. We retain all elements of the molecular relationship matrices, but we randomly sample 10% in plotting to improve visualization.

assuming homogeneity throughout the genome. Certainly, the latter condition is never fulfilled because the genome varies widely in all its features: number of SNPs, recombination rate, evolutionary constraint and so on, but this only means that the variances in the Figure 3 are severely underestimated compared with the likely real ones. The most important aspect to notice in Figure 3 is that the history of the species, especially the presence of admixture, has a dramatic influence on the

Results and discussion Influence of the demographic scenario

First, we study the properties of the two demographic models from the coalescence simulations, as if true sequences were known. Figure 3 shows the distribution of DNA variability (h) and Fst in both ‘cattle’ and ‘pig’ scenarios. These values are obtained across simulation replicates, but they can also be interpreted as the distribution across 100 kb windows in the genome of a single individual, provided the windows are sufficiently apart to be considered as unlinked, and Table 1 List of metrics used in this study

Data

Average allele difference (Equation 1)

Frequency weighted score (Equation 2)

True sequence NGS SNP (1 SNP/10 kb) High-density SNP (~1 SNP/kb)

GTLa GNLb GSSc GHSc

GTF GNF GSF GHF

a

Divided by number of total sites. Divided by number of total callable sites, that is, sites with appropriate depth and mapping quality per pair of individuals. c Divided by number of segregating SNPs. b



Figure 3 Top: distribution of nucleotide diversity in cattle (cow) and pig scenarios (CN, China; IB, Iberian; LW, Large White); middle: distribution of Fst between pairs of populations (DU, Duroc); bottom, distribution of Fsts between Iberian and Large White versus Large White and Duroc.

89



distribution of nucleotide diversity and on differentiation between breeds. These two parameters are fundamental for many aspects of livestock genetics, including effective genetic management, repeatability of GWAS studies or estimating molecular relationships, the main subject of this paper. In the absence of admixture and in a relatively simple model that only includes bottlenecks (the ‘cattle’ model), both h and Fst follow smooth patterns. In practice, this means that any single window is equally representative of the whole genome and that a limited number of markers will suffice to characterize the genome. In contrast, when hybridization between genetically distant breeds occurs, as in the pig, the whole pattern becomes much more complex and unpredictable. Note in particular differentiation between two breeds (Fst). While ‘Iberian’ – ‘Chinese’ Fsts windows are consistently high, the Fst’s between ‘Large White’ and China is also high but shows a thick tail on the lower end of the distribution. This means that some windows – those that contain tracts of Chinese germplasm – are not too differentiated between these two breeds. The situation becomes even more complex when comparing Iberian and International breeds or between International breeds. Our pig model predicts that the profile is very flat in this case and, therefore, that differentiation along the genome will be extremely heterogeneous. It should be mentioned that Fst values are poorly correlated across comparisons; for instance, even if Fst’s patterns were very similar in

G_NL ( −0.0953 )

G_SS ( −0.0240 )

G_HS ( −0.0236 ) 6

0.995

4 0

1.0

0.5

0.9

G_HS ( −0.6455 )

1.0

Density

0.0

1.0

0.7

2.0

G_SS ( −0.7017 )

0.0 0.985

2

Density

3

Density

2

0.6

2.0

G_NL ( −1.9007 )

0 0.995

0.2

Density

400 Density

250 100 0

0 0.9970 0.9985 1.0000

G_TL ( −1.6506 )

0.985

1

1000

Density

0 0.999

200

0.997

Density

400

400 800 0

Density

4

G_TL ( −0.3185 )

IB-LW and LW-DU comparisons, the correlation between those Fst’s was mild, 0.48 (Figure 3c). In other words, our model predicts that rarely the same two windows in, say, Large White and Duroc will be simultaneously enriched for alleles of Asian origin. The effect of demography is also seen in the distribution of G matrix elements (Figure 4, left column). In no case, the G elements were normally distributed while the distribution was symmetrical in the no migration scenario (top row). In stark contrast, but unsurprisingly, migration induces a profound change in how G elements are distributed, consistent with what is observed in Fst (Figure 3). The bottom row in Figure 4 shows these distributions for the ‘Large White’ breed. The distribution is highly skewed to the left because, due to introgression, some pairs of individuals in some genome regions are far less related than the rest of non-admixed individuals or windows. This distribution is very complex and, certainly, its exact shape will depend on a number of factors like age and extent of admixture. For instance, if admixing is very recent and massive, like in any of the B. taurus 9 B. indicus synthetic breeds, the density will be multimodal and rather flat. If admixture occurred long time ago, recombination will have erased differences and the distribution would become symmetrical again, as in the cattle scenario of Figure 1. This is not the case of the pig, where introgression occurred quite recently and so we expect to see patterns similar to those presented in the bottom row of Figure 4.

0.0

0.4

0.8

0.0

0.4

0.8

Figure 4 Density of GL matrices’ elements obtained from Equation (1) in the cattle (top row) and pig (bottom row) inferred from the different data types: using the true sequence (G_NL), NGS data (G_NL), standard SNP density (G_SS) and high-density genotyping (G_HS). Elements plotted are those corresponding to within the cattle population not used in the SNP discovery process (top) and the pig ‘Large White’ breed (bottom). Numbers in parentheses are skewness coefficients. Note that x and y axes scales vary widely from plot to plot.

90





the same shape as the true one, although note that the mean and variances differ. SNP arrays resulted in highly distorted profiles in the presence of admixture (bottom row), again, that underestimates skewness but overestimates variance as compared to the true distribution. NGS data (GNL) allowed us to recover approximately the true pattern, both with and without admixture. In commercial arrays, the user has no control over the SNPs genotyped. It is possible, nevertheless, that genotyping technologies improve so that customdesigned high-density arrays may become affordable in the future. In this case, it makes sense to study how SNP ascertainment side effects can be minimized. To that end, we explore four SNP ascertainment strategies that differ in SNP density (10 and 100 per 100 kb) and in the minimum MAF required (0.15 versus 0.01 or 0.05). The correlations between true GL elements and those obtained from ascertained SNPs are in Table 2; for completeness, those with NGS data at low (49) or medium (109) depths are also shown. At equal SNP density, relaxing the MAF criterion results in better correlations, especially for those breeds not included in the ascertainment processes (Iberian and China). Increasing the SNP density also improves correlation, although with a likely increase in associated costs as well. We do not find a clear improvement if the number of individuals used to discover the SNPs were 150 instead of 50 (results not shown). For SNP

The effect of SNP ascertainment

SNP ascertainment is difficult to model because there are several, often unknown, reasons for a given SNP to be included in the array’s final panel. Its effects also depend on the genetic distance between the genotyped population and the breeds used to discover the SNPs, which may be large or even unknown. For all those reasons, the results presented here should be taken with some caution and are aimed at gaining insight on the, plausibly, most important consequences of ascertainment. One of the most wellknown effects of ascertainment is distorting the site frequency spectrum (SFS) of the markers relative to the true one if individuals were sequenced (Nielsen et al. 2004; Lachance & Tishkoff 2013). Figure 5 shows the observed SFS in the cattle population that is not used for the SNP discovery process. Clearly, while NGS allows us to recover closely the true spectrum, array-based SNP data are biased towards intermediate allele frequency alleles, as is usually observed in real data from large-scale genotyping studies. Note that even high-density (equivalent to a ~3 million SNPs array) results in a considerable shift towards common variants. Unsurprisingly, SNP ascertainment also affects the distribution of the G elements (Figure 4). In the case of no migration (top row), the effect of SNP ascertainment is not so dramatic because the distributions have

NGS

0

0

1000

1000

3000

3000

True

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2


0.8

1.0

0.8

1.0

800 400 0

200 0

Figure 5 Site frequency spectra in the cattle population that was not used for SNP ascertainment. Average results of 200 replicates of 100 kb windows and a 25 individual population. In the Figure we show the unfolded spectrum, where a high frequency means a high frequency of the mutated (derived) allele. In practice, the ancestral allele is not necessarily known so only the minimum allele frequency (0< MAF 0.85) and were slightly better when standardizing by frequency (GNF), and this improved the linearity of the plots. With SNP data, in contrast, the relationship becomes unpredictable, often markedly non-linear and multimodal. In some extreme case, like for China, using array data to infer relationships was virtually useless because of the large genetic distance between ascertained (Large White and Duroc) and genotyped (China) populations. Still, array-based estimates

1.0

1.5

2.0

1.0 0.5 2.0

0.5

1.0

1.5

1.0 0.0

0.5

1.0

1.5

G_HF 0.91 3

G_SF 0.75

2 1

2

0

1 0

−1

−1

0.996

1.000

1.0 1.5 2.0 2.5 3.0 3.5

2.5 2.0 1.5 1.0

G_NF 0.99

0.5

1.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5 0.996

1.000

0.5 1.0 1.5 2.0

G_HF 0.92

−0.5

−0.5 −1.0 −0.5

0.0

0.5

1.0

−1.0

−1.5 0.992

−0.5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0 0.988

1.5

G_SF 0.70

−1

0.985 0.992

1.0

G_NF 0.98

G_NF 0.99

1

0.995 0.990

0.6 0.988

0.5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

1.5

0.998

G_NL 0.97

0.5 1.0 1.5 2.0

3.0

3.0 2.0 1.0 0.996

−0.5

0.0

G_HF 0.15

0.0 0.994

0.4 1.000

0.5 1.0 1.5 2.0

G_SF −0.06

2

0.998

0.8

0.8 0.6

0.996

−0.5

0.996 0.996

G_HS 0.76

0.4

0.992

0.998

0.992 0.994

0.2 0.988

0.994

G_NL 0.76

0.988

0.998

G_SS 0.54

1.0

0.990

1.000

0.996

0.998

G_HS 0.33

0.80 0.85 0.90 0.95 1.00

1.0 0.9 0.8 0.7 0.6 0.994

0.994

2.0

0.0

−0.5 0.0

0.992 0.990

1.5

LW

0.999

0.988

0.998

1.0

CHINA

0.997

G_NL 0.96

0.996

0.6 0.4 0.2 0.994

G_SS 0.17

0.5

G_NF 0.98

IB − LW

0.995

0.8

0.8 0.6 0.4 0.2 0.990

−0.5 0.0

−0.5 0.0 0.5 1.0 1.5 2.0

0.999

1.5

3

1.0

1.0

0.997

G_HS 0.89

1.0

0.5

2 1 −1

0.992

0.995

1.000

0.999

0.5

G_HF 0.87

0

0.7 0.6 0.5 0.997

G_SS 0.76

−0.5 0.0

IBERIAN

0.5

G_SF 0.60

0.996

0.8

0.8 0.6 0.4 0.995

−0.5 0.0

0.5 −0.5 0.0

1.5

0.9995

−0.5

−1 0.9985

G_NL 0.73

CATTLE

2

1.5

1.0

0.9975

0.5 1.0 1.5 2.0

0.9995

3

0.9985

G_HS 0.78

0.9

1.0

0.9975

0.9975

0.9995

1.000

0.9985

G_SS 0.63

G_NF 0.98

1 0

0.7 0.6

0.2 0.9975

G_HF 0.92 1.5

G_SF 0.70

0.9985

0.8

0.8 0.6 0.4

G_NL 0.97

0.9995

G_HS 0.73 0.9

1.0

G_SS 0.52

perform reasonably well within international breeds (q = 0.70–0.83), although patterns are often far from linear. We evaluate the influence of average depth across a limited range of options. We do not observe a marked improvement by increasing 10–209 in depth (results not presented). In contrast, shallow depth (49) does have a big impact for worse, but mainly in the nonstandardized G elements (GL). For instance, in ‘Iberian’ pig, correlation between true and NGS elements is 0.25 for GL elements, but it increases to 0.94 in GF. The reason for such a large – and consistent – discrepancy between GL and GF at shallow depth is not completely clear. It is likely, though, that this is due to the fact that only the SNPs identified are considered in the computation of GF, whereas all sites are employed in GL, irrespective of whether SNPs are identified or

−1.0 −0.5

0.0

0.5

1.0

−1.0 −0.5

0.0

0.5

1.0

Figure 7 Scatter plots of true (x-axis) versus estimated G (y-axis) elements across scenarios, populations and methods. Across rows, from top to bottom: (1) within the non-ascertained cattle population, (2) within Iberian, (3) within Large White, (4) within China, (5) between Iberian and Large White. Across columns, from left to right: GSS, GHS, GNL, GSF, GHF, and GNF. Numbers in the title are the correlations between true and estimated G elements.



93



not. At very shallow depth, many truly heterozygous individuals will not show up because reads pertain to only one allele type, simply because of sampling. This site will count as homozygous in GL but will not be considered in GF. Besides, these events will occur more frequently as MAF decreases so these markers will in any case be given less weight in GF. Will high-density genotyping, instead of NGS, suffice?

The answer to this question critically depends in part on issues that have not been specifically addressed here, like the reliability of massive imputation of haplotypes out of a few influential animals’ sequenced. Other aspects to be considered is that sequence data are far more complex to analyse and results in highly unbalanced data compared with chip genotyping; sequence data are also much slower and more expensive to acquire than array genotyping. From our simulations, we also observe that the demographic scenario has a key influence on the properties of the G matrices, so a preliminary study to infer key genetic parameters in the target population should be undertaken before setting a definitive sequence-based selection programme. Despite their limitations, relationships based on ascertained high-density SNPs are reasonably accurate (q ~ 0.80 in Large White pigs and q > 0.90 in cattle, Figure 7). The main practical limitation is to have access to such a high-density array, which is equivalent to a ca. 3 M SNPs in mammalian genomes (this number follows from assuming 100 SNPs per 100 kb window and a 3 Gb genome). Currently, the densest array in human is over 4 M SNPs but the densest livestock array is the 800k SNP cattle chip (Matukumalli et al. 2011). For that quantity of SNPs, our model predicts a correlation between estimated and true G elements of 0.57 and 0.83 for GL and GF, respectively, in the cattle scenario. These correlations are obtained when SNPs are ascertained with a minimum MAF ≥ 0.15, and are slightly higher if restriction on MAF was less strict, i.e., 0.05. Therefore, the use of high-density SNP arrays instead of sequence data for genomic selection will partly depend on whether standardization by allele frequency results in optimum genetic progress as compared to using average pairwise differences (Equation 1). Conclusions This work explores the effects of SNP ascertainment and of sequencing bioinformatics pipelines in estimating genetic relationships. It evaluates, using plausible 94

demographic scenarios, the impact of NGS and of SNP ascertainment on the estimation of genetic relationships, a fundamental tool for genomic selection. Several conclusions can be drawn from this study. First, the demographic scenario, specially the presence of admixture, has a dramatic influence on the G elements. While average pairwise differences (GL) have a clear asymmetrical distribution with admixture, weighting by SNP frequency (GF) normalizes – to an extent – this distribution. Second, NGS data result in the most accurate estimates of G, provided enough depth is obtained, although shallow NGS may result in highly distorted estimates of GL (although not so much for GF). Third, high-density genotyping can also result in accurate estimates of G; we recommend nevertheless that the distribution of G elements should be carefully inspected to detect unexpected patterns. It has been advocated (Larkin et al. 2012; Hayes et al. 2013) that implementation of NGS data in genomic selection schemes could involve sequencing a few highly influential animals and have the genomes of the rest of the population imputed using genotyping or very shallow sequencing data. Here, we provide an alternative strategy: to develop high-density specific arrays for the population of interest with a very mild restriction on MAF. These chips could be designed using, for example, the sequence of the founders or of influential animals and would later be used for massively genotyping the rest of the population. This approach should be cheaper and more reliable than sequencing followed by imputation because animals should be genotyped or sequenced anyhow and the development of custom arrays should be widely feasible in the future. A final important note: this strategy critically depends on GF (Equation 2) being better suited for genomic selection than GL (Equation 1). Acknowledgements I thank Andr es Legarra. Miguel Toro and Clara Dıaz, for suggestions and many discussions, and the editor and reviewer for constructive criticisms. I also thank Martien Groenen, Hendrik Jan Megens and students for a stimulating environment and discussions, Sebasti an Ramos-Onsins for his mstatpop program, and Bruno Nevado for his script to map MaCS output onto fasta files. This work was accomplished during a sabbatical stay in Wageningen University, kindly funded by Wageningen University Research (WUR), which provided a stimulating environment and numerous opportunities for invigorating bicycle rides. Work also funded by AGL2010-14822 grant (Spain). © 2014 Blackwell Verlag GmbH




References Chen G.K., Marjoram P., Wall J.D. (2009) Fast and flexible simulation of DNA sequence data. Genome Res., 19, 136– 142. Druet T., Macleod I.M., Hayes B.J. (2014) Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity, 112, 39–47. Gibbs R.A., Taylor J.F., Van Tassell C.P., Barendse W., Eversole K.A., Gill C.A., Green R.D., Hamernik D.L., Kappes S.M., Lien S., Matukumalli L.K., McEwan J.C., Nazareth L.V., Schnabel R.D., Weinstock G.M., Wheeler D.A., Ajmone-Marsan P., Boettcher P.J., Caetano A.R., Garcia J.F., Hanotte O., Mariani P., Skow L.C., Sonstegard T.S., Williams J.L., Diallo B., Hailemariam L., Martinez M.L., Morris C.A., Silva L.O., Spelman R.J., Mulatu W., Zhao K., Abbey C.A., Agaba M., Araujo F.R., Bunch R.J., Burton J., Gorni C., Olivier H., Harrison B.E., Luff B., Machado M.A., Mwakaya J., Plastow G., Sim W., Smith T., Thomas M.B., Valentini A., Williams P., Womack J., Woolliams J.A., Liu Y., Qin X., Worley K.C., Gao C., Jiang H., Moore S.S., Ren Y., Song X.Z., Bustamante C.D., Hernandez R.D., Muzny D.M., Patil S., San Lucas A., Fu Q., Kent M.P., Vega R., Matukumalli A., McWilliam S., Sclep G., Bryc K., Choi J., Gao H., Grefenstette J.J., Murdoch B., Stella A., Villa-Angulo R., Wright M., Aerts J., Jann O., Negrini R., Goddard M.E., Hayes B.J., Bradley D.G., Barbosa da Silva M., Lau L.P., Liu G.E., Lynn D.J., Panzitta F., Dodds K.G. (2009) Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science, 324, 528–532. Giuffra E., Kijas J.M., Amarger V., Carlborg O., Jeon J.T., Andersson L. (2000) The origin of the domestic pig: independent domestication and subsequent introgression. Genetics, 154, 1785–1791. Groenen M.A., Archibald A.L., Uenishi H., Tuggle C.K., Takeuchi Y., Rothschild M.F., Rogel-Gaillard C., Park C., Milan D., Megens H.J., Li S., Larkin D.M., Kim H., Frantz L.A., Caccamo M., Ahn H., Aken B.L., Anselmo A., Anthon C., Auvil L., Badaoui B., Beattie C.W., Bendixen C., Berman D., Blecha F., Blomberg J., Bolund L., Bosse M., Botti S., Bujie Z., Bystrom M., Capitanu B., Carvalho-Silva D., Chardon P., Chen C., Cheng R., Choi S.H., Chow W., Clark R.C., Clee C., Crooijmans R.P., Dawson H.D., Dehais P., De Sapio F., Dibbits B., Drou N., Du Z.Q., Eversole K., Fadista J., Fairley S., Faraut T., Faulkner G.J., Fowler K.E., Fredholm M., Fritz E., Gilbert J.G., Giuffra E., Gorodkin J., Griffin D.K., Harrow J.L., Hayward A., Howe K., Hu Z.L., Humphray S.J., Hunt T., Hornshøj H., Jeon J.T., Jern P., Jones M., Jurka J., Kanamori H., Kapetanovic R., Kim J., Kim J.H., Kim K.W., Kim T.H., Larson G., Lee K., Lee K.T., Leggett R., Lewin H.A., Li Y., Liu W., Loveland J.E., Lu Y., Lunney J.K., Ma J., Madsen O., Mann K., Matthews L., McLaren



S., Morozumi T., Murtaugh M.P., Narayan J., Nguyen D.T., Ni P., Oh S.J., Onteru S., Panitz F., Park E.W., Park H.S., Pascal G., Paudel Y., Perez-Enciso M., RamirezGonzalez R., Reecy J.M., Rodriguez-Zas S., Rohrer G.A., Rund L., Sang Y., Schachtschneider K., Schraiber J.G., Schwartz J., Scobie L., Scott C., Searle S., Servin B., Southey B.R., Sperber G., Stadler P., Sweedler J.V., Tafer H., Thomsen B., Wali R., Wang J., Wang J., White S., Xu X., Yerle M., Zhang G., Zhang J., Zhang J., Zhao S., Rogers J., Churcher C., Schook L.B. (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature, 491, 393–398. Hayes B.J., Lewin H.A., Goddard M.E. (2013) The future of livestock breeding: genomic selection for efficiency, reduced emissions intensity, and adaptation. Trends Genet., 29, 206–214. Hickey J.M. (2013) Sequencing millions of animals for genomic selection 2.0. J. Anim. Breed. Genet., 130, 331–332. Huang W., Li L., Myers J.R., Marth G.T. (2012) ART: a next-generation sequencing read simulator. Bioinformatics, 28, 593–594. Hudson R.R. (2002) Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics, 18, 337–338. Lachance J., Tishkoff S.A. (2013) SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. BioEssays, 35, 780–786. Larkin D.M., Daetwyler H.D., Hernandez A.G., Wright C.L., Hetrick L.A., Boucek L., Bachman S.L., Band M.R., Akraiko T.V., Cohen-Zinder M., Thimmapuram J., Macleod I.M., Harkins T.T., McCague J.E., Goddard M.E., Hayes B.J., Lewin H.A. (2012) Whole-genome resequencing of two elite sires for the detection of haplotypes under selection in dairy cattle. Proc. Natl Acad. Sci., 109, 7693–7698. Li H., Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. Matukumalli L.K., Schroeder S., DeNise S.K., Sonstegard T., Lawley C.T., Georges M., Coppieters W., Gietzen K., Medrano J.F., Rincon G., Lince A., Eggen A., Glaser L., Cam G., Van Tassell C. (2011) Analyzing LD blocks and CNV segments in cattle: Novel genomic features identified using the BovineHD BeadChip. (ed. by Pub. No. 370-2011-002 II, San Diego, CA). Meuwissen T., Goddard M. (2010) Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics, 185, 623–631. Nielsen R., Slatkin M. (2013) An Introduction to Population Genetics. Sinauer, Sunderland, MA.

95



Nielsen R., Hubisz M.J., Clark A.G. (2004) Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics, 168, 2373–2382. Nielsen R., Paul J.S., Albrechtsen A., Song Y.S. (2011) Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet., 12, 443–451. Nordborg M., Tavare S. (2002) Linkage disequilibrium: what history has to tell us. Trends Genet., 18, 83–90. Ojeda A., Ramos-Onsins S.E., Marletta D., Huang L.S., Folch J.M., Perez-Enciso M. (2011) Evolutionary study of a potential selection target region in the pig. Heredity (Edinb), 106, 330–338. Raineri E., Ferretti L., Esteve-Codina A., Nevado B., Heath S., Perez-Enciso M. (2012) SNP calling by sequencing pooled samples. BMC Bioinformatics, 13, 239. Toro M.A., Garcia-Cortes L.A., Legarra A. (2011) A note on the rationale for estimating genealogical coancestry from molecular markers. Genet. Sel. Evol., 43, 1–10. VanRaden P.M. (2008) Efficient methods to compute genomic predictions. J. Dairy Sci., 91, 4414–4423. Yang J., Manolio T.A., Pasquale L.R., Boerwinkle E., Caporaso N., Cunningham J.M., de Andrade M., Feenstra B., Feingold E., Hayes M.G., Hill W.G., Landi M.T., Alonso A., Lettre G., Lin P., Ling H., Lowe W., Mathias R.A., Melbye M., Pugh E., Cornelis M.C., Weir B.S., Goddard M.E., Visscher P.M. (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet., 43, 519–525.

Appendix 1: Coalescence commands to generate sequence data For a thorough documentation on coalescence commands, the reader is referred to Hudson’s site http:// home.uchicago.edu/rhudson1/source/mksamples. html (Hudson 2002). The command used in the cattle model is (parameters are scaled in 4Ne units): macs 250 100000 -t 0.0012 -r 0.0012 -I 2 150 50 -ej 0.042 2 1 -eN 1.67 10

which is read as: a coalescence simulation of 125 diploid individuals (250 chromosomes) sequence of

96

length 100000 bp with population variability (h) and recombination rates per base pair equal to 0.0012 in 4Ne units (Ne is 300), 75 and 50 individuals are sampled from each of two populations (-I 2 150 50), the populations diverged t1 = 0.042 generations ago in 4Ne units (-ej 0.042 2 1) and the ancestral population suffered a bottleneck decreasing its effective population size 10-fold t2 = 1.67 generations ago (-eN 1.67 10). These parameters are adjusted to mimic, approximately, observed diversity levels in cattle (Gibbs et al. 2009). The complete command for the ‘pig’ scenario is, where scaling is again in 4Ne units, with Ne=125, derived from Ne = h/4l =0.0012/4 9 106: macs 300 100000 -t 0.0005 -r 0.0005 -I 4 50 100 100 50 -m 2 3 0.01 -m 3 2 0.01 -n 1 0.15 -n 2 0.15 -n 3 0.15 -n 4 1.5 -em 0.049 3 4 5 -em 0.05 2 4 10 -eM 0.06 0 -ej 0.07 2 1 -ej 0.0701 3 1 -en 0.09 1 5 -en 0.2 1 7 -en 0.21 4 10 -ej 10 1 4

The above command (-I 4 50 100 100 50) means that diploid sequence of 25 individuals from each of Iberian and Chinese populations are simulated, together with 50 from Duroc and Large White. Of those, 25 Duroc and 25 LW are later randomly selected to carry out SNP ascertainment, which is used to ‘genotype’ the rest of samples. Migration between China (population 4) and International breeds (populations 2 and 3) have occurred between generations 0.049 and 0.06 ago at a rate 5% and 10%, as well as limited symmetric migration between LW and DU (-m 2 3 0.01 -m 3 2 0.01) all European populations have merged 0.07 9 4Ne generations ago (-ej 0.07 2 1), whereas China and Europe diverged 10 9 4Ne generations ago (-ej 10 1 4). The properties, in terms of variability and Fst distributions, from these two models were explored with mstatpop (Ramos-Onsins et al., unpublished, available at http://bioinformatics.cragenomica.es/numgenomics//people/sebas/software/software.html), that process directly the output from MaCS software.



Robust demographic inference from genomic and SNP data.

A Method for Checking Genomic Integrity in Cultured Cell Lines from SNP Genotyping Data.

Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken.

Scanning and Filling: Ultra-Dense SNP Genotyping Combining Genotyping-By-Sequencing, SNP Array and Whole-Genome Resequencing Data.

Re-evaluating data quality of dog mitochondrial, Y chromosomal, and autosomal SNPs genotyped by SNP array.

SNP Array in Hematopoietic Neoplasms: A Review.

Combined Analysis of SNP Array Data Identifies Novel CNV Candidates and Pathways in Ependymoma and Mesothelioma.

Analyzing copy number variation using SNP array data: protocols for calling CNV and association tests.

Deciphering clonality in aneuploid breast tumors using SNP array and sequencing data.

4Pipe4--A 454 data analysis pipeline for SNP detection in datasets with no reference sequence or strain information.

Performance comparison of SNP detection tools with illumina exome sequencing data--an assessment using both family pedigree information and sample-matched SNP array data.

Total genomic alteration as measured by SNP-array-based molecular karyotyping is predictive of overall survival in a cohort of MDS or AML patients treated with azacitidine.

A novel bioinformatics method for efficient knowledge discovery by BLSOM from big genomic sequence data.

Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction.

Inferring Strain Mixture within Clinical Plasmodium falciparum Isolates from Genomic Sequence Data.

Approaches to estimating inbreeding coefficients in clinical isolates of Plasmodium falciparum from genomic sequence data.

Assembly and comparative analysis of transposable elements from low coverage genomic sequence data in Asparagales.

Genome wide association and genomic prediction for growth traits in juvenile farmed Atlantic salmon using a high density SNP array.

Array comparative genomic hybridization (array CGH) for detection of genomic copy number variants.

LAMPLINK: detection of statistically significant SNP combinations from GWAS data.

Estimation of SNP heritability from dense genotype data.

Detecting SNP combinations discriminating human populations from HapMap data.

An updated felCat5 SNP manifest for the Illumina Feline 63k SNP genotyping array.

Minimal-assumption inference from population-genomic data.