J. Anim. Breed. Genet. ISSN 0931-2668

EDITORIAL

Genetics, genomics, breeding – why scale matters

Many fundamental changes in science evolved from analyses of a given problem on largely different scales. One of the most prominent examples for this can be found in physics. For almost 400 years, Newton’s laws satisfactorily explained all phenomena on the scales relevant for humans. However, when considering the world on an extremely large (in cosmology) or extremely small (in particle physics) scale, Newton’s universe collapsed and novel and hardly intuitive ideas such as quantum mechanics and the general theory of relativity took over. What is the lesson to be learned from these (and many other) examples? Analysing a complex system on very different scales may require different approaches, may lead to substantially different conclusions and there may be no continuous changes, but rather qualitative jumps, especially when going to the extremes. In animal breeding and genetics, largely different scales are pervasive as well, and genetic and genomic approaches over the last decades have been characterized by moving through increasingly more refined scales in an exponentially accelerating way. Markerbased application started with few (in the order of 101) applicable protein polymorphisms in the 1950s; with novel techniques such as RFLPs in the 1980s, the number of polymorphisms was in the order of 102; around the millennium 103 microsatellites were in use; and in the last 10 years, SNP based approaches provided marker sets in the order of 104–105. Sequencing technology soon will make the whole genomic variability in the order of 107 SNPs available and will also provide access to other classes of polymorphisms such as small indels, micro RNAs and methylation patterns. The increase in information density per individual coincided with a significant increase in genotyped individuals. While a typical mapping study in the 1990s was based on a few hundred individuals genotyped for a few hundred microsatellites, present genomic studies rely on thousands, often tens of thousands of individuals genotyped for tens of thousands up to a million SNPs. Have we experienced a qualitative change of rules when moving from the microsatellite world to the SNP world? Of course we have! In the microsatellite world, the dominating concept was linkage. Distances © 2014 Blackwell Verlag GmbH

• J. Anim. Breed. Genet. 131 (2014) 83–84

between markers were so large that the probability of a recombination within the given pedigree – often spanning just two or three generations – could not be ignored, leading to the situation that in different families different marker alleles were associated with the positive effect. As a consequence, positional resolution in linkage mapping typically spanned several marker intervals, often covering a sizeable part of a whole chromosome, and marker-assisted selection schemes on the whole failed because recombination destroyed established associations faster than they could be (re-) established and effectively exploited. When moving from the microsatellite world to the SNP world with the pilot applications of the 54k SNP array in cattle, marker intervals dropped from few centiMorgans – roughly equivalent to a few per cent chance of recombination in each meiosis – to less than a tenth of a centiMorgan, making the probability of a recombination between neighbouring loci essentially negligible. With this, the concept of linkage was replaced by the concept of linkage disequilibrium, leading to locally stable marker associations across families. This proved to be highly useful both in the context of mapping causal variants for phenotypic variability and in the context of genome-based prediction. While linkage mapping was a highly demanding exercise (in regard to time, methodological complexity and required resources) in the microsatellite world, association mapping today comes as a by-product of genomic selection by applying relatively simple multiple regression methods to widely available data. Prediction within a population based on marker scales of increasing density has important implications from a statistical point of view. The number p of markers regularly exceeds the sample size n so that estimation problems fall in the class of ‘large p small n’ problems. In the context of genomic prediction, p often exceeds n by several orders of magnitude. In principle, this is nothing new: in each animal model BLUP, the total number of individuals (including ancestors in the pedigree) is larger than the number of individuals with observations. Since Prof. Henderson’s groundbreaking contributions, animal breeders know how to deal with this situation by applying mixed model methodology. In more general terms, the strategy for solving the large p small n problem is doi:10.1111/jbg.12072

Editorial

to impose a penalty on the excess of predictors used. Penalties can be imposed uniformly, for example, by regressing all predictors by the same amount towards zero, as carried out in the widely used random regression BLUP. Alternatively, penalization can differ with effect sizes, as implemented in Bayesian methods such as Bayes A, B, C or the Bayesian Lasso. Basically, there are two different ways in which a finer marker scale may affect the accuracy of genomic prediction. The first option is that accuracy continuously increases with increasing marker density. In this case, the argument is that with the refinement of the scale, the probability of more and more QTLs being in useful LD with an observed marker increases so that the proportion of genetic variance captured approaches 100 per cent. The alternative option is that only a limited number of markers will capture relevant genetic variability and that adding more and more markers will just add noise to the system, in which case the accuracy of prediction would no longer increase when going beyond a certain marker density. Empirical studies have shown that accuracy of genomic prediction increases linearly with the log of the marker density. However, this holds only up to a (population- and trait specific) upper limit. Factors determining the position of this upper limit are not fully understood to date. Other studies have demonstrated that penalizing methods can improve the accuracy of prediction, especially if the genomic architecture underlying the studied trait involves large QTLs. Another scale affecting the achievable accuracy is the average degree of relatedness in the population. In some applications, the genomic prediction model is trained on a sample which is hardly or not at all related to the set of individuals yet to be predicted. This is often the case in human genetics applications or when prediction is to be made across breeds within a livestock species. For the latter scenario, it was actually suggested that applying much higher marker densities will improve the accuracy of genomic predictions across populations, because only with high marker densities will haplotypes persist across populations. Empirical studies did not confirm this expectation, though – at least moving from the 54k SNP array to the 777k array in cattle was not enough to cause a major improvement in prediction accuracy across populations. For the case of negligible relatedness and perfect LD between markers and QTLs, de

84

los Campos et al. (2013, PLoS Genet., 9(7), e1003608.) suggested a theoretical (and rather low) upper limit for the achievable accuracy, explaining the limited potential of genomic prediction in human genetics applications, where high marker densities and low degree of relatedness prevail. The other extreme is a high degree of relatedness between the training and the prediction set. This is the standard situation in many within-breed applications in livestock, when a forward prediction is made and individuals in the prediction set are direct offspring of individuals in the training set. With genomic BLUP based on the genomic relationship matrix as defined by VanRaden (2008, J. Dairy Sci., 91, 4414), increasing the marker density basically improves the accuracy of the realized relationship coefficients and the Mendelian sampling term. Obviously, there cannot be such a thing as ‘too many markers’ in this case, although there is of course a diminishing return when increasing the marker density. At high levels of relatedness, genomic prediction also appears to be rather robust and insensitive to the underlying genetic architecture, that is, GBLUP almost consistently provides accurate predictions and is generally hard to beat. Technological progress has confronted animal breeding and genetics with a situation in which information is available on scales spanning several orders of magnitude. The change of paradigm in the so-called ‘genomic revolution’ was caused to a large extent by a change of the operative genomic scale, paralleled by the concomitant introduction of novel statistical concepts for mapping and prediction. The interdependency of scales, methods and underlying genetics is complex and not yet fully understood. But clearly, scale does matter. Choosing methods appropriate for the given scale can improve our understanding of the genetic mechanisms underlying the relevant traits and lead to applications allowing to change populations more efficiently in the desired direction. The authors acknowledge the support of the Deutsche Forschungsgemeinschaft (DFG) for the research training group ‘Scaling Problems in Statistics’ (RTG 1644). H. Simianer and M. Erbe Animal Breeding and Genetics Group, Georg-August-University G€ ottingen, G€ ottingen, Germany E-mail: [email protected]

© 2014 Blackwell Verlag GmbH

• J. Anim. Breed. Genet. 131 (2014) 83–84

Genetics, genomics, breeding - why scale matters.

Genetics, genomics, breeding - why scale matters. - PDF Download Free
69KB Sizes 0 Downloads 2 Views