REVIEW

doi: 10.1111/age.12295

Software solutions for the livestock genomics SNP array revolution E. L. Nicolazzi*, S. Biffani†, F. Biscarini*, P. Orozco ter Wengel‡, A. Caprera*, N. Nazzicari* and A. Stella*,† *Fondazione Parco Tecnologico Padano (PTP), Via Einstein, Cascina Codazza, Lodi 26900, Italy. †Istituto di biologia e biotecnologia Agraria (IBBA-CNR), Consiglio Nazionale delle Ricerche, Via Einstein, Cascina Codazza, Lodi 26900, Italy. ‡School of Biosciences, Cardiff University, Museum Avenue, Cardiff CF10 3AX, UK.

Summary

Since the beginning of the genomic era, the number of available single nucleotide polymorphism (SNP) arrays has grown considerably. In the bovine species alone, 11 SNP chips not completely covered by intellectual property are currently available, and the number is growing. Genomic/genotype data are not standardized, and this hampers its exchange and integration. In addition, software used for the analyses of these data usually requires not standard (i.e. case specific) input files which, considering the large amount of data to be handled, require at least some programming skills in their production. In this work, we describe a software toolkit for SNP array data management, imputation, genomewide association studies, population genetics and genomic selection. However, this toolkit does not solve the critical need for standardization of the genotypic data and software input files. It only highlights the chaotic situation each researcher has to face on a daily basis and gives some helpful advice on the currently available tools in order to navigate the SNP array data complexity. Keywords genomic selection, genome-wide association studies, imputation, livestock species, management, population genetics, single nucleotide polymorphism

In the last two decades, many efforts were made to study the genetic architecture of livestock traits, as reviewed by Weller (2001), and to apply this knowledge to breeding (Nejati-Javaremi et al. 1997). Following the lessons learnt from a large number of previous studies, Meuwissen et al. (2001) developed an application for a yet unavailable technology to select individuals by obtaining accurate (genome based) estimated breeding values (EBVs) at birth age. Years later, several livestock genomes were published, starting with that of the chicken (International Chicken Genome Sequencing Consortium 2004) and later on the bovine (Bovine Genome Sequencing & Analysis Consortium 2009), pig (Groenen et al. 2012), goat (Dong et al. 2013) and sheep genome (Jiang et al. 2014), among others. In 2008, the first commercial high-density genome-wide SNP array became available in livestock (Illumina BovineSNP50 BeadChip; Matukumalli et al. 2009), making genome-based selection feasible. The success of the bovine SNP chip boosted interest in applying this technology to other Address for correspondence E. L. Nicolazzi, Bioinformatics and Statistical Genomics Group, Fondazione Parco Tecnologico Padano (PTP), via Einstein, Cascina Codazza, Lodi 26900, Italy. E-mail: [email protected] Accepted for publication 7 March 2015

livestock species, and commercial SNP arrays were produced for sheep (Kijas et al. 2009), pig (Ramos et al. 2009), horse (McCue et al. 2012), chicken (Kranis et al. 2013), goat (Tosser-Klopp et al. 2014), trout (Palti et al. 2014), salmon (Houston et al. 2014) and other species. With the exception of chicken and salmon, whose very first commercial SNP chip consisted of ~600 000 and ~6000 (Lien et al. 2011) SNPs respectively, the number of SNPs present in the first generations of the SNP arrays ranged from 50 000 to 60 000. Further developments in SNP chip technology and market/research requirements drove the development of SNP arrays for several species with different marker densities. For cattle alone, there are currently 11 commercial SNP chips produced by three major companies (Illumina, Neogen-GeneSeek and Affymetrix), using two different genotyping technologies (provided by Illumina and Affymetrix). In addition, there are a constantly growing number of custom SNP chips protected by intellectual property (IP), meaning that they are not commercially available for third parties (or they require that access be granted prior to their use), developed by research consortia (e.g. EuroGenomics), private companies (e.g. Zoetis, Cobb-Vantress) or governmental research institutions (e.g. USDA). A list of currently available commercial SNP chips not completely protected by IP for the six major livestock species (cow, pig, horse, sheep, goat and chicken) are shown in Table 1.

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

343

344

Nicolazzi et al. The increase in the number of SNP chips available, however, was not accompanied by an organized effort to standardize the genomic/genotype data, making comparison and cross-compatibility of SNP arrays difficult. Allele coding, SNP names and genomic coordinates (chromosome and base pair positions) are often difficult to retrieve and update, especially for ‘old’ SNP chips no longer commercially available. Currently, Affymetrix provides their genotypes as 0/1/2/–1, for homozygous AA, heterozygous, homozygous BB and missing call respectively. Using Affymetrix array-specific annotation file, the user can recode these standardized allele calls in the forward strand of the genome assembly used as reference. On the other hand, Illumina provides three different allele coding conventions FORWARD/REVERSE, TOP/BOT and A/B. Although both the Affymetrix genotyping coding and Illumina TOP/BOT and A/B allele coding will remain constant over time FORWARD/REVERSE allele coding may change as a new, or updated, genome assembly is used as reference. Standardization and consistency in SNP names is of great importance. However, except for Affymetrix, commercial companies do not provide ‘reference SNP IDs’ (RefSeq SNP or rs IDs) for their SNPs. This makes it difficult, if not impossible, for developers to link SNP chip information with the data stored in public databases. This situation hinders research on individuals genotyped with a single SNP chip and is a large obstacle for the integration of data obtained from several different chips. The need for integration and standardization of protocols becomes even more critical when genotypes produced by companies using different genotyping technologies need to be combined. In addition to standardizing naming conventions, a standard for identifying the SNP position coordinates in the genome (i.e. chromosome and base pair positions) is essential. The coordinates must refer necessarily to a reference

genome assembly (RGA) that, as discussed previously, is bound to be updated (or completely replaced) over time. The quality and ‘stability’ of RGAs differ among species. For some species, the first RGA has only just been released (goat; Dong et al. 2013), whereas others have undergone many revisions and improvements. The frequency with which the RGA changes depends on the quality of the sequence and the size of scientific investments in the species. In addition, more than one ‘official’ RGA may exist, as is the case with cattle: one from University of Maryland (UMD; Zimin et al. 2009) and a second from the International Bovine Genome Consortium (BTAU; Liu et al. 2009). The first attempt to tackle these difficulties was an online tool, called SNAT (Jiang et al. 2011), currently offline, which considered few bovine SNP chips and was focused on annotation rather than on integration of SNP chip data. A few years later, Nicolazzi et al. (2014) published the SNPchiMp v.1 web tool, the first tool that focused specifically on integration and standardization of bovine data. Since its first publication, this tool has almost doubled the number of handled bovine SNP chips and has been extended to all six major livestock species. The tool actually solves all the above issues, includes all commercial SNP chips, and provides a first attempt of standardization, integration and full disclosure of the data related to the SNP chips (including the annotation of SNPs to the different RGA available). Moreover, it allows retrieving annotated genes at a user-defined range up- and downstream of a list of SNPs in nearly all species using the Ensembl BioMart data mining tool functionalities (Kinsella et al. 2011). Standardized, integrated (across SNP chips and with other available resources), user-friendly and fully disclosed information, such as in this web application, are now highly important and will become fundamental when large-scale wholegenome sequence (WGS) data will be available.

Table 1 Currently available commercial SNP chips (SNPs not protected by IP) in the six major livestock species. Cow 3k

a,1

Pig (2900)

LD v.1a,1 (6.909) LD v1.1 (6912) GGPLD v1b,1 (8610) GGPLD v2b (19 721) GGPLD v3b (26 151) SNP50 v.1a,1 (54 001) SNP50 v.2a (54 609) GGPHDb (76 879) AxiomBos1c (648 875) HDa (777 962)

Horse b

Sheep a,1

GGPLD v.1 (10 241)

SNP50 v.1

(54 602)

SNP60 v.1a,1 (62 163) SNP60 v.2a (61 565) SNP80b (68 528)

SNP70b (65 157)

Goat a

SNP50 v.1 (54 241)

SNP50 v.1

Chicken a,3

(53 347)

Axiom Chickenc (580 961)

HDa,2(606 006)

a

Produced by Illumina Inc. Produced by Neogen-GeneSeek. c Produced by Affymetrix Inc. 1 Out of production. 2 Custom chip developed by the International Sheep Genome Consortium in collaboration with FarmIQ. 3 Custom chip developed by the International Goat Genome Consortium. b

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

Software for SNP array data in livestock Besides the difficulty of data integration and standardization, SNP chip data analysis is usually hampered by the multiple input/output data formats used by the increasing amount of available software. Although we are aware of the fact that it is impossible to review all the software available for SNP analysis, the following sections review some of the software currently available for SNP array data management, imputation, genome-wide association studies (GWAS), population genetics and genomic selection.

SNP array data management Several SNP array data management tools have been developed, even if many were not originally conceived for this purpose but, rather, for data analysis. For example, PLINK (Purcell et al. 2007) was originally developed for SNP array data analysis in a genome-wide context where tens of thousands of markers were genotyped for multiple individuals. However, due to its speed and stability, PLINK has quickly become one of the standards for data management. PLINK allows among others, recoding of data sets into various formats, flipping DNA strands and quickly extracting sets of individuals and/or SNPs of interest. The current stable version PLINK is 1.07. However, a new beta version 1.90 is available, which includes improvements in speed, memory usage and the implementation of new features, such as the possibility of calculating genomic relationship matrices (https://www.cog-genomics.org/ plink2). Recently, the open-source and freely available R statistical analysis software (R Core Team 2014) has also become a powerful resource for data management. Although specific R libraries have been developed for (Illumina) SNP array data storage and management (snpQC; Gondro et al. 2013a), in general, R libraries developed for data analysis can also be used for data management. Managing SNPs within homemade relational databases can be a good solution when data sets are small, whereas when complexity and dimensionality increases, specifically designed third-party software is advisable. Commercial solutions are often the most complete; among them, JMP GENOMICS (http://www.jmp.com/software/genomics/), Progeny lab (http://www.progenygenetics.com/) and BCPlatforms (http://bcplatforms.com/solutions) are able to handle several data formats offering functionalities that often surpass those of mere data management. Nevertheless, there are also open access alternatives, which usually require at least basic knowledge of specific programming languages, for example SNPpy (Mitha et al. 2011), the genotypic browser GBrowse (Donlin 2009) or the Chado database (Mungall et al. 2007). Relatively recently, the Broad Institute developed a visualization tool named INTEGRATIVE GENOMICS VIEWER (IGV) as a user-friendly approach for displaying array-based and next-generation sequencing data (Robinson et al. 2011).

As for SNP array data format conversion, the speed at which software for specific analyses are published makes it impossible to build a comprehensive tool able to cope with all possible input/output formats. There are a few software packages available that cover most common formats, for example PGDSPIDER (Lischer & Excoffier 2012) and FCGENE (Roshyara & Scholz 2014), which are able to handle a large number of input/output formats. PGDSPIDER is a conversion tool focused on population genetics and genomics and is able to handle many genotypic data (from restriction/ amplified fragment length polymorphisms to whole-genome sequence data). It can handle a wide variety of formats (http://www.cmpg.unibe.ch/software/PGDSpider/ #Input_and_output_formats) and can be run with the native java graphical user interface or through a command line. However, the interface is poorly informative for debugging (e.g. when something is wrong in the input file provided) and, because it stores the whole input file into memory, it is not able to handle large data sets. On the other hand, FCGENE is more focused on imputation and GWAS. It was written specifically for human population analyses, but it is able to transform PLINK format files into the input files for a number of software commonly used also in animal genetics. Although the number of software handled is lower than in PGDSPIDER, FCGENE allows streamlining the quality control and imputation steps of any analysis without the need for programming. In any case, especially for young scientists, the use of linux-/unix-based operative systems together with command-line bash scripting is strongly advised. This enables a series of functionalities for data management usually not available in windows-based systems. Furthermore, a good practice would be learning (even basic) programming of highlevel languages, such as Python (https://www.python.org) or Ruby (https://www.ruby-lang.org; Aerts & Law 2009), in order to be able to manipulate data and, eventually, design SNP management procedures. There are many free courses for beginner and advanced users available online (https:// www.coursera.org; https://www.edx.org) as well as organizations and initiatives that promote programming and provide open course materials (Software Carpentry: http:// software-carpentry.org/index.html; GOBLET training portal: Corpas et al. 2015; Animal Breeding and Genetics Hub: http://www.netvibes.com/abg-hub#General).

Imputation Imputing missing alleles and genotypes is a preliminary step for a wide range of genetic analyses, because most models and software in genetics and genomics cannot handle missing data. Imputation is needed to fill in the blanks left by sporadic missing genotypes in SNP array genotyping data or to impute larger amounts of missing genotypes in animals genotyped with SNP chips at different densities. Imputation is a highly effective technique that can reach an

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

345

346

Nicolazzi et al. accuracy close to ~99% (Weigel et al. 2010; Zang & Druet 2010). Imputation methods can be divided essentially into two groups: those based solely on linkage disequilibrium and allele frequencies, and those that also include pedigree information. BEAGLE (Browning & Browning 2007) and IMPUTE2 (Howie et al. 2009) are popular choices for imputation of sporadic missing genotypes. They are designed mainly for human genetics and are pedigree-free tools. They can accommodate simple population structures (e.g. trio data) but not complete pedigree information. In general, they perform well in structured populations even when population structure is not explicitly considered (Johnston et al. 2011; Nicolazzi et al. 2013). On the contrary, MACH (Li et al. 2010), FINDHAP (VanRaden et al. 2011), PEDIMPUTE (Nicolazzi et al. 2013), FIMPUTE (Sargolzaei et al. 2014) and ALPHAPHASE (Hickey et al. 2011) make use of pedigree information. Specifically, FINDHAP, FIMPUTE, ALPHAPHASE and PEDIMPUTE were designed for animal breeding and genetics, with special regard to cattle populations. In terms of computation speed, deterministic methods (e.g. FINDHAP) tend to be faster than numerical methods (Johnston et al. 2011; Sargolzaei et al. 2014). The accuracy of imputation depends on a series of aspects, such as the imputation method used, the species considered, the density and type of SNP arrays considered, the amount of samples available and the genetic structure and history (e.g. linkage disequilibrium level) of the population, among others. Recent evidence suggests that, in cattle, RGA has only marginal impact on imputation accuracy, irrespective of the method used (Milanesi et al. 2015). Each of these tools has its own input format and, although FCGENE software converts PLINK format into MACH, IMPUTE or BEAGLE formats, a conversion tool including also livestock-specific tools could be of great use.

Genome-wide association studies The implementation of high-density microarray technologies for SNP detection has turned GWAS into the gold standard method for the identification of loci underlying complex diseases, both in humans and non-humans (Bush & Moore 2012). A simplified scheme of a GWA study may include the following steps: (i) genotype calling from the raw chip data and basic quality control; (ii) methods to detect and correct for population stratification (e.g. principal component analysis); (iii) genotype imputation; (iv) testing for association between a single SNP and continuous or categorical phenotypes; (v) global significance analysis and correction for multiple testing; (vi) data presentation (e.g. Manhattan plots); and/or (vii) cross-replication and metaanalysis, integrating association data from multiple studies. All these tasks can be executed by ad hoc computer programs, which in the majority of cases are open source and multiplatform (Gondro et al. 2013b). The generalized GWA procedure above does not account for a number of

key factors, such as the choice of the statistical model, which has a direct impact on the successive analyses. For instance, a Bayesian statistical model does not yield a statistical significance (as intended by classical-frequentist statistics); thus, downstream software (and analyses) needs to be adapted to account for this. Over the last 10 years, a plethora of software has become available to researchers (Table 2), and PLINK has become the most widely used software for GWAS. However, PLINK is not readily suited for all livestock species (e.g. GWAS options cannot effectively account for population stratification or the particular chromosome number if the species of choice is not available). The GenABEL R library fills this gap (Aulchenko et al. 2007). GenABEL implements effective GWAS data storage, handling and accurate quality control and has functions for estimating kinship from a dense marker panel. Kinship estimates can then be used in linear mixed models to account for related individuals in the analysis of a quantitative or binary trait even if multiple random effects cannot be accommodated. Additionally, GenABEL has many features for analysis and visualization of GWAS data. GENOME-WIDE COMPLEX TRAIT ANALYSIS (GCTA; Yang et al. 2011) is another user-friendly and open-source GWAS software. Originally designed to estimate the proportion of phenotypic variance explained by SNPs for complex traits, it now also includes several options to analyse GWAS results or perform association analysis based on a mixed linear model (option –mlma; Yang et al. 2014a, b). Additionally, it allows for the input and management of data from other software such as PLINK or MACH (Li et al. 2010). Another useful open-source software for GWAS is GEMMA, which implements the Genome-wide Efficient Mixed Model Association algorithm developed by Zhou & Stephens (2012). GEMMA fits linear, multivariate and Bayesian linear mixed models for estimating the proportion of SNP variance in phenotypes, testing marker associations with multiple phenotypes simultaneously, while controlling for population stratification and estimating genetic correlations. Because GEMMA accepts PLINK input files, it can be easily streamlined in a pipeline reducing the amount of programming skills needed. A slightly different approach that is worth mentioning is the PREGSF90–POSTGSF90 (Aguilar et al. 2014) programs, which are actually interfaces that can be used to pre- and post-process genomic information obtained using the BLUPF90 family programs (Misztal et al. 2002; http://nce.ads.uga.edu/). BLUPF90 is a collection of Fortran 90/95 software for mixed model computations in animal breeding. PREGSF90 can be used to pre-process genotype data before the implementation of the single-step methodology (Legarra et al. 2009; Misztal et al. 2009), whereas POSTGSF90 calculates SNP effects, as described in Wang et al. (2012a, b). Plots of SNP effects or variances explained can be produced using GNUPLOT, R plot or specific libraries (RegionPlot: Zhang et al. 2015), or SNPEVG (Wang et al. 2012a,b).

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

Software for SNP array data in livestock Table 2 Characteristics of the software for SNP management and analysis cited.

SNP management3

Name

OS1

License2

Input

Link

PLINK 1.07

W/L/M W/L/M W/L/M W/L/M W/L/M

F F OS L L

Own Own Illumina raw Many Many

http://pngu.mgh.harvard.edu/~purcell/plink/ https://www.cog-genomics.org/plink2 http://www-personal.une.edu.au/~cgondro2/snpQC.htm http://www.jmp.com/software/genomics/ http://www.goldenhelix.com/SNP_Variation/index.html

W/L L L

L L OS

http://www.progenygenetics.com/ http://bcplatforms.com/solutions https://bitbucket.org/faheem/snppy

L/M W/L/M W/L/M W/L L/M L/W/M

OS OS OS OS F F

L/W/M L/W/M

F F

Many Many Affymetrix raw Illumina raw GFF3 Many Many Many Own Own, PLINK, VCF Own Own (~PLINK)

L/M W/L/M L W/L/M

OS F OS OS

L W/L

OS OS

W/L/M L W/L/M W/L/M L/M L/M L/M W/L/M W/L/M W/L/M W W/L/M W/M/L W/L

OS OS F F F F F F F F F F F F

Own VCF Own Own, ~PLINK VCF Own PLINK Own, PLINK Own, GENEPOP Own Own Own Own -

W/L/M W/L/M L/M W/L/M W/L/M W/L/M

OS/LA L L OS OS F/LA

Own Own Own Own, Own Own

PLINK 1.90 SNPQC JMP GENOMICS GOLDEN HELIX SNP & VARIATION SUITE PROGENY LAB BCPLATFORMS SNPPY

GBROWSE IGV PGDSPIDER FCGENE

Imputation

FIMPUTE BEAGLE

IMPUTE2 MACH

PEDIMPUTE ALPHAPHASE FINDHAP

GWAS

GENABEL (R)4

GCTA GEMMA

Own Own Own MACH, PLINK, Illumina raw Affymetrix raw PLINK, MACH PLINK

http://gmod.org/wiki/GBrowse http://www.broadinstitute.org/igv/ http://www.cmpg.unibe.ch/software/PGDSpider/ http://sourceforge.net/projects/fcgene/ http://www.aps.uoguelph.ca/~msargol/fimpute/ http://faculty.washington.edu/browning/beagle/beagle.html https://mathgen.stats.ox.ac.uk/impute/impute_v2.html http://www.sph.umich.edu/csg/abecasis/MACH/tour/ imputation.html http://dekoppel.eu/pedimpute/ https://sites.google.com/site/hickeyjohn/alphaphase http://aipl.arsusda.gov/software/findhap/ http://www.genabel.org/manuals/GenABEL

http://ctgg.qbi.uq.edu.au/software/gcta/index.html http://home.uchicago.edu/xz7/software/GEMMAmanual.pdf

BIMBAM SSGBLUP

Population genomics and signatures of selection

SWEED ARLEQUIN SELSCAN VCFTOOLS BAYESCAN ADMIXTURE FASTSTRUCTURE BAPS LFMM MATSAM DIY-ABC POPABC ABCTOOLBOX

Genomic predictions

GS3 ASREML GENSEL BGLR (R)4 RRBLUP (R)4 BLUPF90 SUITE

PLINK

1

http://nce.ads.uga.edu/wiki/doku.php?id=readme.pregsf90 http://sco.h-its.org/exelixis/web/software/sweed/index.html http://cmpg.unibe.ch/software/arlequin35/ https://github.com/szpiech/selscan http://vcftools.sourceforge.net/ http://cmpg.unibe.ch/software/BayeScan/index.html https://www.genetics.ucla.edu/software/admixture/ http://pritchardlab.stanford.edu/structure.html http://www.helsinki.fi/bsg/software/BAPS/ http://membres-timc.imag.fr/Olivier.Francois/lfmm/index.htm http://www.econogene.eu/software/sam/ http://www1.montpellier.inra.fr/CBGP/diyabc/ https://code.google.com/p/popabc/ http://www.cmpg.iee.unibe.ch/content/softwares__ services/computer_programs/abctoolbox/index_eng.html http://snp.toulouse.inra.fr/~alegarra/ http://www.vsni.co.uk/software/asreml – http://www.soph.uab.edu/ssg/software/bglr http://cran.r-project.org/web/packages/rrBLUP/ http://nce.ads.uga.edu/wiki/doku.php

W, Windows; L, Linux; M, MacOS. OS, open source (which, in this list, implies that source code is available and it is free); F, free for all users (but source code not provided); LA, licensed but free for academia/non-profit users); L, licensed – requires a payment by all users. 3 Most SNP management software presented here can be considered to be in multiple categories (e.g. PLINK, JMP GENOMICS, Golden Helix SNP & Variation Suite, all perform multiple analyses). 4 R package. 2

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

347

348

Nicolazzi et al.

Population genomics and signature of selection Many of the tools used to manage SNP array data in the context of population genetic analyses are the same ones used for SNP array data management and GWAS analyses, for example PLINK, Golden Helix SNP & Variation suite (http://www.goldenhelix.com/SNP_Variation/). These software are able to perform multiple population genomics and signature of selection analyses, including frequency-based analyses (interestingly, PLINK does not include capabilities to perform Fst calculations), runs of homozygosity and haplotype-based analyses. However, beyond these tools, population genetics is also populated by a large array of software tools, most of which are open source and freely available (see Table 2 for a selection of the most common tools). Lenstra et al. (2012) already reviewed a few tools present for these analyses (although focusing more on characterization of genetic diversity). Briefly, standard population genetics analyses start by identifying patterns of population structure. For that purpose, various approaches have been developed that attempt, in a Bayesian framework, to identify individuals that can be grouped into sets of samples with similar genetic background (e.g. minimizing linkage disequilibrium and maximizing Hardy–Weinberg equilibrium). The most common software tools traditionally used are STRUCTURE (Falush et al. 2003) and BAPS (Corander & Marttinen 2006). However, due to the large size of current data files (i.e. tens if not hundreds of samples for many thousands of SNPs), alternative less computer-intensive approaches have been developed, for example ADMIXTURE (Alexander et al. 2009) and FASTSTRUCTURE (Raj et al. 2014), the improved version of STRUCTURE. The usual population genetics analysis to detect signatures of selection uses groups of individuals arranged in single populations (each analysed separately) or in groups of populations (for pairwise comparisons). Several software tools are available for the different type of analysis, as each of them captures different types of information. For example, ARLEQUIN (Excoffier & Lischer 2010) implements the standard Fst/Heterozygosity test that finds candidate SNPs under selection as outliers with respect to neutral coalescent simulations. However, as with most population genetics software tools, ARLEQUIN requires its own data format, and the final display of the results (if required) is performed via third party tools such as R. Although more linked to management of the variant call format from WGS (i.e. thus including SNPs and other variant types such as insertions/deletions), VCFTOOLS (Danecek et al. 2011) allows a wide array of data analysis options (e.g. data filtering and data format conversion). VCFTOOLS estimates FST and Tajima’s D (Tajima 1989) between populations, which when measured in windows along the genome, can be used to identify regions with unusual patterns of differentiation. During the last decade, various

approaches have been developed which make use of the distribution of linkage disequilibrium along the genome to identify haplotypes under selection. Such approaches are related to the EHH (extended haplotype homozygosity) statistic, developed by Sabeti et al. (2002). These types of approaches rely on the principle that during a selective sweep, the site under selection drags along neighbouring alleles to which it is linked, while it raises in frequency. Other such approaches are the integrated haplotype score (iHS; Voight et al. 2006) and the cross-population EHH (Szpiech & Hernandez 2014), which consider the pattern of linkage disequilibrium around sites under selection in either the ancestral and derived alleles, or between two populations respectively. The last three approaches are included in the recently published SELSCAN software suite (Szpiech & Hernandez 2014). Due to the problem arising from population genetics approaches by which demographic changes can leave signatures that mimic those of natural selection, alternative approaches have been developed to avoid the demographic pitfall. For example, MATSAM (Joost et al. 2007), SAMBADA (Stucki et al. 2014) and LFMM (Frichot et al. 2013) use statistical approaches that identify the relationship between allelic variants in SNPs distributed across the genome in hundreds of individuals that are spread across a geographic area of interest. By taking into account the geographic positioning of the individuals and attempting to relate their genetic variation to environmental variables measured in the sampling site, it is possible to identify genetic variants under selection as those that show particularly strong associations with environmental variables. Alternatively, it is possible to improve the traditional population genetic approaches (e.g. Fst/H, BAYESCAN) by previously estimating the demographic history of the populations of interest and then using that information to simulate the null distribution of genetic variation in the population against which the observed data are tested. That can be performed with approximate Bayesian computation approaches such as DIY-ABC (Cornuet et al. 2014), POPABC (Lopes et al. 2009) or ABCTOOLBOX (Wegmann et al. 2010).

Genomic selection After its initial visionary conception (Nejati-Javaremi et al. 1997; Meuwissen et al. 2001), the idea of genomic prediction and selection has found many applications in recent years. This was motivated both by the huge amount of data produced by high-throughput genotyping and sequencing technology and by advancements in bioinformatics capacity and processing. Currently, a large array of methods and tools to implement the concepts of genomic prediction and selection are available. Among the most popular methods used in animal breeding are GBLUP (genomic best linear unbiased predictions) and the long list of Bayesian

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

Software for SNP array data in livestock approaches falling in the so-called Bayesian alphabet (Bayes A/B/C/Cp/. . .; Gianola et al., 2009). In principle, GBLUP can be implemented in any software for statistical analysis, provided that the matrix of genomic relationship is used as a covariance matrix. A few examples of such software are in the restricted maximum likelihood (REML) family – DMU (Madsen et al. 2006), WOMBAT (Meyer 2007) and ASREML (Gilmour et al. 2009) – which usually give ample flexibility as to fitting models. The already-cited BLUPF90 family of programs come from the quantitative genetics era, being widely used to fit mixed models for the estimation of both variance components and breeding values (EBVs). Although the use of this software requires advanced statistical genetics knowledge and not-trivial data and parameter files preparation, it allows for the fitting of a large number of models. In fact, it has been recently extended to fit GBLUP with the single-step approach (Misztal et al. 2009). GS3 is an open-source software for genomic predictions from GBLUP, Bayes Cp and Bayesian Lasso models (http:// snp.toulouse.inra.fr/~alegarra/). It supports a fairly general model that can take into account additive, dominant and polygenic infinitesimal effects in addition to environmental permanent effects. A drawback is the use of custom input/ output data formats. The same drawback, plus the inability to handle missing genotypes, is present in the GENSEL software (Fernando & Garrick 2013), a suite for the estimation of genomic EBVs from Bayes A, Bayes B, Bayes C and Bayes Cp models through Markov chain Monte Carlo implementation. GENSEL is run from the command line either interactively or in batch mode, and it is computationally efficient. R packages also are available for the estimation of genomic EBVs. BGLR (Bayesian generalized linear regression; Perez & de los Campos 2014) implements Gibbs sampling to solve a range of Bayesian regression models. It offers a straightforward way of handling the p≫n problem through the choice of the regularization parameter to run GBLUP; Bayes A, B and C; Bayesian Lasso; and Bayesian Ridge Regression (i.e. SNP BLUP with p≫n). It also allows the fit of semi-parametric regression models through Bayesian Reproducing Kernel Hilbert Spaces (RKHS) regressions. BGLR is suited for both continuous and categorical traits and also can handle censored data (e.g. survival analysis). Alternatively, rrBLUP (Endelman 2011) is an R library for genomic selection that implements a (very quick) ridge regression model or the standard GBLUP.

Discussion One of the main issues related to SNP array data management is that any suggested solution should contemporaneously consider past, present and, possibly, future scenarios. In livestock (or plant) species, where genomic information is

still not consolidated, conserving or integrating past data and knowledge, and planning future challenges are key aspects that need to be carefully addressed. SNP array information (e.g. multiple IDs, allele codings, positions in the different RGAs and flanking sequences) should be readily accessible and fully conserved. Currently, the SNPchiMp tool is handling the storage, integration and easy access of SNP array information for six major livestock species, but others are not included (e.g. farmed fishes). Integrating and standardizing SNP array data is relatively simple, with a few caveats, when handling chips from just one producer (e.g. Illumina or Affymetrix). The main difficulties are the conversion of all SNP arrays to a common allele format (which can be handled, for instance, using the ICONVERT software provided in the ‘tools’ section of the SNPchiMp) and the presence of cross-references for the IDs of the SNPs. However, when arrays from both producers are involved, the complexity increases as allele coding and SNP IDs between the two producers are not consistent. For instance, ~25% of the SNPs with common RefSeq IDs (rsIDs) in Illumina BovineHD (777k) and Affymetrix Axiom Bos-1 (648k) have different forward-strand allele coding (which can be nearly completely normalized using bioinformatics tools). In addition, whereas Affymetrix provides the rs ID for at least part of its SNPs, Illumina does not, which makes the integration of the data more difficult. The integration complexity increases exponentially when SNP array and WGS data have to be handled contemporaneously (or when two RGAs have to be interrogated). To our knowledge, there are no specific software tools available to solve these issues in livestock species. In human genetics, an integrated solution, although not 100% accurate, has been suggested only recently (GACT; Sulovari & Li 2014). GACT software uses artificial neural networks to consistently predict allele definitions (even from different assemblies) using training data from the 1000 Genomes project data, the dbSNP database and SNP array data. In livestock species, such large training data sets are currently available only for cattle (1000 Bull Genomes Project, http:// www.1000bullgenomes.com), although they probably will be available soon for other species in the so-called Genomic Selection 2.0 era (Hickey 2013). Thus, such a tool is a promising starting point to solve most of the abovementioned issues in a simple way. First studies involving the contemporaneous use of SNP array and WGS data have been reported recently in cattle (Daetwyler et al. 2014), pig (Yang et al. 2014a,b) and horse (Frischknecht et al. 2014). All the latter studied imputation from lower to higher density, although the first two actually imputed WGS genotypes from SNP array positions (but not actual SNP array genotypes). On the contrary, Frischknecht et al. (2014) and Baes et al. (2014) imputed actual SNP array– WGS data, reporting an average genotypic concordance higher than 98% in ~30 horse and ~50 cattle individuals respectively. Although good results were obtained, a large

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

349

350

Nicolazzi et al. number of bioinformatics steps were needed to filter and normalize the data in both studies. The software described in this review (data management, imputation, GWAS, selection signatures, population genetic structure and genomic selection) is only a fraction of the large number of software currently available for genetic/ genomic analyses. To our knowledge, no software is able to streamline all the cited analyses together, although licensed solutions, such as JMP GENOMICS and Golden Helix SNP & Variation suite, usually allow multiple streamlined options. On the other hand, many analyses can be performed simply integrating currently available software in a workflow, which makes more sense in research, where typically new solutions are required (including streamlining software not originally coded to work together). In general, programming skills are required at least for (relatively) ‘small’ operations but not necessarily. For instance, both Illumina and Affymetrix genotype calling software are currently able to provide their genotype calls in PLINK format. Using such files, QC and imputation can be performed using a combination of PLINK, FCGENE and any of the latter’s available imputation methods. The identification of population genetic patterns or a (basic) GWAS can easily follow using the outcome of the previous analyses without the need for any programming skills (Fig. S1). We underline that the scheme provided is an oversimplified example of all the steps required before and after the statistical analysis. Descriptive analyses of the data available, outlier identification and all the post hoc analyses are not considered here. To this purpose, at least basic programming skills are needed, especially considering the high dimensionality of the data involved. In fact, with even basic knowledge of R, one could perform simple descriptive analyses, a GWAS using GenABEL (Appendix S1) or a genomic evaluation using BGLR. Programming skills are highly useful when dealing with genomic data. It not only increases the amount of tools available to the researcher (who would be able to adapt data to any format needed by any software) but also allows automatizing standard procedures and building new tools.

Conclusion SNP data management, standardization and integration are often underestimated problems. We highlight a few critical needs in the SNP array data management and analysis. (i) Common, standardized and internationally recognized SNP IDs for SNPs in SNP arrays are needed. Public databases (e.g. dbSNP, Ensembl) use rs IDs as a SNP ID key, which seems the most reasonable choice. Illumina should provide such information by default (Affymetrix does already). (ii) Reference and alternative alleles of a given RGA should also be provided by the producers (instead of only forward/ reverse strands) to facilitate integration between SNP arrays and WGS data. (iii) Software able to standardize and integrate the different allele coding systems and different

RGAs, such as GACT in humans, should be developed for livestock as more and more sequence data are available. (iv) Input files for SNP array data management and analysis software should be standardized. Free conversion tools are currently available but can hardly keep up with the large amount of software periodically published. An effort from the whole animal genetics community would be needed: programmers (i.e. adopting standard formats or at least allowing multiple ‘proteiform’ input formats), end-users (i.e. strongly requiring such standards), editors and reviewers of scientific journals (i.e. asking for a standard input format to accept software manuscripts and full RGA information and software used – including specifics about the version and options chosen – prior to acceptance of research manuscripts).

Acknowledgements Authors wish to acknowledge John L. Williams, who helped fine-tune the final version of this manuscript. Chandrasen Soans (Illumina Inc.) and Barry Simpson (Neogen-GeneSeek) provided key information about the history and commercialization of SNP chips. Authors wish also to thank Mick Watson and the other three unknown reviewers for their useful comments and suggestions. The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/ 2007-2013) under grant agreement n° 289592 – Gene2Farm, the Marie Curie European Reintegration Grant ‘NEUTRADAPT’ and the ‘Nextgen’ project (http://nextgen.epfl.ch/). The authors declare that they have no competing interests.

References Aerts J. & Law A. (2009) An introduction to scripting in Ruby for biologists. BMC Bioinformatics 10, 221. Aguilar I., Misztal I., Tsuruta S., Legarra A. & Wang H. (2014) PREGSF90 – POSTGSF90: Computational Tools for the Implementation of Single-step Genomic Selection and Genome-wide Association with Ungenotyped Individuals in BLUPF90 Programs. In: Proceedings, 10th World Congress of Genetics Applied to Livestock Production, Vancouver, Canada, August 2014. Alexander D.H., Novembre J. & Lange K. (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19, 1655–64. Aulchenko Y.S., Ripke S., Isaacs A. & van Duijn C.M. (2007) GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–6. Baes C.F., Dolezal M.A., Koltes J.E. et al. (2014) Evaluation of variant identification methods for whole genome sequencing data in dairy cattle. BMC Genomics 15, 948. Browning S.R. & Browning B.L. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. American Journal of Human Genetics 81, 1084–97.

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

Software for SNP array data in livestock Bush W.S. & Moore J.H. (2012) Chapter 11: Genome-wide association studies. PLoS Computational Biology 8, e1002822. Corander J. & Marttinen P. (2006) Bayesian identification of admixture events using multi-locus molecular markers. Molecular Ecology 15, 2833–43. Cornuet J.M., Pudlo P., Veyssiere J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.M. & Estoup A. (2014) DIYABC v2.0: a software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data. Bioinformatics 30, 1187–9. Corpas M., Jimenez R.C., Bongcam-Rudloff E. et al. (2015) The GOBLET training portal: a global repository of bioinformatics training materials, courses and trainers. Bioinformatics 31, 140– 2. Daetwyler H.D., Capitan A., Pausch H. et al. (2014) Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genetics 46, 858–67. Danecek P., Auton A., Abecasis G. et al. & 1000 Genomes Project Analysis Group (2011) The variant call format and VCFTOOLS. Bioinformatics 27, 2156–8. Dong Y., Xie M., Jiang Y. et al. (2013) A reference genome of the domestic goat (Capra hircus) generated by Illumina sequencing and whole genome mapping. Nature Biotechnology 31, 135–41. Donlin M.J. (2009) Using the generic genome browser (GBrowse). Current Protocols in Bioinformatics 9, 9. Endelman J.B. (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4, 250–5. Excoffier L. & Lischer H.E.L. (2010) ARLEQUIN suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources 10, 564–7. Falush D., Stephens M. & Pritchard J.K. (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–87. Fernando R.L. & Garrick D.J. (2013). Bayesian methods applied to GWAS. In Genome-Wide Association Studies and Genomic Prediction (Ed. by C. Gondro, J.H.J. van der Werf & B. Hayes), pp. 237–74. Springer Series: Methods in Molecular Biology. Humana Press, Berlin. Frichot E., Schonveile S., Bouchard G. & Francois O. (2013) Environmental gradients using latent mixed models. Molecular Biology and Evolution 30, 1687–99. Frischknecht M., Neuditschko M., Jagannathan V., Dr€ogem€ uller C., Tetens J., Thaller G., Leeb T. & Rieder S. (2014) Imputation of sequence level genotypes in the Franches-Montagnes horse breed. Genetics Selection Evolution 46, 63. Gianola D., de los Campos G., Hill W.G., Manfredi E. & Fernando R. (2009) Additive genetic variability and the Bayesian alphabet. Genetics 183, 347–63. Gilmour A.R., Gogel B., Cullis B. & Thompson R. (2009) ASREML User Guide Release 3.0. Hemel Hempstead, VSN International Ltd, Hemel Hempstead, HP11ES, UK. Gondro C., Porto-Neto L.R. & Lee S.H. (2013a) snpQC - an R pipeline for quality control of Illumina SNP genotyping array data. Animal Genetics 45, 758–61. Gondro C., Porto-Neto L.R. & Lee S.H. (2013b) R for genome-wide association studies. In: Genome-Wide Association Studies and Genomic Prediction (Ed. by C. Gondro, J.H.J. van der Werf & B. Hayes), pp. 1–17. Springer Series: Methods in Molecular Biology Humana Press, Berlin.

Groenen M.A., Archibald A.L., Uenishi H. et al. (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491(7424), 393–8. Hickey J.M. (2013) Sequencing millions of animals for genomic selection 2.0. Journal of Animal Breeding and Genetics 130, 331–2. Hickey J.M., Kinghorn B.P., Tier B., Wilson J.F., Dunstan N. & van d.W. J.H.J. (2011) A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genetics Selection Evolution 43, 12. Houston R.D., Taggart J.B., Cezard T. et al. (2014) Development and validation of a high density SNP genotyping array for Atlantic salmon (Salmo salar). BMC Genomics 15, 90. Howie B.N., Donnelly P. & Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5, e1000529. International Chicken Genome Sequencing Consortium (2004) A genetic variation map for chicken with 2.8 million single nucleotide polymorphisms. Nature 432, 717–22. Jiang J., Jiang L., Zhou B., Fu W., Liu J.F. & Zhang Q. (2011) SNAT: a SNP annotation tool for bovine by integrating various sources of genomic information. BMC Genetics 12, 85. Jiang Y., Xie M., Chen W. et al. (2014) The sheep genome illuminates biology of the rumen and lipid metabolism. Science, 344, 1168–1173. Johnston J., Kistemaker G. & Sullivan P.G. (2011) Comparison of different imputation methods. Interbull Bulletin 44, 25–31. Joost S., Bonin A., Bruford M.W., Despres L., Conord C., Erhardt G. & Taberlet P. (2007) A spatial analysis method (SAM) to detect candidate loci for selection: towards a landscape genomics approach to adaptation. Molecular Ecology 16, 3955–69. Kijas J.W., Townley D., Dalrymple B.P. et al. & the International Sheep Genomics Consortium (2009) A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. PLoS ONE 4, e4668. Kinsella R.J., K€ ah€ ari A., Haider S. et al. (2011) Ensembl BioMart: a hub for data retrieval across taxonomic space. Database 23, bar030. Kranis A., Gheyas A.A., Boschiero C. et al. (2013) Development of a high density 600K SNP genotyping array for chicken. BMC Genomics 14, 59. Legarra A., Aguilar I. & Misztal I. (2009) A relationship matrix including full pedigree and genomic information. Journal of Dairy Science 92, 4656–63. Lenstra J.A., Goeneveld L.F., Eding H. et al. (2012) Molecular tools and analytical approaches for the characterization of farm animal genetic diversity. Animal Genetics 43, 483–502. Li Y., Willer C.J., Ding J., Scheet P. & Abecasis G.R. (2010) MACH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology 34, 816–34. Lien S., Gidskehaug L., Moen T., Hayes B.J., Berg P.R., Davidson W.S., Omholt S.W. & Kent M.P. (2011) A dense SNP-based linkage map for Atlantic salmon (Salmo salar) reveals extended chromosome homeologies and striking differences in sex-specific recombination patterns. BMC Genomics 12, 615. Lischer H.E.L. & Excoffier L. (2012) PGDSPIDER: an automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics 28, 298–9. Liu Y., Qin X., Song X.Z. et al. (2009) Bos taurus genome assembly. BMC Genomics 10, 180.

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

351

352

Nicolazzi et al. Lopes J.S., Balding D. & Beaumont M.A. (2009) POPABC: a program to infer historical demographic parameters. Bioinformatics 25, 2747–9. Madsen P., Sorensen P., Su G., Damgaard L.H., Thomsen H. & Labouriau R. (2006) DMU – a package for analyzing multivariate mixed models. In: Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, Belo Horizonte, Minas Gerais, Brazil, 2006. Matukumalli L.K., Lawley C.T., Schnabel R.D. et al. (2009) Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE 4, e5350. McCue M.E., Bannasch D.L., Petersen J.L. et al. (2012) A high density SNP array for the domestic horse and extant perissodactyla: utility for association mapping, genetic diversity, and phylogeny studies. PLoS Genetics 8, e1002451. Meuwissen T.H.E., Hayes B.J. & Goddard M.E. (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–29. Meyer K. (2007) WOMBAT – A tool for mixed model analyses in quantitative genetics by restricted maximum likelihood (REML). Journal of Zhejiang University Science B 8, 815–21. Milanesi M., Vicario D., Stella A., Valentini A., Ajmone-Marsan P., Biffani S., Biscarini F., Jansen G. & Nicolazzi E.L. (2015) Imputation accuracy is robust to cattle reference genome updates. Animal Genetics, 46, 69–72. Misztal I., Tsuruta S., Strabel T., Auvray B., Druet T. & Lee D.H. (2002) BLUPF90 and related programs (BGF90). Proceedings of the 7th World Congress on Genetics Applied to Livestock Production, Montpellier, France, August, 2002. Misztal I., Legarra A. & Aguilar I. (2009) Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. Journal of Dairy Science 92, 4648–55. Mitha F., Herodotou H., Borisov N., Jiang C., Yoder J. & Owzar K. (2011) SNPpy – database management for SNP data from genome wide association studies. PLoS ONE 6, e24982. Mungall C.J., Emmert D.B. and The FlyBase Consortium (2007) A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics 23, i337–46. Nejati-Javaremi A., Smith C. & Gibson J.P. (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. Journal of Animal Science 75, 1738–45. Nicolazzi E.L., Biffani S. & Jansen G. (2013) Short communication: imputing genotypes using PEDIMPUTE fast algorithm combining pedigree and population information. Journal of Dairy Science 96, 2649–53. Nicolazzi E.L., Picciolini M., Strozzi F., Schnabel R.D., Lawley C., Pirani A., Brew F. & Stella A. (2014) SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock. BMC Genomics 15, 123. Palti Y., Gao G., Liu S., Kent M.P., Lien S., Miller M.R., Rexroad C.E. & Moen T. (2014) The development and characterization of a 57K single nucleotide polymorphism array for rainbow trout. Molecular Ecology Resources, 15(3), 662–72, DOI: 10.1111/ 1755-0998.12337 Perez P. & de los Campos G. (2014) Genome-wide regression & prediction with the BGLR statistical package. Genetics 114, 483– 95.

Purcell S., Neale B., Todd-Brown K. et al. (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics 81, 559–71. R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Raj A., Stephens M. & Pritchard J. (2014) Variational inference of population structure in large SNP datasets. Genetics 197, 573– 89. Ramos A.M., Crooijmans R.P.M.A., Affara N.A. et al. (2009) Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS ONE 4, e6524. Robinson J.T., Thorvaldsdottir H., Winckler W., Guttman M., Lander E.S., Getz G. & Mesirov J.P. (2011) Integrative genomics viewer. Nature Biotechnology 29, 24–6. Roshyara N.R. & Scholz M. (2014) FCGENE: a versatile tool for processing and transforming SNP datasets. PLoS ONE 9, e97589. Sabeti P., Reich D., Higgins J.M. et al. (2002) Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–7. Sargolzaei M., Chesnais J.P. & Schenkel F.S. (2014) A new approach for efficient genotype imputation using information from relatives. BMC Genomics 15, 478. Stucki S., Orozco-terWengel P., Bruford M.W., Colli L., Masembe C., Negrini R., Taberlet P. & Joost S. & the NEXTGEN Consortium (2014) High performance computation of landscape genomic models integrating local indices of spatial association. arXiv:1405.7658v1 Sulovari A. & Li D. (2014) GACT: a genome build and allele definition conversion tool for SNP imputation and meta-analysis in genetic association studies. BMC Genomics 15, 610. Szpiech Z. & Hernandez R. (2014) SELSCAN: an efficient multithreaded program to perform EHH-based scans for positive selection. Molecular Biology and Evolution, 31(10), 2824–7. doi: 10.1093/molbev/msu211 Tajima F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–95. The Bovine Genome Sequencing and Analysis Consortium (2009) The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324, 522. Tosser-Klopp G., Bardou P., Bouchez O. et al. & The International Goat Genome Consortium (2014) Design and characterization of a 52K SNP chip for goats. PLoS ONE 9, e86227. VanRaden P.M., O’Connell J.R., Wiggans G.R. & Weigel K.A. (2011) Genomic evaluations with many more genotypes. Genetics Selection Evolution 43, 10. Voight B.F., Kudaravalli S., Wen X. & Pritchard J.K. (2006) A map of recent positive selection in the human genome. PLoS Biology 4, e72. Wang H., Misztal I., Aguilar I., Legarra A. & Muir W.M. (2012a) Genome-wide association mapping including phenotypes from relatives without genotypes. Genetic Research 94, 73–83. Wang S., Dvorkin D. & Da Y. (2012b) SNPEVG: a graphical tool for GWAS graphing with mouse clicks. BMC Bioinformatics 13, 319. Wegmann D., Leuenberger C., Neuenschwander S. & Excoffier L. (2010) ABCTOOLBOX: a versatile toolkit for approximate Bayesian computations. BMC Bioinformatics 11, 116.

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

Software for SNP array data in livestock Weigel K.A., Van Tassell C.P., O’Conell J.R.O., VanRaden P.M. & Wiggans G.R. (2010) Prediction of unobserved single nucleotide polymorphism genotypes of Jersey cattle using reference panels and population-based imputation algorithms. Journal of Dairy Science 93, 2229–38. Weller J.I. (2001) Quantitative Trait Loci Analysis in Animals. CABI Publishing, UK. Yang J., Lee S.H., Goddard M.E. & Visscher P.M. (2011) GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics 88, 76–82. Yang J., Zaitlen N.A., Goddard M.E., Visscher P.M. & Price A.L. (2014a) Mixed model association methods: advantages and pitfalls. Nature Genetics 46, 100–6. Yang Y., Wang Q., Chen Q., Liao R., Zhang X., Yang H., Zheng Y., Zhang Z. & Pan Y. (2014b) A new genotype imputation method with tolerance to high missing rate and rare variants. PLoS ONE 9, e101025. Zang Z. & Druet T. (2010) Marker imputation with low-density marker panels in Dutch Holstein cattle. Journal of Dairy Science 93, 5487–94.

Zhang F., Zhang Z.Y., He Y.N., Chen H., Yang B., Deng W.J. & Huang L.S. (2015) RegionPlot: an R package for regional plot association results for pigs. Animal Genetics, 46, 94–5. Zhou X. & Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–4. Zimin A.V., Delcher A.L., Florea L. et al. (2009) A whole-genome assembly of the domestic cow, Bos taurus. Genome Biology 10, R42.

Supporting information Additional supporting information may be found in the online version of this article. Appendix S1 Pseudo-code for a basic genome-wide association study analysis of quantitative traits using GenABEL R package. Figure S1 Workflow example for multiple analyses requiring minimum (or no) programming skills.

© 2015 Stichting International Foundation for Animal Genetics, 46, 343–353

353

Software solutions for the livestock genomics SNP array revolution.

Since the beginning of the genomic era, the number of available single nucleotide polymorphism (SNP) arrays has grown considerably. In the bovine spec...
166KB Sizes 5 Downloads 40 Views