GenoMatrix: A Software Package for Pedigree-Based and Genomic Prediction Analyses on Complex Traits.

Journal of Heredity, 2016, 372–379 doi:10.1093/jhered/esw020 Computer Notes Advance Access publication March 29, 2016

Computer Notes

GenoMatrix: A Software Package for Pedigree-Based and Genomic Prediction Analyses on Complex Traits Alireza Nazarian and Salvador Alejandro Gezan From the School of Forest Resources and Conservation, University of Florida, 363 Newins-Ziegler Hall P.O. Box 110410, Gainesville, FL 32611-0410 (Nazarian and Gezan). Address correspondence to Salvador A. Gezan at the address above, or e-mail: [email protected]. Received October 22, 2015; First decision December 22, 2015; Accepted March 22, 2016.

Corresponding editor: J. Perry Gustafson

Abstract Genomic and pedigree-based best linear unbiased prediction methodologies (G-BLUP and P-BLUP) have proven themselves efficient for partitioning the phenotypic variance of complex traits into its components, estimating the individuals’ genetic merits, and predicting unobserved (or yet-to-be observed) phenotypes in many species and fields of study. The GenoMatrix software, presented here, is a user-friendly package to facilitate the process of using genome-wide marker data and parentage information for G-BLUP and P-BLUP analyses on complex traits. It provides users with a collection of applications which help them on a set of tasks from performing quality control on data to constructing and manipulating the genomic and pedigree-based relationship matrices and obtaining their inverses. Such matrices will be then used in downstream analyses by other statistical packages. The package also enables users to obtain predicted values for unobserved individuals based on the genetic values of observed related individuals. GenoMatrix is available to the research community as a Windows 64bit executable and can be downloaded free of charge at: http://compbio.ufl.edu/software/genomatrix/. Subject areas: Bioinformatics and computational genetics, Quantitative genetics and Mendelian inheritance Key words: complex traits, G-BLUP analysis, genome-wide marker data, P-BLUP analysis, software

The genetic analysis of complex traits has received increasing attention in the fields of statistical genetics and computational biology over past decades. Such analyses have been commonly performed through 2 main approaches (i.e. gene discovery vs. phenotypic prediction) which aim at either discovering genetic variants underlying the traits of interest or predicting unobserved (or yet-to-be observed) phenotypes without the need of having explicit information about all contributing causal variants (Lee et al. 2008). The advent of high throughput technologies has fueled the 2 approaches by providing researchers with an unprecedented opportunity to make use of dense marker information through genome-wide association studies (GWAS) (Ziegler et al. 2008; Cantor et al. 2010; Moore et al. 2010)

or genomic prediction strategies (Meuwissen et al. 2001; Lee et al. 2008; VanRaden 2008). The premise of genomic prediction methodology [e.g. genomic selection (Meuwissen et al. 2001)] is that given the presence of linkage disequilibrium (LD) between a large number of markers, densely distributed across genome, and causal variants it would be possible to obtain relatively accurate prediction of phenotypic outcomes even if there are many unknown causal loci (Meuwissen et al. 2001; Hayes et al. 2009b; Goddard 2009). In recent years, a growing upsurge of interest has emerged in using a variant of genomic prediction, known as genomic best linear unbiased prediction (i.e. G-BLUP), for estimating the genetic variance components of complex traits and

© The American Genetic Association. 2016. All rights reserved. For permissions, please e-mail: [email protected]

372

Journal of Heredity, 2016, Vol. 107, No. 4

obtaining individuals’ genetic merit (VanRaden 2008). The traditional counterpart of G-BLUP, known as pedigree-based best linear unbiased prediction (i.e. P-BLUP) (Henderson 1963; Henderson 1975; Henderson 1976; Henderson 1984), has also been successfully implemented in many artificial selection programs for decades. The growing number of studies using BLUP methodologies in various species and fields of study ranging from animal sciences and plant breeding, to evolutionary biology and medical genomics (Lee et al. 2008; Hayes et al. 2009a; Wilson et al. 2010; Lee et al. 2010; Yang et al. 2010; Grattapaglia and Resende 2011; Su et al. 2012; de Los Campos et al. 2013; Ridge et al. 2013; Muñoz et al. 2014; Beaulieu et al. 2014) has proven them as invaluable methods for the genetic analysis of complex traits. Both G-BLUP and P-BLUP integrate genetic relationship information into a linear mixed model (LMM) framework, with the difference that while the former makes use of the observed relationship matrices constructed from genotyping information (Yang et al. 2010; Powell et al. 2010; Vitezica et al. 2013), the latter uses the expected relationship matrices calculated based on the parentage information (Schaeffer et al. 1989; Mrode 1996). The observed (i.e. marker-based or genomic) and expected relationship matrices are different in the sense that Mendelian sampling during gametogenesis deviates the observed proportion of identity-by-descent (IBD) between any pair of relatives from the expected IBD values (Hayes et al. 2009b; Goddard 2009). Also, in the case a pair of individuals have a common ancestor several generations ago, that has not been included in the pedigree information, the observed IBD values will represent their relationship more accurately than the expected ones (VanRaden et al. 2009). Therefore, the use of marker-based approach becomes of great importance especially with messy data in which pedigree information is unknown, incomplete or inaccurate. It has been also suggested that G-BLUP may outperform P-BLUP in estimating genetic parameters of interest when there is some shortrange and possibly long-range LD between causal and marker loci (Beaulieu et al. 2014). This is mainly because both LD and relationship information can be taken into account by G-BLUP analysis (Habier et al. 2007). Statistical packages, which implement BLUP methodology, often require users to provide relationship matrices or their inverses directly, especially in the case of analyzing genomic data or modeling non-additive effects. Therefore, undertaking a preparatory step is needed to generate these necessary input files. The GenoMatrix software, presented here, provides a tool to facilitate the use of G-BLUP and P-BLUP analyses on complex traits. Starting from input data, genotyping data or a pedigree file, users are assisted all the way from assessing data quality to obtaining pedigree- and marker-based additive, dominance, and epistatic relationship matrices of interest and their inverses as well as, if desired, merging or comparing them with each other. Such relationship matrices can later be included in downstream analyses by statistical packages such as ASReml and WOMBAT (Butler et al. 2007; Meyer 2007; Gilmour et al. 2015). Users are also provided with assistance in predicting genetic values for unobserved individuals. To our knowledge, there is no single user-friendly package available which can perform all the tasks carried out by the implemented software, although, there are a few other packages with some overlapping functionality (Bradbury et al. 2007; Endelman 2011; Yang et al. 2011; Wolak 2012; Gilmour et al. 2015). In addition, GenoMatrix takes advantage of MATLAB programming language (The MathWorks, Inc. 2015) which provides a powerful, fast, and robust framework for mathematical applications. In the following sections, the different modules of the implemented

373

software will be described, and an example of how to run it, using simulated datasets, will be presented.

Features and Functionality GenoMatrix consists of the following modules, which collectively generate a workflow to facilitate the use of parentage information and genome-wide marker data for genetic analyses of complex traits.

Pedigree Sort Module (PSM) When preparing a pedigree file, containing parentage information about individuals under consideration, it is common that some errors may appear in pedigree which must be corrected before further proceeding with generating pedigree-based relationship matrices. This module helps the user to sort the pedigree file in a way that no subject is duplicated, missing, or appears after its offspring. The sorting process can be performed manually, by referring to the report file generated by the module, or automatically. The sorted pedigree will be then used by downstream modules such as Pedigree-based Relationship Matrix Calculation and Pedigree Manipulation.

Pedigree-Based Relationship Matrix Calculation Module (PRM) Starting from an input pedigree file, this module calculates the pedigree-based additive and dominance relationship matrices according to the following formulas (Schaeffer et al. 1989; Mrode 1996):

{a} jk = 0.5 ( jm , km ) + ( j f , kf ) + ( jm , kf ) + ( j f , km )

{d} jk = ( jm , km ) ( j f , kf ) + ( jm , kf )( j f , km )

where {a}jk and {d}jk are the additive and dominance relationship coefficients of individuals j and k, respectively, and terms inside parentheses refers to the additive relationship coefficients between male and female parents of each pair of individuals. Such matrices contain the expected proportions of IBD loci that may exist across the entire genome of any pair of individuals (Falconer and Mackay 1996; Mrode 1996; Hayes et al. 2009b). The resulting output files can be used by the Epistatic Matrix Calculation, Inverse Matrix Calculation, and Pedigree Manipulation modules.

Quality Control Module (QCM) This module can be used to refine input genotyping data based on multiple user-specified quality control (QC) measures. Its application is not limited to G-BLUP analyses; instead it can be useful for quantitative trait loci (QTL) analyses and association studies as well. Markers are assumed to have 2 alleles, as seen for example in the case of single nucleotide polymorphisms (SNPs). The genotypes in the input file can be encoded in numeric (i.e. 0, 1, and 2), letters (i.e. AB), or standard DNA bases (i.e. ACTG) formats. The module provides the user with opportunity to specify multiple criteria such as the thresholds for minor-allele frequency (MAF) of markers, and missing values frequency of markers/subjects (i.e. calling rates), as well as the significance level for Hardy–Weinberg equilibrium (HWE) chi-square test on markers (Ziegler et al. 2008). Once the QC is performed, the genotyping information for subjects and markers which have passed all QC measures will be numerically encoded and saved into a file. This file along with 2 other files containing the identifier of QC-passed markers and subjects can be used by other


374

modules such as Genomic Relationship Matrix Calculation and Pedigree Manipulation. Also, multiple output files containing information about the markers and subjects which have not passed QC measures are generated.

Alternatively, the centered dominance relationship matrix (DG*) can be generated by the formula described by Su et al. (2012): DG* =

HH ′

∑ 2p q (1 − 2p q ) i

This module takes one or more genotyping input file(s) and calculates genomic (i.e. marker-based) additive and dominance relationship matrices. There are different methods of generating such matrices with small differences in their formulas and output (VanRaden 2008; Yang et al. 2010; Yang et al. 2011; Su et al. 2012; Poland et al. 2012; Vitezica et al. 2013). In this implementation, the normalized additive relationship matrix (AG) is obtained according to the formula suggested by Yang et al. (2010):

{aG }jk

if

j≠k

if

j=k

where {aG}jk is the genomic additive relationship coefficient corresponding to individuals j and k, M is the number of markers, gij and gik are the numerically encoded genotypes of individuals j and k at marker i, and pi is the frequency of allele with numeric value of 1 at marker i. Alternatively, the centered additive relationship matrix (AG*) is generated according to the following formula (VanRaden 2008; Endelman and Jannink 2012; Poland et al. 2012):

AG* =

ZZ ′ ∑ i 2pi qi

where Z is an N (number of individuals) by M (number of markers) matrix of genotypes in which the original genotypic values across different loci have been mean-centered by subtracting the mean of each locus (i.e. {z}ij = gij − 2pi) where gij is the genotype of individual j at marker i, and pi and qi are the allele frequencies corresponding to the alleles with numeric values of 1 and 0 at marker i, respectively. The AG and AG* matrices contain the observed proportions of IBD values of any pair of individuals across the genotyped loci (Hayes et al. 2009b). The normalized dominance relationship matrix (DG) is obtained according to the following formula (Vitezica et al. 2013):

DG =

HH ′

∑ (2p q ) i

i

i

2

where H is an N by M matrix in which the original genotypes with 0, 1, and 2 values have been re-encoded as:

{h}ij

i

i

i

where H is an N by M matrix of genotypes re-encoded as:

Genomic Relationship Matrix Calculation Module (GRM)

1 ( gij − 2pi ) ( gik − 2pi )  ∑i 2 pi (1 − pi ) M = gij2 − (1 + 2 pi ) gij + 2 pi2  1  1+ ∑ i 2 pi (1 − pi ) M 

i

 −2 pi2 if gij = 0  =  2 pi qi if gij = 1  2  −2qi if gij = 2

A statistical model containing additive and normalized dominance relationship matrices estimates breeding values (i.e. additive effects plus a portion of dominance effects) and dominance deviations which are orthogonal to each other. Such estimated effects are in line with the statistical definition of additive and dominance effects and can directly be compared to their counterparts obtained by a traditional pedigree-based model (Vitezica et al. 2013).

{h}ij

 −2 pi qi if gij = 0  =  1 − 2 pi qi if gij = 1  if gij = 2  −2 pi qi

A statistical model containing additive and centered dominance relationship matrices estimates pure additive and dominance effects which are non-orthogonal to each other and are more in line with the biological definition of such effects than their statistical definition. Therefore, the variance components and genetic parameters estimated by the models containing additive term plus either DG or DG* matrices may be different, particularly when the true dominance variance contributing to the trait of interest gets larger, or the allele frequencies across loci under consideration become more extreme (Vitezica et al. 2013). In addition to the relationship matrices of interest, the current module writes down a matrix containing the number of markers with non-missing values shared by any pair of individuals. The output files, generated by this module, can be used by other modules such as Epistatic Matrix Calculation, Inverse Matrix Calculation, and Pedigree Manipulation.

Epistatic Matrix Calculation Module (EMM) For any 2 additive (A) and/or dominance (D) relationship matrices, the epistatic matrix (E) of interest is calculated as their Hadamard product (i.e. ʘ) (Davis 1962; Liu and Trenkler 2008). The user may request additive-by-additive (A ʘ A), additive-by-dominance (A ʘ D), and dominance-by-dominance (D ʘ D) matrices. Note that the uploaded input matrices must be of the same sizes and follow the same order of individuals. If this is not the case, the Pedigree Manipulation module can be first used to align them and make them equivalent. The resulting output files of this module can be used by the Inverse Matrix Calculation and Pedigree Manipulation modules.

Inverse Matrix Calculation Module (IMM) This module calculates and writes down the inverse of any relationship matrix. Having such inverse matrices is required for estimating corresponding random effects by P-BLUP and G-BLUP methodologies (Henderson 1975; Henderson 1976; Henderson 1984; VanRaden 2008). In the case a matrix is singular or near-singular; the module provides the user with 3 options to overcome this issue by generating approximated inverses. The first option is to obtain a positive definite (PD) matrix (VPD) by non-iteratively replacing the negative eigenvalues of the relationship matrix of interest with small positive values according to the following formulas (Schaeffer 2010b): V PD = U δ U ′ where U is the matrix of eigenvectors of the original relationship matrix and δ is the diagonal matrix of eigenvalues, where all negative values (E(i)neg) have been replaced by new values (E(i)new) which are calculated as: 2 p * (s − E(i)neg ) E(i)new = w


375

where p is the absolute value of the largest E(i)neg, s is the sum of all the negative eigenvalues, and w = 100*s2+1. The second option takes advantage of an iterative algorithm to modify the negative and very small positive eigenvalues of the relationship matrix of interest through a weighted or unweighted bending procedure according to the formula suggested by Jorjani et al. (2003): Vn + 1 = Vn − [Vn − U nδ nU n′ ]  W where Vn and Vn+1 are the relationship matrix of interest at iterations n and n + 1, Un is the matrix of eigenvectors of Vn, and δn is the diagonal matrix of eigenvalues of Vn with values smaller than a userspecified constant (ε) being replaced by ε. W is a matrix of weights and ʘ refers to the Hadamard product of matrices. The iteration process continues until Vn+1 is a PD matrix with all eigenvalues equal or greater than ε. If W is not specified a matrix of ones (i.e. J = 1’1) is used resulting in un-weighted bending. The last option is to obtain a PD and/or invertible matrix (VPD) through blending the relationship matrix of interest (VNPD) with a second relationship matrix (VBlend) (VanRaden 2008; Aguilar et al. 2010; Rodríguez-Ramilo et al. 2014). VPD = p * VNPD + (1 − p)* VBlend where p and (1 − p) specify the proportion of contribution of VNPD and VBlend matrices to the resulting VPD matrix, respectively. For example, a genomic additive relationship matrix can be blended with its counterpart pedigree-based matrix. Starting from 99% contribution from the first matrix and 1% contribution from the second one, a wellconditioned matrix is sought as the linear combination of the 2 matrices through an iterative procedure which increments the contribution of the second matrix by 1% in each round of iteration. The blending process will continue until a well-conditioned matrix is obtained or until the contribution of the second matrix exceeds 50%.

Pedigree Manipulation Module (PMM) This module provides the user with multiple options to modify relationship matrices generated by other modules. For example, an adjusted (or corrected) genomic additive relationship matrix (AcG) can be obtained, according to the following formula, by shrinking original genomic relationship coefficients towards their counterpart expected (i.e. pedigree-based) values in order to reduce the sampling variance (Yang et al. 2010; Powell et al. 2010; Muñoz et al. 2014): 

{aGc }jk =  1 − 

1

 M  var (aG ) 

({a }

G jk

)

− {a}jk + {a}jk

where {a}jk and {aG}jk are the pedigree-based and genomic additive relationship coefficients corresponding to individuals j and k, M is the number of markers, var(aG) is the variance of the elements of the genomic matrix with the same relationship class (e.g. full-sibling, half-sibling) as that of individuals j and k. The same formula can be used to obtain an adjusted non-additive (e.g. dominance) relationship matrix based on the corresponding pedigree-based and genomic matrices. In addition, the pedigree-based and genomic relationship matrices can be merged to generate an extended matrix which contains all subjects present in both of the 2 relationship matrices. This might be of relevance, e.g. in the cases in which not all individuals are genotyped but all have phenotypic data or when obtaining predicted values for unobserved phenotypes is desired through simultaneously using both pedigree information and molecular data (Meuwissen

and Goddard 1999; Goddard 2009; Christensen and Lund 2010; Aguilar et al. 2010). The 2 uploaded matrices can also be aligned to generate new matrices of the same sizes and order of individuals. Such aligned matrices are used, e.g. when obtaining an epistatic relationship matrix is needed or when blending method is implemented to overcome the singularity issue in order to obtain the inverse of a relationship matrix. Also, when comparing expected and observed relationship coefficients is of interest, the 2 matrices must first be aligned. In this case, a file containing information regarding differences between the 2 matrices will be generated. Investigating this file will be helpful, e.g. to identify errors in assigning parentage information in the input pedigree file.

Genetic Prediction Module (GPM) This module enables the user to predict genetic responses [e.g. genotyping information, estimated breeding values (EBVs)] for unobserved (or yet-to-be observed) subjects given the availability of the response values for a portion of dataset (i.e. observed subjects) and a relationship matrix for the entire subjects under consideration. The unobserved responses can be predicted, by taking into account their conditional distribution given the observed responses, according to the following formula (Eaton 1983; Gengler et al. 2007):  µy  −1 E ( g x | g y ) = (1 GxyGyy ) g − µ 1  y y  where 1 is a vector of ones, gx and gy are the vectors of responses for subsets with unknown and known values (i.e. subsets x and y), µy is the mean of subset y, Gxy is the submatrix corresponding to the portion of the full relationship matrix (G) which contain the relationship coefficients between subjects in subsets x and y, and Gyy is the submatrix with the portion of G corresponding to the subjects in subset y. Alternatively, the unobserved responses can be predicted by solving the LMM equations system. Assuming the general structure for the fitted LMM as: g = µ1 + Za + e where g is the vector of responses, µ is the overall mean, Z is the incidence matrix for random effects, a ~ MVN(0, σ2aA) is the vector of additive random effects where σ2a is the additive variance and A is the numerator relationship matrix sorted in a way that all subjects with observed responses appear before other subjects, and e ~ MVN(0, Iσ2e) is the vector of residual errors with I is an identity matrix and σ2e is the error variance. The random effects for subsets x and y (ax = gx − µ1 and ay = gy − µ1) will be obtained by solving the following equation (Henderson 1984; Gengler et al. 2007):  µ 1′ Z     1′ g y   1′ 1    ay  =    Z ′1 Z ′ Z + kA −1     Z ′ g y   ax  where k is a shrinkage coefficient calculated as k = σ2e/σ2a. The application of this module can be of great relevance when working with large-scale genomic datasets. For example, when G-BLUP analyses are performed over datasets with thousands of subjects, it might be more time efficient to fit a reduced animal model (RAM) over only subjects with observed phenotypes than to fit a full animal model (Schaeffer 2010a). Once the RAM is converged, the resulting EBVs can then be used to predict the EBVs for other related


376

individuals that do not have observed phenotypes. Note that a pedigree-based relationship matrix is usually sparse enough to prevent the need of fitting RAMs when a P-BLUP analysis is performed. In such cases, fitting a full LMM is straightforward and recommended. However, in the case of G-BLUP analysis, since a genomic relationship matrix or an extended matrix, obtained by merging pedigreebased and genomic relationship matrices, contains many non-zero elements, solving and optimizing a full LMM equations system might become time consuming and computationally expensive. Another example of the module’s applications is to use the genotyping information available for a part of individuals under consideration (e.g. offspring generation) to predict genotypes of other individuals (e.g. parent generation) (Gengler et al. 2007). This can be helpful for obtaining allele frequencies in the unselected population. It has been suggested that genomic relationship matrices which are generated by using biased allele frequencies, calculated over an inbred population with a large levels of homozygosity, might be inaccurate and result in biased EBVs (VanRaden 2008; Simeone et al. 2011).

Examples of Use A number of simulated datasets have been generated which can be used to investigate the functionality of different modules of GenoMatrix. These datasets are described in the section A of Supplementary Materials. In addition, a detailed explanation about running the Pedigree Sort and Quality Control modules is provided in the sections B and C of Supplementary Materials. The user can also follow the instructions in the section D of Supplementary Materials to generate pedigree-based and genomic additive and dominance relationship matrices using the provided datasets. The following paragraphs assume that these matrices have been already generated. The section E of Supplementary Materials describes an application of the Genetic Prediction module for predicting the genotypes of parents, as unobserved individuals, based on the genotypes of their offspring as observed individuals.

Obtaining Epistatic Relationship Matrices Once the additive and dominance relationship matrices were available, the output files in dense matrix format were uploaded to the Epistatic Relationship Matrix Calculation module (EMM) to generate epistatic relationship matrices of interest. Note that the pedigree file which was used to generate pedigree-based matrices contains parentage information of individuals in both G1 (i.e. parents) and G2 (i.e. offspring) generations. However, the file which was used to generate genomic matrices contains genotyping information for only G2 individuals. Consequently, the size of pedigree-based matrices is different from the size of genomic ones (1000-by-1000 vs. 900-by-900). If an epistatic matrix between pedigree-based and genomic matrices is needed, the 2 relationship matrices must therefore be aligned first. This can be performed by uploading them to the Pedigree Manipulation module (PMM), and running this module after choosing the align option. The input files required for running the PMM are listed in the section A of Supplementary Materials. Including dominance and epistatic relationship matrices into the statistical analysis is important whenever partitioning genetic variance of complex traits into its additive and non-additive components is of interest. The P-BLUP analysis have mainly been implemented for capturing additive effects because, e.g. the lack of appropriate mating structure and study design may lead to the lack of statistical power to estimate non-additive (i.e. dominance and epistatic) effects

contributing to the genetic variance of complex traits (Tenesa and Haley 2013). However, a number of recent studies have suggested that G-BLUP may have more power to capture non-additive effects by relying on the observed relationship coefficients, and to some extent, by bypassing the need of appropriate study designs (Su et al. 2012; Vitezica et al. 2013; Muñoz et al. 2014; Nazarian and Gezan 2016).

Comparing Genomic and Pedigree-based Relationship Matrices The PMM, through choosing the generate differences option; can be used to compare the pedigree-based and genomic relationship matrices (i.e. APED and AHAT-Norm output files generated by the PRM and GRM, respectively) for each pair of individuals that are present in both matrices. The resulting RM-Differences output file may also be used to investigate the distribution of relationship coefficients and obtain their maximum, minimum and mean across a certain relationship level such as full-sib or half-sib relationships by exploring it in a spreadsheet. For example, Table 1 contains the maximum, minimum and mean of the coefficients of relationship across diagonal, full-sib and half-sib elements of the generated genomic normalized additive matrix for G2 individuals. Investigating the distribution of coefficients of relationship across diagonal elements of a genomic additive relationship matrix may provide suggestive information regarding population admixture, mislabeling of subjects, and errors in genotyping procedure. Detecting and resolving such issues will in turn increase the accuracy of predictions. For example, since marker effects are more or less specific to the population in which they are estimated, effects calculated in a population cannot accurately predict genetic merits of mislabeled subjects from another population (Simeone et al. 2011). In addition, investigating genomic coefficients across full-sib or half-sib families may help the user to find potential errors or inconsistencies in specifying parentage information in pedigree file. For example, it is straightforward to replace the parents of the last individual in the provided pedigree file (i.e. individual 1000 with parents 34/74) with 46/69. Individuals 46 and 69 are the parents of another full-sib family with offspring 155–160. The distribution of coefficients of relationship within this full-sib family, after introducing this error, can be investigated by generating a new pedigree-based additive matrix using the modified pedigree file, and then obtaining a new RM-Differences file. Table 2 summarizes the mean, minimum and maximum pedigree-based and genomic relationship coefficients existing between each pair of individuals 155–160 (i.e. real fullsiblings), and between each of these individuals and the individual 1000. Deliberately changing the parentage information for the individual 1000 made it a member of this family with expected (i.e. pedigree-based) relationship coefficients of 0.5 with others. However, the genomic coefficients between this individual and other offspring in this family became consistently different from what is expected from a full-sib relationship. Note that genomic coefficients for a certain Table 1. The minimum, maximum and mean of relationship coefficients for diagonal elements, full-sib, and half-sib relationships obtained from genomic normalized additive relationship matrix

Mean Minimum Maximum

Diagonal

Full-sibs

Half-sibs

0.997 0.945 1.188

0.489 0.057 0.863

0.239 −0.041 0.531


377

Table 2. The minimum, maximum, and mean of pedigree-based and genomic additive relationship coefficients between real fullsib members of the family 46/69 (i.e. individuals 155–160), and between individual 1000 and these members (i.e. erroneous full-sib relationships) Real full-sibs

Mean Minimum Maximum

IND 1000

Pedigree-based

Genomic

Pedigree-based

Genomic

0.5 0.5 0.5

0.452 0.268 0.723

0.5 0.5 0.5

−0.031 −0.055 −0.017

relationship level such as full-sib or half-sib relationships show some degree of variance around their expected value due to Mendelian segregation during gametogenesis (VanRaden 2007; Hayes et al. 2009b). However, if a particular individual (or individuals) consistently has lower or higher coefficients of relationship with other members of that family it might indicate that there is a pedigree error in specifying parentage information for that individual.

Obtaining Inverse of Relationship Matrices Once the additive, dominance and epistatic matrices are generated, their inverses can be obtained through the Inverse Matrix Calculation module (IMM). Having such inverse matrices is important since they are included in the process of estimating additive, dominance and epistatic effects by downstream P-BLUP or G-BLUP analyses (Henderson 1975; Henderson 1976; Henderson 1984; VanRaden 2008). However, not all statistical packages calculate these matrices internally. For example, ASReml-R package and WOMBAT (Butler et al. 2007; Meyer 2007) require the user to upload the genomic or non-additive inverse matrices directly. Moreover, it is quite common that such variance–covariance relationship matrices are ill-conditioned because they may have some negative eigenvalues or very small positive eigenvalues (Jorjani et al. 2003; Schaeffer 2010b). Such issues make a matrix unstable and singular or near-singular, making inversion difficult or impossible without modifying the matrix. The IMM considers the reciprocal condition number (RCN) of a matrix as the index of its stability and the accuracy of its inverse (Antia 2002). If a matrix is not singular or near-singular with respect to its RCN, constructing the inverse of the matrix is attempted, otherwise, the user is provided with alternative iterative and non-iterative bending as well as blending methods to circumvent the singularity issue. As an example, the IMM was run on the genomic normalized additive relationship matrix, using defaults values and choosing all possible alternative methods. Note that in this experiment pedigreebased additive matrix served as the second matrix for the blending procedure. Therefore, these 2 matrices which were not of the same sizes and order of individuals were first aligned by the PMM. The resulting report file generated by the IMM indicated that since the normalized additive matrix had an RCN of 2.601 × 10−7, which is greater than the RCN threshold’s default value of 1 × 10−12, an inverse matrix was constructed. Since the matrix was non-positive definite (NPD) with minimum eigenvalue of −8.391 × 10−6, alternative methods were implemented to obtain a positive definite (PD) matrix. The non-weighted iterative bending resulted in a PD matrix in the fourth iteration with an RCN of 3.1 × 10−6 and minimum eigenvalue of 1 × 10−4. When a weighted iterative bending was performed instead; by specifying a matrix of weights (e.g. Count-Markers file generated by the GRM), a PD matrix was obtained in the first iteration with an RCN of 2.511 × 10−4 and minimum eigenvalue of 3.748 × 10−2.

The resulting report file showed that the non-iterative bending was also able to generate a PD matrix which seems to be ill-conditioned and unstable with both RCN and minimum eigenvalue of less than 1 × 10−15. Note that although, in this case, the non-iterative bending procedure resulted in a badly conditioned matrix; however, in some other cases it might be much faster and outperforms the iterative bending. Also, blending of the genomic additive matrix with the aligned pedigree-based additive matrix resulted in a PD matrix with an RCN of 3.096 × 10−4 and minimum eigenvalue of 4.212 × 10−2. The contribution of genomic and pedigree-based matrices to this PD matrix was 99 and 1%, respectively. However, in some other cases obtaining such a matrix may require more contribution from the second matrix (here, pedigree based additive relationship). It has been suggested that the reliability of phenotypic predictions obtained by incorporating a blended matrix into LMMs may decrease with less contribution of the original matrix (here, genomic normalized additive relationship) (VanRaden 2008).

Comparison with Other Software Packages The performance of different modules of GenoMatrix (e.g. PRM, GRM, EMM, and IMM) was compared to that of other software packages with overlapping functionality. The files S1–S4 of the Supplementary Material containing the resulting relationship matrices, generated by different packages, can be downloaded along with provided sample datasets from GenoMatrix website. The pedigree-based additive, dominance and epistatic relationship matrices generated by the PRM and EMM were compared to those generated by nadiv R package (Wolak 2012). The file S1 (Supplementary Material) contains the resulting matrices in rowwise format. Investigating these matrices revealed that the matrices generated by both of these packages were identical. The genomic additive and dominance relationship matrices generated by the GRM were compared to those generated by TASSEL (Bradbury et al. 2007) and rr-BLUP R package (Endelman 2011). The file S2 (Supplementary Material) contains the normalized additive relationship matrices (AG) (Yang et al. 2010; Yang et al. 2011) resulted from GenoMatrix and TASSEL. The off-diagonal elements of the 2 AG matrices were generally the same with trivial differences of 5.06 × 10−7 or less. However, diagonal elements had greater differences with some elements having differences of 0.013 or less. The reason is that TASSEL follows the formula suggested by Yang et al. (2011) which treats diagonal and off-diagonal elements equally when calculating AG. However, GenoMatrix makes use of the formula suggested by Yang et al. (2010) which calculates diagonal and off-diagonal elements differently in order to obtain more accurate estimates of inbreeding values. The file S3 (Supplementary Material) contains the centered additive relationship matrices (AG*) (VanRaden 2007; Endelman and Jannink 2012; Poland et al. 2012) generated by GenoMatrix, TASSEL, and rr-BLUP. Comparing the 3 AG* matrices revealed that they were mostly the same with some values having differences of 5.16 × 10−7 or less. The file S4 (Supplementary Material) contains the centered dominance relationship matrices (DG*) (Su et al. 2012) generated by GenoMatrix and TASSEL. Once again, comparing the 2 matrices showed that they were mostly the same with some elements having negligible differences of 5.06 × 10−7 or less. Since obtaining the normalized dominance relationship matrix has not yet been implemented by TASSEL, a comparison of DG matrix was not possible. To investigate the performance of PD matrices generated by modifying eigenvalues of the non-positive definite AG matrix through


378

the IMM, a simulated phenotypic dataset (i.e. Phenotypes file) was used to fit several G-BLUP models. The general structure of the fitted LMMs was the same as the model which was previously explained in the “Genetic Prediction Module (GPM)” subsection of the “Features and Functionality” section. Three models were fitted and compared, each containing a different AG matrix. The model M1, which was considered as the standard model, was fitted by using the original AG matrix (i.e. as a grm file) and allowing the standalone ASReml v. 4.1 package (Gilmour et al. 2015) to make the inverse matrix internally. However, note that since this matrix is NPD, its inverse cannot be used directly which will become an important issue in cases in which statistical packages, e.g. ASReml-R (Butler et al. 2007), require users to provide the inverse of relationship matrices directly. Models M2 and M3 were fitted by using the inverse of AG generated by weighted iterative bending and blending methods, respectively. The hypothesis to be tested was that if the EBVs and their ranks are highly correlated across different models that contain different AG matrices, such matrices can then be treated as equivalent and be used interchangeably for data analysis. The matrix generated by non-iterative bending was ignored because it was ill-conditioned and very unstable according to its RCN and minimum eigenvalue. The EBVs were highly correlated among the fitted models with the Pearson product-moment correlation coefficients of ~1.00 for the entire dataset as well as for the top 5, 10, and 20% of individuals according to their EBVs estimated by the standard model (i.e. model M1). The Spearman’s rank correlation coefficients of the ranks of EBVs were also ~1.00, indicating that the rankings of individuals based on their EBVs were highly consistent across the 3 fitted models. More details are presented in Supplementary Tables S1–S4. To compare the computing times of GenoMatrix and TASSEL when they are employed to generate genomic relationship matrices using large datasets, a dataset of 100 000 markers and 2000 individuals was used to generate AG, AG*, DG*, and DG matrices. While TASSEL was faster than GenoMatrix in reading input and writing output files, the actual computing times for generating genomic matrices were comparable. Times for generating AG, AG*, and DG* matrices were about 33, 15, and 18 s with GenoMatrix, respectively. In contrast, it took about 45, 25, and 22 s for TASSEL to generate the same matrices. Also, the computing time was around 28 s for generating DG matrix by GenoMatrix. The performance of GenoMatrix has been tested on a typical desktop computer with Intel® Xeon® CPU at 3.07 GHz and 12 GB of random access memory (RAM). Dealing with extremely large datasets, the 2 most memory-consuming modules of GenoMatrix are the QCM and GRM. These 2 modules are able to handle multiple input files. Therefore, for analyzing a very large dataset (for example a dataset containing 1 million markers) it is recommended that, depending on the available RAM on the user’s machine, the user provides GenoMatrix with multiple genotyping files each containing genotyping information for all the individuals under consideration and a portion of markers. This way, GenoMatrix can easily be run on a desktop computer with typical amounts of RAM even for very large datasets.

Implementation and Availability GenoMatrix is distributed as a Windows 64bit user-friendly executable. It is available, royalty-free, to the research community and can be downloaded at: http://compbio.ufl.edu/software/genomatrix/. The downloadable package contains the compiled application, the user guide and several sample datasets for testing the software functionality.

Although GenoMatrix has been implemented using MATLAB programming language (The MathWorks, Inc. 2015), users do not need to have MATLAB installed on their machines to make use of it. However, installing a compatible MATLAB Compiler Runtime (MCR) is required. The MCR is a free standalone package with essential libraries to run compiled MATLAB applications. The compatible MCR can be downloaded directly from the MathWorks website at: http://www.mathworks.com/products/compiler/mcr/index. html.

Conclusions The GenoMatrix software provides a collection of applications, which by helping users to generate, modify and compare necessary relationship matrices, enable them to take advantage of P-BLUP and G-BLUP analyses more readily. Accessibility of the implemented package will facilitate future studies on the genetic architecture of complex traits especially for those scientists who are not expert in developing relevant necessary computer applications. The use of GenoMatrix is not limited to the artificial selection programs, instead, other fields of study such as evolutionary genetics and medical genomics will benefit as well due to the increasing interest in using BLUP methodology for the study of complex traits in various species.

Supplementary Material Supplementary material can be found at http://www.jhered.oxfordjournals.org/.

Funding This study was partially funded by the School of Forest Resources and Conservation and the Institute of Food and Agricultural Science at the University of Florida and by the Specialty Crop Block Grant from the Florida Department of Agriculture and Consumer Services.

Conflict of interest The authors declare no conflict of interest.

References Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, Lawlor TJ. 2010. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci. 93:743–752. Antia HM. 2002. Numerical Methods for Scientists and Engineers. Birkhäuser Verlag. Beaulieu J, Doerksen TK, MacKay J, Rainville A, Bousquet J. 2014. Genomic selection accuracies within and between environments and small breeding groups in white spruce. BMC Genomics. 15:1048. Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics. 23:2633–2635. Butler DG, Cullis BR, Gilmour AR and Gogel B. 2007. ASReml-R Reference Manual Release 2.0, Release 2.00. Brisbane: The State of Queensland, Department of Primary Industries and Fisheries. Cantor RM, Lange K, Sinsheimer JS. 2010. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet. 86:6–22. Christensen OF, Lund MS. 2010. Genomic prediction when some animals are not genotyped. Genet Sel Evol GSE. 42:2.


Davis C. 1962. The norm of the Schur product operation. Numer Math. 4:343–344. de Los Campos G, Vazquez AI, Fernando R, Klimentidis YC, Sorensen D. 2013. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9:e1003608. Eaton ML. 1983. Multivariate statistics: a vector space approach, 1st ed. John Wiley & Sons Inc. Endelman JB. 2011. Ridge regression and other kernels for genomic selection with R Package rrBLUP. Plant Genome. 4:250–255. Endelman JB, Jannink J-L. 2012. Shrinkage estimation of the realized relationship matrix. G3 Bethesda Md. 2:1405–1413. Falconer DS, Mackay TFC. 1996. Introduction to quantitative genetics, 4 edn. Essex (UK): Longman Group Ltd. Gengler N, Mayeres P, Szydlowski M. 2007. A simple method to approximate gene content in large pedigree populations: application to the myostatin gene in dual-purpose Belgian Blue cattle. Anim Int J Anim Biosci. 1:21–28. Gilmour AR, Gogel BJ, Cullis BR, Welham SJ and Thompson R. 2015. ASReml User Guide Release 4.1 Functional Specifcation, Release 4.1. Hempstead (UK): VSN International Ltd. Goddard M. 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 136:245–257. Grattapaglia D and Resende MDV. 2011. Genomic selection in forest tree breeding. Tree Genet Genomes. 7:241–255. Habier D, Fernando RL, Dekkers JC. 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 177:2389– 2397. Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME. 2009a. Invited review: Genomic selection in dairy cattle: progress and challenges. J Dairy Sci. 92:433–443. Hayes BJ, Visscher PM, Goddard ME. 2009b. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res (Camb). 91:47–60. Henderson CR. 1963. Selection index and expected genetic advance. Statistical Genetics and Plant Breeding. In: Hanson WD, Robinson HF, editors. National Academy of Sciences-National Research Council, Washington, p. 141–163 Henderson CR. 1984. Applications of linear models in animal breeding. Guelph: University of Guelph. Henderson CR. 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics. 31:423–447. Henderson CR. 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics. 32:69–83. Jorjani H, Klei L, Emanuelson U. 2003. A simple method for weighted bending of genetic (co)variance matrices. J Dairy Sci. 86:677–679. Lee SH, Goddard ME, Visscher PM, van der Werf JH. 2010. Using the realized relationship matrix to disentangle confounding factors for the estimation of genetic variance components of complex traits. Genet Sel Evol. 42:22. Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM. 2008. Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genet. 4:e1000231. Liu S, Trenkler G. 2008. Hadamard, Khatri-Rao, Kronecker and other matrix products. Int J Inf Syst Sci. 4:160–177. Meuwissen TH, Hayes BJ, Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157:1819–1829. Meuwissen TH and Goddard ME. 1999. Marker assisted estimation of breeding values when marker information is missing on many animals. Genet Sel Evol. 31:375. Meyer K. 2007. WOMBAT: a tool for mixed model analyses in quantitative genetics by restricted maximum likelihood (REML). J Zhejiang Univ Sci B. 8:815–821. Moore JH, Asselbergs FW, Williams SM. 2010. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 26:445–455. Mrode RA. 1996. Linear models for the prediction of animal breeding values, 1st ed. Wallingford (UK): CABI.

379

Muñoz PR, Resende MF Jr, Gezan SA, Resende MD, de Los Campos G, Kirst M, Huber D, Peter GF. 2014. Unraveling additive from nonadditive effects using genomic relationship matrices. Genetics. 198:1759–1768. Nazarian A, Gezan SA. 2016. Integrating nonadditive genomic relationship matrices into the study of genetic architecture of complex traits. J Hered. 107:153–162. Poland J, Endelman J, Dawson J, Rutkoski J, Wu S, Manes Y, Dreisigacker S, Crossa J, Sánchez-Villeda H, Sorrells M et al. 2012. Genomic selection in wheat breeding using genotyping-by-sequencing. Plant Genome. 5:103–113. Powell JE, Visscher PM, Goddard ME. 2010. Reconciling the analysis of IBD and IBS in complex trait studies. Nat Rev Genet. 11:800–805. Ridge PG, Mukherjee S, Crane PK, Kauwe JS; Alzheimer’s Disease Genetics Consortium. 2013. Alzheimer’s disease: analyzing the missing heritability. PLoS One. 8:e79771. Rodríguez-Ramilo ST, García-Cortés LA, González-Recio O. 2014. Combining genomic and genealogical information in a reproducing kernel Hilbert spaces regression model for genome-enabled predictions in dairy cattle. PLoS One. 9:e93424. Schaeffer LR. 2010a. Linear models and animal breeding. Guelph (ON): Centre for Genetic Improvement of Livestock, Department of Animal and Poultry Science, University of Guelph. Schaeffer LR. 2010b. Modification of negative eigenvalues to create positive definite matrices and approximation of standard errors of correlation estimates. Guelph (ON): Centre for Genetic Improvement of Livestock, Department of Animal and Poultry Science, University of Guelph. Schaeffer LR, Kennedy BW, Gibson JP. 1989. The inverse of the gametic relationship matrix. J Dairy Sci. 72:1266–1272. Simeone R, Misztal I, Aguilar I, Legarra A. 2011. Evaluation of the utility of diagonal elements of the genomic relationship matrix as a diagnostic tool to detect mislabelled genotyped animals in a broiler chicken population. J Anim Breed Genet Z Für Tierz Zücht. 128:386–393. Su G, Christensen OF, Ostersen T, Henryon M, Lund MS. 2012. Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers. PLoS One. 7:e45293. Tenesa A, Haley CS. 2013. The heritability of human disease: estimation, uses and abuses. Nat Rev Genet. 14:139–149. The MathWorks, Inc. 2015. MATLAB and Statistics Toolbox Release 2015b. Natick (MA): The MathWorks, Inc. VanRaden PM. 2008. Efficient methods to compute genomic predictions. J Dairy Sci. 91:4414–4423. VanRaden PM. 2007. Genomic measures of relationship and inbreeding. Interbull Bull. 37:33–36. VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS. 2009. Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 92:16–24. Vitezica ZG, Varona L, Legarra A. 2013. On the additive and dominant variance and covariance of individuals within the genomic selection scope. Genetics. 195:1223–1230. Wilson AJ, Réale D, Clements MN, Morrissey MM, Postma E, Walling CA, Kruuk LE, Nussey DH. 2010. An ecologist’s guide to the animal model. J Anim Ecol. 79:13–26. Wolak ME. 2012. nadiv : an R package to create relatedness matrices for estimating non-additive genetic variances in animal models. Methods Ecol Evol. 3:792–796. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al. 2010. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 42:565–569. Yang J, Lee SH, Goddard ME, Visscher PM. 2011. GCTA: a tool for genomewide complex trait analysis. Am J Hum Genet. 88:76–82. Ziegler A, König IR and Thompson JR. 2008. Biostatistical aspects of genomewide association studies. Biom J Biom Z. 50:8–28.

PHYLUCE is a software package for the analysis of conserved genomic loci.

Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits.

MP3: a software tool for the prediction of pathogenic proteins in genomic and metagenomic data.

LS-VISM: A software package for analysis of biomolecular solvation.

QuantWorm: a comprehensive software package for Caenorhabditis elegans phenotypic assays.

Poly-omic prediction of complex traits: OmicKriging.

[SIGAPS: a software package for the evaluation of medical publications].

obitools: a unix-inspired software package for DNA metabarcoding.

MultiBLUP: improved SNP-based prediction for complex traits.

The critical need for computational methods and software for simulating complex genetic and genomic data.

Kernel-based whole-genome prediction of complex traits: a review.

gr-MRI: A software package for magnetic resonance imaging using software defined radios.

cgDNA: a software package for the prediction of sequence-dependent coarse-grain free energies of B-form DNA.

Genomic prediction of traits related to canine hip dysplasia.

Assessing models for genetic prediction of complex traits: a comparison of visualization and quantitative methods.

Multiple-trait QTL mapping and genomic prediction for wool traits in sheep.

Accuracy of genomic prediction for growth and carcass traits in Chinese triple-yellow chickens.

Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer.

Regularized machine learning in the genetic prediction of complex traits.

GAMOLA2, a Comprehensive Software Package for the Annotation and Curation of Draft and Complete Microbial Genomes.

The Nature of Genetic Variation for Complex Traits Revealed by GWAS and Regional Heritability Mapping Analyses.

Genetic pleiotropy in complex traits and diseases: implications for genomic medicine.

ProtPOS: a python package for the prediction of protein preferred orientation on a surface.

sgRNAcas9: a software package for designing CRISPR sgRNA and evaluating potential off-target cleavage sites.