Imputing genotypes using regularized generalized linear regression models.

Stat. Appl. Genet. Mol. Biol. 2014; 13(5): 519–529

William W.L. Wong, Josh Griesman and Zeny Z. Feng*

Imputing genotypes using regularized generalized linear regression models Abstract: As genomic sequencing technologies continue to advance, researchers are furthering their understanding of the relationships between genetic variants and expressed traits. However, missing data can significantly limit the power of a genetic study. Here, the use of a regularized generalized linear model, denoted by GLMNET, is proposed to impute missing genotypes. The method aims to address certain limitations of earlier regression approaches in regards to genotype imputation, particularly the specification of the number of neighboring SNPs to be included for imputing the missing genotype. The performance of GLMNET-based method is compared to the conventional multinomial regression method and two phase-based methods: fastPHASE and BEAGLE. Two simulation scenarios are evaluated: a sparse-missing model, and a small-panel expansion model. The sparse-missing model simulates a scenario where SNPs were missing in a random fashion across the genome. In the small-panel expansion model, a set of individuals is only genotyped at a subset of the SNPs of the large panel. Each imputation method is tested in the context of two data-sets: Canadian Holstein cattle data and human HapMap CEU data. Results show that the proposed GLMNET method outperforms the other methods in the small panel expansion scenario and fastPHASE performs slightly better than the GLMNET method in the sparse-missing scenario. Keywords: elastic net; genotype imputation; regularized generalized linear models; single nucleotide polymorphism (SNP). DOI 10.1515/sagmb-2012-0044

1 Introduction In the wake of rapidly advancing genotyping techniques, genome-wide association studies (GWAS) are increasingly being used to identify relationships between genotypes and phenotypes (Hirschhorn and Daly, 2005). Following from the central dogma of biology, it is suggested that polymorphisms in DNA can explain the phenotypic variability found in expressed traits. The single nucleotide polymorphism (SNP) is the most common form of variation in DNA. Over 40 clinical conditions have been shown to be directly correlated with SNPs in humans (Cichon et al., 2009). Frayling (2007) identified 11 genomic locations that affect the risk of type 2 diabetes. Easton and Eeles (2008) discussed previous studies related to several cancer types, and noted that over 20 separate genes have been shown to be correlated with cancer. Modern advances allow for the genotyping of millions of SNPs, and, in doing so, the probability of identifying SNPs correlated with the disease increases (Li et al., 2009). However, with such large data sets, the quantity of missing data becomes substantial. A study will be significantly hindered if it is limited to only the intersection of genotyped SNPs for all subjects. The dimensions of the data set shrink significantly, and one is left with less statistical power when arriving at genes responsible for a given outcome (Li et al., 2009). Imputing the missing genotypes can help to solve this problem, and enable a more powerful study. *Corresponding author: Zeny Z. Feng, Department of Mathematics and Statistics, University of Guelph, 50 Stone Road East, Guelph N1G2W1, ON, Canada, e-mail: [email protected] William W.L. Wong: Toronto Health Economics and Technology Assessment Collaborative, Leslie Dan Faculty of Pharmacy, University of Toronto, 6th Floor, Room 658, 144 College Street, Toronto M5S 3M2, ON, Canada Josh Griesman: Royal College of Surgeons in Ireland, 123 St. Stephens Green, Dublin 2, Ireland

Brought to you by | University of Connecticut Authenticated Download Date | 5/20/15 4:38 AM

520 W.W.L. Wong et al.: Imputing genotypes using regularized GLMs Recently, genotype imputation has been used in another practical situation, where a “small” panel of SNPs is expanded to a “large” panel via imputation (Daetwyler et al., 2011). This is in contrast to the sparsetype imputation described in the previous paragraph. A “large” panel refers to a high-density SNP set and a “small” panel refers to a low-density SNP set that may be a subset of the large panel. A sample of individuals that were genotyped with a large panel can serve as a reference panel for the imputation of those who have only been genotyped with a small panel. This scenario happens in several practical situations. For example, one may encounter a situation where the technology at the time could only genotype a limited number of SNPs. As technologies improve, and the genotyping density increases, the original samples may not be available to be re-genotyped. A similar situation could arise if different samples have been genotyped using different chip technologies. On the other hand, the cost of genotyping a less dense SNP set is lower. So, using a less dense chip and complementing with an accurate imputation method for untyped SNPs becomes practical and cost effective when resources are limited. Imputing untyped SNPs and merging with the large panel sets enables an increase in sample size for genetic studies and thus increases the statistical power when identifying important genes that may have otherwise been missed. Generally, methods of genotype imputation are based on the concept that SNPs close together on a chromosome are often inherited together. The resulting correlations among SNPs are referred to as linkage disequilibrium (LD) or association in genetic literature (Ardlie et al., 2002). Many methods for imputing missing genotype have been suggested and tested. Here, two method classes are considered: phasing and regression. Phase-based methods consider haplotype structure and common descent patterns (Li et al., 2009). As humans and animals are diploid, a genotype is the combination of maternal and paternal alleles. Alleles close together on a chromosome are typically inherited together in a whole unit as a haplotype. Phase-based algorithms try to split genotypes at SNPs into haplotypic phases. Here, a “phase” is simply an inferred parental haplotype. Once phased, missing alleles can be estimated from neighboring haplotype alleles through their LD relationship, and the inferred alleles are then combined to impute the missing genotype. Many current and popular genotype imputation methods are phase based. fastPHASE (Scheet and Stephens, 2006) is a significant improvement on the previous PHASE in that it allows for the computationally feasible haplotyping of genome-wide data (Browning and Browning, 2011). BEAGLE (Browning and Browning, 2007) and IMPUTE2 (Howie et al., 2011) are significantly faster than fastPHASE and scale more effectively with increasing sample size. IMPUTE2 is particularily useful when the data can be stratified into clusters of similar haplotypes due to recent common ancestry. When a sample consists of correlated subjects, ChromoPhase (Daetwyler et al., 2011) incorporates relationship data among subjects for haplotype phasing. However, this method requires information of relationships among subjects for the imputation. Another approach is to use regression models to impute the missing genotypes by using flanking SNPs as covariates (Li et al., 2009). In Souverein et al. (2006), multinomial regression models were used to impute missing genotypes based on the fitted correlation among available SNPs. Yu and Schaid (2007) used linear regression models. Regression-based methods face a common problem in variable selection: it can be difficult to select which available SNPs should be included as covariates. One reason for this is that LD patterns are not homogenous across the genome (Ardlie et al., 2002); for example, lower LD would be expected among SNPs in recombination hotspots than in low recombination regions (high LD regions). Therefore, fewer SNPs may be useful as covariates in lower LD regions. In previous studies, the number of flanking SNPs used to impute the missing genotype has been chosen manually or with a fixed size across the whole genome. These limitations made regression methods less attractive and less accurate. To overcome this problem, the use of a regularized multinomial regression model for imputing missing genotypes is proposed in this paper. Friedman et al. (2010) proposed regularized generalized linear models that include l1 (the lasso) and l2 (ridge regression) penalties to efficiently select relevant covariates. The regularized model also allows the number of covariates p to greatly outnumber the sample size n, which is typical in genetic studies. With a wide enough window size, the regularized multinomial regression model allows for automatic selection of SNPs. For easy reference, we name our proposed imputation method the GLMNET method throughout the paper. This paper describes a simulation study based on two real data sets: Canadian Holstein cattle data and human HapMap CEU data. The GLMNET method is evaluated under two scenarios: a sparse-missing model


W.W.L. Wong et al.: Imputing genotypes using regularized GLMs 521

and a small-panel expansion model. The sparse-missing model simulates missing genotypes in random locations across all subjects and all SNPs. The small-panel expansion simulates a situation where a large portion of the SNPs are not genotyped for all individuals. In the sparse-missing scenario, Scheet and Stephens (2006) demonstrated that fastPHASE has significantly lower error rates than other competing phase-based imputation techniques. Its comparative efficacy was also confirmed by Yu and Schaid (2007). However, fastPhase is generally considered to be computationally intensive as this method takes into account all observed genotypes on a chromosome when imputing each missing genotype. In contrast, BEAGLE is more efficient as it focuses on the nearby observed genotypes when imputing each missing genotype. A later study by Calus et al. (2011) indicates that BEAGLE outperforms fastPHASE under some situations, for example, when the sample consists of multiple extended pedigrees. So, in this paper, the performance of our proposed method is compared with fastPHASE and BEAGLE. In addition, the proposed method is also compared with the conventional multinomial regression method that can be easily implemented using the “glm” function in R. The paper is organized as follows. In Section 2, we describe our method which centres around the regularized multinomial regression model. In Section 3, we detail the procedures and settings of the simulation study and the simulation results. The paper then concludes with discussions in Section 4.

2 Methods 2.1 Regularized generalized linear regression Friedman et al. (2010) suggested models that generalized the elastic net penalty to non-continuous responses. The elastic net model (Zou and Hastie, 2005) combines the l1 penalty (LASSO) and the l2 penalty (ridge). The elastic net penalty enables the model to handle situations where the number of covariates (p) is larger than the sample size (n). The elastic net is able to select large numbers of correlated predictors, whereas the LASSO typically only selects one variable to represent the group and does so indiscriminately (Zou and Hastie, 2005). In the context of genotype imputation, it is likely that a small number of samples are genotyped for a large number of SNPs, such that n is much smaller than p. Also, many of these SNPs are likely to be correlated due to linkage disequilibrium. By incorporating the elastic net penalty into the generalized linear regression model, one increases stability and predictive accuracy. The remaining sections briefly present, respectively, the LASSO, ridge regression, elastic net, and regularized generalized linear regression models, specifically the multinomial regression model.

2.1.1 LASSO Tibshirani (1996) developed the Least Absolute Shrinkage and Selection Operator (LASSO), a variable selection method that uses an l1 penalty to shrink the coefficients and to set some coefficients for unimportant covariates to zero. Consider a linear regression model in the form p

Y = β0 + ∑ β j X j + ε, ε~ N (0, σ 2 ), j=1

where Y ∈R is the response variable and X = (X1, …, Xp) is a vector that has p predictors. The LASSO attempts p to find a model that minimizes the sum of squares of residuals subject to a constraint ∑ j =1 | β j | ≤t , where t is the parameter that controls the amount of shrinkage applied to the coefficients and allows for the removal of irrelevant variables from the model by setting some β’s to zero. The LASSO estimate of ˆβ = ( ˆβ0 , ˆβ 1 ,..., ˆβ p ) is given by


522 W.W.L. Wong et al.: Imputing genotypes using regularized GLMs

2 n  p ˆβ = argmin   y − β − β x   , ∑ ∑   i j ij 0 β   j =1  i=1  

p

∑ | β |≤t.

s.t

j =1

(1)

j

p The value of t ranges from 0 to ∑ j =1 | β j |, where β j is the ordinary least squares (OLS) estimate of βj. A cross-validation technique is usually employed to select the value of t that minimizes the mean square prediction error.

2.1.2 Ridge regression The ridge regression model imposes an l2 penalty

∑

p j =1

β 2j ≤ t , which causes OLS coefficients to shrink (Zou and

Hastie, 2005). The ridge regression estimate of ˆβ = ( ˆβ0 , ˆβ 1 ,..., ˆβ p ) is given by

2 n  p ˆβ = arg min   y − β − β x   , ∑ ∑   i j ij 0 β   j=1  i=1  

p

∑β

s.t

j=1

2 j

≤t.

(2)

Models built with the ridge penalty are not sparse, i.e., unimportant covariates are not filtered out. Therefore, with a large p, ridge regression may lead to a model that is complicated and difficult to interpret.

2.1.3 Elastic net The elastic net (Zou and Hastie, 2005) uses a novel penalty that is partly LASSO (l1) and partly ridge (l2). The p p elastic net penalty is given by pα(β) = (1–α)|β|2+α|β|, where | β | 2 = ∑ j=1 β 2j and | β | = ∑ j =1 | β j |. The estimate of ˆβ = ( ˆβ , ˆβ ,..., ˆβ ) subject to the elastic net penalty is given by 0 1 p

2 n  p ˆβ = arg min   y − β − β x   , ∑ i 0 ∑ j ij  β   j =1  i=1  

s.t pα ( β ) ≤ t

(3)

α takes values between 0 and 1 and controls the weighting between the l1 and l2 penalties. The elastic net penalty allows for the removal of unimportant covariates through the l1 penalty and for the handling of collinearity among covariates via the l2 penalty. 2.1.4 Regularized multinomial regression models Where the elastic net is used to predict quantitative responses, Friedman et al. (2010) generalized the elastic net model to categorical responses. In generalized linear models, a multinomial regression model can be used when the response is categorical with more than 2 levels. The logistic link function that is extended from the logistic regression model for binary responses is used and has the form log

P( Y = l | x ) P( Y = K | x )

p

= β0 l + ∑ β jl x j , l = 1,..., K − 1, j=1

if the response Y has K levels. In Friedman et al. (2010) a more symmetric model is used instead: P( Y = l | x ) =

e

∑

β + 0l

K k =1

e

∑ pj= 1 β jl x j

β + 0k

∑ pj= 1 β jk x j


.

(4)

W.W.L. Wong et al.: Imputing genotypes using regularized GLMs 523

However, this symmetric model leads to multiple solution parameter estimates where each l = 1, …, K, (β0, βl) and (β0+c0, βl+c) give the same probabilities. The model is fit by the regularized maximum likelihood to solve this problem. Suppose we wish to maximize the penalized log-likelihood

p K  1 n  K   K β0 l +∑ pj= 1 β jl xij    − λ∑ pα ( β l )  , max  ∑ ∑yik ( β0 l + ∑ β jl xij ) − log  ∑e  β , β  n i=1  l=1  l=1   l=1 j=1   0 

(5)

where β0 = (β01, …, β0K)T, β = ( β T1 ,..., β TK )T such that each component βl = (β1l, …βpl)T for l = 1, …, K, and λ is the tuning parameter that determines the magnitude of the elastic net penalty. To solve the multiple solution problem, an improved estimate of the β’s can be obtained by solving K

and the solution of c is achieved by

T min∑ pα ( β l − c ), c = ( c1 ,...,c p ) , c

l=1

(6)

K

1  c j = argmin∑  (1− α )( β jl − t ) 2 + α | β jl − t |  . t  l= 1  2 The regularized multinomial regression method is implemented in R as part of the GLMNET function. The tuning parameter λ is chosen by cross-validation. The weight parameter α is set at 1, which is the default setting in the GLMNET function in R. Note that the default choice of α = 1 corresponds to the lasso penalty. When α = 0, the penalty corresponds to the ridge regression penalty that is known to shrink the coefficients of correlated predictors towards each other. However, in the application of SNP selection for imputing missing SNPs, it is better to utilize the strong association pattern among neighboring SNPs to predict those that are missing. This is particularly true when a tag SNP is missing; all SNPs represented by this tag SNP should be included for predicting the tag SNP, despite these SNPs being in somewhat high linkage disequilibrium.

2.2 GLMNET imputation procedure For each SNP i along the chromosome, a training set and test set were designated such that the test set contained subjects who had an unknown genotype at SNP i and the training set contained subjects that were fully genotyped at that SNP. A window of K SNPs on either side of SNP i were used to fit the model. Of these 2K SNPs, GLMNET automatically selected l ( ≤ K) SNPs that best explained the variation in SNP i. The fitted model was then used to predict the missing genotypes in the test set. Imputation for the sparse missing simulation required a two-step procedure. First, all missing genotypes were imputed using the observed mean genotype at each SNP. Then, each missing genotype was re-imputed using GLMNET, using the means as temporary covariates. This was done to ensure a full training panel for imputation. Had the missing genotypes not been initially filled, GLMNET would be limited to a small number of covariates, some of which may not have been informative. This step was also required in the imputation using the multinomial regression method. However, this extra step was not required for the small panel imputation, as the reference panel was consistent.

3 Simulation studies 3.1 Data description Two real data sets were utilized for the simulation experiment: Canadian Holstein cattle data and human HapMap CEU data. The Canadian Holstein cattle data contained information pertaining to 725 bulls born


524 W.W.L. Wong et al.: Imputing genotypes using regularized GLMs between 1965 and 2001. Each bull was genotyped using the llumina BovineSNP50KTM Bead chip for 38,417 SNPs that span 30 chromosomes. Because SNPs are biallelic, genotypes were coded as 0, 1, or 2, with 0 and 2 being the homozygous genotypes and 1 representing the heterozygous genotype. In our study, we only focused on imputation on the 29 autosomes. Bulls in the data set were related and some were inbred, resulting in complicated relationships. All methods used in this study – GLMNET, fastPHASE, BEAGLE, and the multinomial regression – require subjects to be independent. We were curious to see how each method would handle these dependencies. However, for a fair comparison of the performance of the proposed method with the other three methods, human data containing independent individuals were also used. The human data CEU data set was obtained from the International HapMap Project. The data represent 90 Utah residents from 30 trio families (father, mother, child) with Northern and Western European ancestry. To obtain data with independent subjects, the children were removed, leaving a total of 60 unrelated subjects (parents are assumed to be unrelated to one another). For the CEU data, each individual was genotyped at more than four million SNPs across the human genome. For the sake of illustration, we performed the simulation on chromosome 22, in which 56,011 SNPs were genotyped.

3.2 Simulation settings The data sets were pre-processed before use in our simulation study. This involved removing all SNPs that were not successfully genotyped for all individuals. This was done because it would have otherwise been impossible to calculate prediction accuracies if the true values were unknown. SNPs that were not considered to be polymorphic were removed, i.e., SNPs with any genotype frequency > 95% or with minor allele frequencies

Pediatric readmission classification using stacked regularized logistic regression models.

Modeling the frequency of opposing left-turn conflicts at signalized intersections using generalized linear regression models.

Markov Boundary Discovery with Ridge Regularized Linear Models.

Linear mixed function-on-function regression models.

Regularized matrix regression.

Bioprocess data mining using regularized regression and random forests.

Use of generalized linear mixed models for network meta-analysis.

Generalized linear models with coarsened covariates: a practical Bayesian approach.

Covariance analysis in generalized linear measurement error models.

Flexible analysis of digital PCR experiments using generalized linear mixed models.

A generalized analog implementation of piecewise linear neuron models using CCII building blocks.

Detection of genomic loci associated with environmental variables using generalized linear mixed models.

Estimate of influenza cases using generalized linear, additive and mixed models.

Measuring the individual benefit of a medical or behavioral treatment using generalized linear mixed-effects models.

Commensurate Priors for Incorporating Historical Information in Clinical Trials Using General and Generalized Linear Models.

Speed limit reduction in urban areas: a before-after study using Bayesian generalized mixed linear models.

The Dantzig Selector for Censored Linear Regression Models.

Modelling subject-specific childhood growth using linear mixed-effect models with cubic regression splines.

Genomic prediction based on data from three layer lines using non-linear regression models.

Censored linear regression models for irregularly observed longitudinal data using the multivariate- t distribution.

On constrained and regularized high-dimensional regression.

Discussion of 'Regularized Regression for Categorical Data'.

Least square regularized regression in sum space.

The use of generalized linear models and generalized estimating equations in bioarchaeological studies.