YGENO-08699; No. of pages: 7; 4C: Genomics xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

Detection of genomic loci associated with environmental variables using generalized linear mixed models Stéphane Lobréaux ⁎, Christelle Melodelima ⁎ Laboratoire d'Ecologie Alpine, UMR CNRS 5553, Université Joseph Fourier, BP53 38041, Grenoble, France

a r t i c l e

i n f o

Article history: Received 7 February 2014 Accepted 5 December 2014 Available online xxxx Keywords: GLMM Adaptation SNP Genome scan

a b s t r a c t We tested the use of Generalized Linear Mixed Models to detect associations between genetic loci and environmental variables, taking into account the population structure of sampled individuals. We used a simulation approach to generate datasets under demographically and selectively explicit models. These datasets were used to analyze and optimize GLMM capacity to detect the association between markers and selective coefficients as environmental data in terms of false and true positive rates. Different sampling strategies were tested, maximizing the number of populations sampled, sites sampled per population, or individuals sampled per site, and the effect of different selective intensities on the efficiency of the method was determined. Finally, we apply these models to an Arabidopsis thaliana SNP dataset from different accessions, looking for loci associated with spring minimal temperature. We identified 25 regions that exhibit unusual correlations with the climatic variable and contain genes with functions related to temperature stress. © 2014 Elsevier Inc. All rights reserved.

1. Introduction The recent evolution in sequencing technologies now allows affordable high-resolution genotyping of many individuals [1]. The discovery of genome-wide single-nucleotide polymorphisms (SNPs) may be based on full genome resequencing [2–4] or targeted sequencing of genomic regions such as exon or sequences adjacent to a restriction enzyme cleavage site [5,6]. The large genetic variation data sets produced with extensive genome coverage offer the possibility to analyze and compare polymorphisms between individuals. Genome-wide association studies (GWAS) have been used to identify the association between genetic loci and phenotypes such as traits or diseases from SNP data using statistical modeling [7,8]. Hundreds of genes implicated, for example, in human diseases [9,10] or important plant traits have been discovered [11]. Next-generation sequencing technologies (NGS) open up the possibility to detect many rare variants to improve such association mapping strategy [12]. Another approach is to scan for genomic loci associated with environmental variables, such as climatic data, for example, that could be involved in the adaptation to specific conditions of the species of interest [13]. There is a growing interest in identifying factors that influence adaptation in species [14]. Genetic variability evolves as a result of environmental factors imposing selective pressure on part of the genome, generating changes in allele frequency in such regions. Looking for loci associated with environmental variables

⁎ Corresponding authors. Fax: +33 476514279 15. E-mail addresses: [email protected] (S. Lobréaux), [email protected] (C. Melodelima).

requires statistical modeling as GWAS to find an allele distribution fitting for the chosen variable [13]. In both association study approaches, it is essential to take into account the population structure of the sampled individuals [15,16]. Indeed, hierarchical population structure can cause bias in loci detection, generating an excess of false positives [15]. The main problem is that without corrections for the effect of population structure, the underlying null distribution may be insufficient to account for demographic history. In GWAS, Generalized Linear Mixed Models have been shown to efficiently take into account the population structure [16]. Studies in maize, potato, and Arabidopsis revealed that such mixed models led to fewer false positives and were a powerful method [17–19]. GLMMs are an extension of a generalized linear models providing a more flexible approach for analyzing nonnormal data when random effects are present [20,21]. The most common types of random effects are variation among individuals, genotypes, species, and geographical regions or time periods. For instance, individuals within clusters (for example, at distances of only some meters) are genetically more similar than distant individuals between clusters. GLMMs can more efficiently account for population structure in comparison with previously used methods [22,23]. Using such models to detect the correlation between allele distribution and environmental data can be performed on samples collected in different sites, without acquiring population data such as allele frequency at the sampling sites. GLMMs are fast, while some Bayesian approaches, for example, require extensive computing time and are therefore more difficult to apply to very large genetic variability data sets [13,24]. In this study, we tested the use of GLMMs to detect associations between genetic loci and environmental variables. Then we applied these models to analyze an A. thaliana SNP data set from different accessions

http://dx.doi.org/10.1016/j.ygeno.2014.12.001 0888-7543/© 2014 Elsevier Inc. All rights reserved.

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

2

S. Lobréaux, C. Melodelima / Genomics xxx (2014) xxx–xxx

[2] and to identify 25 regions that exhibit unusual correlations with climatic variables and contain 18 genes with functions related to temperature stress. 2. Materials and methods 2.1. Simulations Genotype simulations were performed using the GenomePop2 program [25], allowing the simulation of chromosomes and their evolution under different scenarios. All simulations were performed under a biallelic model, with independent SNPs, with population sizes of 100 individuals, and with a mutation rate of 10−4. To reproduce a situation where different sampling sites are located in different geographical regions, clusters that mimic geographical regions were defined. Ten clusters were run in parallel. In each cluster, 20 populations were simulated, and a migration rate of 1% was applied in an island model. For each cluster, a 20-step gradient of selective coefficient was defined, and each step value was applied to one population in the cluster. The gradient names 0.01, 0.02, 0.03, 0.05, 0.1, and 0.2 corresponded to selective coefficients in a range, respectively, of −0.01 to 0.01, −0.02 to 0.02, −0.03 to 0.03, −0.05 to 0.05, −0.1 to 0.1, and −0.2 to 0.2. For SNPs to which selective pressure was applied, the initial allele frequency in the population was set to 0.5. For neutral SNPs, the initial frequency was set as a random value between 0.1 and 0.9 in each cluster. Each simulation run consisted of 1000 generations, and 25 runs were performed. Simulations were repeated eight times for each cluster with new initial allele frequency for neutral SNPs. A total of 2.105 SNPs under selective pressure and 2.106 neutral SNPs were simulated. In all subsequent analysis, at least 2000 SNPs under selective pressure and 20,000 neutral SNPs were sampled randomly and independently from the simulated data set. The selective coefficient applied to an individual was assimilated to the environmental variable. 2.2. Detection of potentially adaptive loci using generalized linear mixed models GLMMs were used to detect SNPs correlated to environmental variables. The general form of the model (in matrix notation) is Y ¼ Χβ þ Ζγ þ ε ;

ð1Þ

where Y is a column vector, the outcome variable; X is a matrix of the predictor variables; β is a vector of fixed-effects regression coefficients; Z is a design matrix for the random effects (the random complement to the fixed X); γ is a vector of random effect (the random complement to the fixed β); and ε is a vector of the residual, the part of Y that is not explained by the model Χβ + Ζγ. In GLMMs, environmental variables are introduced as fixed effect, while geographical proximity is modeled by random effect. In the model, the matrix Ζγ models the part of the genetic variation that cannot be explained by the environmental pressures. A GLMM model with a logit link and binomial error distribution was estimated between each SNP and the selected environmental variable. The likelihood ratio (LR) and the Wald tests were used to evaluate each model's performance [22]. For both tests, the null hypothesis corresponds to no correlation between a particular SNP and an environmental variable. The Wald test for GLMMs tests the null hypothesis of no effect by scaling parameter estimates by their estimated standard errors and comparing the resulting test statistic to zero [26]. The LR test determines the contribution of a single (random or fixed) factor by comparing the fit (measured as the deviance, that is, minus two times the log-LR) for models with and without the factor, namely, nested models. All GLMM models were calculated using the R package lme4.

2.3. Statistical analysis of false and true positive rates McNemar tests and p value corrections have been applied to test the effect of the different thresholds of the Wald and the LR tests on false and true positive rates. An analysis of variance has been applied to test the sampling effect (e.g., the number of sampled populations, sampled sites per population, and sampled individuals per site). Friedman and Wilcoxon paired tests have been used to test the effect of the different selection intensities on the rate of true positives. 2.4. Arabidopsis thaliana SNP dataset Eighty A. thaliana ecotypes sampled across Europe have been submitted for genomic DNA sequencing in the frame of the 1001 Genomes Project (http://1001genomes.org) by Cao et al. [2]. Sequence reads were mapped against the Columbia ecotype reference genome to detect genetic polymorphisms [2]. From this data set, we retained 78 ecotypes, removing two samples for which data quality was lower. SNPs were filtered according to the following criteria: only biallelic sites were retained, positions for which genotyping data were not available for more than two ecotypes which were discarded, and a minimal coverage threshold of 5 was applied. The resulting filtered data correspond to a high density of 1 SNP/125 bp. The 948,330 SNPs of the filtered data set were distributed across the five A. thaliana chromosomes as follows: chr1 (247,859 SNPs), chr2 (146,826), chr3 (179,642), chr4 (153,061), and chr5 (220,942). The climatic data were extracted from the WorldClim database (http://www.worldclim.org) with a spatial resolution of 30 arcsec. We used the R package raster for that purpose, according to the GPS coordinates of each sampling site (http://1001genomes.org/projects/ MPICao2010/index.html). The average minimal temperature was calculated over the period from April to June, corresponding to spring in which plants are under vegetation in all sampled sites. The detection of SNP markers potentially associated with minimal temperature was conducted as described above using GLMM. 3. Results and discussion 3.1. Simulations 3.1.1. Allele frequency and selection intensities We evaluated the effect of the selective intensities on allele frequencies of our simulation. For this purpose, we generated a plot of allele frequencies under the different intensities of selection (Fig. 1). A clear structuring of allele frequencies was observed in clusters when the gradient of selection intensity was strong (gradient = 0.05–0.2). For a gradient of selection of 0.03 and 0.02, the structuring of allele frequencies in the cluster was present, although it was lower than that under strong selection. Finally, with a weaker gradient of selection (0.01), the structuring was not discernible. 3.1.2. Influence of the Wald and the LR tests on true and false positives False positives are a major concern when using GLMMs in association studies. A strategy adopted by Joost et al. [22] was to jointly use the Wald and the LR test results to reach a higher stringency. In the frame of our study, we have investigated the influence of the two tests over the false positive and false negative rates obtained when using GLMMs. Results are presented for the gradient 0.05 which selection intensities values were assimilated to the environmental variable, but similar results were obtained for all selection intensities. For each test, four different thresholds 5.10−5, 5.10−4, 5.10−3, and 5.10−2 were used. The false positive rate was strongly influenced by the LR test threshold, but not by the Wald test threshold; differences of 14.7% (CI, [11.8; 17.57]) and 0.6% (CI, [0.3; 0.9]) between thresholds were obtained for the LR and the Wald tests, respectively. In large-scale genomic experiments involving thousands of statistical tests

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

S. Lobréaux, C. Melodelima / Genomics xxx (2014) xxx–xxx

3

Fig. 1. The effect of different gradient intensities on allele frequency in simulated data.

(thousand markers), many univariate models have to be run simultaneously in order to detect markers likely to be associated with environmental data. It is recognized that when one wishes to test several hypotheses at a common significance level α simultaneously, the probability of rejecting at least one of the hypotheses being tested that is in fact true is typically much higher than α (multiple hypotheses testing). Although the Bonferroni correction is known to be very conservative [27,28], this correction was applied [29] on LR p value to efficiently limit the number of false positives. To evaluate the calibration of p values, we used simulated data under a null hypothesis of no association with any environmental variable. Fig. 2A reports the empirical cumulative distribution function (ECDF) of p values for the LR and the Wald tests. p values are well calibrated when their ECDF is close to the uniform distribution, represented by the bisector line. Below the line, the test is conservative. For the Wald test, the ECDF was close to a uniform distribution, and p values without correction were correctly calibrated (Fig. 2A). On the other hand, the distribution of p values for the LR test showed a strong departure from the uniform distribution (Fig. 2A). In this case, the tests were too liberal and produced a large number of false positives. After Bonferroni correction, LR test p values were well calibrated (Fig. 2A). In detecting the true positives, the LR test has outperformed the Wald statistic, the percentage of true positives was always greater with the LR test than with the Wald test (difference average, 4%; CI, [2%; 15%]). The Wald test has often failed to reject the null hypothesis as suggested by Hauck and Donner [30]. However, as shown by Tu and Zhou [31], the size of the sample influences the performance of the Wald test. Thus, when the size of the sample was large, the performance of the Wald test was satisfactory (around 2% of the difference between the LR and the Wald tests).

3.1.3. Factors affecting the tests' performances 3.1.3.1. Sampling effect. The effect of sampling was investigated using different numbers of sampled clusters, sampled sites per cluster, and sampled individuals per site. For logistic regression, De Mita et al. [23] have shown that the sampling, including many populations with only one sample per population, provided the best results. Our simulations confirmed this result for GLMMs: a significant effect of the number of individuals per site (p value = 2.10−16) was observed on the false positive rate, which increases for the sampling incorporating 10 individuals per site (Fig. 3). Furthermore, the distributions of Wald's p values for tests based on 3 or 10 individuals showed a strong departure from the uniform distribution (Figs. 2B, C), these tests produced a large number of false positives. LR distribution (Figs. 2B, C) based on 3 or 10 individuals showed an excess of low and high p values. Using one individual, p values for the Wald and the LR tests were well calibrated (Fig. 2A). In contrast, the number of clusters and the number of sites per cluster had no effect on the false positive rate (p value = 0.4 and 0.15, respectively). Results obtained for true positive rates were opposite to results obtained for false positive rates. The number of individuals had a low influence on the true positive rate (Fig. 4A and Table 1), whereas a strong significant effect was observed according to the number of sites or the number of clusters (Fig. 4A and Table 1). The true positive rate increased with the number of sites or clusters. As shown in Table 1, these results were independent of the selection intensities for values in a range of 0.03 to 0.2. However, if the selection intensity was weak (i.e., values between 0.01 and 0.02), the number of individuals has an effect on the true positive rate. To limit false positives, it would be interesting to select only one individual per site and to increase the number of distant sites. This

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

4

S. Lobréaux, C. Melodelima / Genomics xxx (2014) xxx–xxx

Fig. 2. Empirical cumulative distribution function (ECDF) of p values for GLMMs, with 1, 3, or 10 individuals per site. The Wald and the LR tests were used.

strategy might sample individuals with a high rate of differentiation between clusters, where individuals within clusters might be redundant. A low true positive rate might be explained by a correlation between environmental variables and population structure. 3.1.3.2. Selection gradient effect on true positive rate. We evaluated the effect of the intensity of selection on the true positive rates (Fig. 4B). Thus, the power of this method to detect selected loci might be reduced. The selection intensity significantly affected the power of the method (p value = 2.10−16). GLMMs appeared to have the best power when using discriminant selection intensities (gradient = 0.02–0.05), with no significant difference between gradient 0.03 and gradient 0.05 (p value = 0.34). When a very low selection coefficient was applied (e.g., gradient 0.01), GLMMs hardly detected these loci that are very close to neutral SNPs when one or three individual genotypes are analyzed. With 10 individuals, the true positive rate reached 21.25% of the SNPs, however, in these conditions the measured false positive rate was 13%. GLMMs are correlative methods and their results depend strongly not only on the selection intensity but also on the correlation between the genotype and the environmental variable. In our simulations, the environmental variable was known without error, which might not be the case with real data. The choices of sampling strategy and the environmental variable are important and therefore need to be optimized.

3.2. Detection of A. thaliana genomic regions associated with the minimal temperature variable Simulation studies indicate that GLMMs could represent an efficient statistical tool to detect loci associated with environmental variables. In order to further test this hypothesis, an A. thaliana SNP data analysis was conducted. A data set of genetic variations as SNPs has been produced by Cao et al. [2] from A. thaliana ecotypes sampled across Europe. Eight different geographical regions were sampled, and the genotype for one individual from each sampling site has been reported. To evaluate the power of GLMM models and the number of false positives expected on this data set, we have simulated 25,000 loci for which configuration was close to the A. thaliana data: 8 clusters, 10 sites per cluster, and 1 individual per site. Different selective gradients have been tested: 0.1, 0.05, 0.03, and 0.02. The GLMM performances are shown in Table 2. For a level of selection in a range of 0.1 to 0.03, the true positive rate was very good: 89.3% to 98.9% (Table 2). We observed the loss of power caused by the decrease in selection strength to 0.02 (Table 2). The false positive rate expected was 0%. These results support the fact that this data set is well adapted for an accurate GLMM detection of loci associated with environmental variables. Temperature is a well-known climatic factor affecting plant growth and influencing plant distribution and productivity [32]. Temperature

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

S. Lobréaux, C. Melodelima / Genomics xxx (2014) xxx–xxx

5

Fig. 3. The influence of sampling design on the false positive rates obtained with GLMMs for simulation of 10 clusters, using different numbers of sites and different numbers of individuals per site.

A

B

Fig. 4. True positive rates obtained with GLMM analysis using simulated data sets. (A) True positive rates obtained with a gradient of 0.05 for 10 clusters, using different numbers of sites and individuals per site. (B) True positive rates obtained with different gradient intensities for 10 clusters and 6 sites, using 1, 3, or 10 individuals per site.

represents a major driving factor for plant adaptation, and therefore, we chose in our test to look for loci associated with minimal temperature using GLMMs. Climatic data extracted from the WorldClim database for each sampling sites are in a range of − 7 °C to 12 °C. Such a large Table 1 Test of sampling effect (number of sites and individuals) through variance analysis. (For instance g0.05 corresponds to selective coefficients in a range of −0.05 to 0.05). Effect (p value) Number of sites/clusters Number of individuals

g0.01 −4

9.10 2.10−16

g0.02 −7

6.10 1.10−5

g0.03 3.10 0.05

−5

g0.05 5.10 0.74

−6

g0.1 2.10 0.7

g0.2 −11

2.10 0.47

−16

range is favorable to detect loci associated with this climatic variable and potentially linked to adaptation in plants growing in contrasted climatic conditions. Table 2 False and true positive rates obtained with 8 clusters, 10 sites per cluster and 1 individual per site. False positive rate

0%

True positive rate: gradient = 0.1 True positive rate: gradient = 0.05 True positive rate: gradient = 0.03 True positive rate: gradient = 0.02

89.3% 98.9% 97.9% 27.6%

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

6

S. Lobréaux, C. Melodelima / Genomics xxx (2014) xxx–xxx

Fig. 5. Frequencies of SNPs associated with temperature along the A. thaliana chromosome 1 calculated over 20 kb sliding windows taking into account the total number of the SNPs in each window.

GLMM analysis was performed using a Wald threshold fixed to 0.05 and an LR threshold fixed to 0.05 after Bonferroni corrections of p values, which led to the identification of 14,287 A. thaliana SNPs significantly associated with spring minimal temperature. This corresponds to 1.5% of the total number of SNPs. Fig. 5 shows the distribution of the associated SNPs along chromosome 1 as an example, indicating that some genomic segments present a higher frequency of SNPs associated with the climatic variable used. Environmental selective pressure most likely affects genomic regions where SNPs are in linkage disequilibrium (LD). Cao et al. [2] and Long et al. [4] have reported a strong decrease of LD within 5 kb in A. thaliana genome. An additional analysis step was therefore performed using overlapping sliding windows along the chromosomes to detect genomic regions significantly associated with minimal

Table 3 Candidate genes in detected genomic regions. Gene ID

Gene

Comments

At1g05160 ENT-Kaurenoic acid oxydase I Cold-inducible gene during seed involved in gibberellin biosynthesis germination [36] At1g05170 Galactosyltransferase Up-regulated during cold acclimation [37] At1g13260 Ethylene response DNA binding Up-regulated in response to low factor 4 temperature [38] At1g30500 Nuclear Factor Y, subunit A7 Induced by heat stress [39] At1g30510 Root-type ferredoxin: NADP(H) Induced by cold and heat [40] oxidoreductase (AtRFNR2) At2g22570 Nicotinamidase I (NICI) Cold stress induced [39] At2g28420 Glyoxylase I 8 Expressed in response to abiotic stresses [41] At2g31730 bHLH DNA-binding factor Involved in cold stress response [42] At2g43790 AtMAPK6 MAP kinase 6 Up-regulated under cold stress [43] At4g01320 CAAX protease AtSte24 Salt stress induced [44] At4g02330 Pectin methyl transferase Regulated by chilling stress [45] (AtPME41) At4g02380 Late embryogenesis abundant Regulated by low temperatures like protein 5 (AtLEA5) [46] At4g10970 Unknown protein Interacts with cold shock domain protein 3 [47] At5g16320 Frigida like 1 Required for winter-annual habit [48] At5g19690 Staurosporin and temperature Osmotic stress response [49] sensitive 3-like A At5g20630 Germin 3 (AtGer3) Temperature stress response [50] At5g62960 Unknown Insertional mutant affected in cold tolerance [51] At5g63450 Cytochrome P450 (CYP91B1) Up-regulated in response to low temperatures [39]

temperature. This approach allowed to focus on the more relevant regions and should also help reduce the false positive rate. For each window of 5 kb, the frequency of SNPs significantly associated with the climatic variable used was calculated. Only windows with a frequency higher than 0.6 were retained as positive. Twenty-five regions were identified for their association to minimal temperature. They are located on four different chromosomes, and their size and location are provided in Supplementary Table 1. Each genomic region has been analyzed for its gene content according to TAIR10 annotations available from the TAIR website (www.arabidopsis.org). Crossing the gene list from these regions together with functional data available for A. thaliana genes highlighted some candidate genes in these regions (Table 3). Some genes have been reported as regulated by temperature stress or in cold acclimation in Arabidopsis. Other genes are linked to the osmotic stress response for which a cross talk with the cold response has been shown [33]. At temperatures below freezing, ice crystal formation leads to membrane damage and dehydration stress [34]. This phenomenon links water and cold stress responses, and indeed, proteins involved in the water stress response also accumulate during freezing damage [35]. This list of genes is built according to the actual functional knowledge about temperature stress response in A. thaliana. We cannot, of course, rule out that other genes in these regions could be relevant to temperature adaptation in this plant, but for which functional data linking them to this stress are not available so far. Among the SNPs identified in each candidate gene, some of them correspond to nonsynonymous mutations leading to different protein sequences. These case study results argue, therefore, in favor of the efficiency of GLMMs in the detection of genomic loci associated with an environmental variable. In addition to an appropriate sampling design and corrections for population structure, a high SNP density can help improve such correlative approach. NGS open the possibility to perform genomic sequencing and SNP discovery in non-model species in order to look for genomic loci associated with an environmental variable. However, in such species, less gene functional data are available in comparison with model species to help analyze candidate regions. A high SNP density will enable to estimate the LD in the studied species to calibrate optimal sliding window size and extract the most relevant genomic regions. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.ygeno.2014.12.001.

Acknowledgments Part of the computations presented in this paper were performed using CIMENT infrastructure (https://ciment.ujf-grenoble.fr), which

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

S. Lobréaux, C. Melodelima / Genomics xxx (2014) xxx–xxx

is supported by the Rhône-Alpes region (GRANT CPER07-13 CIRA: http://www.ci-ra.org). References [1] D.C. Koboldt, K.M. Steinberg, D.E. Larson, R.K. Wilson, E.R. Mardis, The nextgeneration sequencing revolution and its impact on genomics, Cell 155 (2013) 27–38. [2] J. Cao, K. Schneeberger, S. Ossowski, T. Günther, S. Bender, J. Fitz, et al., Wholegenome sequencing of multiple Arabidopsis thaliana populations, Nat. Genet. 43 (2011) 956–963. [3] U. Ober, J.F. Ayroles, E.A. Stone, S. Richards, D. Zhu, R.A. Gibbs, et al., Using wholegenome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster, PLoS Genet. 8 (2012) e1002685. [4] Q. Long, F.A. Rabanal, D. Meng, C.D. Huber, A. Farlow, A. Platzer, et al., Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden, Nat. Genet. 45 (2013) 884–890. [5] N.A. Baird, P.D. Etter, T.S. Atwood, M.C. Currey, A.L. Shiver, Z.A. Lewis, et al., Rapid SNP discovery and genetic mapping using sequenced RAD markers, PLoS One 3 (2008) e3376. [6] R.J. Elshire, J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawamoto, E.S. Buckler, S.E. Mitchell, A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species, PLoS One 6 (2011) e19379. [7] W.S. Bush, J.H. Moore, Chapter 11: genome-wide association studies, PLoS Comput. Biol. 8 (2012) e1002822. [8] B. Hayes, Overview of statistical methods for genome-wide association studies (GWAS), Methods Mol. Biol. 1019 (2013) 149–169. [9] T.A. Manolio, L.D. Brooks, F.S. Collins, A HapMap harvest of insights into the genetics of common disease, J. Clin. Investig. 118 (2008) 1590–1605. [10] B.E. Stranger, E.A. Stahl, T. Raj, Progress and promise of genome-wide association studies for human complex trait genetics, Genetics 187 (2011) 367–383. [11] X. Huang, B. Han, Natural variations and genome-wide association studies in crop plants, Annu. Rev. Plant Biol. 65 (2014) 531–551. [12] M.J. Wagner, Rare-variant genome-wide association studies: a new frontier in genetic analysis of complex traits, Pharmacogenomics 14 (2013) 413–424. [13] S.D. Schoville, A. Bonin, O. Francois, S. Lobreaux, C. Melodelima, S. Manel, Adaptive genetic variation on the landscape: methods and cases, Annu. Rev. Ecol. Evol. Syst. 43 (2012) 23–43. [14] S.J. Franks, A.A. Hoffmann, Genetics of climate change adaptation, Annu. Rev. Genet. 46 (2012) 185–208. [15] L. Excoffier, T. Hofer, M. Foll, Detecting loci under selection in a hierarchically structured population, Heredity (Edinb.) 103 (2009) 285–298. [16] H.M. Kang, N.A. Zaitlen, C.M. Wade, A. Kirby, D. Heckerman, M.J. Daly, E. Eskin, Efficient control of population structure in model organism association mapping, Genetics 178 (2008) 1709–1723. [17] J. Yu, G. Pressoir, W.H. Briggs, I. Vroh Bi, M. Yamasaki, J.F. Doebley, et al., A unified mixed-model method for association mapping that accounts for multiple levels of relatedness, Nat. Genet. 38 (2006) 203–208. [18] M. Malosetti, C.G. van der Linden, B. Vosman, F.A. van Eeuwijk, A mixed-model approach to association mapping using pedigree information with an illustration of resistance to Phytophthora infestans in potato, Genetics 175 (2007) 879–889. [19] K. Zhao, M.J. Aranzana, S. Kim, C. Lister, C. Shindo, C. Tang, et al., An Arabidopsis example of association mapping in structured samples, PLoS Genet. 3 (2007) e4. [20] B.M. Bolker, M.E. Brooks, C.J. Clark, S.W. Geange, J.R. Poulsen, M.H. Stevens, J.S. White, Generalized linear mixed models: a practical guide for ecology and evolution, Trends Ecol. Evol. 24 (2009) 127–135. [21] X. Zhou, M. Stephens, Genome-wide efficient mixed-model analysis for association studies, Nat. Genet. 44 (2012) 821–824. [22] S. Joost, A. Bonin, M.W. Bruford, L. Després, C. Conord, G. Erhardt, P. Taberlet, A spatial analysis method (SAM) to detect candidate loci for selection: towards a landscape genomics approach to adaptation, Mol. Ecol. 16 (2007) 3955–3969. [23] S. De Mita, A.C. Thuillet, L. Gay, N. Ahmadi, S. Manel, J. Ronfort, Y. Vigouroux, Detecting selection along environmental gradients: analysis of eight methods and their effectiveness for outbreeding and selfing populations, Mol. Ecol. 22 (2013) 1383–1399. [24] G. Coop, D. Witonsky, A. Di Rienzo, J.K. Pritchard, Using environmental correlations to identify loci underlying local adaptation, Genetics 185 (2010) 1411–1423. [25] A. Carvajal-Rodriguez, GENOMEPOP: a program to simulate genomes in populations, BMC Bioinform. 9 (2008) 223.

7

[26] A. Agresti, Logistic Regression in Categorical Data Analysis, Wiley-Interscience, 2002. 165–210. [27] L.V. Garcia, Escaping the Bonferroni iron claw in ecological studies, Oikos 105 (2004) 657–663. [28] S.R. Narum, Beyond Bonferroni: less conservative analyses for conservative genetics, Conserv. Genet. 7 (2006) 783–787. [29] J.P. Shaffer, Multiple hypothesis-testing, Annu. Rev. Psychol. 46 (1995) 561–584. [30] W.W. Hauck, A. Donner, Wald's test as applied to hypothesis in logit analysis, J. Am. Stat. Assoc. 72 (1977) 851–853. [31] W. Tu, X.H. Zhou, A Wald test comparing medical costs based on log-normal distributions with zero valued costs, Stat. Med. 18 (1999) 2749–2761. [32] F.W. Went, The effect of temperature on plant growth, Annu. Rev. Plant Physiol. 4 (1953) 347–362. [33] K. Shinozaki, K. Yamaguchi-Shinozaki, Molecular responses to dehydration and low temperature: differences and cross-talk between two stress signaling pathways, Curr. Opin. Plant Biol. 3 (2000) 217–223. [34] P. Steponkus, Role of the plasma membrane in freezing injury and cold acclimation, Annu. Rev. Plant Physiol. (1984) 543–584. [35] T. Close, Dehydrins: a commonality in the response of plants to dehydration and low temperature, Physiol. Plant. (1997) 291–296. [36] Y. Yamauchi, M. Ogawa, A. Kuwahara, A. Hanada, Y. Kamiya, S. Yamaguchi, Activation of gibberellin biosynthesis and response pathways by low temperature during imbibition of Arabidopsis thaliana seeds, Plant Cell 16 (2004) 367–378. [37] Y. Oono, M. Seki, M. Satou, K. Iida, K. Akiyama, T. Sakurai, et al., Monitoring expression profiles of Arabidopsis genes during cold acclimation and deacclimation using DNA microarrays, Funct. Integr. Genomics 6 (2006) 212–234. [38] S.G. Fowler, D. Cook, M.F. Thomashow, Low temperature induction of Arabidopsis CBF1, 2, and 3 is gated by the circadian clock, Plant Physiol. 137 (2005) 961–968. [39] J. Kilian, D. Whitehead, J. Horak, D. Wanke, S. Weinl, O. Batistic, et al., The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses, Plant J. 50 (2007) 347–363. [40] W.E. Finch-Savage, C.S. Cadman, P.E. Toorop, J.R. Lynn, H.W. Hilhorst, Seed dormancy release in Arabidopsis Cvi by dry after-ripening, low temperature, nitrate and light shows common quantitative patterns of gene expression directed by environmentally specific sensing, Plant J. 51 (2007) 60–78. [41] A. Mustafiz, A.K. Singh, A. Pareek, S.K. Sopory, S.L. Singla-Pareek, Genome-wide analysis of rice and Arabidopsis identifies two glyoxalase genes that are highly expressed in abiotic stresses, Funct. Integr. Genomics 11 (2011) 293–305. [42] J.A. Kreps, Y. Wu, H.S. Chang, T. Zhu, X. Wang, J.F. Harper, Transcriptome changes for Arabidopsis in response to salt, osmotic, and cold stress, Plant Physiol. 130 (2002) 2129–2141. [43] K. Ichimura, T. Mizoguchi, R. Yoshida, T. Yuasa, K. Shinozaki, Various abiotic stresses rapidly activate Arabidopsis MAP kinases ATMPK4 and ATMPK6, Plant J. 24 (2000) 655–665. [44] C. Papdi, E. Abrahám, M.P. Joseph, C. Popescu, C. Koncz, L. Szabados, Functional identification of Arabidopsis stress regulatory genes using the controlled cDNA overexpression system, Plant Physiol. 147 (2008) 528–542. [45] B.H. Lee, D.A. Henderson, J.K. Zhu, The Arabidopsis cold-responsive transcriptome and its regulation by ICE1, Plant Cell 17 (2005) 3155–3175. [46] S.J. Robinson, I.A. Parkin, Differential SAGE analysis in Arabidopsis uncovers increased transcriptome complexity in response to low temperature, BMC Genomics 9 (2008) 434. [47] M.H. Kim, Y. Sonoda, K. Sasaki, H. Kaminaka, R. Imai, Interactome analysis reveals versatile functions of Arabidopsis COLD SHOCK DOMAIN PROTEIN 3 in RNA processing within the nucleus and cytoplasm, Cell Stress Chaperones 18 (2013) 517–525. [48] R.J. Schmitz, L. Hong, S. Michaels, R.M. Amasino, FRIGIDA-ESSENTIAL 1 interacts genetically with FRIGIDA and FRIGIDA-LIKE 1 to promote the winter-annual habit of Arabidopsis thaliana, Development 132 (2005) 5471–5478. [49] H. Koiwa, F. Li, M.G. McCully, I. Mendoza, N. Koizumi, Y. Manabe, et al., The STT3a subunit isoform of the Arabidopsis oligosaccharyltransferase controls adaptive responses to salt/osmotic stress, Plant Cell 15 (2003) 2273–2284. [50] J. Larkindale, E. Vierling, Core genome responses involved in acclimation to high temperature, Plant Physiol. 146 (2008) 748–761. [51] S. Luhua, A. Hegie, N. Suzuki, E. Shulaev, X. Luo, D. Cenariu, et al., Linking genes of unknown function with abiotic stress responses by high-throughput phenotype screening, Physiol. Plant. 148 (2013) 322–333.

Please cite this article as: S. Lobréaux, C. Melodelima, Detection of genomic loci associated with environmental variables using generalized linear mixed models, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.12.001

Detection of genomic loci associated with environmental variables using generalized linear mixed models.

We tested the use of Generalized Linear Mixed Models to detect associations between genetic loci and environmental variables, taking into account the ...
1MB Sizes 0 Downloads 9 Views