Article type: Original Article

ACCOUNTING FOR RATE VARIATION AMONG LINEAGES IN COMPARATIVE DEMOGRAPHIC ANALYSES

Running Title: Rate Variation and Demographic Inference

Andrew G. Hope1,2, Simon Y. W. Ho3,4, Jason L. Malaney5,6, Joseph A. Cook7,8, Sandra L. Talbot1,9

1

U. S. Geological Survey, Alaska Science Center, 4210 University Drive, Anchorage,

AK 99508, USA 2

E-mail: [email protected]

3

School of Biological Sciences, Edgeworth David Building A11, The University of

Sydney, Sydney, NSW 2006, Australia 4

E-mail: [email protected]

5

Department of Natural Resources and Environmental Science, University of

Nevada-Reno, Reno, NV 89557, USA 6

E-mail:[email protected]

7

Museum of Southwestern Biology and Department of Biology, MSC03 2020,

University of New Mexico, Albuquerque, NM 87131, USA 8

E-mail: [email protected]

9

E-mail: [email protected]

Data Archival Location: Gene sequences were submitted to GenBank and This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1111/evo.12469. This article is protected by copyright. All rights reserved.

1

accessions are listed in Supplementary Information.

KEY WORDS: Bayesian skyline plot, climate change, comparative phylogeography, mammal, substitution rate

Genetic analyses of contemporary populations can be used to estimate the demographic histories of species within an ecological community. Comparison of these demographic histories can shed light on community responses to past climatic events. However, species experience different rates of molecular evolution, and this presents a major obstacle to comparative demographic analyses. We address this problem by using a Bayesian relaxed-clock method to estimate the relative evolutionary rates of 22 small mammal taxa distributed across northwestern North America. We found that estimates of the relative molecular substitution rate for each taxon were consistent across the range of sampling schemes that we compared. Using three different reference rates, we rescaled the relative rates so that they could be used to estimate absolute evolutionary timescales. Accounting for rate variation among taxa led to temporal shifts in our skyline-plot estimates of demographic history, highlighting both uniform and idiosyncratic evolutionary responses to directional climate trends for distinct ecological subsets of the small mammal community. Our approach can be used in evolutionary analyses of populations from multiple species, including comparative demographic studies.

This article is protected by copyright. All rights reserved.

2

The climatic fluctuations of the late Quaternary significantly structured genetic diversity in contemporary populations (Hewitt 1996, 2000). To assess the variable impacts of Quaternary climate on species evolution using comparative phylogeographic approaches, genetic data from multiple taxa can be integrated to allow simultaneous comparisons within a consistent statistical framework (Riddle et al. 2008; Hickerson et al. 2010). Quantifying the spatiotemporal changes in effective population size and genetic diversity among taxa can provide critical insights for management of biodiversity in the face of current climate trends (Hope et al. 2013a; Prost et al. 2013). There has been considerable progress in methods for inferring the timing and extent of population-size changes using molecular data, including coalescent-based approaches such as skyline plots (Pybus et al. 2000; Drummond et al. 2005; Gill et al. 2013). Estimated demographic histories can then be incorporated into broader methodological frameworks and related to environment, ecology, and community structure for specific temporal periods (Galbreath et al. 2009; Lorenzen et al. 2011; Hope et al. 2013a; Pauls et al. 2013). Despite the growing use of comparative methods (Hickerson et al. 2010) and interest in community-level analysis for management and conservation (Mouquet et al. 2012), obstacles remain to accurate inference from comparative genetic datasets. Meaningful comparison of the timing of demographic changes in different species relies on accurate characterization of molecular rates. In particular, coalescent estimates of past population dynamics are only as reliable as the assumed substitution rates. For example, halving the substitution rate will approximately double the estimated coalescent timescale, prolonging the apparent duration of demographic changes. This can have a critical impact on interpretation of the spatiotemporal and evolutionary responses of species to a given climatic trend. For example, a rapid population expansion may result in a greater reduction in genetic diversity

This article is protected by copyright. All rights reserved.

3

than a more gradual expansion (Hewitt 1996). We present an approach that allows estimation of population histories while accounting for rate variation among small- to medium-sized mammal species in northwestern North America. In mammals, substitution rates vary dramatically among species. For one of the most widely studied genetic markers in mammals, the mitochondrial cytochrome b gene (Cytb), rate estimates among species have ranged from 0.5 substitutions/site/Myr for third codon positions (Nabholz et al. 2008). Obtaining accurate taxon-specific rates remains difficult, despite advances in our understanding of the factors that drive variation in mammalian evolutionary rates (e.g., Welch et al. 2008; Bromham 2011). For this reason, attempts to model rate variation have been descriptive in nature. These include relaxed molecular clocks (e.g. Thorne et al. 1998; Sanderson 2002; Drummond et al. 2006), which allow evolutionary timescales to be estimated in the presence of rate variation among lineages. However, such models might not be able to capture time-dependent variation in rates, such as an apparent decline in rates towards the root owing to the action of purifying selection (Ho et al. 2011a). If deep calibrations are used to infer demographic histories from different taxa, the timing and duration of population trends are prone to overestimation (Ho and Larson 2006). An emerging approach to estimating taxon-specific rates involves the analysis of ancient DNA (aDNA) sequences (Lorenzen et al. 2011; Prost et al. 2013). Sequences from dated samples can be used to calibrate the molecular clock, allowing the substitution rate to be estimated for the population of interest. Despite the decreasing cost and increasing efficiency of obtaining aDNA data, there remain a limited number of species for which there are sufficient aDNA sequences to allow reliable rate estimation (Ho et al. 2011b). This study presents a solution to an important problem: estimating and comparing the population responses of multiple, distantly related taxa within a community to late-

This article is protected by copyright. All rights reserved.

4

Quaternary climate, while taking into account rate heterogeneity among taxa. Phylogeographic studies focusing on Arctic species have revealed variable demographic responses to recent glacial cycles (e.g. Hope et al. 2013b). Species can persist and potentially adapt to changing environmental conditions, shift to follow preferred conditions, or be periodically extirpated from a region as conditions exceed taxon-specific tolerances (Bellard et al. 2012). Different evolutionary responses provide predictable demographic signatures among populations of species with distinct evolutionary origins (Palearctic versus Nearctic) and variable habitat associations (forest vs. tundra vs. generalist; Hewitt 1996). Therefore observed variation in the estimated timing, extent, and duration of demographic responses among species might reflect a genuine pattern, an artifact of sampling biases, stochastic genetic variation among taxa at a single locus, or a failure to account for rate heterogeneity among lineages. Previous comparative studies have employed a variety of approaches to handle data from multiple species and markers. Some have assumed a range of mutation rates that are incorporated into estimates of theta, particularly for taxa that are distantly related (e.g., Carstens et al. 2005), whereas others have assumed an average rate among species and among loci that is allowed to vary within bounds (Huang et al. 2011). Some authors have applied a single evolutionary rate that is shared across taxa (Hope et al. 2013a, b), particularly among related species (Kumar and Subramanian 2002). As a consequence, the contribution of among-lineage rate variability to estimated temporal responses remains unknown. A more appropriate approach might be to account explicitly for rate variation among lineages by using methods that involve taxon-specific rates. Temporal and spatial inference of fine-scale evolutionary dynamics is a small but critical aspect of most phylogeographic studies that remains methodologically unstandardized and highly variable, despite being of high importance for management and conservation application.

This article is protected by copyright. All rights reserved.

5

We evaluate the effects of accounting for rate variability among distantly related taxa on estimates of demographic and evolutionary responses among different ecological community components. To achieve this, we compiled a data set comprising Cytb sequences for small mammals representing boreal forest specialists, Arctic tundra/alpine specialists, and widespread habitat generalists. In addition to estimating relative rates of Cytb evolution among taxa, we assessed the robustness of our estimates of taxon-specific substitution rates to variation in the species pool, sample size, and sequence length, with the potential for these methods to be applied to much larger comparative datasets. Finally, we applied three reference substitution rates gleaned from the literature – two based on aDNA calibrations and one based on a biogeographic calibration – to estimate and compare timescales of demographic change.

Methods STUDY SYSTEM, SAMPLING, AND TAXA Our dataset included 22 small mammal taxa (20 species), representing three mammalian orders and a broad range of body sizes, ecological associations, and habitat specializations. The study system centers on Beringia, a biogeographic province that spans the northern continents from roughly the Lena River in Siberia, Russia, eastward to the MacKenzie Mountains in the Yukon Territory, Canada (Hultén 1937). The study taxa vary in their evolutionary origins (Palearctic, Nearctic, or endemic Beringian) and life-history associations (Table 1). For each species, previous phylogeographic assessments (listed by specimen accessions in supporting information) identified one or more discrete evolutionary lineages with a geographic extent at least partially within the study area. Geographic limits for each taxon were congruent with the extent of distinct mitochondrial lineages defined as wellsupported monophyletic clades from comprehensive species phylogenies. In this way we are

This article is protected by copyright. All rights reserved.

6

demonstrating a method of comparing the demographic history of a given genetic marker within many potentially distantly related but co-distributed lineages. For two species that had multiple lineages occurring within the study area, lineages were parapatric and specimens near zones of contact were assigned to a lineage only through analysis of mitochondrial sequences. Gene sequences were either obtained from GenBank or sequenced for the current study from museum-archived frozen (-80C) heart or liver tissue through standard salt extraction, polymerase chain reaction, and cycle sequencing methods (Hope et al. 2010). Novel sequences have been deposited in GenBank (Supporting Information). For sequenced samples, a single primer pair for Cytb was used (MSB05/MSB14; Hope et al. 2010). Automated sequencing of complementary strands was conducted using an Applied Biosystems 3110 DNA sequencer (Molecular Biological Facility, University of New Mexico). Sequences were edited and aligned in SEQUENCHER 4.8 (Genecodes, Ann Arbor, Michigan) and checked visually. Length of sequence varied among taxa from 451 bp (the longest common length among all taxa) to the full Cytb gene (1140-1143 bp) although sequence length was consistent within each taxon (Table S1). Sample size also varied among taxa (n=6-176).

ESTIMATING RELATIVE RATES AMONG TAXA To control for among-lineage rate variation, we performed Bayesian phylogenetic analyses using relaxed molecular clocks. To check for consistency in relative rate estimates among taxa from analyses based on different sample sizes and sequence lengths, we compiled and analyzed datasets under three sampling schemes (SS1-3). Each sampling scheme was repeated using all available sequence (≤1143 bp) and using the shortest common sequence length across all taxa (451 bp). To capture maximum genetic diversity within each taxon, the

This article is protected by copyright. All rights reserved.

7

two sequences exhibiting greatest uncorrected pairwise distance, as calculated in MEGA v5 (Tamura et al. 2011), were included in all analyses. SS1—For the first sampling scheme we included the full dataset of 944 sequences from 22 taxa. SS2—The second sampling scheme allowed us to evaluate the effects of using a small sample size for each taxon. For each taxon, we included the two most distant sequences plus eight additional sequences (except for one taxon with fewer samples available). This assured that we minimally included at least the range of total genetic diversity within a given taxon. By virtue of each taxon constituting a discrete monophyletic clade, we assume that the pair of most distant sequences does not consist of significant outliers from the mean pairwise divergence within a given taxon. We repeated this procedure three times, producing a total of four replicate datasets. Rates from replicate analyses were averaged for each taxon. SS3—The final sampling scheme allowed us to assess the impact of taxon sampling. We randomly divided the 22 taxa into two groups of 11 taxa. For each taxon we included the two most distant sequences and an additional 18 randomly chosen sequences (where available). The process was repeated three times to yield four data sets, each comprising 11 taxa and up to 20 samples per taxon. Rates were averaged across replicates. To facilitate repeatability, all methods for calculating taxon-specific rates are illustrated in a schematic, including necessary equations (Fig. 1). Bayesian phylogenetic analyses were performed in BEAST v1.7.0 (Drummond et al. 2012). We assigned the HKY substitution model and empirical base frequencies to each codon position to avoid additional parameterization from a more complex model. The HKY model of substitution was consistent with most intra-taxon best-model estimates using MRMODELTEST v2 (Nylander 2004) under the Akaike Information Criterion. To account for rate variation among lineages, we used an uncorrelated lognormal relaxed clock (Drummond et al. 2006). We specified a

This article is protected by copyright. All rights reserved.

8

constant-size coalescent prior for the tree. Default priors were used for other parameters, although uniform priors were changed to lognormal distributions (log[mean]=0, log[s.d.]=1). Posterior distributions were estimated using Markov chain Monte Carlo sampling, with samples drawn every 5,000 steps over a total of 50 million steps. Acceptable sampling and convergence to the stationary distribution were checked in TRACER v1.5 (Rambaut and Drummond 2007). Maximum-clade-credibility trees were identified using TREEANNOTATOR v1.7.0 (part of the BEAST software package), with node ages reported as posterior median values. A custom Python script was used to extract sample labels, branch lengths, and branch rates. These values were used to calculate the mean rate for each taxon, weighted by branch length (in units of time; Fig. 1B). To assess variability in relative rates estimated from all sampling schemes, rates were rescaled according to the mean rate calculated from each sampling scheme (taxon-specific rate divided by mean rate; Fig. 1C, D). We used analysis of variance (ANOVA) tests to evaluate differences in mean rates among sampling schemes (i.e., do different sampling schemes yield different estimates of rates?), and mean rates among taxa after correcting for differences in the mean rates among sampling schemes (i.e., do taxa exhibit different rates?).

CALCULATING EMPIRICAL RATES ACROSS TAXA To allow absolute timescales to be inferred in our analyses, we assigned three separate reference rate estimates from the literature. The lowest reference rate of 5.510-2 substitutions/site/Myr (herein RHope) was based on Cytb divergence of Sorex minutissimus from its sister taxon, the latter isolated on Honshu Island, Japan, with the timing of their split inferred using bathymetry data (Hope et al. 2010). While we acknowledge the potential for substitution rates to differ between sister taxa, this value provides a useful empirical reference point from which to investigate rate variation among taxa. Taxon-specific rates were

This article is protected by copyright. All rights reserved.

9

calculated by assigning the reference rate to S. minutissimus and then scaling the relative rates among taxa, inferred from the Bayesian phylogenetic analysis of all available data (Fig. 1E). The second reference rate was based on aDNA sequence divergence among collared lemmings (Dicrostonyx torquatus; Prost et al. 2010). As Prost et al. (2010) analyzed concatenated Cytb and control region sequences, we reanalyzed the data to estimate a rate based only on the Cytb sequence data (282 bp) but including the same samples. The bestfitting model of nucleotide substitution was estimated as HKY+I in MRMODELTEST v2. Following Prost et al. (2010), we used the Bayesian skyline plot (BSP) tree prior in BEAST with a strict clock (favored by the Bayes factor over an uncorrelated lognormal relaxed clock). Two independent MCMC analyses, each of 50 million steps, converged on a mean rate estimate of 5.610-2 substitutions/site/Myr (herein RProst). We assigned this rate to D. groenlandicus, which is the recognized sister taxon of D. torquatus, and scaled the rates of the remaining taxa. This reference rate is very similar to RHope, despite the two estimates being obtained using very different methods. Nevertheless, these two reference rates would only produce comparable scaled rates across study taxa if the relative substitution rate of S. minutissimus were the same as D. groenlandicus. This was found not to be the case, and so both reference rates were included in subsequent analyses. A third reference rate, based on aDNA divergence among common voles (M. arvalis; Martínková et al. 2013), provided an estimate of 3.210-1 substitutions/site/Myr (herein RMartínková). Although M. arvalis is not sister to any of the study taxa, we assigned this rate to the most closely related taxon, M. oeconomus, to investigate the effects of scaling rates using a relatively high reference rate. We inferred the population-size history of each taxon using Bayesian skyline plots (BSPs; Drummond et al. 2005). By using three different reference rates, we were able to

This article is protected by copyright. All rights reserved.

10

compare temporal differences in demographic responses among study taxa, and to interpret the timing and duration of population-size changes in the context of known climatic trends. Although BSPs assume panmixia and demographic inference might be misled if there is substantial population structure (Chikhi et al. 2010; Ho and Shapiro 2011), it is difficult to test the validity of this assumption for extended temporal periods and for wild populations. In the presence of population structure, demographic history is best estimated using a pooled sample of individuals from different subpopulations (Heller et al. 2013). Our BSP analyses were based on total available sampling, generally characterizing a pooled sampling regime for each taxon. We investigated this issue using two vole species and two shrew species with large total samples. BSPs inferred from five subsamples, each comprising half of the available sequences, produced similar demographic estimates. For each of the three reference rates, we produced two BSPs for each taxon: one using the scaled, taxon-specific rate (Fig. 1E) and the other using the average of all scaled taxonspecific rates (Fig. 1F), the latter assuming that all taxa share the same rate. All skyline plots from each reference rate were overlaid onto a single timescale. This was done separately for analyses based on taxon-specific substitution rates and those based on averaged rates. We also produced separate plots of overlaid skylines for taxa grouped by ecological habitat associations (boreal, tundra/alpine, or widespread). To evaluate differences in the timescales inferred using taxon-specific rates versus averaged rates, we identified the positive inflection points from each skyline plot (representing estimated time of maximum population growth). To do this, changes in median population size (Ne) based on 100 bins from each BSP (where

Ne = Nebinx – Nebinx+1) were plotted against time, and maximum values for Ne were retained as inflection points.

This article is protected by copyright. All rights reserved.

11

Results RELATIVE SUBSTITUTION RATES AMONG TAXA We assessed the effects of varying the number of taxa, sample size, and sequence length on estimates of relative substitution rates among taxa (Fig. 2). Considering all six sampling schemes (SS1-3 repeated using both long and short sequences), rate estimates tended to decrease as number of taxa, sample size per taxon, and sequence length increased (Fig. 2A). Mean rate estimates calculated from each sampling scheme varied significantly (one-way ANOVA: F=8.297, P

Accounting for rate variation among lineages in comparative demographic analyses.

Genetic analyses of contemporary populations can be used to estimate the demographic histories of species within an ecological community. Comparison o...
656KB Sizes 0 Downloads 3 Views