Letters

Trends in Plant Science May 2014, Vol. 19, No. 5

The multispecies coalescent model and land plant origins: a reply to Springer and Gatesy Bojian Zhong1, Liang Liu2, and David Penny1 1 2

Institute of Fundamental Sciences, Massey University, Palmerston North, New Zealand Department of Statistics and Institute of Bioinformatics, University of Georgia, Athens, GA 30606, USA

Springer and Gatesy [1] give a slightly different hypothesis of the origin of land plants but suggest both that their results ‘undermine major conclusions’ in our paper [2] and that the coalescent model is not reliable for deep phylogenetic questions such as the origin of land plants. There are three aspects to our response: (i) there are important points of agreement between the two studies (the biological aspect); (ii) the trees obtained by the two studies are very similar (the mathematical aspect); and (iii) there are differences between the methods used (the computational aspect). The computational aspect is more complex and the issues here are to include coalescence itself, the programs, missing data, and recombination. With respect to the biology, Springer and Gatesy ignore several important findings in our paper. Our main finding is that the Charales do not fit as the closest relatives of land plants [2], and Springer and Gatesy [1] put the Charales even outside the Klebsormidales, thus supporting our conclusion. As far as the Zygnematales and Coleochaetales are concerned, either hypothesis (Zygnematales+ Coleochaetales or Zygnematales alone, as the sister group to land plants) cannot be excluded based on our recent phylogenomic analyses using chloroplast genomes [3], but it is not really a concern here because both hypotheses reject the ‘coenocytic’ lineages (i.e., Charales) as being the closest lineage to land plants. Similarly, the streptophyte algae are much closer to land plants than the chlorophyte algae, and land plants are monophyletic [2]. Of the other two differences within land plants [1], we note that Huperzia is a long branch and we agree that the phylogenetic position of the Gnetales is an interesting question [4,5], but it is not the issue here. On the mathematical side, there are approximately 2.92  1040 unrooted trees for 32 taxa and we calculated a bound on the probability ( p) of getting two trees with just four differences (shown in Figure 1 in [1]) of p < 10–33 for n = 32 taxa (see p. 545, example 3 in [6]), indicating again that the phylogenetic trees shown in Figure 1 in our paper [2] are highly similar to the tree shown in Springer and Gatesy [1]. We believe that these lines of evidence do not support the conclusion of Springer and Gatesy that ‘These results undermine major conclusions regarding land plant Corresponding authors: Zhong, B. ([email protected]); Liu, L. ([email protected]). 1360-1385/$ – see front matter ß 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tplants.2014.02.011

270

origins’ in our paper [2] and instead support phylogenetic trees being both highly similar and ‘locally stable’ [7] – the differences are only variations around a single internal edge/branch of the tree. The coalescence process may occur in deep phylogeny [8]. When the deep internal branch is short (in coalescent units), ancestral polymorphism, caused by deep coalescence in the ancestral population, can be retained. The origin of land plants is a deep divergence and the internal branch is short (see Figure 1 in Zhong et al. [2] and Figure 1D in Springer and Gatesy [1]). The short branches at the land plant origins (as well as the expected larger population size in the early green algae) suggest that incomplete lineage sorting due to deep coalescence is likely to occur during the ancient rapid radiation. We have shown that the coalescent model can account for 68% of gene tree variation observed for 184 gene trees [2], although less than one-third of gene tree heterogeneity needs to be explained by other biological processes (e.g., recombination, introgression, natural selection). We have several comments regarding the programs used. The coalescent program STAR [9] is based on the ranks of the internal nodes of gene trees. Because the ranks depend on the number of taxa in the individual gene trees, having a large number of missing taxa in some gene trees may bias the ranks in those trees. We observed a considerable number of missing taxa in the plant data (also shown in Figure 1 in Springer and Gatesy [1]). Thus we chose the MP-EST method [10] rather than the STAR method, to reduce the effect of missing data on species tree estimation, because MP-EST is based on gene tree triples, which are comparable among all gene trees. Springer and Gatesy state that ‘Liu et al. (pp. 4–5) cautioned that for MP-EST, missing lineages ‘‘in some gene trees are allowed if lineages are missing randomly, but a lot of missing lineages may dramatically reduce the performance of the pseudo-likelihood approach’’.’ However, ‘performance’, in the original MP-EST paper [10], refers to ‘precision’ rather than ‘accuracy’; that is, having many missing lineages may lead to low bootstrap support (or a large variance) of the MP-EST tree. The problem of large variance due to missing taxa can be overcome by sampling a large number of loci [10]. As demonstrated in our plant data analysis, the likelihood ratio test has significantly rejected the null hypothesis [2], which implies that our plant data contain enough information to make inferences on the origin of land plants. In addition, a recent study [11] has shown that missing data have little impact on species tree estimation under the coalescent model if enough loci are sampled.

Letters Springer and Gatesy claim that ‘to our knowledge the only simulations where MP-EST or STAR outperformed concatenation involve a single tree that is small (5 taxa), asymmetrical, and shallow (3.1 coalescent units from root to tip).’ In fact, many papers have demonstrated inconsistency of concatenation in the anomaly zone [9–13], which may occur for any species tree with five or more species. As an example, when a subtree within a large species tree is in the anomaly zone, concatenation will consistently estimate the incorrect tree. Moreover, the references cited by Springer and Gatesy (references 5–8 in [1]) do not support their overstated conclusion that ‘Simulations that have modeled ancient diversifications and larger sets of taxa have uniformly favored concatenation to ‘‘shortcut’’ coalescence methods (STAR, MP-EST, STEM, STEAC, MDC) even when data were simulated via coalescence models.’ In reference 5 in Springer and Gatesy [1], data were simulated under the concatenation model and the main conclusion of the paper is that ‘conventional simulation under the concatenation model cannot adequately resolve which method is most efficient to recover the true tree.’ References 6 and 7 in [1] demonstrated by simulation that MP-EST and STEM do not perform well (i.e., large variance or low bootstrap support) when the number of loci is small and the estimated gene trees are poorly supported, which is not the case for our plant data analysis, which involved 289 genes and produced MP-EST trees that are in general highly supported. Springer and Gatesy claim that ‘The poor performance of coalescence methods presumably reflects their incorrect assumption that all conflict among gene trees is attributable to deep coalescence, whereas a multitude of other problems (long branches, mutational saturation, weak phylogenetic signal, model misspecification, poor taxon sampling) negatively impact reconstruction of accurate gene trees and provide more cogent explanations for incongruence.’ However, they exaggerated the effects of gene tree estimation error on the performance of coalescence methods. It is commonly observed that the problems of long branches, mutational saturation, weak phylogenetic signal, model misspecification, and poor taxon sampling can negatively impact the performance of all phylogenetic tree reconstruction methods, including concatenation and coalescent-based methods. The poor performance (or relatively large variance) of the MP-EST method is observed when the number of loci is small and individual gene sequence alignments have a low phylogenetic signal [14]. In fact, MP-EST is designed and performs well for phylogenomic data analyses that typically include sequence alignments from hundreds of loci. The MP-EST trees estimated from empirical phylogenomic data are in general well supported, with high bootstrap support values [2,15,16]. For small data sets, it is recommended that the Bayesian coalescent method should be adopted, because it is statistically more efficient than the coalescent methods that estimate species trees from gene tree topologies [10]. Springer and Gatesy indicate that ‘A consequence of highly inaccurate gene trees for coalescence analyses is a species tree with branches (in coalescent units) that are impossibly short’ and conclude that the ‘MP-EST tree with its stunted branch lengths provides a flawed underpinning

Trends in Plant Science May 2014, Vol. 19, No. 5

for circular simulations, which erroneously suggest that the coalescence model accounts for 68% of the conflict among gene trees.’ Because MP-EST trees are based on gene trees estimated from sequence alignments, branch lengths of MP-EST trees represent the amount of variation among estimated gene trees, which is presumably caused by coalescence and uncertainty in estimating gene trees. Consequently, MP-EST tends to underestimate the branch lengths of the underlying species tree, to explain the noncoalescence variation among estimated gene trees due to gene tree estimation error. Because the true gene trees are unknown, it is reasonable to evaluate the amount of observed variation among estimated gene trees that can be explained by the coalescent model by simulating gene trees from the MP-EST tree, in which branch lengths are estimated from the estimated gene trees. Springer and Gatesy argue that our study (as well as others) treats coding sequences stripped of intervening introns as individual loci, thus neglecting potential recombination within genes. However, combining the fragmental sequences from the same gene into single protein-coding loci has been widely used in phylogenetic studies. Recently, Lanier and Knowles [17] showed that unrecognized intralocus recombination has only a minor effect and does not represent a major factor impacting the accuracy of species tree estimation. If recombination occurs on the long branches of the species tree, it will not produce significantly distinct histories for different exons, and treating exons as single genes does not cause problems for the coalescent methods. Bayzid and Warnow [14] suggest that concatenating sequences from a small number of loci can actually improve the performance of coalescent methods, even if the concatenated loci have distinct histories. This indicates that the coalescent model is robust against a small amount of recombination within genes. Nevertheless, Springer and Gatesy indeed raised an interesting topic concerning how we select independent genetic markers on the genomes for building species trees. This topic certainly deserves more thorough phylogenetic studies to understand the effects (negative and positive) of choosing different genetic markers on species tree estimation. In summary, the multispecies coalescent model is an effective, reliable, and testable method for deep phylogenetic inferences such as the origin of land plants. As recognized in the widely cited words of George Box, ‘all models are wrong, but some are useful’ [18]. The coalescent model, like almost all other mathematical models, is an approximation to the real biological process and we do not expect that it can perfectly fit all aspects of real biological characteristics. We always welcome a constructive discussion to evaluate coalescent methods based on substantial research. Acknowledgments The authors thank Chris Tuffley for assistance with checking the calculation of the numbers of differences between trees.

References 1 Springer, M.S. and Gatesy, J. (2014) Land plant origins and coalescence confusion. Trends Plant Sci. 19, 267–269 2 Zhong, B. et al. (2013) Origin of land plants using the multispecies coalescent model. Trends Plant Sci. 18, 492–495 271

Letters

Trends in Plant Science May 2014, Vol. 19, No. 5

3 Zhong, B. et al. (2014) Streptophyte algae and the origin of land plants revisited using heterogeneous models with three new algal chloroplast genomes. Mol. Biol. Evol. 31, 177–183 4 Zhong, B. et al. (2010) The position of Gnetales among seed plants: overcoming pitfalls of chloroplast phylogenomics. Mol. Biol. Evol. 27, 2855–2863 5 Zhong, B. et al. (2011) Systematic error in seed plant phylogenomics. Genome Biol. Evol. 3, 1340–1348 6 Steel, M.A. (1988) Distribution of the symmetric difference metric on phylogenetic trees. SIAM J. Disc. Math. 1, 541–551 7 Cooper, A. and Penny, D. (1997) Mass survival of birds across the Cretaceous/Tertiary boundary. Science 275, 1109–1113 8 Oliver, J.C. (2013) Microevolutionary processes generate phylogenomic discordance at ancient divergences. Evolution 67, 1823–1830 9 Liu, L. et al. (2009) Coalescent methods for estimating phylogenetic trees. Mol. Phylogenet. Evol. 53, 320–328 10 Liu, L. et al. (2010) A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10, 302

11 Hovmo¨ller, R. et al. (2013) Effects of missing data on species tree estimation under the coalescent. Mol. Phylogenet. Evol. 69, 1057–1062 12 Kubatko, L.S. and Degnan, J.H. (2007) Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 56, 17–24 13 Liu, L. and Edwards, S.V. (2009) Phylogenetic analysis in the anomaly zone. Syst. Biol. 58, 452–460 14 Bayzid, M.S. and Warnow, T. (2013) Naı¨ve binning improves phylogenomic analyses. Bioinformatics 29, 2277–2284 15 Song, S. et al. (2012) Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl. Acad. Sci. U.S.A. 109, 14942–14947 16 Chiari, Y. et al. (2012) Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (Archosauria). BMC Biol. 10, 65 17 Lanier, H.C. and Knowles, L.L. (2012) Is recombination a problem for species-tree analysis? Syst. Biol. 61, 691–701 18 Box, G.E.P. and Draper, N.R. (1987) Empirical Model Building and Response Surfaces, John Wiley & Sons

Plant Science Conferences in 2014 4th Pan-American Congress on Plants and BioEnergy 4–7 June, 2014 Guelph, Canada http://my.aspb.org/BlankCustom.asp?page=Bioenergy2014 The 8th Scandinavian Plant Physiology Society PhD Students Conference 16–19 June, 2014 Uppsala, Sweden http://phd2014.spps.fi/ ESF-EMBO Symposium Biology of plastids – towards a blueprint for synthetic organelles 21–26 June, 2014 Pultusk, Poland http://bioplastids.esf.org/programme.html Plant Biology Europe FESPB/EPSO 2014 Congress Learning from the past, preparing for the future 22–26 June, 2014 Dublin, Ireland http://europlantbiology.org/ International Symposium on Auxins and Cytokinins in Plant Development, ACPD 2014 29 June–4 July, 2014 Prague, Czech Republic http://www.acpd2014.org/

272

The multispecies coalescent model and land plant origins: a reply to Springer and Gatesy.

The multispecies coalescent model and land plant origins: a reply to Springer and Gatesy. - PDF Download Free
184KB Sizes 0 Downloads 3 Views