STATISTICALTESTSOFPHYLOGENIES

[401

645

Acknowledgments The author is grateful for the encouragement and advice received from Dr. Russell Doolittle, Dr. Charles Langley, and Dr. Michael Waterman.

[40] Statistical

Tests of Molecular

&WEN-HSIUNG

Phylogenies

LI and MANOLOGOUY

Introduction Although phylogenetic reconstruction has long been recognized as a problem in statistical inference, i few methods have been developed for evaluating the confidence level of estimated phylogenies. This deficiency has occurred largely because of the complexity of the problem. This problem has become important because the rapid accumulation of molecular data has generated much interest in phylogenetic studies.2*3 In fact, it is no longer acceptable to throw sequences through any available tree-making method and to publish the results without some evaluation of the reliability of the inferred phylogeny.4 The purpose of this chapter is to provide a summary of methods for testing the significance of inferred phylogenies. We classify analytic methods into the character state approach and the distance matrix approach. We also review the bootstrap approach, which resamples the data to infer the variability of the estimate obtained by a tree-making method.5 Character

State Approach

In a pioneering work, Cavender6 studied the confidence limits that can be placed on a phylogeny inferred from a four-species data set using parsimony methods. The results were given in terms of the total number of characters. Felsenstein’ studied the same problem under the assumption of ’ A. W. F. Edwards and L. L. Cavalli-Sforza, in “Phenetic and Phylogenetic Classification” (V. H. Heywood and J. McNeill, eds.), p. 67 (Syst. Assoc. Publ. No. 6). Systematics Association, London, 1964. 2 M. Nei, “Molecular Evolutionary Genetics.” Columbia Univ. Press, New York, 1987. 3 J. Felsenstein, Annu. Rev. Genet. 22, 521 (1988). 4 D. Penny, Nature (London) 331, 111 (1988). 5 J. Felsenstein, Evolution 39, 783 (1985). 6 J. A. Cavender, Math. Biosci. 40,27 1 (1978). 7 J. Felsenstein, Syst.Zool. 34, 152 ( 1985). METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

646

PHYLOGENETIC

TREES

I401

a constant rate of evolution (an evolutionary clock) and showed that the confidence limits are narrower than those obtained by Cavender. He considered inferences made by using parsimony methods, and the results were given in terms of the number of “phylogenetically informative” characters (see definition below). More recently, Lake8 proposed the evolutionary parsimony method. These two methods are reviewed below. Maximum

Parsimony

Consider the problem of inferring the branching order of three species with an outgroup (species 4). The three possible rooted trees are shown in Fig. la-c. In terms of parsimony a nucleotide site is informative (useful) for choosing among the three trees only if it is in the same state in two of the four species and in another state in the other two species. For example, if the site has the configuration AAGG among the four species, then it supports the first tree (Fig. la) because this tree requires (at least) only one nucleotide substitution to explain the configuration, whereas the other trees (Fig. 1b,c) each require two substitutions. As another example, if the configuration is AAGC, then it is not informative because each of the three trees requires two substitutions. Under the assumption of rate constancy, an informative site has a higher probability (say p) of supporting the true tree and a lower probability (say q) of supporting each of the two other trees. The probability that among n random informative sites the number (S) of sites supporting a particular incorrect tree is m or larger is given by Prob(S 2 m) = i$m &

qi(p + q)nmi

which is obtained by expanding [q + (p + q)]“. Since p + 2q = 1 and since q 5 p, Eq. (1) assumes the maximum value when p = q = 3, i.e., if the three species represent a true trichotomy (Fig. Id). Now suppose in a sequence data set with n informative sites the best supported tree is favored by m sites. Is this tree the true tree? Let us assume that this tree is incorrect and has by chance been supported by so many sites. The probability for a particular incorrect tree to be supported by m or more sites is given by Eq. (1). Therefore, the probability for one of the two incorrect trees to be supported by m or more sites is P = 2 Prob(S 2 m)

(2)

If P is smaller than (Y, then the observed tree is significant at the level of cy. In practice q is unknown, and it is necessary to use the least favorable * J. A. Lake,

Mol. Biol. Evol. 4, 167 (1987).

[401

STATISTICAL

TESTS

b

a

647

OF PHYLOGENIES

d

C

A A A A 1

2

3

4

1

3

2

4

2

3

14

12

3

4

1. a, b, and c represent the three possible rooted, bifurcating trees for three species with one outgroup, and d represents the case of trichotomy. FIG.

value, ). The above theory was developed by Felsenstein.’ He has also developed a statistic for testing whether the best supported topology is significantly better than the next best topology. The computation of this statistic is more complicated, but the powers of the two tests are about the same.’ This is not surprising because if the best supported tree is significant, it is likely to be significantly better than the next best tree (and also the worst tree), and vice versa. Equation (2) assumes that a true trichotomy is impossible so that only two of the three trees can be incorrect. In practice, however, the most difficult cases to resolve are those that are close to a true trichotomy; a good example is the branching order of humans, chimpanzees, and gorillas. Therefore, a more rigorous analysis is to take trichotomy as the null hypothesis. The probability for rejecting this hypothesis is P = 3 Prob(S 2 m)

(3) Under this hypothesis q = 4. We suggest that an inferred phylogeny should be taken with caution if the probability obtained from Eq. (2) is less than 0.05 whereas that obtained from Eq. (3) is larger than 0.05. Although the above results were derived under the assumption that one of the four species is known to be an outgroup, they apply equally well in the absence of this knowledge if one considers unrooted trees. Furthermore, the results should hold approximately even under nonconstant rates of evolution if the sequences are fairly closely related so that parallel and back mutations are rare. Table I shows the minimum m value required for Eq. (1) to be smaller than or equal to a given probability, assuming that n informative sites are observed and that q = p = +. It can be used to determine the significance level. For example, if n = 10 and m = 8, then the significance level is approximately 2 X 0.5% = 1% or 3 X 0.5% = 1.596, depending on whether Eq. (2) or Eq. (3) is used. If n is greater than 100, Eq. (1) can be easily

648

PHYLOGENETIC TABLE

MINIMIJM~VALUEREQUIREDFOR

1401

TREES I

Prob(S Z=~)IN EQ. (1) TO BE SMALLERTHANOR EQUALTOP 6 (%)

6 (W n

0.05

0.30

0.50

1.00

2.50

n

0.05

0.30

0.50

1.00

2.50

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32

9 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 21

9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 16 17 17 18 18 19

8 9 9 10 10 I1 II 12 12 13 13 14 14 15 15 16 16 17 17 17 18 19

8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 15 16 16 17 17 18

7 8 8 9 9 10 10 11 I1 11 12 12 13 13 14 14 14 15 15 16 16 17

34 36 38 40 43 47 50 53 57 60 63 67 70 73 77 80 83 87 90 93 97 100

22 23 24 24 26 28 29 30 32 33 35 36 38 39 41 42 43 45 46 47 49 50

20 21 22 23 24 26 27 28 30 31 33 34 35 37 38 40 41 42 44 45 46 48

20 20 21 22 24 25 26 28 29 31 32 34 35 36 38 39 40 42 43 44 46 47

19 20 21 21 23 24 26 27 28 30 31 33 34 35 36 38 39 40 42 43 44 46

18 19 19 20 22 23 24 26 27 28 29 31 32 33 35 36 37 39 40 41 43 44

a It is assumed that q = p = 4. n, Number of informative sites observed.

computed by assuming that m/n follows a normal distribution equal to + and variance equal to 3 X j/in. Evolutionary

with mean

Parsimony

The evolutionary parsimony method also applies to only four lines of descent. Since it assumes that all nucleotides are equivalent and makes a distinction only between transitional and transversional substitutions, the representation of sequence data can be simplified as follows.8 Label the four DNA sequences under study by 1, 2, 3, and 4. We work along the sequences from the 5’ to the 3’ end and, at each position, recode the nucleotides in all four sequences in terms of their relationships to the nucleotide in sequence 1 at that position. All nucleotides that are the same as the one in sequence 1 are represented by “ 1,” those related by a

1401

STATISTICAL

TESTS

649

OF PHYLOGENIES

transitional mutation are represented by “2,” and the two possible transversions are represented by “3” and “4” in the order we come across them, if at all. With this notation, any configuration of four nucleotides can be represented by 1 of 36 four-dimensional vectors. For example, the two sets of nucleotides ATCG and GCCA are recoded as vectors 1342 and 1332, respectively. Let us assume that the tree root is known so that the three possible alternative trees can be represented by Fig. 1a-c, though the theory applies equally well without this knowledge. For a set of nucleotide data, let Niikl be the total score of the configuration ijkl over the length of the sequences. Lake8 defined the following three quantities: X= K

133

+

N1234

-

N,233

-

N1134

(4)

Y=

N,3,3

+

N,324

-

N1323

-

N1314

(5)

2 =

N133,

+

Nn42

-

N,332

- NM,

(6)

where X is the score associated with the tree in Fig. la, Y with the tree in Fig. lb, and 2 with the tree in Fig. lc. The reason for subtracting the last two terms in each score is that this would make the total score for an erroneous tree statistically 0. Lake* called, for the tree in Fig. la, A = N1,33 + Nt234 the parsimony-like term and B = Nl233 + N,,,, the background term, and proposed (A - B)2/(A + B) as a x2 with one degree of freedom for testing whether the X score is significantly greater than 0; if the number of sites is small, one should test the equality of A and B by the binomial test.9 He suggested that a tree is “proven” if its associated x2 value is significant whereas the other two x2 values are not. We note, however, since three independent x2 values are tested, the significant level is not (Y, but is approximately 1 - (1 - a)3 = 3a. Lake* has extended the theory to the case of multiple species in each lineage, but the mathematical treatment does not appear to be rigorous. The method of Lake may be more robust against unequal rates of evolution than the maximum parsimony method (see later). However, it utilizes a relatively small part (only transversional differences) of the sequence information and is likely to be less efficient than the maximum parsimony method. Distance

Matrix

Approach

Instead of testing the significance of an inferred phylogeny, it is simpler to test the significance of estimated intemodal distances. As explained 9 R. Holmquist,

M. M. Miyamoto,

and M. Goodman,

Mol.

Bid.

Evol.

5,217

(1988).

650

PHYLOGENETIC

TREES

1401

below, in the case of four taxa, significance of the internodal distance can be taken as significance of the inferred phylogeny. When the number of taxa under study is more than four, the two problems are no longer equivalent, and the requirement of all intemodal distances being significantly greater than 0 seems to be a too stringent test for the significance of the inferred branching order. A simple way to test the significance of internodal distances is to study their variances. Variances of Internodal

Distances

Mueller and AyalaiO proposed computing these variances by the jackknife method, while Nei et al.” derived analytic formulas for the case of a UPGMA tree, i.e., a tree estimated by the unweighted pair-group method.‘* The UPGMA method assumes a constant rate of evolution, but there is now strong evidence that this assumption is often violated.i3 It is therefore desirable to consider an approach that does not make this assumption. Lii4 proposed a two-step approach. The first step is to infer the branching order. One can use the transformed distance method,“-” the neighbor:joining method,i8 or any other method that does not assume rate constancy and that has been shown to be effective for obtaining the correct tree. The second step is to estimate the branch lengths by the least-squares method.L9,20 The variances of intemodal distances are then obtained from the equations derived from the least-squares method. The variances for the cases of four and five taxa are given below. Four Taxa. Suppose that the inferred tree topology is as shown in Fig. 2a; the root of the tree can be determined if one of the four taxa is an outgroup. The branch lengths estimated by the least-squares method are a = 342 + S(43 - 43 + 4, - h4)

(7)

b = d,* - a

(8)

lo L. D. Mueller and F. J. Ayala, Genet. Res. 40, 127 (1982). I’ M. Nei, J. C. Stephens, and N. Saitou, Mol. Biol. Evol. 2,66 (1985). I2 P. H. A. Sneath and R. R. Sokal, “Numerical Taxonomy.” Freeman, San Francisco, California, 1973. I3 W.-H. Li, M. Tanimura, and P. M. Sharp, J. Mol. Evol. 25, 330 (1987). I4 W.-H. Li, Mol. Biol. Evol. 6,424 (1989). Is J. S. Fanis, in “Major Patterns in Vertebrate Evolution” (M. K. Hecht, P. C. Goody and B. M. Hecht, eds.), p. 823. Plenum, New York, 1977. I6 L. C. Kiotz, N. Komar, R. L. Blanken, and R. M. Mitchell, Proc. Nat/. Acad. Sci. U.S.A. 76,4516 (1979). I7 W.-H. Li, Proc. Natl. Acad. Sci. U.S.A. 78, 1085 (1981). I* N. Saitou and M. Nei, Mol. Biol. Evol. 4,406 (I 987). r9 L. L. Cavalh-Sforza and A. W. F. Edwards, Am. J. Hum. Genet. 19,233 (1967). *OR. Chakraborty, Can. J. Genet. Cytol. 19,217 (1977).

[401

STATISTICAL

TESTS OF PHYLOGENIES

a

651

b

FIG. 2. Model trees used in the derivation of the mean and variance of the branch lengths.

c = S(dn+ 4, + 44 + 44) - Wn + 44)

(9)

d = f&, + Kdn + 43 - d,, - 4,)

(10)

e = d,, - d

(11)

and the variance of c is W

= i!d W,,) + W,,) + J’(d,J + W,,) + 2 W,d + 2 l/(46) + 4W,,) + 2W,,)

- f[ W,,) + JW,,) + Wd + W&J1 + $W,,) + $Vdd

+ 2 Wdl (12)

where V(d,) denotes the variance of the estimate of the distance between sequences i and j. For protein sequence data the mean and variance of the number (dJ of amino acid replacements per site between sequences i and j can be estimated by d,=-wln(1

-p/w)

(13)

J’V,) = 1.41 - PML(1- P/w)*I

(14) where w = $$, p is the proportion of different amino acids between the two sequences, and L is the number of residue sites compared. For a pair of extant sequences, Eqs. (13) and (14) are readily applicable. However, the sequences at nodes 5 and 6 do not exist, and thus variances such as V(d,,) and V(d,,) cannot be estimated directly from actual data. However, they can be estimated as follows,” using V(d16) as an example. From Eq. (13), p=

w(1 - pb-)

(15)

652

PHYLOGENETIC

I401

TREES

Since dr6 = u + c, p = w[ 1 - e-(a+C)‘W]; a + c can be obtained from Eqs. (7) and (8). Putting p into Eq. (14) one readily obtains V(d,,). Next, consider nucleotide sequence data. Under the assumption of random substitution among the four types of nucleotides, i.e., the one-parameter model, the mean and variance of the number of substitutions per nucleotide site between sequences i and j are also given by Eqs. (13) and (14) except that now w = $, p is the proportion of different nucleotides between the two sequences, and L is the number of nucleotide sites compared.2’,22 Under the two-parameter model,23 the formulas corresponding to Eqs. (13) and (14) are d,=A+B

(16)

J’(d& = [x2P + z2Q - (xP + zQ)~]/L

(17) where P and Q are, respectively, the proportions of transitional and transversional differences between sequences i and j, A = + In(x) - $ In(y) is the number of transitional substitutions per site, B = 3 In(y) is the number of transversional substitutions per site, x = l/( 1 - 2P - Q), y = l/( 1 - 2Q), and z = (x + y)/2. Note that, unlike Eq. ( 14), Eq. ( 17) involves two parameters, P and Q. The formulas corresponding to Eq. ( 15) are given by Q=j(l p=+[l

-emu)

(18)

-Q-

e-(u+B)]

(19)

Five Tuxu. Suppose that the inferred branching Fig. 2b. The branch lengths are then given by a = td,2 + CA3 b = d,, - a

- 43 + 44 - 6, + d,s- 4,l

c = 4-V,,

+

43)

+

itd,,

d =

+

43)

-

M,,

W,,

order is as shown in w (21)

+ -

&,

+

4s

+

4,)

-

K&

+

44

-

342

(23)

c

e = a(d34 + d,, - d,, - 4,) + Mu

(22)

+ 4s + 4, + 4,) - %s

(24)

f=d3.,-d-e

(25)

g=4, -f

(26)

If two of the five taxa, say, taxa 4 and 5, are known to be outgroups, then 21 T. H. Jukes and C. R. Cantor, in “Mammalian Protein Metabolism” p. 2 1. Academic Press, New York, 1969. 22 M. Kimura and T. Ohta, J. Mol. Evol. 2,87 (1972). 23 M. Kimura, J. Mol. Evol. 16, 111 (1980).

(H. N. Munro, ed.),

c401

STATISTICAL

TESTS

653

OF PHYLOGENIES

one needs to obtain only the variance of c: UC) = &I W,,)

+ w&J

+ 2 WM

+ 2 WA,) + 4 w&,)

+ 2 w63)l

+ 81Vd14)+ w&4) + WA,) + W2,) + 2 Wd + 4 w&&I) + 2 w&4) + 2 m&d

+ 2 m&J

- f[ W,J + wL5) + w,,)l - a w&8) + W74)+ w,,)l + id Jw3.4)+ w34 + 2 Wd If only one or no outgroup variance of e:

+ SUd,*)

(27) is known, then one needs also to obtain the

Ue) = iWk.L) + W&J + W&3)+ W&)1 + ihw@M)+ W24)+ w45) + mm + $w&5) - NW,,) + m&7) - w&3) - WA*) - cu - I/(&)1 -u w&7) + 2 w&3) - wh?) + 2 w4t4) + 2 v&)1 + 6[ I/(d,*) + W**) + 2 w6,) + W64) + w6,N

(28)

Computer programs for the above formulas are available on request by sending a floppy disk to the authors.

Test of Significance of Inferred Phylogeny In the case of three taxa with one or two outgroups the above results can be used to test the significance of an inferred phylogeny. Since in this case there is only one internal branch, i.e., branch c, testing the significance of the internal branch is equivalent to testing the significance of the inferred phylogeny. More explicitly, the null hypothesis is that the true phylogeny is a trichotomy, i.e., the three taxa diverged at the same time. This hypothesis is the same as the hypothesis of c = 0. Therefore, if the estimated c is significantly greater than 0, the null hypothesis of trichotomy is rejected and the inferred branching order can be taken as statistically significant. If one considers unrooted trees, the same argument applies to the case of four taxa with no outgroup because in this case there is also only one internal branch (Fig. 2a). The above formulas for the mean and variance of c were derived under the assumption that the inferred branching order of the three taxa was [( 1,2)3], which means that lineage 3 branched off earlier than did lineages 1 and 2. If, instead, the inferred branching pattern is [( 1,3)2], then the subscripts 2 and 3 in the above formulas should be exchanged, and if the inferred branching pattern is [(2,3)1], subscripts 1 and 3 should be ex-

654

PHYLOGENETIC

TREES

I401

changed. Under the null hypothesis of trichotomy, the three branching patterns [( 1,2)3], [( 1,3)2], and [(2,3)1] occur with equal probability. However, in each data set only one pattern can occur, and only one c can be positive and tested for significant deviation from 0. Regardless of which pattern occurs, the probability for c to assume a particular (nonnegative) value is the same. If the distribution of c is the same as the distribution of 1x1,where x is a standard normal random variate, then the standard statistical test based on the standard normal distribution can be applied. In particular, the estimated c value is significant at the 5% level if the ratio of mean to standard error is at least 2, and is significant at the 1% level if the ratio is at least 2.60. Obviously, the case of four taxa with no outgroup can be treated in the same manner, if one considers unrooted trees. A computer simulationi for the case of three taxa with one outgroup shows that the significance level defined by the above criteria is generally applicable, though the distribution of c is not strictly normal. When the number of taxa under study is more than four, the situation becomes complicated. For example, in the case of five taxa there are two internal branches (Fig. 2b), and the probability for (only) one of them to become by chance significantly greater than 0 at the level of CY= 5% is 2a = 10%. Thus, in this case one cannot reject the null hypothesis that all the internal branches have zero length, i.e., all the taxa diverged at the same time point and form a “star” phylogeny; of course, this null hypothesis can be rejected if cr 5 2.5%. On the other hand, the probability for both internal branches to be by chance significant at the level of cy = 5% is only approximately o2 = 0.0025 (it is approximate because the two intemodal distances are not estimated independently). Hence, the requirement of all internal branches being significant seems to be a too stringent test for the significance of the inferred topology. However, one cannot draw a conclusion about the significance of an inferred tree topology as long as one or more of the internodal distances are nonsignificant; of course, the uncertainty can be restricted to a subset of taxa. Numerical

Examples

The following examples assume that the rate of nucleotide substitution is constant over time and that the observed number of substitutions between each pair of sequences is equal to the expected value. Table II shows the case of three species with an outgroup (taxon 4) (Fig. 2a). In Table II, (Y denotes the proportion of transitional changes; (Y = -f if substitutions occur randomly. The standard error (SE.), which is the square root of V(c), is larger for a! = 3 than for (Y = -f. Since transitional

[401

STATISTICAL

TESTS

OF PHYLOGENIES

TABLE

655

II

STANDARDERROROFTHEESTIMATEOFLENGTH OFBRANCH CINFIG. 2aa c

a

S.E.

c/SE.

L’

0.010

f f 4 3 4 3

0.0046 0.0050 0.0039 0.0043 0.0032 0.0037

2.17 2.00 1.28 1.16 0.31 0.27

850 l,oC@ 2,500 3,000 41,000 54,000

0.005 0.001

a The branch lengths in Fig. 2a are u = b = 0.05, d = 0.05 + c, e = 0.15 - c. cr, Proportion of transitional substitutions. SE. was computed under the assumption that the number of nucleotide sites (L) studied is 1000. L’ is the number of nucleotide sites required for the ratio c/SE. to be 2 or larger (i.e., to be -5% significant). From W.-H. Li, Mol. Biol. Evol. 6, 424 ( 1989).

changes generally occur more often than transversional changes,24*25 the two-parameter model is more realistic than the one-parameter model; the former is applicable to all (x values whereas the latter only to (Y= 4. The ratio c/S.E. can be used to test whether c is significantly greater than 0. A ratio of 2 can be taken as significant as the 5% level. All the values in Table II were obtained for L = 1000. When c = 0.01, the ratio is 2 or larger if CY5 3. Thus, this case requires only a small amount of sequence data to resolve the branching order of the three species. When c is 0.005, then the ratio is considerably smaller than 2; e.g., the ratio is 1.28 for (Y= f. Equations (14) and (17) imply that V(c) is inversely proportional to L. Therefore, for the ratio to increase from 1.28 to 2 the L value should increase from 1000 to L’ = 1000 X (2/1.28)* = 2500, approximately. The other L’ values in Table II were obtained in the same manner. If c is 0.00 1, then the number of nucleotide sites to be studied is rather large, greater than 50,000. Saitou and Nei26 have considered this problem from a different angle. They studied the probability of obtaining the correct topology as 24W. M. Brown, E. M. Prager, A. Wang, and A. C. Wilson, J. Mol. Evol. 18,225, (1982). 25W.-H. Li, C.-I. Wu, and C.-C. Luo, J. Mol. Evol. 21,58 (1984). 26N. Saitou and M. Nei, J. Mol. Evol. 24, 189 (1986).

656

PHYLOGENETIC

TREES

1401

a function of the number of nucleotides studied under various tree-making methods. Table III shows the amount of reduction in V(c) when a second outgroup (taxon 5) is added (Fig. 2b); the V(c) value for the case of one outgroup is denoted by V,(c) and that for the case of two outgroups by V,(c). The reduction increases as V,(c) becomes larger. Since l’(c) is inversely proportional to L, a reduction in l’(c) can also be achieved by increasing L. Is it more advantageous to increase L or to add a second outgroup? The total number of nucleotides sequenced is 4L for the case of one outgroup and 5L for the case of two outgroups, the latter being 1.25 times the former. Therefore, if the same total number of nucleotides is to be sequenced, it is less advantageous to add a second outgroup than to increase L, if V,(c)/I$(c) is smaller than 1.25, whereas the reverse is true if the ratio is larger than 1.25. In Table III the ratio is smaller than or equal to 1.25 for the first six cases and is larger than 1.25 for the last six cases. Since the ratio tends to increase with Vi(c), in general it is more advantageous to increase L if V,(c) is relatively small but more advantageous to add a second outgroup if V,(c) is relatively large. In all the cases in Table III the distances from sequences 4 and 5 to the other three are the same, i.e., g =f in Fig. 2b, so that the fifth sequence is as good a reference as the fourth one. If the fifth is more distantly related to the other three than the fourth TABLE VARIANCE [V,(c)]

OF THE ESTIMATE

III

OF LENGTH

V,(c) a=b

d

0.02

0.025

0.05

0.10

0.055

0.110

f=g 0.05

0.10

0.15

e

c

a

0.025

0.005

+

0.029

0.001

:

0.045

0.005

3 + 3

0.049

0.001

0.040

0.010

0.045

0.005

4 3 + f

4 +

(X lo-‘)

0.072 0.079 0.028 0.033 0.153 0.188 0.104

0.135 0.445 0.577 0.376 0.501

OF BRANCH

c IN

FIG. 2b”

V,(c) (X10-y

0.066 0.07 1 0.023 0.027 0.126 0.150 0.079 0. loo

0.35 1 0.444 0.287 0.374

V,Wv,(4

1.09 1.11 1.22

I .22 1.21 1.25 1.32 1.35 1.27 1.30 1.31 1.34

a a, Proportion of transitional substitutions. V,(c) was obtained under the assumption that the second outgroup (taxon 5 in Fig. 2b) is not available. From W.-H. Li, Mol. Biol. Evol. 6,424 ( 1989).

[401

STATISTICAL

TESTS

OF PHYLOGENIES

657

sequence is, then the reduction in V(c) is expected to be smaller than those shown in Table III. Further, the effect will also be reduced if sequences 4 and 5 are closely related to each other. Bootstrap

Method

Like the jackknife, the bootstrap is a method of resampling the data under study to infer the variability of the estimate.27 This method was introduced into phylogenetic studies by Felsenstein.5 It is not a tree-making method but a statistical method that can be used to evaluate the confidence level of a phylogenetic estimate obtained from a data set by a tree-making method. Suppose that L nucleotides from each of the species under study are obtained and that a certain tree-making method, which can be either a cladistic or a distance matrix method, is used to analyze the data. We randomly resample L sites from the data table with replacement and then reconstruct a tree. This process is repeated a large number (r) of times. We then search for all subsets of species that occur on 95% or more of the r estimated phylogenies. Each of these subsets is considered to be supported in the sense that its alternatives are rejected. A computer program is available from J. Felsenstein. When the number of species involved is more than five, the bootstrap seems to be the best method now available, for it is rather difficult to treat such a case analytically. However, there are difficulties in the interpretation of results obtained by this approach.5 First, the confidence statements are not joint confidence statements: for two subsets each supported at the 95% level, we might have lower than 90% confidence in the statement that they are both present in the true tree. Second, there is the “multiple tests” problem: if there are 20 subsets, then on average one should show significance at the 5% level purely by chance. One way to overcome these difficulties is to use a high level of confidence, say, 99% or even higher. In the literature, the resampling process is usually repeated only 100 times. This is too low; at least several hundred times should be conducted, particularly when many species are involved. Of course, this can be very time consuming, especially if a computation-intense method such as the maximum parsimony method is used. In the case of four species, the above-mentioned difficulties do not occur because each tree contains only two subsets, so that when one of them is determined, so is the other. However, as just noted, the advantage of the bootstrap method is its ability to deal with more than four species. 27B. Efron and G. Gong, Am. Stat. 37,36 (1983).

658 Concluding

PHYLOGENETIC

TREES

1401

Remarks

Problem of Statistical Inconsistency When the degree of sequence divergence is small or moderate (say, 50% or lower), the maximum parsimony method, the transformed distance method, and the neighbor-joining method are likely to be statistically consistent in the sense that the probability for each of these methods to give the correct tree increases with the amount of data. However, when the degree of divergence is high and when the assumption of rate constancy is violated, these methods may not be consistent,28*29 and thus a highly significant tree may in fact be an erroneous one. (This problem can also occur even with equal rates of evolution.“) The evolutionary parsimony method was intended to overcome the effects of unequal rates of evolution. However, this method can also become statistically inconsistent if the transitional or transversional rates are unequal. Indeed, in an application of this method to the small subunit rRNA sequences from human, Drosophila, rice, and Physarum, we obtained a x2 of 4.8 for the tree with human and rice as a sister group, the x2 values for the other two trees being 0.6 and 0.3. The favored tree is obviously erroneous because humans are much more closely related to Drosophila than to rice. The bootstrap approach does not correct the statistical inconsistency problem that is inherent in the tree-making method used.5 Thus, regardless of which ap preach is used to test the significance of a phylogenetic estimate, one should be aware of the possibility of statistical inconsistency, particularly when highly divergent sequences are used. Heterogeneous Data Phylogenetic studies often use sequence data from different DNA regions. If all the regions studied have similar rates of nucleotide substitution, then all the data can be combined together into a single set. If considerable variation in rates exists, however, regions with different rates should be treated separately, particularly when the distance matrix approach is used because the internodal distances are dependent on the rate of evolution. The question then arises as to how to test the significance when the results from different data sets are combined. A simple test z* J. Felsenstein, Cyst. Zool. 27, 401 (1978). 29 W.-H. Li, K. H. Wolfe, J. Sourdis, and P. M. Sharp, Cold Spring Harbor Symp. Quant. Biof. 52,847 (1987). M D. Penny, M. D. Hendy, and I. M. Henderson, Cold Spring Harbor Symp. Quant. Biol. 52, 857 (1987).

1411

PARSIMONY

AFTER

PROGRESSIVE

ALIGNMENT

659

procedure is the inverse x2 method. 31 Suppose that there are k different data sets. Let pi be the significance level (probability) estimated from the ith data set. If the null hypothesis is true, then - 2 In (pi) has a x2 distribution with 2 degrees of freedom and P = -2 i

ln(p,)

i-l

has a x2 distribution

with 2k degrees of freedom.

Acknowledgments This study was supported by National Institutes of Health Grant GM30998. )r R. A. Fisher, “Statistical London, 1932.

Methods for Research Workers,”

[41] Nearest Neighbor Progressively Aligned By RUSSELL

4th Ed. Oliver and Boyd,

Procedure for Relating Amino Acid Sequences

F. DOOLITTLE

and DA-FEI FENG

Introduction Generally speaking, there are two quite different approaches to constructing phylogenies from sequence data: the one matrix based, and the other column-by-column character analysis. The latter procedure is a kind of parsimony approach. Matrix methods, on the other hand, strive for arrangements that most readily accommodate pairwise distances. Both methods have strengths and weaknesses. For example, it has been noted that the tendency for overall optimization inherent in the matrix approach systematically foreshortens the ancient (deeper) branches.’ Nonetheless, matrix methods are usually more sensitive than common ancestor schemes, and, in the case of protein sequences, are much more widely used. Although less popular, column-by-column methods have been used with protein sequences in the past, particularly in some early r M. 0. Dayhoff, C. M. Park, and P. J. McLaughlin, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed.), Vol. 5, pp. 7- 16. National Biomedical Research Foundation, Silver Spring, Maryland, 1972. METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

Statistical tests of molecular phylogenies.

STATISTICALTESTSOFPHYLOGENIES [401 645 Acknowledgments The author is grateful for the encouragement and advice received from Dr. Russell Doolittle,...
840KB Sizes 0 Downloads 0 Views