Calculation of evolutionary trees from sequence data.

Proc. Natl. Acad. Sci. USA

Vol. 76, No. 9, pp. 4516-4520, September 1979 Evolution

Calculation of evolutionary trees from sequence data (different rates of evolution/nucleic acid and protein sequences/matrix method)

LYNN C. KLOTZ, NED KOMAR, ROGER L. BLANKEN, AND RALPH M. MITCHELL Department of Biochemistry and Molecular Biology, Harvard University, Cambridge, Massachusetts 02138

Communicated by Paul Doty, May 29, 1979

mutations are represented in the figure by the numbers next to the branches. For sequence data, the most common method for calculating evolutionary trees is the so-called matrix method (3, 4), which is a type of cluster analysis (5). This method relies solely on the sequence differences among pairs of present-day organisms, which may be approximately related to the mutation difference between the organisms (6, 7). The pairwise mutation differences are usually summarized in a difference matrix, D. As an illustration, the difference matrix for the organisms in Fig. la is given by A B C D E A 0 18 24 20 21 B 18 0 18 14 15 D = C 24 18 6 9 0 [1] D 20 14 6 0 5 E 21 15 9 5 0 This matrix contains exact values because it was obtained directly from the hypothetical tree in the figure. Even with exact data in the difference matrix, the matrix method, which uses as its only source of data the difference matrix D, may not determine the correct tree topology or root. This is illustrated in Table 1, in which a brief description of the matrix method is presented by using the difference matrix in Eq. 1 as an example. The resulting wrong topology and wrong root are shown in Fig. Id. The fact that the difference matrix alone cannot determine the correct root is well known (8, 9). The reason is illustrated in Fig. 1 a-c. The tree in Fig. lb is obtained by assuming that the common ancestor X to all the organisms A, B, C, D, and E is located half-way between A and the AB branch point. This tree has a different root from that in Fig. la in that it shows A branching from BCDE first, then B-branching from CDE, etc., whereas the tree in Fig. la shows AB branching from CDE first. However, both trees have exactly the same difference matrix D. Another example resulting in a different root is shown in Fig. ic, in which X is connected between E and the E-CD branch point. Again, this tree has the same difference matrix as those in Fig. 1 a and b. It is clear that any method for calculating evolutionary trees that uses the difference matrix alone may fail to calculate the correct root. A little thought shows that if one could take into account the actual number of mutations of each organism from the common ancestor, one would choose the correct root, Fig. la. We shall show how to do this

ABSTRACT Evolutionary trees are usually calculated from comparisons of protein or nucleic acid sequences. from present-day organisms by use of algorithms that use only the difference matrix, where the difference matrix is constructed from the sequence differences between pairs of sequences from the organisms. The difference matrix alone cannot.define uniquely the correct position of the ancestor of the present-day organisms (root of the tree). Furthermore, methods using the difference matrix alone often fail to give the correct pattern of tree branching (topology) when the different sequences evolve at different rates. Only for equal rates of evolution can the difference matrix (when used with the so-called matrix method) yield exactly the correct topology and root. In this paper we present a method for calculating evolutionary trees from sequence data that uses, along with the difference matrix, the rate of evolution of the various sequences from their common ancestor. It is proven analytically that this method uniquely determines both the correct tree topology and root in theory for unequal rates of sequence evolution. How one would estimate an ancestral sequence to be used in the method is discussed in particular fox the 5S RNA sequences from prokaryotes and eukaryotes and for ferredoxin sequences.

The recent proliferation of protein and nucleic acid sequences, due in part to the development of new sequencing techniques (1, 2), has led to a renewed interest in the evolution of these sequences and the organisms from which they came. We wish to discuss here some problems with the commonly used method (3, 4) (the matrix method) for calculating evolutionary trees from molecular sequence data and to present an alternative method that, in theory, eliminates these problems. Because the analytic proof of our method depends on a knowledge of existing methods, we will briefly review what is necessary from the existing literature, then discuss the problems with the matrix method in some detail. An evolutionary tree representation of nucleic acid or protein sequence differences among several organisms should give the following information: (I) the correct pattern of branching of the various present-day organisms from one another (that is, the correct tree "topology"); (ii) the correct position on the tree of the common ancestor to all the present-day organisms (that is, the correct tree "root"); and (iii) of secondary importance, a reasonable estimate of the number of mutations along each of the tree branches (branch mutations). One type of such a representation is presented in Fig. la for five hypothetical present-day organisms, A, B, C, D, and E, and their common ancestor X. In this representation, the topology is shown by the order in which the branches connect: C and D are most closely related in the sense that they connect first, then CD connects to E, and A connects to B. The root of the tree with the common ancestor X is shown to be between the groups AB and CDE. This pictorial representation may also be represented in line form as ((AB) X ((CD)E)). The number of branch

shortly.

For our specific example of the matrix method, there is one error in topology: organism D is connected to E (Fig. Id) instead of to organism C (Fig. la). Some researchers (3) guard against such topology errors by using the following procedure. After the topology is found by the matrix method, branch mutations (i.e., numbers of mutations along each branch) can be assigned by using D and the derived topology by a method such as that of Fitch and Margoliash (3), for example, as we have done in Fig. Id. (The actual procedure need not concern us here.) From

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U. S. C. §1734 solely to indicate this fact.

4516

Evolution: Klotz et al. B

A

D

C

Proc. Natl. Acad. Sci. USA 76 (1979)

4517

E

52

2

(x (b)

(a)

A

12\

D

BC

E

A

B

C D E

v -0.5

2/

lx IdI

(d)

(c)

FIG. 1. Evolutionary tree representations of the hypothetical organisms A, B, C, D, and E. The numbers along the branches indicate the number of sequence differences. The vertical positions at which the branches join have no relationship to time. The three trees in a, b, and c have exactly the same difference matrix, but have different roots; a has the correct root, b and c do not. (d) The evolutionary tree and branch lengths calculated from the simple application of the matrix method. The dashed circle at the ancestral organism indicates the uncertainty in the actual location between A and B of the ancestor.

these assigned branch mutations, a new difference matrix Co be calculated. One then calculates the squared difference So2 between the real difference matrix D and the one calculated from the assigned branch mutations Co: can

=EE(dij - cO )2[2 q

S02 2 ~~~~[ ij dij in which the sum is taken over all the elements of the D and Co matrices (10). Next, some other topology close to that found by the matrix method is chosen, new branch lengths are assigned, and a new difference matrix Cl is calculated, yielding a new ~~~ d

squared difference,

S12 = Ei Y.j

_d

dij

[3]

After trying many topologies, the correct topology is taken to be the one for which the squared difference is minimized. Although this procedure will yield the correct topology when applied to exact data such as we are considering here, in practice it may be difficult to apply. For a tree with as few as 10 organisms there are more than 2 X 106 rootless topologies (11, 12), a number for which it would be difficult to calculate all topo-

FIG. 2. (a) An evolutionary tree with widely varying numbers of mutations along the branches. We assume this tree has the correct topology for the organisms A, B, C, D, E, F, and G. (b) The topology calculated by the matrix method on the exact difference matrix obtained from a. (c) An evolutionary tree having the same topology and root as a, but with equal rates of evolution.

logies, branch mutations, and squared differences even on a fast computer. If the matrix method always yielded a topology close to the real one, it would be necessary to analyze only a few topologies similar to the one found by the matrix method to arrive at the correct one. However, as Fig. 2 illustrates, the matrix method can yield a topology radically different from the true one. Here, Fig. 2a is the correct topology and root and

Fig. 2b is that determined by the matrix method. Under such circumstances the correct topology might never be found if it is not practical to try all possible topologies. Examination of the evolutionary trees in Figs. 1 a-c and 2a shows that they differ greatly in the rate of mutation from the common ancestor to the present-day organisms. We define X as the vector whose elements are the number of mutations (i.e., rates of evolution) from the common ancestor X to A, B, C, D, and E. For example, for Fig. la X

=

Table 1. Matrix method applied to difference matrix in Eq. 1 Topology being considered

(average sequence difference) First branch point

(AB)

(AC)

(AD)

(18) (24) (20) Second branch point (AB) (AC) (18) (24) Third branch point

(AB) (18)

14]

AB CD E [14 8 10 6 7]

(AE) (BC) (BD) (BE) (CD) (CE) (DE) (21) (18) (14) (15) (6) (9) (5) (A(DE)) (BC) (B(DE)) (C(DE)) (20.5) (18) (14.5) (7.5) (A(C(DE))) (B(C(DE)))

Minimum value 5

7.5

Topology from minimum value

(DE) (C(DE))

(21.67) 15.67 (15.67) (B(C(DE))) Final topology* ((A)X(B(C(DE)))) Briefly, the method works as follows. All the organisms are considered in pairs to see which pair gives the minimum sequence difference to establish the first branch point. Then this pair is considered to be a single entity, and everything is compared pairwise again to obtain the second branch point from the minimum distance. This second branch point will have either the form (X(XX)) or (XX) and (XX). Then the resulting triplet or each of the two pairs, as the case may be, are considered as single entities and the procedure is applied again to determine the third branch point. * See also Fig. ld.

4518


Evolution: Klotz et al.

and for Fig. lb

is 18-5 - (-1) = 14. The converted matrix D' is, therefore, A B C D E X = [6 12 18 14 151

[5]

It can be proven (10, 13) that for equal rates of evolution (i.e., all elements of X equal), the matrix method will yield both the correct topology and the correct root. This result is easily seen by example, as in Fig. 2c, in which is shown an evolutionary tree with the same topology as that in Fig. 2a, but with equal rates of evolution. Using Fig. 2c to construct D and then using the matrix method will yield the correct topology and root. Evidence is beginning to accumulate that sequence evolutionary rates are not equal (14). Also, molecular evolution studies are now being extended to prokaryotes, higher eukaryotes, and unicellular eukaryotes (15). For these very different organisms, there is no reason to accept a priori the assumption of equal evolutionary rates for molecular sequences because of the presumed different evolutionary histories and pressures encountered by the organisms, because of possible variation in the rates of mutation fixation due to difference in the efficiency of DNA repair and replication mechanisms, etc. Our own preliminary studies on evolutionary trees calculated from 5S RNA sequences indicate some large variations in rates of sequence evolution. For example, tomato and human 5S RNAs have fixed approximately 32 and 19 mutations, respectively, from the time of their common ancestor to the present. The discussion above indicates that if we can somehow first correct the difference matrix for different evolutionary rates, we can calculate the correct topology and root by the matrix method or, in fact, by any other reasonable method. In the next section, we develop a method that exactly corrects the difference matrix for different evolutionary rates and therefore allows us to find the correct topology and root.

The method As an illustration of how the method works, we will use the exact data in the difference matrix, Eq. 1, which was taken from the evolutionary tree in Fig. la. An analytic proof for the validity of this method will be presented afterward. Basically, the method involves correcting the difference matrix D into a related matrix D' which allows for the different evolutionary rates of the organisms-i.e., for the number of sequence mutations separating each organism from the common ancestor. The correction is made as follows: first, construct a vector X whose elements are the number of mutations from the ancestral organism, X. This has already been done in Eq. 4 for the present example. The average number of mutations from the ancestor X is x = (14 + 8 + 10 + 6 + 7)/5 = 9. We now construct a new vector X' whose elements are the elements of

A 0 14

B 14 0

C 18 18

D 18 18

E 18 18

18 18

18 18

0 8

8 0

18

18

10

10

10 10 0J

[8]

D' is a matrix that has the same average sequence difference (averaged over the whole matrix) as the real difference matrix D, and D' has also been converted exactly to equal rates of evolution along all the branches. This is shown by the equivalence matrix elements from one major branch of the tree to the other-i.e., dAC = dAD = dAE = dBc = dBD = dBE, and also dCE = dDE. It can be easily verified that equal rates of evolution give equal matrix elements by calculating the difference matrix for the tree in Fig. 2c. Using the matrix method on this corrected difference matrix D' yields the correct topology and root, shown in Fig. la. Compare this result to that (Fig. ld) obtained by using the matrix method on the standard difference matrix D in Eq. 1. The proof of the validity of the method takes a different form from the example above. The proof is carried out algebraically for a tree consisting of sequences from only three organisms, A, B, and C, and their ancestor X. The three-organism case can be generalized to a tree containing N organisms. Case I. In Fig. 3a, the arbitrary branch lengths a, b, c, and y to be used in the proof are defined. The difference matrix is

LA A 0 D =B a+b+y C +c+y

B a+b+y

a+c+y

0

b+c

b+c

0

C

[9]

Note that, depending on the magnitudes of a, b, c, and y, all of the three possible topologies and roots could be calculated by the matrix method. The X vector for this case is A B X =[a b+y

B

A

C

C

[10]

c+Y]. B

A

C

a

d

IX (a)

lox (b)

X minus x:

A BC D E X' = [ 5-11 -3-2]'

[6]

The converted difference matrix Do is found from the following equation for the ijth matrix element

dj =dj - Xj- X.

[7]

For example, the converted difference for organisms A and B

FIG. 3. Evolutionary trees for the proof of the validity of the method for correcting for different evolutionary rates. See text for details.



Also, x=

1 3

[a +b+c+ 2y],

[11]

and X', which is X minus x, is A B C Xo=[ar- b+ y-x c+ y-I]

[12]

or

B

A I

C

(2a-b-c-2y) I(2b-a-c+ y) (2c-a-b+y)J. [13]

Using Eqs. 7, 9, and 12 to get the corrected difference matrix, D', we have A B C 0 =A 2x 2x [4 Do [14] B 2[ -2y + 2 0 C 2x -2y + 2x 0 The important result is that the elements of Do no longer depend on the magnitudes of a, b, and c at all (except in the x term, which appears as the same value, 2i, in every term of the matrix except the meaningless diagonal zero terms). Therefore, applying the matrix method to Do will yield the correct topology

plus root because the matrix method operates by placing together those two organisms with the minimum matrix element (dBc in this case because y > 0 by definition). Note that this correct topology plus root is obtained regardless of the values of a, b, c, and y, that is, regardless of the rates of evolution of the three sequences. The proof is also independent of whether the mutations occur along the various branches evenly in time or not, as a little thought will show. One final comment on this case: we could have used X instead of X' to get D', and then x would not appear in D' at all and the proof would hold in slightly simpler form. The only reason for not using X is "cosmetic," since it is convenient to have D' = D for the case of equal rates of evolution. Case II. Before proceeding to the case of a general tree of N organisms, we need to consider a slightly different threeorganism case. This case is diagrammed in Fig. Sb, in which we do not know the immediate ancestor, Z, to A, B, and C, but know instead an "ancient" ancestor, X. The branch length, d, indicates the number of mutations that occurred between X and Z. For this case it is clear that D is still the same and thus is given by Eq. 9. Now, A B C X=[a+d b+y+d c+y+d],

[15]

x= S [a+b+c+2y]+d,

[16]

3

and X' is again given by Eq. 13. Because both D and X' are the the previous case, D' will be the same, and the previous proof holds. Case III. A tree containing N present-day organisms is pictured in Fig. 3c. From this tree we pick at random any three organisms I, J, and K. These three organisms are isolated from the tree in Fig. 3d, where their various connecting branches have been labeled also. It is clear that the tree for the three organisms is equivalent to a case II tree with d = yi, a = Y2 + i, b = y4 + j, C = k, and y = ys. Thus, the difference matrix elements for these three organisms, dij, dIK, and djK will give the same as

4519

correct topology and root for the branches on which they reside from the matrix method. To complete the proof it is sufficient to say only that because it is true that any three organisms chosen at random from the tree will have the correct topology and root with respect to each other by using the matrix method on D', the whole tree must then have the correct topology and root. Discussion There are two points that require discussion: (i) the method by which one might obtain the ancestral sequence to calculate X in practice and (ii) the effect of real sequence data on our theoretical results. Most important is the first point. For our method to work, a good estimate of the ancestral sequence, or at least the number of mutations to the present-day sequences, is required. As examples, we will discuss two situations in which such data can be obtained. A problem of interest to us is the evolution of DNA sequence organization and chromosome structure. In this connection, we have studied DNA organization in several organisms that may be near the major prokaryote-eukaryote evolutionary splite.g., dinoflagellates (16), blue-green algae (17), yeast (18, 19), and methanobacteria (20). We wish to determine the evolutionary tree topology and root for these organisms with respect to this major split by using 5S RNA sequences. Our procedure is to pick a "basis set" of a small number of typical aerobic prokaryotes and a small number of plants and animals. The basis set is picked so that we already know the topology and root from mainly biological considerations. We can then use the simple version of the parsimony procedure of Fitch (21) to get an estimate of the ancestral sequence. Preliminary results using computer-simulated evolutionary trees indicate that these ancestral sequences calculated from the parsimony operation are surprisingly close to the "real" ancestor. Now, one at a time, the organisms in whose evolutionary position with respect to each other we are interested (dinoflagellates, etc.) can be placed on the basis-set tree by using the D' matrix calculated from the D matrix and the distance from the estimated ancestor. For this method and for another method in which distances from the root are calculated from branch lengths of the basis set, we are now running extensive computer simulations to determine the accuracy of the methods. A second situation in which an ancestor appears to be available is for ferredoxin sequences. Eck and Dayhoff (22) have presented convincing evidence that present-day ferredoxin sequences arose from simple repeating sequences. This simple repeating sequence, or a later derivative of it, should provide an exact ancestor for evolutionary tree calculations. Now we turn to the problem of real data. The proof and examples presented in the previous section used exact numbers of mutations. Although for theoretical purposes this is a useful way to proceed, comparison of any two real sequences will usually yield a number of sequence differences less than the actual number of mutation differences between the two sequences. The reasons for this are that there can be more than one mutation at a sequence position, which would be recorded as only one sequence difference, and that back mutations to the ancestral base at a sequence position would make two or more mutations appear to be none at all. An average correction can be applied to convert the observed number of sequence differences to number of mutations (6, 7). This average correction has, of course, an error associated with it due to statistical fluctuations, so the difference matrix will also carry some error. How these errors affect the confidence with which one can accept a topology and root found from an analysis of real se-

4520


quence data can be found by performing extensive computer simulations, which include the statistical effects of mutation mentioned above, of evolutionary trees with roughly the same branch mutations, topology, and root of the real sequences. From the present-day sequences resulting from the simulations, one attempts to back calculate the topology and root used for the simulation. The frequency with which the correct topology and root can be back calculated is a measure of confidence. Without such extensive simulations, especially when there may be large differences in evolutionary rates, it is difficult to see how calculations of evolutionary trees can be accepted as correct without question. In summary, we have presented a method for accounting for different evolutionary rates when calculating evolutionary trees. In fact, we can see no better way, in theory, to correct exactly for different evolutionary rates than the correction of the difference matrix as embodied in Eq. 7. In practice, the drawback, of course, is the need for an ancestral sequence or at least the number of mutations from the position of the root to the present-day organisms. Although we have discussed two cases in which a good estimate of the data can be obtained, there are many situations in which it probably cannot. Thus, the method may be of limited usefulness. However, when it can be used, it should provide more believable evolutionary trees and information on the constancy of the rates of molecular sequence evolution. We thank Alan Hinnebusch for a critical reading of the manuscript. This research was supported by National Science Foundation Grant PCM 7611158.

Maxam, A. M. & Gilbert, W. (1977) Proc. Natl. Acad. Sci. USA 74,560-564. 2. Donis-Keller, H., Maxam, A. M. & Gilbert, W. (1977) Nucleic Acid Res. 4, 2527-2538. 1.

Proc. Natl. Acad. Sci. USA 76 (1979) 3. Fitch, W. M. & Margoliash, E. (1967) Science 155, 279-284. 4. Dayhoff, M. O., Park, C. M. & McLaughlin, P. J. (1972) in Atlas of Protein Sequence and Structure, ed. Dayhoff, M. 0. (Nati. Biomed. Res. Found., Washington, DC), Vol. 5, pp. 12-13. 5. Sokal, R. R. & Sneath, P. A. H. (1963) Principles of Numeric Taxonomy (Freeman, San Francisco), pp. 169-203. 6. Dayhoff, M. O., Eck, R. V. & Park, C. M. (1972) in Atlas of Protein Sequence and Structure, ed. Dayhoff, M. 0. (Natl. Biomed. Res. Found., Washington, DC), Vol. 5, pp. 89-99. 7. lizuka, M., Kazushige, I. & Matsuda, H. (1975) J. Mol. Evol. 5, 249-254. 8. Sokol, R. R. & Sneath, P. A. H. (1963) Principles of Numeric Taxonomy (W. H. Freeman, San Francisco), pp. 227-229. 9. Farris, J. S. (1971) Annu. Rev. Ecol. & Sys., pp. 277-302. 10. Moore, G. W., Goodman, M. & Barnabas, J. (1973) J. Theor. Biol. pp. 38, 423-457. 11. Cavalli-Sforza, L. L. & Edwards, A. W. F. (1967) Am. J. Hum. Genet. 19, 233-257. 12. Fitch, W. M. (1977) Am. Nat. 111, 223-257. 13. Moore, G. W. (1971) Dissertation (North Carolina State Univ.,

Raleigh, NC). 14. Dobzhansky, T., Ayala, F. J., Stebbins, G. L. & Valentine, J. W. (1977) Evolution (Freeman, San Francisco), pp. 308-313. 15. Hori, H. & Osawa, S. (1979) Proc. NatI. Acad. Sci. USA 76, 381-385. 16. Allen, J. R., Roberts, T. M., Loeblich, A. R., III & Klotz, L. C. (1975) Cell 6, 161-169. 17. Roberts, T. M., Klotz, L. C. & Loeblich, A. R., III (1977) J. Mol.

Biol. 110,341-361. 18. Lauer, G. D. & Klotz, L. C. (1975) J. Mol. Biol. 95,309-326. 19. Lauer, G. D., Roberts, T. M. & Klotz, L. C. (1977) J. Mol. Biol. 114,507-526. 20. Mitchell, R. M., Loeblich, L. A. & Klotz, L. C. (1979) Science 204, 1082-1084. 21. Fitch, W. M. (1971) Syst. Zool. 20,406-416. 22. Eck, R. V. & Dayhoff, M. 0. (1966) Science 152,363-366.

The inference of evolutionary trees from molecular data.

Evolutionary insights into host-pathogen interactions from mammalian sequence data.

Behavior Trees for Evolutionary Robotics.

phyC: Clustering cancer evolutionary trees.

Phylogeny and evolutionary histories of Pyrus L. revealed by phylogenetic trees and networks based on data from multiple DNA sequences.

A Beta-splitting model for evolutionary trees.

Novel non-parametric models to estimate evolutionary rates and divergence times from heterochronous sequence data.

Using multiple relaxed-clock models to estimate evolutionary timescales from DNA sequence data.

Accelerated isotope fine structure calculation using pruned transition trees.

A well-resolved phylogeny of the trees of Puerto Rico based on DNA barcode sequence data.

Extremely divergent histone H4 sequence from Trypanosoma cruzi: evolutionary implications.

Inferring evolutionary histories of pathway regulation from transcriptional profiling data.

The early evolutionary history of belemnites: new data from Japan.

SICLE: a high-throughput tool for extracting evolutionary relationships from phylogenetic trees.

The evolutionary history of the embiotocid surfperch radiation based on genome-wide RAD sequence data.

City trees: Urban greening needs better data.

Relationship estimation from whole-genome sequence data.

A computer program for processing data from amino acid analysis and for the calculation of molecular weights from those data.

SubClonal Hierarchy Inference from Somatic Mutations: Automatic Reconstruction of Cancer Evolutionary Trees from Multi-region Next Generation Sequencing.

Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.

Calculation of sequence divergence from the thermal stability of DNA heteroduplexes.

Molecular Formula Identification Using Isotope Pattern Analysis and Calculation of Fragmentation Trees.

Calculation of the rate of utilization of albumin-bound free fatty acids from specific radioactivity data.

PCA features.