The inference of evolutionary trees from molecular data.

Comp. Biochem,Physiol.Vol. 102B,No. 4, pp. 643-659, 1992 Printed in Great Britain

0305-0491/92 $5.00 + 0.00 © 1992Pergamon Press Ltd

MINI REVIEW THE INFERENCE OF EVOLUTIONARY TREES FROM MOLECULAR DATA TIMOTHY J. BEANLANDand CHRISTOPHERJ. HOWE* Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QW, U.K. (Received 16 January 1992) AImtraet--1. Procedures for multiple alignment of sequence data, subsequent phylogenetic inference, and testing of the trees derived are presented. 2. The assumptions underlying different approaches and the extent to which they are valid are discussed.

The construction of evolutionary trees from molecular data ("phylogenetic inference") is a large and ever-expanding field (for recent reviews see Felsenstein, 1982 and 1988; Bishop et al., in Bishop and Rawlings, 1987; Swofford and Olsen in Hills and Moritz, 1990). Rather than reiterate these excellent general articles, our objective here is the more modest one of providing an introduction for the non-specialist. We have in mind (as had Friday, 1989) the biologist in possession of a set of sequences (e.g. of a certain gene or protein from a range of organisms) who wants to know how they might be related. One part of this review is to outline the practical steps (including information on computer programs) in going from sequences to evolutionary trees (phylogenies). In our experience many biologists see no need to pursue inference further than this and are prepared to adopt an uncritical "handle-turning" approach. This attitude is compounded by the inaccessibility of much of the theoretical literature on inference, and reflected in many of the phylogenies inferred by bench scientists. Our second objective is therefore to give sufficient background to the practicalities of inference to allow a more intelligent use of the programs and interpretation of results. The reader should then be equipped to form the necessary bridge between the hands-on program documentation and the theoretical review. OVERVIEW

An overview of the steps involved in going from molecular data to evolutionary trees is given in Fig. 1; programs are listed with notes and references in Tables 1 and 2. Broadly, there are three phases: (i) sequence alignment to generate the dataset for inference; (ii) phylogenetic inference itself; and (iii) testing of the preferred tree to see how strongly it is favoured over others. The last is usually an option within an inference program and as such, although logically *To whom correspondence should be addressed. 643

distinct from finding the optimal phylogeny, is not shown separately in the figure. MULTIPLE SEQUENCE ALIGNMENT As a starting point, we assume the would-be treebuilder possesses a set of ~ 5-20 sequences of ~ 300 amino acids or ---1 kb of D N A or RNA. These figures are only guidelines, many programs being able to deal with substantially larger datasets (more and longer sequences). Are the sequences homologous? The first problem to address is whether or not the sequences are homologous, i.e. related by descent from a common ancester, a prerequisite for inferring a phylogeny. The experimental method used to determine the sequences (e.g. nucleotide hybridisation, immunochemical cross-reactivity) may suggest that they are related, but a test for homology is still a useful exercise. We recommend Doolittle's (1986) book as excellent background reading for this problem. Note that the definition of homology is a qualitative one: either two genes (proteins) are homologous or they are not. The quantitative variables in this area are percent identity (fraction of identical residues shared) and percent similarity (identical amino acid residues plus conservative substitutions). The use of "percent homology" to mean percent similarity is not to be recommended. Testing the hypothesis that two sequences are homologous reduces to asking whether they show a degree of similarity (or identity) greater than would arise by chance or functional convergence. One way of testing for this is to carry out a dot matrix plot (e.g. using the programs SIP, D O T M A T R I X or COMPARE, Table 2). If one has a choice, these are best carried out on amino acid rather than nucleotide data, as the background noise will almost always be less (Doolittle, 1986). If the two sequences share extensive regions of similarity over and above background noise or the short stretches (e.g. an active site) that characterise functional convergence, a hypothesis of homology can reasonably be taken as proven.

644

TIMOTHY J. BEANLAND a n d CHRISTOPHER J. HOWE

IMMUNOCHEMICALo¢ HYBRIDIZATION DATA

UNALIGNEI~ SEQUENCI~

PRIMARYDATA

TEST FORHOMOLOGY (MULTALK]N.RDP2) MULTIPLEALIGNMENT (MULTAL,M~TALIGN.MSA)

ALIGNMENT

INPUTFOR INFERENCE PAM fables

INFERENCE

/

DISTANCEMATRIX I~IUHBOR,UPGMA, FITCH,KITSCH

~

~EN~M

PAUP.l~d~IG. DNAPARS,PROYPARS, DNACOI~, CL~SCH 1

CLA~RAN

DN.M~.BishopFridayML

\

I~YL~EN~'

Fig. 1. Overview of the steps required in construction of evolutionary trees from molecular data. The scheme shows the overall flow of things rather than every specific detail, and includes only some of the PROGRAMSavailable (for an exhaustive list see Table 2).

A q u a n t i t a t i v e " s h u f f l i n g " test a l o n g similar lines (Doolittle, 1981) is to c a r r y o u t a n a l i g n m e n t o f the s e q u e n c e s a n d c o m p a r e the p e r c e n t similarity (identity) f o r these w i t h the s a m e figures f o r aligned r a n d o m p e r m u t a t i o n s o f the t w o s e q u e n c e s ( p r o grams M U L T A L I G N and R D F 2 ) . Note that simply g e n e r a t i n g a p a i r w i s e a l i g n m e n t f o r the t w o u n s c r a m bled s e q u e n c e s (SIP, G A P , etc) is n o t a test o f h o m o l o g y as these p r o g r a m s will align a n y t w o sequences, h o m o l o g o u s o r not. I n the shuffling test, the g r e a t e r the p e r c e n t similarity (identity) o f the aligned " r e a l " s e q u e n c e s is o v e r the j u m b l e d o n e s t h e n the g r e a t e r is the likelihood t h a t the f o r m e r are h o m o l o g o u s . I f the p e r c e n t identities f o r p a i r w i s e c o m p a r i s o n s o f r a n d o m i z e d s e q u e n c e s are a p o p u l a t i o n with a n o r m a l distrib u t i o n , t h e n t w o unshuffled p r o t e i n s w h o s e a l i g n m e n t score is 3 S.D. o r m o r e g r e a t e r t h a n the m e a n f o r the

Table 1. Main features of the most commonly used molecular inference packages and programs Package

Program

PAUP (3.0)

Hennig86 (1.5)

Method Amino acid and DNA parsimony Operator invariants Amino acid and DNA parsimony

Model Pattern Rate

Rooting

Errors

Reference

?

-

MGPD, Outgroup

Bootstrap

Swofford, 1990

TVS

Variable

Outgroup

Lake, 1987a

?

-

Outgroup

Farris, 1988

?

-

Outgroup

?

-

Outgroup

Felsenstein, 1991

PHYLIP (3.4)

PROTPARS DNAPARS/ DNAPENNY

DNACOMP

Amino acid parsimony Nucleotide parsimony

Nucleotide compatibility DNAINVAR Operator invariants DNADIST Estimates dist. matrix KrrSCH Phenogram from dist. matrix nrcn Pbenogram from dist. matrix NEIGHBOR Neighbour-joining and UPGMA DNAML Nucleotide ML BishopNucleotide Friday ML

K-H K-H, DNABOOT

?

-

Outgroup

SEQBOOT

TVS

Variable

Outgroup

4 2 test

J-C, K2P, DNAML

BF, tsn/tvn J-C

Lake, 1987a

Bootstrap Uniform

Ultrametric

via DNADIST

Variable

Outgroup

via ONAInST

Variable Uniform Variable

Outgroup Ultrametric Outgroup

SEQm~OT SEQBOOT K-H

Uniform

Ultrametric

Saitou and Nei, 1987

Bishop and Friday, 1985

In most cases, the packages contain programs or options for non-molecular data that are not listed here. The tests listed for tree robustness are not exhaustive, e.g. SEQBOOTcan in principle be used with all PHYLIPprograms. The version numbers given are those for which the data listed apply, not necessarily the most recent releases. Abbreviations: '?' and ' - ' reflect the difficulty in relating cladistic methodologies to the Markov framework. '?' means that the pattern for cladistic methodologies is more complex than can be readily incorporated into a table. In contrast, ' - ' under "Rate" for these programs indicates that they have no time structure. TVS, a model of transversional symmetry; MGPD, rooting at the midpoint of the greatest patristic distance; K-H, pairwise Kishino-Hasegawa testing; J-C, the Jukes-Cantor model; K2P, the Kimura-two-parameter model; ML, maximum likelihood. 'D~AML' under the pattern heading for DNADISTindicates the same model as the DNAMLprogram, for which 'Variable' in the same column means the transition probabilities can be varied (by default being specified by the observed mean base frequencies in the dataset) and 'tsn/tvn' that the transition/transversion can also be selected as desired. General comments: PHYUPis a package including all types of inference program, whose main strength is its statistical framework: Version 3.4 allows one to bootstrap and delete-half jack-knife any program (using SEQnOOT),although for some programs (D~AML)the time to do this is likely to be prohibitive. Pairwise testing is available where appropriate, and input format is standardized across the package to allow results from different programs to be compared. PHYLIPversions prior to 3.4 are non-interactive and notoriously cumbersome to use. Parsimony programs within the package are, compared with PAUP, also unsophisticated, reflecting Felsenstein's lack of esteem for cladistic methodologies. In contrast, DNAMLis arguably the best general nucleotide ML program released. Most programs in PHYLIP search topologies by a branch-swapping algorithm (exceptions: NEIGHBORand DNAPENNY,the latter being DNAPARSwith a branch and bound search). PAUP is recognised as state-of-the-art for parsimony methods and is extremely user-friendly. There is much flexibility in using both the nucleotide and protein parsimony programs, with the step differences between nucleotides and amino acids being selected as required. This allows, for example, a transition/transversion ratio other than unity to be set for nucleotide parsimony. Rooting by the MGPD method is simply placing the root at the midpoint oftbe longest edge. The search algorithms for PAUPinclude options for branch-swapping and branch and bound ones.

The inference of evolutionary trees from molecular data

645

Table 2. Summaryof sequence analysissoftwarecited, other than inferenceprograms Package Program Function Reference RDF2 Randomization test Pearsonand Lipman, 1988 of homology AMPS MULTALIGN Randomization test Bartonand Sternberg, 1987 of homology, multiple alignment GCG COMPARE/DOTPLOT Dot matrix plots Devereuxet al., 1984 GAP Pairwise alignment STADEN SIP Dot matrix plots Staden,1982 NBRF DOTMATRIX Dot matrix plots Georgeet al., 1982 RELATE Randomization test of homology MSA SP-multiple alignment Lipman et al., 1989 CLUSTAL CLUSTAL Multiple alignment Higginsand Sharp, 1988 MULTAL Multiple alignment Programs are listed in the rough order they appear in the text. Functions shown are only those relevantto this review:many packages(especiallyGCGand SXAOE~)containmanymore sequence analysis facilities. jumbled sequences can be taken as homologous (Doolittle, 1981). (As a rule of thumb, in such a case it is likely that two proteins of > 100 amino acids will have at least 25% identity. For nucleotide sequences, the figure is heavily dependent upon the base compositions). As this value decreases (amino acid identities 15-20%), distinguishing between a highly-divergent pair of homologous and non-homologous proteins, showing some functional convergence, becomes more difficult. In cases where one cannot establish homology, it is best to omit the sequences from the inference process altogether (Olsen, 1988). Multiple sequence alignment

Having established that we are dealing with a family of proteins or genes, the next problem is generating a multiple sequence alignment. This is an important step in phylogenetic inference and worth careful consideration. In an ideal world, alignment and inference would be carried out simultaneously because the sequences are a product of an evolutionary process, related by a unique tree topology and a series of substitution events (Sankoff et al., 1976; Sankoff and Cedergren in Sankoff and Kruskal, 1983; Bishop and Thompson, 1986). So far, this has proven computationally tractable only for short sequences (e.g. the ~ 120 nt 5S RNA: Hogewog and Hesper, 1984; Hein, 1990) so for our imaginary dataset, alignment and inference are separate steps (Fig. 1). Alignment is then based upon maximising the number of matched (identical or similar) residues, an approach that though tractable has no clear foundation in the theory of molecular evolution (Bishop and Thompson, 1986; Pearson and Lipman, 1988). The principles of multiple sequence alignment can be illustrated by considering lining up just a pair of sequences. These methods were developed by Needleman and Wunsch (1970) and Sellers (1974) and the results are hence known as " N W S " alignments. Central to the NWS approach is a similarity matrix containing the scores for matching pairs of residues. For amino acids these scores might be 1.0 for identifies (except for cysteines: 2.0) and zero for mismatches (the so-called "unitary" matrix--Doolittle, 1981) or more typically would be on a sliding scale with mismatches scored on physico-chemical similarity, differences in implied D N A substitutions

(Sellers, 1974) or empirically observed substitution frequencies (Dayhoff et al., in Dayhoff, 1978). Substitution in vivo is likely to reflect both genetic effects (e.g. residues related by single transitions exchanging most readily) and functional constraint, although these will not always act in tandem. For instance, less than half of the 75 possible single nucleotide substitutions that cause changes lead to conservative amino acid replacements (Doolittle, 1986). For all but closely-related proteins ( > 8 5 % identity) the semiempirical log-odds PAM-250 matrix (Dayhoff et al., in Dayhoff, 1978) has been found to be preferable (Feng et al., 1985 but see also Altschul, 1991) and is indeed the default for most alignment programs (GAP, CLUSTAL, MSA, FASTA). For a comparison of the different schemes, see Feng et al. (1985). For alignment of nucleotide sequences, some programs (GAP) assume a unitary matrix, whilst others score matches 1.0, mismatches related by a transition (A-G, C-T) as 0.5 and those by a transversion (A-C, A-T, C-G, G - T ) zero. Such a scheme is an option within CLUSTAL. If one has both D N A and amino acid sequence data from the same genes (proteins) (e.g. D N A data from a sequencing project reliably translated), then it may be best not to run a nucleotide alignment program but instead to line up the proteins and then manually fit the D N A alignment to correspond. This approach has the advantage that one can infer trees from both datasets and compare these, knowing that a n y inconsistencies are not the result of differences in alignments. Generating a NWS pairwise alignment raises the possibility of introducing gaps (Altschul, 1989) into either sequence to maximise an aggregate similarity score, defined as "rewards" for matches (as in the similarity matrix) less penalties for gaps. Alignment programs often require that one specifies a gap weight (penalty per gap independent of length) and a gap length weight, the latter to determine how heavily long gaps are penalised. The total cost per gap is then (gap weight)+ (gap length weight)x (gap length): setting the length weight to zero will score all gaps the same irrespective of length. It is possible that the mechanism for insertion/deletion of short sections of D N A (errors in replication) differs from that for longer ones (unequal cross-over--Nei, 1987), although, in the absence of information on the

646

TIMOTHYJ. BEANLANDand CHRISTOPHERJ. HOWE

frequency with which insertions and deletions ("indels") occur & vivo, accepting the gap default settings in a program seems sensible. It is, however, worthwhile seeing how different values affect the result, as regions of the alignment that are heavily dependent upon the gap weights should be discarded from inference (Olsen, 1988). Whatever final alignment is decided upon, the temptation to tinker with it by manually introducing extra gaps is also best resisted (Doolittle, 1986). Generalising NWS principles to multiple sequence alignment poses some problems as pairwise alignment of a group of proteins may be mutually inconsistent: alignment of A with B and of B with C may be incompatible with that of A with C (Murata et al., 1985). There are two ways round this problem. One is to generate the multiple alignment by carrying out sequential pairwise alignments. In some programs of this type (e.g. MULTALIGN), the most similar pair are aligned first and a consensus of these defined as a frequency distribution of amino acids at each site. The sequence most similar to the consensus is then aligned against this, a new consensus defined, and so on (Feng and Doolittle, 1987; Taylor, 1988). Programs using the sequential clustering algorithm (CLUSTAL, MULTAL, M U L T A L I G N ) run rapidly and can handle many sequences. A more sophisticated alternative is to define the best multiple alignment as the one that simultaneously maximises the weighted sum of all the pairwise scores (hence SP-alignment for Sum of Pairs--Altschul, 1989). Programs of this type (MSA: Lipman et al., 1989; Bacon and Anderson, 1986) are less widely used and will take longer to run, though not excessively so for around 10 sequences. In our experience, the differences between the output from sequential clustering and SP-alignment programs are often trivial for moderately diverged proteins ( ~ 30-50% identity). Which sequences to use f o r inference?

If in possession of protein and coding D N A sequence from the same genes, it makes sense to infer trees from both and compare results. A consideration of sampling errors would argue for preferring the D N A (the dataset will be three times longer), but more subtle systematic errors may mean that the amino acid sequences are more reliable. These inelude variation in composition of the sequences ("compositional bias"), likely to be more extreme in D N A than the corresponding protein because many nucleotide substitutions will be silent. The products of most alignment programs are not suitable for feeding direct to an inference program. We are here not thinking of problems of format (although these will often be present) but rather of regions of the alignment that are best omitted from the inference process in order to prevent artefacts when generating trees. Trivially, it will usually be necessary to remove overhanging ends of sequences so that all are of the same aligned length. Less obvious, and dependent on the nature of the problem in hand, it is necessary to look carefully at the alignment for potential problems in the form of long gaps common to two or more sequences. This problem is clearly not relevant to those inference programs

(Bishop-Friday ML, DNAINVAR) which do not allow gaps. Where a gapped dataset is permitted, long indels shared by two or more sequences are likely to mislead inference (unless downgraded by weighting: see below) because each gap site shared by two sequences will be interpreted by the inference program as a common residue. Morden and Golden (1989) report a case where a single seven-amino-acid indel was sufficient to overwhelm inference from all the remaining sites in a 32 kDa protein. Figure 2(a) shows some partial amino acid data (Beanland, 1990) that contain several long indels in two or more of the six sequences. Left in, this feature would cause the first four sequences to group, swamping any effect from the remaining residues. Unless one regards lengthy indels as overwhelmingly important indicators of phylogeny (Meyer et al., 1986) then inference in this case is best carried out using only the boxed regions indicated (Fig. 2b). Some workers (e.g. Lockhart et al., 1992b) take a more extreme view and would trim even these regions to include only blocks bounded by residues that are physico-chemically conserved in all sequences (Fig. 2c). As a general point, when deciding which regions to include in a dataset for inference, one should be aware of the possibility of indel-related artefacts, compare results from different datasets, and remember that here (if not elsewhere) omission is preferable to commission (Olsen, 1988; Cedergren et al., 1988). INFERENCE

Viewed from the "handle-turning" perspective, inference programs all appear fairly similar. One feeds in a multiple sequence alignment or a distance matrix derived from an alignment and then waits. The output consists of a tree with the sequences (species) at the tips. The tree has a topology (the way in which the sequences are clustered) and it has branch lengths (the distances between nodes or between nodes and tips). If the program is a modern one, the tree will resemble Fig. 3(a) in which branch lengths are the horizontal distances only (vertical ones being arbitrary), rather than Fig. 3(b) in which branch lengths are variously either the distance along the branch or its horizontal component. The tips may or may not line up. The branch lengths are related not necessarily to geological time but to some measure of "evolutionary distance" or the number of amino acid or nucleotide substitutions. Calibration of a tree in geological time requires extrinsic information, e.g. from the fossil or geochemical record (Wilson et al., 1987; Marshall, 1990). The tree may have a direction of "growth" in time (i.e. be rooted, as Fig. 3b) or not (unrooted--Fig. 3a), equivalent to saying that the ancestor-descendant relationships are defined or undefined respectively. Some programs automatically produce rooted trees, others allow this as an option if the most ancientlydiverging sequence (the "outgroup") is known beforehand. Many programs also allow one to test how much confidence the data allow in the tree, i.e. how well it is supported over other topologies. All this is necessary but far from sufficient knowledge for an intelligent use of inference programs. It

The inference of evolutionary trees from molecular data (a)

51

CfxaurL CfxaurM RpsvirL RpsvirM Spin D1 Spin D2

IF .... RYDP F D F W V G P F Y V IEIPLLENFG FDSQLGPFYL LI .... G G D L F D F W V G P Y F V FYSYWLG-KI GDAQIGPIYL T ............. ENRLYI R ............. DRFVFV

I00

GFWGFVSVIG GFWNAVAYIT GFFGVSAIFF GASGIAAFAF GWFGVLMIPT GWSGLLLFPC

IIFGSYFYIN GGIFTFIWLM IFLGVSLIGY GSTAILIILF LLTATSVFII A Y F A .... LG

i01 CfxaurL CfxaurM RpsvirL RpsvirM Spin D1 Spin D2

(b)

........... SIPQNFFAG . . . . . . . . NP V A F A K Y F V V L . . . . . . . . . . . . . . . DPFAI . . . . . . . . DP L Q F F R Q F F W L DGIREPVSGS LLYGNNIISG SWYTHGLASS YLEGCNFLTA

IF .... RYDP F D F ~ G P F Y V IEIPLLENFG FDSQ~GPFYL LI .... G G D L F D F ~ V G P Y F V FYSYWLG-KI GDAQ~GPIYL T ............ -~NRLYI R ............ -~RFVFV

RIDPPPPELG QIDPPSSRYG SINPPDLKYG GLYPPKAQYG AIIPTSAAIG AVSTPANSLA

LGFAAPG ............. LSFPPLN ............. LGAAPLL ............. MGIPPLH ............. LHFYPIWEAA SVDE---WLY HSLLLLWGPE AQGDFTRWCQ i00


IIFGSYFYIN GGIFTFIWLM IFLGVSLIGY GSTAILIILF LLTATSVFII A Y F A .... LG

I01 CfxaurL CfxaurM RpsvirL RpsvirM Spin D1 Spin D2

........... SIP~FFAG . . . . . . . . NP V A F A ~ Y F W L . . . . . . . . . . . . . . JDPFAI -~ . . . . . . DP L Q F F ~ Q F F W L DGIREPVSGS LLYG~NIISG SWYTHGLASS YLEGqNFLTA

CfxaurL CfxaurM RpsvirL RpsvirM Spin D1 Spin D2

IF .... RYDP F D F W V G P ~ Y V IEIPLLENFG FDSQLGPFYL LI .... G G D L F D F W V G P ~ F V FYSYWLG-KI GDAQIGP~YL T . . . . . . . . . . . . . ENR~YI R ............. DRF~FV


LGFAAP~ ............ LSFPPLN~ ............ LGAAPLL~ ............ MGIPPLH~ ............ LHFYPI~AA SVDE---WLY HSLLLL~GPE AQGDFTRWCQ i00


I01 CfxaurL CfxaurM RpsvirL RpsvirM Spin D1 Spin D2

........... SIPQN~AG . . . . . . . . NP V A F A K Y ~ / V L . . . . . . . . . . . . . . . DP~AI . . . . . . . . DP L Q F F K Q ~ W L DGIREPVSGS LLYGNN~ISG SWYTHGLASS YLEGCN~TA

ETILKGP~VMFAQVN~-AASQGPT~NMAAEVH~AFIAAPP~DI GWFTGTTF~T 150

51

(c)

ETILKGPY-VMFAQVNY-AASQGPTW-NMAAEVHF-AFIAAPPVDI GWFTGTTFVT 150

51 CfxaurL CfxaurM RpsvirL RpsvirM Spin D1 Spin D2

647

IIFGSY~IN GGIFTF~WLM IFLGVS~IGY GSTAIL~ILF LLTATS~FII AYFA--J-LG

ETILKGPY-VMFAQVNY-AASQGPTW-NMAAEVHF-AFIAAPPVDI GWFTGTTFVT 150


LG~AAPG ............. LS~PPLN ............. LG~PLL ............. MG~PpLH ............. LH~YP I W E A A S V D E - - - W L Y HS~LLLWGPE AQGDFTRWCQ

Fig. 2. Deciding which regions of a multiple sequence alignment to use for inference. Figure 2(a) shows some partial amino acid data aligned (Beanland, 1990) using MSA. Regions suitable for inclusion in an inference program are boxed in Figs 2(b) and (c). Details are given in the text.

becomes necessary to understand at least some of the theory of inference for two reasons. One is to allow an informed choice to be made as to which program to run: they make different assumptions, use different algorithms and are not guaranteed to produce the same result from a given dataset (e.g. Hedges et al., 1990). The second reason is that it is important to ask if the assumptions behind a method (the "model") are appropriate to the specific application. Even with the "best" inference programs it is easy to obtain (and publish) a phylogeny for which there is strong statistical support (i.e. that is "robust") but which is meaningless because the basic assumptions behind the method used were incorrect. The recent increase in the use of

tests for tree robustness (sample errors) is welcomed, but should now be followed by a rigorous examination of basic principles, i.e. an analysis of systematic errors.

We address inference by first outlining the three major types of approach (cladistic methods, distance ones, maximum likelihood). This is done at a superficial level that will give an idea of the philosophy behind each but not allow the judgements described above to be made. Discussion of the assumptions and properties of methods in the second part of this section allows this but requires looking at methods in a slightly different way. We then discuss more general issues before reviewing some of the available inference software.

648

TIMOTHYJ. BEANLANDand CHRISTOPHERJ. HOWE A

C a)

I

i-o E

A

E

B

b)

D

Fig. 3. Different representations of evolutionary trees. Figure 3(a) shows the preferred format for presenting a tree, in which branch lengths are the horizontal distances only. The tree shown happens to be unrooted and have noncontemporaneous tips. The tree in Fig. 3(b) is a representation sometimes seen in which branch lengths are either the distances along branches, or the horizontal components of these. This should be specified in the legend wherever such a tree is given. This tree is of different topology and branch lengths to Fig. 3(a) and is rooted.

"PHILOSOPHICAL" PRINCIPLES BEHIND INFERENCE METHODS

Cladistic approaches Parsimony and compatibility. The related methods of parsimony and compatibility derive from the phylogenetic systematics school of Hennig (1966). Both aim to determine monophyletic groups (clades) through the identification of "synapomorphies", characters shared by members of a clade that arose for the first time in their ancestor. By identifying such shared derived characteristics the species ideally fall into a series of nested groups (ever larger clades) defined by ever more distant synapomorphies. This approach is satisfactory so long as there are no convergences to the same state in disparate lineages and no reversions (Felsenstein, 1982). If these conditions do not hold, the result will be incongruence, i.e. different sites offering support for different topologies. To resolve incongruence, cladists invoke Hennig's "auxiliary principle" that homology is to be assumed in the absence of grounds to the contrary. This leads to the two slightly different cladistic techniques of compatibility and parsimony. In maximal clique compatibility (LeQuesne, 1969), the favoured topology is that implied by the greatest number of sites. Sites incompatible with this topology are simply discarded. This method is used by the program DNACOMP. Parsimony supposes homology wherever possible but (in its commonest "Wagner" form) allows for characters to undergo reversions and for the same state to arise independently in different lineages

("homoplasy"). The parsimony criterion (Fitch, 1971 and 1977; Farris in Platnick and Funk, 1983) is then that one favours the phylogeny that requires the fewest total number of inferred changes of state or that "invokes the minimum net amount of evolution" (Edwards and Cavalli-Sforza, 1963, p. 105). For nucleotide parsimony (e.g. DNAPARS, PAUP), each substitution is considered one step: in transversion parsimony (Fitch, 1977) the greater frequency of transitions over transversions in vivo (Brown et al., 1982) is recognised by scoring only transversions as changes. For protein parsimony (PROTPARS, PAUP), substitutions between amino acids can all be scored as a single step (Eck and Dayhoff, 1966) or as 1-3 steps depending on the implied number of nucleotide changes (Fitch, 1971; PROTPARS). It should be apparent that the problem of scoring steps for amino acid substitutions is close to that of defining a similarity matrix for alignment. The parsimony method requires working back through the tree to propose ancestral sequences (Fitch and Margoliash, 1967), so the output from parsimony programs may include the favoured topology (cladogram) and the estimated nodal states. Branch lengths are then scaled in the number of inferred substitutions. Note that for up to four sequences, compatibility and parsimony will give identical results. For trees containing more than four taxa the two methods may not agree, parsimony having traditionally been the preferred method. A recognised problem with the cladistic methods is that of "long-edge attraction" (Cavender, 1978; Felsenstein, 1978b; Hendy and Penny, 1989). This is the tendency for branches in which there are many substitutions to group artefactually, best illustrated with reference to a four taxon topology (Fig. 4). Suppose the correct unrooted topology (Fig. 4a) groups sequence A with D and B with C, and that A and C show that a much greater substitution rate than the other two. Under these conditions, random convergence between the sequences A and C will tend to favour the topology of Fig. 4(b) in which the long

A

A

a)

b)

D C

C

Fig. 4. Trees illustrating the problem of long edge attraction. Figure 4(a) is an unrooted topology in which A and C are "fast-clock" sequences. Figure 4(b) shows the tree parsimony or compatibility will infer under such conditions, in which the "fast-clock" taxa are erroneously grouped, and the central branch length over-estimated. For details see text.

649

The inference of evolutionary trees from molecular data edges are erroneously linked and homoplasies between A and C misinterpreted as synapomorphies in the central branch (this is longer in Fig. 4b than in Fig. 4a; branches to A and C are shorter). Random convergence here is the direct result of the amount of change in the "fast-clock" lineages. A pattern in which two taxa share one nucleotide and the other two another is an example of an "informative" site, a concept unique to cladistic methodologies. In the limiting case of infinitely long branches to A and C, then three-sixteenths of all sites and all of the supposedly "informative" ones will favour the incorrect topology (Felsenstein, 1988). In such a case, cladistic methods will favour the incorrect topology with increasing certainty as more data are sampled. Long edge problems are more than theoretical, there being evidence that the supposed monophyly of the "Archaebacteria" (halophiles, methanogens and eocytes--Olsen and Woese, 1989) and the polyphyly of the Metazoa (Field et al., 1988) based on small subunit r R N A data may both be artefactual (Lake, 1988, 1990). A solution to the long edge problem is, however, to hand, if certain conditions hold. This is the method of "evolutionary" parsimony, included here under the cladistic heading because this is the school from which it was developed. Evolutionary parsimony. This curiously named method comprising operator invariants (Lake, 1987a) and operator metrics (Lake, 1987b) was devised to resolve topologies which would otherwise be misled by long edge effects. Operator invariants (DNAINVAR) is a means of selecting which of three four-taxon trees is favoured (and can be generalised for more taxa with some compromise: Lake, 1988 and 1990) and metrics a means of putting branch lengths on the favoured tree. Invariants can be seen as a modification of transversion parsimony, its essence being that the transversions supporting each of the three topologies are evaluated (the number of informative sites as for normal transversion parsimony) and the estimated degree of convergence in each tree due to long edge effects then subtracted. The method is workable because, given certain conditions, every apparent synapomorphy shared by two sequences that is really the convergent effect of rate will, on average, be accompanied by a "background" term that one can identify. Operator invariants is not sensitive to differences in rates between disparate lineages, or site-to-site variation in rates within sequences. For further details of this technique the reader is referred to Lake (1987a). Some caution is, however, needed in using evolutionary parsimony (EP). The method is applicable only to nucleotide data containing no gaps and was designed for ancient divergences, e.g. amongst bacteria-archaebacteria-eukaryotes or the Metazoa, requiring sequences that contain many transversions. Resolving between the three topologies (for which a X: test has been used) for more recent radiations requires longer datasets than are generally necessary for other methods of inference, because EP uses many fewer of the sites. For instance, Holmquist et al. (1988) were unable to resolve the chimp-man-gorilla divergence (time scale ~ 10 million years) from a

1.

A

2.

7. 6.

I

B

D

8. E

Fig. 5. Idealised additive phenogram. The tree is additive when the sum of the branch lengths (horizontal components only) between two taxa is equal to the observed distance (/5) between them. Thus/)AB is the sum of branches labelled" 1" and "2", /SACis the sum of "1", "3" and "4" etc. (The numbers on the branches are labels only and are not proportional to length.) The tree is shown unrooted, equivalent to saying that only the sum of the lengths of branches labelled "5" and "6" can be determined, not the individual values of each.

trichotomy despite having over 10kb of nuclear D N A sequence.

Distance methods Methods based on a matrix of evolutionary distance all work on the same principles whether the distances are derived from immunochemical crossreactivities (Maxson and Maxson, 1986), nucleotide hybridisation (Sibley and Ahlquist, 1987), oligonucleotide catalogues (Fox et al., 1977), or primary sequence identities. These are that the taxa are related by an additive tree (Fig. 5) in which the sum of the branch lengths connecting any two is the expected distance (D) between them. Because this is a phenetic definition of relatedness (i.e. one based on aggregate similarity), such a tree is known as a phenogram. Additivity requires that D satisfies three criteria: (i) D(i,i) = 0, (ii) D0,j)= D0,i) and (iii) D¢i,j)~

Calculation of evolutionary trees from sequence data.

The inference of gene trees with species trees.

SubClonal Hierarchy Inference from Somatic Mutations: Automatic Reconstruction of Cancer Evolutionary Trees from Multi-region Next Generation Sequencing.

Remembering the forest while viewing the trees: evolutionary thinking in the teaching of molecular biology.

Phylogeny and the inference of evolutionary trajectories.

Inference of horizontal genetic transfer from molecular data: an approach using the bootstrap.

Tracing retinal vessel trees by transductive inference.

Behavior Trees for Evolutionary Robotics.

phyC: Clustering cancer evolutionary trees.

Drawing causal inference from Big Data.

Statistical inference from multiply censored environmental data.

Pathway network inference from gene expression data.

The evolutionary history of Stomatopoda (Crustacea: Malacostraca) inferred from molecular data.

Minimal-assumption inference from population-genomic data.

Expectation propagation for large scale Bayesian inference of non-linear molecular networks from perturbation data.

Methodology for the inference of gene function from phenotype data.

Phylogeny and evolutionary histories of Pyrus L. revealed by phylogenetic trees and networks based on data from multiple DNA sequences.

Multilocus inference of species trees and DNA barcoding.

Phylogenetic inference based on matrix representation of trees.

A Beta-splitting model for evolutionary trees.

Two-phase importance sampling for inference about transmission trees.

Maximum-likelihood inference of population size contractions from microsatellite data.

Inference of gene regulation functions from dynamic transcriptome data.

An inference method from multi-layered structure of biomedical data.