Unified approach to alignment and phylogenies.

626

PHYLOGENETIC

1391

TREES

Unix compiler. They can no doubt be made to run on other systems, but the commands for bit extraction will probably need to be changed (we pack 8 nucleotides per 32-bit word). These programs will be provided in their FORTRAN form and, if desired, in compiled form along with sample data and output. One l.ZMbyte floppy disk should be sent to the author (W. F.) for each of the three programs wanted.

[391 Unified

Approach By

to Alignment JOTUN

and Phylogenies

HEIN

Introduction Two important considerations in the analysis of molecular sequences are alignment and phylogeny reconstruction. In both cases the history of the molecules must be reconstructed. The principle of parsimony states that the solution which minimizes the amount of evolution used to explain the extant sequences (sequences from the past cannot be obtained) approximates the real history well. This principle has dominated previous methods used to solve these problems. An alignment of two short DNA sequences is shown below: TTAGTCC - AT T - - GTGCGAT Each sequence is padded with gap signs so similar regions are matched. To define the most parsimonious alignment, the weight assigned to basic evolutionary events, substitutions, and insertion/deletions, must be specified. By choosing these weights to reflect actual occurrences, the usefulness and reliability of the results can be greatly enhanced. The weight, gk, for an insertion/deletion of length k frequently takes the form a + bk (a and b are nonnegative parameters), since this allows the use of a faster algorithm. When a is strictly positive, then runs of gap signs are interpreted as one insertion/deletion; if a is zero, then a stretch of gap signs can be interpreted as composed of several insertions/deletions. Since the parameter a is usually positive, this alignment would postulate an insertion/deletion of length 2 and another of length 1. Only one substitution is postulated, a C for a G or vice versa. Weights of substitutions have been developed from empirical METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Pres. Inc. All rights of reproduction in any form reserved.

WI

SIMULTANEOUS

ALIGNMENT

AND

PHYLOGENY

627

tables of mutation frequencies. ‘9’ This is done efficiently by various dynamic programming algorithms developed since the late 1960s. The typical running time for these algorithms are 0(12) or O(ld), where 1is the length of the sequences and d the distance between the two sequences. Five sequences related by an unrooted tree are shown in Fig. 1. Typically, the procedure to find the phylogeny would presume an alignment of all sequences obtained manually. Each column would be treated independently and then be explained by a series of substitutions. Different branching configurations or tree topologies would be checked, and the one allowing for shortest total branch length would be chosen. Conventionally, the alignment problem involves two sequences and must consider both substitutions and insertions/deletions. the phylogeny problem involves more sequences but usually requires that the insertions/ deletions be taken care of beforehand. The accomplishment of the method presented here is to solve both problems simultaneously. The phylogenetic alignment problem is computationally very difficult. It is harder than two special cases of it: First, the traditional parsimony phylogeny problem is known to be NP-complete,’ i.e., no algorithm is likely to exist that can guarantee to find the most parsimonious solution in polynomial time. Second, it is harder than the phylogenetic alignment problem, when the phylogeny has been given and only insertions/deletions of length 1 occur, where the method to solve it needs computing time proportional to 2Vn.4 If longer gaps are allowed, the situation becomes even w~rse.~ The method presented here is therefore a fast approximation algorithm. Method The method has two aspects: finding the phylogeny and aligning the sequences according to the phylogeny. The central approximation and novelty consists of decomposing the many-sequence alignment problem to a series of pairwise problems of aligning sequences, reconstructing ancestral sequences, and then aligning the ancestral sequences. At each such cycle, the set of most parsimonious ancestral sequences is investigated.

1 M. 0. Dayhoff, “Atlas of Protein Sequences and Structure,” Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C., 1978. * R. F. Doolittle, “Of ORFs and URFs.” University Science Books, Mill Valley, California, 1986. 3 L. R. Foulds and R. L. Graham, Adv. Appl. Math. 3,43 (1982). 4 D. Sankoff, SIAM J. Appl. Math. 78,35 (1975). 5 J. J. Hein, J. Theor. Biol. (submitted for publication).

628

PHYLOGENETIC

sl

s2 s3 s4 s5

TREES

1391

TGTGGTA TGAGCAA TCTGAAG CCGTAAG CCCTTTG

s2 s5 1. Five short sequences and one possible relationship (of 15 possibilities) that happens to allow the most parsimonious history of the sequences. Columns one and two indicate that the tree topology must have s4 and s5 as brothers, column four indicates that sl and s2 are brothers, and then the tree is fully determined. Column six would fit better on a tree topology where sl and s5 are brothers. This, however, cannot outweigh the first three mentioned sites. There exist algorithms that will evaluate the weight for a column on a tree topology. Using such algorithms, different tree topologies can be tested and the best topology retained. FIG.

This is accomplished by a generalization of the traditional sequence comparison algorithm that allows comparison of graphs with edges labeled with sets of nucleotides/amino acids. In the present context the new algorithm is used to align alignments or even groups of alignments and thus braid alignments of a few sequences into alignments of more sequences. This method allows for a more thorough search for likely ancestral sequences by not having to choose between equally good alignments arbitrarily. The algorithm that achieves this can best be described as a generalization of a combination of a string comparison algorithm of Gotoh with a graph comparison algorithm of Sankoff.’ The method is described in a simple situation, where only insertions/ deletions of length 1 occur and all events are weighted 1. 6 0. Gotoh, J. Mol. Biol. 162,705 (198 1). 7 D. Sankoff, in “Time Warps, String Edits and Macromolecules” Kruskal, eds.). Addison-Wesley, Reading, Massachusetts, 1983.

(D. Sankoff and J.

WI

SIMULTANEOUS

ALIGNMENT

AND

629

PHYLOGENY

New Algorithm Let sl and s2 be two sequences of length 11 and 12, respectively. The substring consisting of the first i elements of sk (k = 1, 2) is denoted ski, and the ith element in sk[i] refers to the ith element of sk. Dij refers to the distance between Sli and S2j. . Sellers8 and SankoF gave an algorithm that allowed calculation of the distance between sl and s2 and the corresponding alignment. From Eq. ( 1) the distance is determined as Di,j=min(Di-,j-,

+ d(sl[i],

~2[j]),

Di,j+ 1 + 1, Di-lj

+ l}

(1)

subject to the initial condition DO,O= 0, and all Dij with negative arguments are defined to be infinity. Aligning two sequences corresponds to finding the shortest path in a graph with nodes indicated by the grid (0, . . . ,rl) X (0, . . . 12). Each node (i,j) had three predecessors (i - 1,j - I), (i,j - l), and (i,j - 1). The two last edges are weighted 1, since they correspond to insertions/deletions. The first edge is weighted as d(s1 [i],s2[j]. The use of Eq. ( 1) is illustrated for two short nucleotide sequences (Fig. 2), in which Dij values are calculated. Backtracking from D,1,12 and connecting entries on the path will define a backtracking graph. By choosing a path from (11,12) to (0,O) a minimal alignment is chosen. Then, by choosing at each position in the alignment (or path) where the two sequences have different signs, a set of potential ancestral sequences is generated according to the alignment. The set of potential ancestral sequences can be defined by a slight modification of the backtracking graph. This graph, G, will be directed, noncyclic, and connected. The nodes of G will be the nodes (i,j) that were visited in the backtracking. The edges will correspond to one position in an alignment; (i,j) - (i - 1,j - 1) will correspond to a match and will be associated with the nucleotide set that is the union of the two nucleotides that are being matched. Positions corresponding to insertions/deletions (i,j) - (i,j - 1) [or (i - l,j)] will be associated with the nucleotide of the nondeleted sequence and a gap sign. By traversing G from (11,12) to (0,O) and choosing an element at each edge (and jumping over gap signs), a set of sequences, S(G), is generated. S(G) happens to be identical to the following set of sequences: S = (s: D(sl,s) + D(s,s2) = D(sl,s2))

(2)

In other words, these are the sequences that can be postulated as ancestral to sl and s2 without necessitating any extra evolution. Sequences in S are * P. Sellers, J. Comb. Thor. 16,253 9 D. Sankoff, Proc. Natl. Acad. Sci.

(1974). U.S.A. 69,4 (1972).

630

PHYLOGENETIC Sl = TCA

Minimal

i)

i)

92 = TG

alignments: TCA TG-

Ancestral TC,

Sequence

f391

TREES

ii)

TCA T-G

Sequences: TGA

Graph

ii)

Generating

Ancestral

TA.

TCG

Sequences:

FIG. 2. Illustration of how the simplest two-sequence alignment problem is solved by dynamic programming in a graph. Two short strings are given, and the distance between them is calculated by applying Eq. (1) and registering all distances between subsequences in an integer matrix of dimensions (0, . . . ,/l) X (0, . . . , L?). The alignments corresponding to the minimal distance can be obtained by backtracking in this matrix from (Il,f2) to (0,O). The two obtained alignments are shown. Given an alignment, a set of potential ancestral sequences is obtained by choosing between two states at each position in the alignment where the two sequences differ. These ancestral sequences, excluding the original sequences themselves, are shown for the two alignments. The sequence graph corresponding to the backtracking is shown at the bottom of the figure.

said to be between sl and ~2,‘~ in analogy with points on an interval which obeys the same equation relative to the interval ends. The traditional dynamic programming algorithm can be generalized to compare sequence graphs and to find sequence subgraphs that are close to each other according to some sequence metric. Simultaneously, the subgraphs are aligned. This permits a rational representation of potential ancestral sequences and selection among these sets of sequences, because lo W. H. E. Day, Bull. Math. Biol. 46, 321(1984).

WI

SIMULTANEOUS

ALIGNMENT

AND

PHYLOGENY

631

more information is available in the form of closely related sequences. These graphs are used to represent potential ancestor sequences at each node in the phylogeny. The following generalization of an algorithm of Sankoff allows this. Let Gl and G2 be two sequence graphs. The distance between two such graphs is defined as the distance between the two closest nucleotide sequences in S(G1) and S(G2), i.e., d(Gl,G2)

= min([D(sl,s2)]:

sl E S(Gl), s2 ES(G~))

(3)

Let Gli and G2j be the subgraphs consisting only of nodes that can be reached from node i and node j, respectively (including these nodes), and only edges involving these nodes. Let D(GliyG2j) be the distance between these two subgraphs. The presence of gap signs leads to a nonstandard definition of a predecessor: Node i’ is a predecessor to node i if there is an edge pointing from i and i’ or if it is a predecessor to a node that has an edge with a gap sign pointing to it from node i. Between each node, i, and a predecessor, i’, to it there will be exactly one edge with a set of nucleotides associated, and this is denoted n(i, j). Let P(i, 1) and P(j,2) be the predecessors to i andj in Gl and G2, respectively. The distances between subgraphs will now obey an equation similar to Eq. ( 1): D(Gli,GZj)

= min(D(G1ie,G2j,) + dist[n(i,i’),n(jJ)], D(Gli,G2j*) + 1, D(Gl,,G2j) + 1)

(4)

The minimum is taken over i’ in P( 1,i) and j’ in P(2,j). The function dist(x,y) will now be the distance function on sets of nucleotides defined by the two closest members in the two sets according to d(x,~~). The initial condition is d(Gl,,,G2,-,) = 0. Again, successive applications of Eq. (4) allow calculation of d(Gl,G2). Backtracking picks out the pairs of sequences in S(G1) and S(G2) closest to each other and simultaneously aligns them. If Gl and G2 represent ordinary sequences, then Eq. (4) reduces to Eq. (1). The method is illustrated in Fig. 3 for the sequence graph from Fig. 2 and a new sequence. To see the reasoning behind Eq. (4), focus on how D4,3 is calculated, when all other entries in the matrix are known. The last position in the alignment of the closest sequence in S(G2), say, s’, and s3 [S(Gl)] must either be a match between s’ and ~3, a deletion in ~3, or a deletion in s’. If it is a match the nucleotide in s3 must be G, since s3 is a traditional sequence. The last nucleotide in s’ could be an A from edge (2,4) or an A or G from (3,4). It could also be a G or C from edge (1,2) if the gap sign at edge (2,4) had been chosen. In total, there are five ways to choose the last nucleotide in s’. By going through these possibilities, it is

632

PHYLOGENETIC

TREES

1391

easily verified that the most parsimonious solution for making a match is to choose the G from edge (3,4), which gives D.,3 = D3,* + d(G,G) = 1 + 0. Completely analogous reasoning applies to the cases of deletions in s3 or s’. The new sequence, ~3, in effect selects in the set of equally good ahgnments of sl and ~2 and also which ancestor sequences would demand the least extra evolution. This generalization of the string comparison algorithm becomes extremely useful because it allows the integration of alignments of different sequences. The method actually used has some facets not described above. First, it allows for longer insertions/deletions that are weighted by g, = a + bk. This is done by applying the reasoning of Gotoh to the above graph comparison algorithm. One consequence is that a special type of arrow must be introduced to take care of insertions/deletions, since adding a minus sign to an edge is no longer sufficient. The reason is that adding a series of minus signs to consecutive edges would allow them to be chosen independently, while the insertion/deletion must either be chosen or not in its full length. Second, the method of finding sequence graphs from backtracking is generalized to graph comparison and not just string comparison as described above. Details of this method are reported in Hein.” Most implementations of sequence comparison algorithms use an array of dimensions (0, . . . ,11} X (0, . . . ,Z2) to store the Di,j values. As noted by Ukkonen’* calculations can be restricted to a band around the diagonal i = j (Fig. 4a), which reduces the memory requirements correspondingly. If only the distance between two sequences is needed, only two horizontal rows of the band are needed, which allows the calculation of the distance between two very long sequences, with only little memory. However, if the alignment of the sequences is needed, the introduction of this trick will transform the O(Z2) into an O(13) alignment. When the penultimate position in the alignment is to be calculated the rows corresponding to j = 12 - 1 and j = 12 - 2 are needed, but the values in the last row have been “forgotten” and must be recalculated starting from row j = 0 and up to rowj = 12 - 2. These repeated recalculations are very time consuming. It is possible to modify the memory storage and still have an O(Z*) algorithm for much longer sequences. The point is to avoid recalculating all the rows. This can be done by storing the values of evenly spaced rows up through the band (Fig. 4b). Assume the rows are 100 bp long and the computer allows storage for 200 of them. Using 100 rows and spacing them 100 apart and then using the remaining 100 rows to recalculate rows between the saved rows would allow the comparison of two sequences ‘I J. J. Hein, I2 E. Ukkonen,

Mol. &of. Evol. In Press . Inj Control 64, 100 (1985).

1391

SIMULTANEOUS

ALIGNMENT

AND

633

PHYLOGENY

b Dymmic prcgr~g:

d

c Minim11 alignment

of weight

1:

TCG ACG

S(Q) /

Freeing phylogenies from artifacts of alignment.

Unified Alignment of Protein-Protein Interaction Networks.

Inferring phylogenies of evolving sequences without multiple sequence alignment.

A unified approach to computational drug discovery.

Unified approach to carcinoma of the esophagus.

Unified approach to Čerenkov second harmonic generation.

Studies on structure-based sequence alignment and phylogenies of beta-lactamases.

A Simple, Unified Approach to Bayesian Risk Calculations.

A unified radiological approach to the detection of skeletal metastases.

Unified selective sorting approach to analyse multi-electrode extracellular data.

Unified approach to partition functions of RNA secondary structures.

A unified approach to risk assessment for fetal aneuploidies.

Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach.

ECG data compression techniques--a unified approach.

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

Sequence alignment approach to pick up conformationally similar protein fragments.

When and How-Long: A Unified Approach for Time Perception.

A Unified Approach to Functional Principal Component Analysis and Functional Multiple-Set Canonical Correlation.

A unified approach to the important protein kinase inhibitor balanol and a proposed analogue.

A unified approach to linking experimental, statistical and computational analysis of spike train data.

A unified approach to the optimization of brachytherapy and external beam dosimetry.

A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing.

A unified approach to universal prediction: generalized upper and lower bounds.

A new approach to homeostatic regulation: towards a unified view of physiological and ecological concepts.