FASTSEQUENCEALIGNMENTS

[311

487

This analysis was taken further by Waterman et al.,‘* who showed that there are regions in the plane defined by the ratios of the mismatch-tomatch and gap-to-match scores in which the maximum expected score grows as n? rather than as log n for the comparison of sequences of length n. 12M. S. Waterman, L. Gordon, and R. Arratia, Pm. Nat/ Acad. Sci. U.S.A. 84, 1239 (1987).

131I Fast Alignment of DNA and Protein Sequences By GAD M. LANDAU, UZI VISHKIN, and RUTHNUSSINOV Introduction The library of sequenced DNA already totals about 20 million nucleotides and is likely to increase dramatically over the next decade. Searching for some “key subsequences” with presumably special coding or regulatory function is a basic problem often arising in the analysis of the biological significance of these data. Similar problems arise in the study of evolution, where we need to assessthe degree of similarity between sections of DNA often belonging to different species or genes. The numerous possible matchings of long sequences generate, for these data, a problem of computational difficulty. In particular, a very large number of elementary operations, namely, checking if two bases are identical, counting mismatches, and comparing numbers of mismatches, is required. New efficient, packaged strategies that employ new algorithms are needed in order to achieve the desired goals. Our algorithms suggest a more efficient organization of the matching task and the involved bookkeeping of the quality of partial matchings. In this way they minimize the total number of elementary operations required. Compared with a naive, simplistic approach, factors of hundreds in efficiency can in principle be gained. We describe below, in general outline and omitting detailed proofs, two such algorithms. Some extra details, proofs, and further references to the extensive literature can be found in our original paper on this subject’ on which the present work is based. We describe the algorithms in DNA context, but these algorithms are obviously applicable to proteins as well. Both algorithms can be viewed as text analysis algorithms, with the text being the complete nucleotide library, a fraction thereof, or some specific virus or gene. I G. M. Landau, U. Vishkin, and R. Nussinov, CABIOS 4, 19 (I 988).

METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press. Inc. All rights of reproduction in any form reserved.

488

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

[311

In order to account for mutations, the first algorithm (A) allows imperfect matchings of the standard subsequence (pattern), which is m nucleotides long, with the text, of total length n. Up to k differences are allowed. In the second algorithm (B) we compare two texts of length n, and n, and search for matchings of any subsequence in one text with the other. Differences between the pattern and the text can be due to (1) substitution, (2) deletion, and (3) insertion mutations. Thus, we indicate an imperfect match between the pattern TGATACT (m = 7) and a subsection of the text GTCAAGCTC (n = 9) that starts at the second location and end at the 8th location of the text. For example, in the alignment TgfiTi GG T, there are k = 3 differences arising from a substitution of a C for a G, deletion of T, and insertion of a G. The number of basic operations, and correspondingly the total computation time T, increases with the size of the parameters specifying the problem. For Algorithm A these are n, the total size of DNA searched; m, describing the size of the patterns investigated; and k, indicating roughly the allowed evolutionary distance. For very large values of n,m, and k, only those algorithms for which the computation time T increases most slowly could be feasible. For Algorithm A, our analysis proved that T is proportional to nk, namely, T I constant X nk, the constant depending on the computer being used. We denote this by T = O(nk).The important point is that, unlike previous algorithms, the time bound T is independent of m, the size of the pattern matched. For k < m, this constitutes a major improvement. The linear increase of T with the text(s) length(s) is clearly unavoidable. Finding a weaker than linear dependence on the total number of allowed mutations k seems to be difficult (or perhaps even impossible). Thus, Algorithm A, presented first by Landau and Vishkin2” (building on, and improving resultsr-* may, in fact, be the best obtainable. For Algorithm B the computation time is T = O(n,n2k).Algorithm B is particularly simple and easy to program. The problem solved by Algorithm B lies at the heart of numerous homology searches, since its solution can be used as a database for retrieving many interesting properties regarding relations between the pattern and the text. In particular, the accumulated information allows (1) recombining nonconsecutive optimally aligned subsequences and (2) extracting most frequently repeating sequence 2 G. 3 G. 4 G. s G. 6 G. ’ G. 8 G.

M. M. M. M. M. M. M.

Landau Landau Landau Landau Landau Landau, Landau,

and U. Vishkin, Proc.ACM Symp. Theory Comput., 18th 220 (1986). and U. Vishkin, J. Algorithms 10, 157 (1989). and U. Vishkin, Proc.IEEE Symp.Found. Comput. Sci.,26th 126 (1985). and U. Vishkin, Theor. Comput. Sci.43,239 (1986). and U. Vishkin, J. of Comput. Syst.Sci. 37,63 (1988). U. Vishkin, and R. Nussinov, NucleicAcidsRex 14,3 1 (1986). U. Vishkin, and R. Nussinov, J. Theor. Biol. 126,483 (1987).

1311

FAST

SEQUENCE

ALIGNMENTS

489

motifs. This can be done for those motifs (from the pattern) which repeat (in the text) with 1 differences for any I5 k. From the standpoint of molecular biology, the advantages of such an algorithm are clear. Suppose a newly sequenced gene is compared with a presumably functionally or evolutionarily related one. If only searches for overall optimal alignments are instituted, the result can be disappointing since some badly matched sections imply an overall low homology score. Our algorithm, on the other hand, has the property that badly matched segments will be overlooked and only the functionally better ones noted (point 1 above). Of special interest is point 2, that is, the information which is computed by our algorithm can be used to locate repeating sequence motifs. Thus, regulatory repeats in upstream regions, such as in the yeast amino acid biosynthetic pathway,9 will be noted. Algorithm B is reminiscent of the algorithm of Sankoff,iO where a related problem was considered. The algorithm developed by Goad and Kanehisa” also addresses the same issues as our Algorithm B. However, the mathematical problem solved by them is not identical to the one presented here. Whereas their procedure looks for acceptable “average” homology score between two aligned subsequences, we give the alignments between any two subsequences with 0, 1, 2, . . . , k differences. Thus, their algorithm is not fully comparable to ours. In this chapter, we deal only with the problem of matching biological sequences for regions where the number of differences is relatively small. In practice, it is also often required to consider possibilities where sizable gaps between matching regions occur. Formally, costs are assigned to various gaps among matched subsequences, and, unlike our problems, these costs grow considerably slower than the length of a gap. Such extended problems are discussed in many papers. We refer the reader to the survey article by Waterman’* that gives an excellent review of the literature until 1984. More recent work on this topic includes Lipman and Pearsoni and Altschul and Erickson.r4-I6 Algorithm A is described in the next section. Algorithm B is given in the third section. Considerations regarding retrieval of further information from the database provided by Algorithm B are also given. 9 G. Lucchini, A. G. Hinnebusch, C. Chen, and G. R. Fink, J. Mol. Cell Biol. 4, 1326 (1984). lo D. Sankoff, Proc. Natl. Acad. Sci. U.S.A. 69,4 (1972). I’ W. B. Goad and M. I. Kanehisa, Nucleic Acids Rex 10,247 (1982). I2 M. S. Waterman, Bull. Math. Biol. 46,473 (1984). I3 D. J. Lipman and W. R. Pearson, Science 227, 1435 (1985). I4 S. F. Altschul and B. W. Erickson, Bull. Math. Biol. 48,603 (1986). Is S. F. Altschul and B. W. Erickson, Bull. Math. Biol. 48,6 17 (1986). I6 S. F. Altschul and B. W. Erickson, Bull Math. Biol. 48,633 (1986).

490

ALIGNING

Algorithm

A

PROTEIN

AND NUCLEIC

ACID SEQUENCES

1311

Let b,b2 . . . b, be the base sequence of the text and r,r2 . . . r, that of the pattern. The task addressed by the algorithm is to find all matches of the pattern in the text with at most k differences. Any such match will be viewed as a successful match (within the prescribed standards). To set the stage for presenting Algorithm A, we describe first a simple dynamic programming algorithm which achieves this goal in O(mn) steps. (It was given independently in nine different papers; a list of these papers can .be found in Ref. 17.) This is next improved by a method based on that of Ukkonen,i8 and finally Algorithm A is described. O(mn) Dynamic Programming

Algorithm

We can consider instead of the complete task of matching the entire pattern, r,r, . . . r,, with the text, b,b2 . . . b,,, a series of subtasks. The latter are matchings of increasing (in length) contiguous substrings of the pattern with increasing contiguous substrings of the text. Let us construct an (m + 1) X (n + 1) matrix D of “distances.” Its elements Di,l are entered at the intersection of the ith row, labeled by ri, the ith base in the pattern, and the Ith column, labeled by b, , the Zth base in the text. For convenience, add a zero row and a zero column, corresponding to an empty text or pattern. The Di,l values are the minimal number of differences in matchings of the subsection r,r2 . . . ri and any contiguous substring of the text ending at b,. Evidently, any element in the last row satisfying D,,, 5 k indicates that a “successful” match of the complete pattern and a subsection of the text ending at b, has been obtained. In Fig. 1 we illustrate the D matrix for a concrete example of an n = 7 text, GGGTCTA, and m = 4 pattern, GTTC, while the number of allowed differences is k = 2. We repeatedly refer to this example as we introduce more sophisticated algorithms. From the defintion of D,, it is evident that all Do,, = 0, since we can always match an empty pattern by moving it all the way to the left in front of the text. Also, Die = i since i deletions are required to obtain a match. Given this initialization of the 0th row and 0th column, we proceed to fill up the rest of the matrix by computing Dil from Di-l,,, Di,l--l, and Di-I,/-1, the elements above, to the left of, and diagonally preceding Di,. Given the “distances,” i.e., number of differences in the best matches of I7 D. Sankoff and J. B. Kruskal, eds., “Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison.” Addison-Wesley, Reading, Massachusetts, 1983. ‘8 E. Ukkonen, Proc.Int. Conf Found. Comput. Theor.,Lect. Notes Comput. Sci. 158, 481 ( 1983).

I311

FAST

C

4

3

SEQUENCE

3

491

ALIGNMENTS

3

2

I

2

2

FIG. 1. In our example the text is GGGTCTA (length n = 7) and the pattern is GTTC (length M = 4). We add an auxiliary row and an auxiliary column to our matrix to get an (n + 1) X (m + 1) matrix with entries D,,. These entries give the minimal number of differences between the pattern portion, r,r2 . . . r, and any contiguous substring of the text ending at b,. Suppose we allow at most k = 2 differences. There are occurrences of the pattern in the text with I k differences ending at locations b4, b, , b, , and b, .

r,r2 . . . r+, and a subsection of the text ending at b,, r,r, . . . ri and a subsection of the text ending at b,-,, and r,r2 . . . ‘i-1 and a subsection of the text ending at bl-, , respectively, we need to assessthe minimal number of differences in the best match of r,r, . . . ri and a subsection of the text ending at b/. We do this as follows: if ri = b,, we can match ri with bl and adjoin it to the preexisting match r,r2 . . . ‘i-1 and a subsection of the text ending at b,-, (SO that D,, = Di-,,,-, ). If ri f b,, we can choose the best partial match of r,r2 . . . ri-i and a subsection of the text ending at b,-, and settle for a substitution of ri and b, (SO that Di,, = Di-,,,-l + 1). We could, however, refrain from matching ri and b, and use the best available match of r,r2 . . . ri and a subsection of the text ending at b,-, or of r,r, . . . r+i and a subsection of the text ending at b,, in which case we will have an extra insertion or deletion and D,, = Di,,-, + 1 or Di-1.1 + 1. We choose among these alternatives the one yielding the minimal Di,,. This is summarized in the following program:

Initialization

for all 1, 0 5 1~ n, Do,, : = 0 foralli,l lism,D,,:=i for i : = 1 to m do for 1: = 1 to n do D,, : = min(D+,,l+ 1, D,,_, + 1, Di-I,,-, otherwise)

if ri = b, or Di-I,,-,

+ 1

where Di,, is the minimum of three numbers. These three numbers are obtained from the predecessors of D,, on its column, row, and diagonal, respectively. Evidently it requires O(mn) basic steps to fill in the (m + 1) X (n + 1) elements Di.1.

492

ALIGNING

PROTEIN

AND NUCLEIC

ACID SEQUENCES

1311

Improved AlgorithmI Given a partial matchofr,r, . . . ri and a subsection of the text ending at b,, it is advantageous to try matching additional equal length stretches of pattern and text. This corresponds to proceeding in the D, matrix along diagonal d, defined by 1 - i = d. This observation motivates the improved algorithm, in which we abstract from D,, a new set of quantities Ld,= and directly compute the latter. Obviously, the distances along any diagonal are nondecreasing (either D,! = Di-I,[-1 or D,, = Di-IJ-1 + 1). Therefore, there will be some maximal row number i for which Di,i+d equals some integer e. We label this row number Ld,e. Thus, e is the (minimum) number of differences between r,r2 . . . rL+ (of the pattern) and a subsection of the text ending at bLd,++ Also, we must have rLti+l # bLk+d+, , since otherwise we could extend the match without increasing the number of differences beyond e. For our k differences problem we need to consider only Ld,r values for which e 5 k. Having any such Ld,e = m indicates a successful match of the whole pattern and the portion of the text ending at bm+d. Returning again to our concrete example, we indicate in the legend to Fig. 2 the value of Ld,e for the third (d = 3) diagonal as inferred from elements Di,i+d in Fig. 2. Next we come to the evaluation of Ldp. We will constantly attempt at each stage to maximize the row number for any allowed number of differences and along any diagonal d that we follow. We assume that the LxJI values were already computed for all diagonals and for x < e. We want to find the value of L,, = i, where i is some row number. From the discussion of the simple algorithm, D,l could have been assigned its value e if one of the following conditions is satisfied: (1) D,-,,,- 1, the element preceding D,, on the diagonal d, is also e and ri = bl (the match proceeds and we move along the diagonal) or (2)(a)D+,,,-, is e - 1 and ri # b, (b)Di,,-1 (the

FIG. 2. Example of J& values. The D,,,values from Fig. I along the third, d = 3, diagonal are reproduced. We deduce that the corresponding & values are L,,, = O,L,,, = 3, and L,.J = 4.

I311

FAST

SEQUENCE

493

ALIGNMENTS

element preceding D,, on the ith row, which is on the d - 1 diagonal) is e - 1, or (c)Q-,,, (the element preceding D, on the Ith column, which is on the d + 1 diagonal) is e - 1. Let us trace backward our computation toward the first row of the matrix. We can start from D,, and follow its predecessors on diagonal d by option 1 above until the first time option 2 above occurs. The following program “inverts” this description in order to compute the L,, values. are used to initalize the variable row (correLd,e--ly Ld-l,e-19 and Ld+l,e-l sponding to options a, b, and c in 2 above), which is then increased by one at a time until it hits the correct value of L& (corresponding to option 1 above). As we scan the matrix along successive diagonals and gradually increase the allowed e (for e 5 k), we may reach the last row of the matrix. This means that we can declare a successful match in the following iterative program: 1. Initialization

foralld,Osdsn+ l,Ld,-i:=-1 foralld,-(k+ l)ld(-l,do Ld,,+, : = IdI - 1 L&j,-2 : = IdI - 2 2. for e : = 0 to k do ford:=-etondo 3.

r&V

: =

mm

[(Ld&-,

+

1),

(Ld-+-1),

(Ld+I,e-L

4. while r,,,, , = b,,, , +d do row : = row+ 1 5. ,&:=row 6. if Lde = m then phnt *THERE IS AN OCCURRENCE b d+m*

+

111

ENDING

AT

We skip the detailed proof of the correctness of the improved algorithm,‘* as well as the fairly obvious initialization. The present algorithm evaluates L& for II -l- k + 1 diagonals, and for each diagonal the variable row can get (at most) m + 1 different values. Therefore, it still requires O(mn) steps. New Algorithm

A

We wish to accelerate the above modified algorithm to operate in O(nk) computation time. We do so by modifing the innermost loop (line 4) of the preceding program and jumping over to L,, in o(1) steps (rather than at least m/k, on the average, for the first mismatch in a continuous run when there are altogether k differences in a pattern of length m). To this end, however, we need to first reconstruct the information in the pattern and text in a more efficient form. This is done by using a suffix

494

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1311

tree, an extremely powerful and versatile tool that enables retrieval of information regarding similarities between contiguous substrings of the same string. Let us then detour briefly to define a suffix tree for a general sequence cl, c,, . . . , cl. (For convenience, we assume cl to be a “special base” differing from all the above bases.) Consider the I subsequences Ci=Ci, . . . ) cl of the sequence C. Each such Ci is called a “suffix” of C. Let Ci and Cj be two suffixes. Assume that Cl, . . . , ci+f is the longest common “prefix” of Ci and Cj. That is, ci, . . . ci+/is the longest contiguous substring starting at the leftmost end of Ci (and C’) SO that Ci, . . . , . . . 7 cj+fand Ci+f+l + Cj+f+l - The merit of the suffix tree “data ci+f =Cj, structure” can be summarized as follows: the suffix tree of C enables us to find very fast the longest common prefix for each pair of suffixes. Now that we understand the functionality of suffix trees, we can proceed to describing how they look (see also Fig. 3). The su5x tree has a root, (an unpredictable number of) internal nodes, and 1 “leaves” (i.e., termination nodes). Each of the 1 su5xes defines exactly one leaf. It remains to define the internal nodes of the tree. Recall our two suffixes Ci and Cj. The longest common subsequence Cl, . . . , c,+fdefines an internal node (node which is not a leaf) or a branching point of the tree. We note that the contiguous sequence off+ 1 letters that match exactly the subsequence Cl, * * . , ci+fmay occur more than once in C. Still, all these subsequences together will define only one internal node. So far we have not said anything about the edges of the tree. Each edge connects a pair of nodes. The rationale behind putting the edges of the tree is as follows. There will

RG. 3. Suffix tree for a concrete example, the string abab$, where the special symbol $ is different than a and b. The five suffixes of decreasing length with fixed ending $ are denoted as follows: abab$ = D, bab$ = F, ab$ = E, b$ = G, and $ = C. These suffixes form the extreme tips or tree “leaves.” The nodes ab = A and b = B correspond, according to the rule of tree formation, to LCA(D,E) and LCA(F,G). The LENGTH and START values of the various elements are indicated as follows: START(A) = 1, LENGTH(A) - 2, START(B) = 4, LENGTH(B) = 1, START(C) = 5, LENGTH(C) = 1, START(D) = 1, LENGTH(D) = 5, START(E) = 3, LENGTH(E) = 3, START(J) = 2, LENGTH(F) = 4, START(G) = 4, LENGTH(G) = 2.

1311

FAST

SEQUENCE

ALIGNMENTS

495

be a path in the tree from the root to the leaf defined by Ci and another path from the root to the leaf of Cj. This path must pass through the (internal) node defined by ci, . . . , Ci+p This completes the definition of the suffix tree. An alternative (more formal) definition can be found in Ref. 1. Example. The suffix tree is best illustrated by a concrete example, the sequence abab (see also Fig. 3). We add to it the symbol $ (it is crucial that this symbol is different than both a and b to form ubub$. It contains then five suffixes, abub$, bub$, ab$, b$, and $, which form the leaves of the suffix tree. For instance, the suffixes abab$ and ub$ can track back to a common prefix ab. This indeed will be their lowest common ancestor in the suffix tree. Another instance will be the suffixes bab$ and b$ whose lowest common ancestor in the tree is defined by b. Finally, ub, b, and $ are joined to the root. On construction of the suffix tree, we require that for each node v of the tree a contiguous subsequence 4, . . . , ci+fwhich defines it will be stored as follows: START(v) : = i and LENGTH (v) : =f+ 1. Below, we show how the suffix tree of the above example will be stored. Example (Continued). We denote the five suffixes that form the leaves of the tree: D = ubab$, F = bab$, E = ab$, G = b$, and C = $. We denote the internal nodes A = ub and B = b. Figure 3 gives the full details of how the tree is stored using the START and LENGTH arrays. To set the stage for the improvement, two elements are required. First, the suffix tree of the composite sequence bl, . . . , b,@r,, . . . , rm$, consisting of the text b, , . . . , b,, the pattern r, , . . . , r,, and two extra fictitious bases t and $, different from all bases in both text and pattern. This computation of the suffix tree can be achieved via the standard algorithm of Weineri or McCreightzO in O(n) time and O(n) space requirements. Second, a problem often faced in analyzing the suffix trees is finding the lowest common ancestor of two suffixes (or, pictorially, finding the branching point furthest removed from the root that still leads to both leaves). Remarkably, after some initial overall organization of the suffix tree, queries for the lowest common ancestor can be answered in 0( 1) time. We now return to the crucial step (line 4) of the previous algorithm. For a diagonal d, the situation following line 3 is that we matched (with e differences) ri, . . . , r,, of the pattern with some substring of the text that ends at brow+,. We want to find the largest q for which rrow+, , . . . , r row+q equals brow+d+19 . . . , bow+d+q. We can directly increase row to I9 P. Weiner, Proc. 14th IEEE Symp. Switching Automata Theory, 1 (1973). m E. M. McCreight, J. Assoc. Comput. Mach. 23,262 (1976).

496

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

I311

row + q, rather than wasting on this a potentially large number of computations in line 4. I-et LCkow,~ be the lowest common ancestor (LCA) of the leaves of the suffixes bmw+d+l,. . . and rrow+, , . . . in the suffix tree. The desired q is simply LENGTH (LCA,,J. Thus, the problem of finding q is reduced to finding LC&,+ We use either the algorithm of Hare1 and TarjarP or the later simpler algorithm of Schieber and Vishkin** for the purpose of computing LCAs in the suffix tree whenever we need to find such a q throughout the algorithm. In a query of the LCA, the above algorithms require o(1) time [after a reprocessing of the suffix tree which takes O(n) time]. For each of the n + k + 1 diagonals we evaluate in Algorithm A a total of k + 1 L,, values. Therefore, we have O(nk) LCA queries. It will take O(nk) time to process then, yielding the claimed O(nk) time algorithm. To illustrate how this procedure works, we refer again to our example with GGGTCTA as the text and GTTC as the pattern (Fig. 4). Specifically, let us explain how we compute L3,1.For this, we use L2,0,L3,0,and L4,0. Specifically, L,-, = 2, L3,0= 0, and L4,0= 0. Our algorithm (line 3) initializes row to max(L2,,, L,,. + 1, L4,0+ 1) = 2. This is reflected in the box in which “Initially row = 2” is written. From the suffix tree we get that q = 1 (since r, = be = T and r, # 6,). Therefore, L3., : = 3. Algorithm

B

Algorithm B looks for the longest alignment with 1 differences, 15 k, of any consecutive subsequence in one sequence with a consecutive subsequence of another sequence. In order to have uniform notations, we will call the first sequence the pattern and assume the total base number n, = m. The second sequence will be the text of length n2= n. The basic construct here is a three-dimensional, m X n X k + 1, array In this array the pattern R = rl, . . . r, is D[i-l... m;j--l...n;I-O...k]e stretched along the y axis, the sequence B = bl, . . . , b,,along the x axis, and the number of differences I= 0 . . . k is stretched along the z axis. In each call we store a pair of numbers, D,, = 0. i +fand j + g mark the termination points of the “longest” alignment (as defined in the next paragraph) with I differences, starting at i and j, in the pattern and text, respectively. If ri # bj and I= 0, there is no such alignment, and we therefore assign = Kg) = (- 1,- 1). Given the two locations ri and bj; the “longest alignment starting there Di,j,O

*I D. Hare1 and R. E. Tarjan, SIAM J. Comput. 13, 338 (1984). 22 B. Schieber and U. Vishkin, SIAM .I Compuf. 17, 1253 (1988).

[311

FAST

SEQUENCE

497

ALIGNMENTS

A

G

T

T

C

FIG. 4. Illustration of how to find the elements L+ As before, e is some allowed number of differences, and Ldec equals i if i is the maximal row number on diagonal d such that Di,i+d = e. For instance, the entry L2,0 reflects the fact that the immediate next bases C and T do not match. We explain how to compute L,,, . We use Lao (=2), L,,, W), and J& W). Our algorithm (line 3) initializes row to max(Lz,,, L,,, + l,L,,, + 1) = 2 (see the box in which “Initially row = 2” is written). Next, we turn to the suffix tree. We need to find the LCA of the suffix of the pattern r,,r, (=TC) and the suffix of the text &,b, (=TA). From the suffix tree we get that the length of the matching prefix between these suffixes is q = 1 (since r, = b, = T and r, # b,). Therefore, L,,, : = 3.

(with 1~ k differences) can be naturally defined as that extending furthest to the right in the text (i.e., with g maximal). The following algorithm searches for such alignments. However, other definitions of longest alignments can be easily incorporated by some modification. To visualize the following procedure, imagine slicing the three-dimensional array into horizontal layers. Each layer with a fixed height I is then displayed as a separate DLi,,, matrix. For every matrix we compute the entries D,j = V;g) along diagonals i + j = d (orthogonal to those used earlier), proceeding along any diagonal d from bottom up (with decreasing i). We start with the furthest diagonal d = n + m and conclude with d = 2. This is illustrated in Fig. 5 for our standard example, m = 4, n = 7. At each location (i,j) of the diagonal, we compute all the values D,,, . . . , D,, climbing “up” along the vertical column in the three-dimensional array, directly above i,j. Let us now explain how any D,j,, (e.g., cell X in Fig. 6) is computed. Any DiJ,, consists of two entriesfand g denoted by Xfand Xg for all X. This computation is done using information in neighboring cells (Y, W,Z) in Fig. 6, which, according to the above ordering has already been computed.

498

ALIGNING

PROTEIN

AND

NUCLEIC

ACID SEQUENCES

1311

FIG. 5. Illustration of the order of computation. Bach continuous line represents a diagonal. Each diagonal has a number. The broken arrows guide us to start at diagonal 11 and end at diagonal 2. The arrow on each continuous line indicates the order of computation on a diagonal.

Recall that, at this point, we are searching for the longest alignment of subsequences of pattern and text, starting with ri and bj, respectively, and with 1 differences. The following alternatives exhaust all possibilities: 1. The bj is not matched with ri and is therefore an “insertion” into the text. 2. The ri is not matched with bje It is “deleted” from the pattern. 3. ri = bj and we have a correct match of these two bases. 4. ri # bj and we still match the two bases, reviewing the situation as a substitution (of bj for ri). In Case 1 consider Dij+,,,-i at the Y cell. It represents an alignment with 1 - 1 differences starting at ri (in the pattern) and bj+l (in the text). By declaring bj to be an insertion we extend this alignment one more base to the left in the text. This yields a tentative alignment with I differences starting at i and j. The g value in cell X (termed Xg, i.e., the length of the

-I/

-1-

FIG. 6. Location of the X, W,Z, and Y cells utilized in Algorithm B. To compute the values of the X cell, we utilize the Y cell if bj is an “insert.” The W cell is used when r, is deleted from the text. For a match or a substitution, we utilize the Z cell.

[311

FAST

SEQUENCE

499

ALIGNMENTS

present alignment), will then be one unit bigger than in Y. In Case 2 consider Di+ l,j,,- 1 at the W cell, representing the longest alignment with I- 1 differences, starting with ri+, and bj. If we declare that ri was deleted in the text, we generate an alignment of the subsection bj, . . . , bj+wg with I differences, so that Xg = Wg. In Case 3 we can start from Di+Ij+l,,, in cell Z, indicating the 1 differences match starting at ri+l,bj+, and extend it by one base so that Xg = Zg + I. Finally, in Case. 4 Di+Ij+I,,-, , also at the cell Z, can serve as a candidate for fixing D,, and again Xg = Zg + 1. Since we are looking for the longest alignment starting at ri,bj, we will choose the alternative, yielding the largest Xg (and take over the corresponding Xf as well). If there are several candidates of maximum g, we further choose the one having the maximalJ: The following is a concise, more formal representation of Algorithm B. Algorithm for Computing

the Matrix

D,,, .

, ,,,; ,, .

, ,,;,,, . . . , kl

ford:=n+mdownto2do if d > m(*see Remark below*) theni:=m;j:=d-m 1 else i : = d- I;j:= Whilei>Oandj

Fast alignment of DNA and protein sequences.

FASTSEQUENCEALIGNMENTS [311 487 This analysis was taken further by Waterman et al.,‘* who showed that there are regions in the plane defined by the...
919KB Sizes 0 Downloads 0 Views