Hierarchical method to align large numbers of biological sequences.

456

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

I291

SEQUENCES

Conclusion Here, several protein sequences were compared simultaneously with a multiple sequence comparison method to show conserved regions of one sequence. However, the method is applicable for comparing DNA and RNA sequences, too. Pairwise comparisons are superimposed; thus, enormous computer power is not required. This approach is very fruitful because a multiple sequence comparison gives more data than individual pairwise analyses. As a test, sequences for liquefying and saccharifying cY-amylases were shown to share some common regions with other amylolytic enzymes, but each shared different conserved regions with others. Several secondary structural features can be predicted, and their use to compare protein sequences can point to important features possibly invisible by sequence comparisons. However, some caution is necessary in the use of secondary structure predictions, because predictive methods do not always give correct results. In connection with sequence analysis, structure predictions can be valuable in searching for functional sites, e.g., for protein engineering studies. The relatedness of two hydropathy prediction scales was analyzed with the method. Such applications extended to the analysis of different predictive methods may have numerous applications in the study of the relationships between different structural features. The method can even be used to analyze three-dimensional structures, either refined or modeled. Acknowledgments Antti Euranto and Petri Luostarinen are thanked for implementation This work was supported by a grant from Neste Oy Foundation.

[291 Hierarchical

Method Biological By

WILLIAM

of the algorithm.

to Align Large Numbers Sequences R.

of

TAYLOR

Introduction The rapidly increasing determinations of biological sequences have increased the corresponding need for a fast and effective multiple sequence alignment computer program for their analysis. The contents of this volume indicate that this need has not been neglected by those who write METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

1291

HIERARCHICAL

ALIGNMENT

SCHEME

457

programs. The number of sequence alignment programs that have appeared in the last few years, in particular those directed toward aligning more than two sequences, seem to have kept pace (at least in variety) with the growing sequence data banks. In this chapter I describe the computer program that resulted in response to my own need to align more than two protein sequences. However, with the exception of similar contemporary developments, I have made no attempt to review the rapidly changing multiple sequence alignment field. Background Several years ago, if the need arose to align more than two sequences I used the ALIGN program’ to align pairs of sequences and then combined these pair alignments together in a text editor to produce a multiple alignment. If the sequences were not very similar (e.g., 100 sequences) consists of several subfamilies, the interrelationships of which have been analyzed by Schwartz and DayhofP (see also Dayhoff and

I2 R. M. Schwartz and M. 0. DayhofF, Science 199,395 (1978).

1291

HIERARCHICAL

ALIGNMENT

SCHEME

469

Barker6 and Fig. 6a). The program MULTAL was applied to this family using different alignment strategies and the results compared. Two different strategies were tried. The first allowed no families to align that scored less than 500 using the ID matrix (for a pair of sequences this is -50% identity). Using the same cutoff, the second approach gradually shifts weight from the ID matrix to the MD matrix as described above. As this allows higher overall scores to be obtained, the gap penalty (gapen) was gradually increased throughout this run in compensation. The approach using the “softer” scoring scheme created the largest families. These agreed well with the Schwartz and Dayhoffi2 classification by defining a large group of sequences consisting of the eukaryotic mitochondrial and photosynthetic bacterial sequences (80 sequences including cytochromes c, c-550, and most of the c2 sequences). The other sequences fall into two main groups; the c6 sequences from cyanobacteria and chloroplasts (13 sequences) and the aerobic bacterial sequences (c-551 and some c2, 8 sequences). The remainder includes a small family of two c-555 sequences, remotely related to a C~ sequence. These minor families were eventually joined into one large family (see Fig. 6b). Using the exact matching scheme (ID matrix), the families were more fragmented (e.g., the photosynthetic bacterial and eukaryotic mitochondrial sequences remained distinct). Although the major subfamilies were clearly defined, 14 sequences remained unclassified (Fig. 6c) compared to 4 in the previous run. Both the above runs began with the order of sequences as found in the PIR database, which is already almost optimally ordered. As a more exacting test, the alignment program was provided with the sequences in the order determined by the program SEQSORT and rerun using the “softer” scoring scheme described above. The order imposed by SEQSORT reflected all the major subfamilies, but these were arranged in different positions relative to the order in the PIR database. The results of the alignment starting from this order again identified the major subfamilies and, surprisingly, arrived at this grouping by a route that is largely isostructural with the alignment trees described above (Fig. 6d). Some differences were found, however, the most extreme example being the early alignment of CCMST and CCRF2V, proteins which are separated by 55 sequences in the PIR order. This large displacement was a result of the two sequences being adjacent in the SEQSORT-derived order. Despite this misplacement the resulting alignment was correct. NBRF Data Bank Alignment As a test of speed and capacity, the program was applied to the current NBRF (PIR) protein sequence database (Release 16.0). To avoid polypro-

3

CCSf CCYB Ccl” ccnl

CCffCl CCES ccl!, CCL.P cccr. CCHP CCDSK CCSl CCBISC CCZP CCSL ccsc ecus CCRPSN CCilECC WCS CCRZ CCZ" WCS CCTO CCED ccsx CCLI CCRl CEKD ccxs ccw ccnc CCFS CCP.z

El-

-

-

-

-

1

- - -

-

c

- -

CC?.” CCWP cc!!0 CCYP CCRS cc!& CCXGC CC”STCCPY ccc!4 ccx ceos ccsu CCPN CCSf CCLS CCfG cesw cccl CCDF CC3

d

-

-

L!E

I

__.__. CCPSIA CCm.6 CCPF‘ CCAL‘O CC3F6 CCPR6 CCIM

CCEI CCEG -

FIG. 6. See legend on p. 472

CCDC

CCFA I- CCSP CCGK CCEl

472

ALIGNING

PROTEIN

AND

NUCLEIC

ACID SEQUENCES

1291

teins and large repetitive sequences, a maximum sequence length of 500 residues was set. Similarly, to avoid short peptides, a minimum length of 50 residues was set, leaving 5727 sequences. It might be noted at this point that because the program controls its own storage requirements, no changes were required to read in this volume of data. The first two passes over the data base proceeded cautiously, aligning only closely adjacent sequences within 90% identity and of lengths within 5 residues. In subsequent cycles, the score cutoff was linearly reduced to 500 while the window size was similarly increased until a maximum insertion size of 50 residues was allowed. The range of adjacent sequences compared (span) was doubled in each cycle until this exceeded the size of the data bank. In the final cycles, the span was larger than the size of the condensed data bank; thus, all sequences had been compared to each other, either as individuals or as members of a family. The progressive condensation of the data bank is plotted in Fig. 7 and tabulated, with details of the parameter for each cycle, in Table I. In the final cycles the condensation of the data base had converged, with little reduction being made in the number of distinct entries (single plus consensus sequences). The data base condensed to 2318 distinct entries, of which 714 contained more than one sequence. This degree of compression is similar to a less complete analysis on an older version of the same data bank.‘O I3 W. C. Barker, L. T. Hunt, B. C. Orcutt, D. G. George, L. S. Yeh, H. R. Chen, M. C. Blomquist, G. C. Johnson, E. I. Se&cl-Ross, M. K. Hong and R. S. Ledley, “Protein Identification Resource,” Version 4.3. National Biomedical Research Foundation, Washington, D.C., 1984. FIG. 6. Progressive clustering of the cytochrome c superfamily by the program MULTAL using different parameters. The ill sequences are taken from the PIR data bank (Version 4.3)i2 in the order in which they occur there and are designated by a code followed by the protein name and their source. (a) For comparison the phylogenctic tree derived by Schwartz and Dayhoff” is shown in a simplified form (adpated from Dayhoff and Barker6). Those sequences known in 1978 are shown in bold-face type. In the following trees (b-d) each level of the tree represents the state of clustering after a cycle of the program. For example, in the run described in (b), CCHO, CCHP, CCRB, CCDG, CCKGG, and CCMST were aligned in the first cycle of the program. The order of the sequences within such groups (and hence their position in the alignment) is determined by the algorithm described in the text. For clarity, however, the order as found in the PIR data bank is retained. To facilitate comparison with (a), elements that correspond to that tree arc emphasized. (b) Clusters defined using the parameters specified in the MD matrix (using the “softer” scoring scheme). Careful examination shows that the trees are remarkably similar. (c) Clusters defined using the parameters specified in the ID matrix (scored by amino acid identity). The tree is largely isostructural with (b) but less complete. (d) As (b) but using the sequence order generated by the program SEQSORT (see text). The tree is largely isostructural with (b) barring some local&d rearrangement. One unexpectedly long connection is made between CCMST and CCRF2V, for clarity, it is drawn behind the list of sequence codes.

r291

HIERARCHICAL

ALIGNMENT

473

SCHEME

5727

CYCLE

NO

7. Plot of the condensation of the NBRF data bank. The 5727 starting sequences are condensed into consensus sequences (families), the number of which rises to 7 14, while the total distinct entries (single plus consensus sequences) falls to 23 18, leaving only 1604 unattached sequences (see Table I for full details). The dashed line plots the course of a similar runr” using an older (smaller) version of the data bank but on a doubled scale. FIG.

The above run took 10 days (234 CPU-hr) on a SUN-4 computer and performed roughly 13 million alignments. Such a large run gives a good estimate of the average speed of the basic pairwise alignment algorithm on typical protein sequences (between 50 and 500 residues) and gives a value of 7f set to align a pair of sequences (or consensus sequences). As this step is rate determining for large runs, this value can be used to calculate the total time for any run once the number of pairwise alignments has been estimated. Summary The method presented here is intended as a compromise between finding a good overall alignment and the time taken to do so. Many multiple alignment algorithms spend an excessively large amount of effort trying to find the best global alignment. This time is often ill spent because the results of the standard dynamic programming alignment algorithm are dominated by the choice of gap penalty and the form of the score matrix, both of which have a poor theoretical foundation. Nonetheless, it is impor-

474

ALIGNING

PROTEINANDNUCLEICACIDSEQUENCES

1301

tant that savings in time do not compromise the quality of the alignment. By using the consensus sequence approach, this danger is largely avoided as the conserved features of the sequences are quickly identified and preserved through further cycles. In the alignment of existing alignments, which is one of the more novel aspects of the method, each alignment was treated as an averaged consensus sequence with gaps making no contribution. This gives rise to the advantageous property that gaps will have a greater propensity to be inserted where there are already gaps and is equivalent to a local change in the gap penalty. This type of behavior represents a transition away from the homogeneous scoring schemes used in aligning two sequences toward a scoring scheme that depends on position in the sequence. The alignment of consensus sequences thus forms a bridge between simple pair alignment and the alignment of discrete patterns in which sequence features and allowed gap locations are exaggerated. To complete this transition the program described above has been integrated into the earlier pattern matching (template) program.’ Such templates can reliably locate sequence similarities that are too weak or scattered to be found by the more standard alignment methodsi and should therefore produce a further condensation of the sequence data bank. Only by continually extending our knowledge of the relationships between sequences to increasingly distant similarities can we hope to avoid being overwhelmed by the increasing amount of data. I4 L. H. Pearl and W. R. Taylor, Nature (London) 329,35 1 (1987).

[30] Significance BYJOHN

of Protein

F. COLLINS

Sequence Similarities

and ANDREW

F. W. COULSON

Introduction Similarities between pairs of protein sequences are expressed as alignments of subsequences such as: ** * * * ** * * * * * ** * NYTGLRKQMAVKKYLN S I LNGK2 -NILEDPVPVKRHSDAVFlD 137 -NIVEE LR RRHADGSFSDEMNYV L DS LATRDFINWLLQTKMETHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic’l’tzss, Inc. All right.3 of reproduction in any form reserved.

42 175

BinAligner: a heuristic method to align biological networks.

Forced to align: flow-induced long-range alignment of hierarchical molecular assemblies from 2D to 3D.

A novel method to accurately locate and count large numbers of steps by photobleaching.

HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences.

Letter: Numbers large and small.

MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs.

Screening method for large numbers of dye-adsorbents for enzyme purification.

Training neural networks to analyse biological sequences.

Microtiter plate assay for the measurement of glutathione and glutathione disulfide in large numbers of biological samples.

Some new sets of sequences of fuzzy numbers with respect to the partial metric.

Pin-Align: a new dynamic programming approach to align protein-protein interaction networks.

A MODEL OF NONBELIEF IN THE LAW OF LARGE NUMBERS.

Rapid biochemical screening of large numbers of animal cell clones.

Database research in transfusion medicine: The power of large numbers.

Myxomatosis: breeding large numbers of rabbit fleas (Spilopsyllus cuniculi Dale).

Contingency, convergence and hyper-astronomical numbers in biological evolution.

Correction: Dry shear aligning: a simple and versatile method to smooth and align the surfaces of carbon nanotube thin films.

When Intuition Fails to Align with Data: A Reply to.

Using semantic web technologies to annotate and align microarray designs.

A biological hierarchical model based underwater moving object detection.

A Protocol for mtGenome Analysis on Large Sample Numbers.

CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area.

A Speedy Yet Simple Tip to Align Imbricated Anterior Teeth.

mesoporous materials.