SEARCHINGTHROUGHSEQUENCEDATABASES

163

C61Searching through By RUSSELL

99

Sequence Databases F. DOOLITTLE

Introduction Certainly, among the first things an investigator must do upon determining a new sequence is to compare it with all available sequences to see if it resembles something already known.’ The search need not wait until the sequence is known in every detail; indeed, the sequence determination itself often can be guided by comparison with homologous sequences. At the same time, there is at present an almost excessive zeal in the community for finding “homologies,” to the extent that some researchers would rather find the second member of some new family than the first. Worse, some investigators are misled by marginal resemblances that are likely not due to common ancestry. In either event, the results of a sequence search usually require that judgments be made about the significance of what has or has not been found. The primary aim of this chapter is to provide a few simple guidelines and hints about how to make these judgments. When it comes to low-level similarity, caution is always warranted. Another topic touched on is how to interpret homologies when they are found. Even though the discovery of an unexpected homology is often exceedingly useful in providing insights about the function of a protein, a still greater significance lies in reconstructing the history of present-day systems. The construction of sequence-based phylogenies is another area laden with pitfalls, however, and as I shall try to show, the beginner must tread lightly. Thus, although the computer is a wonderful helpmate for the sequence searcher and comparer, biochemists and molecular biologists must guard against the blind acceptance of any algorithmic output; given the choice, think like a biologist and not a statistician.

Tools In general, there are three variables involved in any sequence search: the nature of the sequence being searched, the size and quality of the I R. F. Doolittle, “Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences.” University Science Books, Mill Valley, California, 1987.

METHODS

IN ENZYMOLOGY,

voL. 183

Copyright 0 19W by Academic Press, Inc. All rights of reproduction in any form reserved.

100

SEARCHING

DATABASES

[61

sequence collection available for comparison, and the capability of the computer program and the computer that will be used. The query sequence may be nucleic acid or protein, it may be short or long, known precisely or perhaps not so precisely. Similarly, the database may be composed of either nucleic acid or protein sequences. Moreover, the collection may be small or large, redundant or representative, up-to-date or not up-to-date. The searching program may be fast or slow, sensitive or insensitive, capable of identifying reasonably short runs of similarity or not. It remains for the individual investigator to assess his particular situation as the first step in determining the significance of any similar sequences that may be retrieved. As discussed elsewhere in this volume, the databases are growing at an alarming rate. At the same time, searching programs are becoming faster and, in some cases, more sensitive. Nevertheless, some of the most remarkable sequence matches have been made with rather limited data banks and relatively simple search routines. Of course, the impact of finding certain kinds of matches is lessening. It is no longer such a surprise to find protein sequences falling into well-defined groups. What was remarkable several years ago may be routine today. Thus, it behooves us to ask just what is being sought and what can be expected realistically. In this regard, it is not uncommon to hear laments about “how far behind” the sequence banks are, the implication being that current research is thus handicapped. In my view, such complaints are ill-founded. Quite apart from the fact that the major sequence banks have done a remarkable job in a time of limited resources, the parimutual nature of the balance between how many sequences have been determined and how many matches have been made seems to me to blunt this criticism effectively. Nucleic Acid versus Protein

Sequence

Data Banks

Currently the two biggest sequence data banks, GenBank and the EMBL Data Library, are mostly collections of DNA sequences. Nonetheless, most of the interesting matches that have been made in the last decade have involved protein sequences. Why is this so, and how does it bear on the maintenance of sequence data banks? Certainly, the overwhelming majority of new “protein sequences” are being determined on the basis of DNA sequences, so it is altogether appropriate, and, indeed, desirable, to bank them as the DNA sequences. The DNA sequence provides essential information, in the resource sense, for molecular geneticists and others; there is an enormous amount of molecular biology not directly concerned with the gene product per se. Nonetheless, there are good reasons for searching expeditions to begin at the protein sequence level.

r61

SEARCHING

THROUGH

SEQUENCE

DATABASES

101

Some investigators are under the erroneous impression that there is more to be gained by searching the actual DNA sequence rather than the amino acid sequence derived from it. That view is greatly mistaken, and resemblances will be missed if it is adopted. The reason is, of course, that there are only four bases but 20 amino acids. As such, the so-called signal-to-noise ratio is improved greatly when the DNA sequence is translated; the “wrong-frame” information is set aside and third-base degeneraties consolidated. As a general rule, then, searchers dealing with potential gene products should translate their DNA sequences into the protein equivalents. Obviously, this implies that the sequence bank to be searched should be in the form of protein sequences also. Several of the earlier chapters in this volume speak to the point. Thus, although GenBank maintains only nucleic acid sequences on-line, they do provide the Protein Identification Resource with translated versions of all open reading frames. The EMBL Data Library has a similar agreement with SWISS-PROT. Short versus Long Query Sequences Generally speaking, the longer a sequence is, the easier it will be to establish a level of confidence about its relationship to other sequences (Fig. 1). Practically speaking, however, the researcher should search whatever is available, as the following example should make clear. Some time ago, during the course of characterization of a melanocyte tumor cell antigen by workers at the Fred Hutchinson Cancer Research Center, a small amount of amino-terminal sequence was determined. Although fewer than 13 residues were available for searching, a scan of the modest-sized protein sequence bank denoted NEWAT* retrieved a single sequence: human transferrin. Seven of 12 matchable residues were identical; was this significant? The answer to the question came not from statistics, but from experiment. Because transferrin is an iron-binding protein, tests were undertaken to see if the tumor antigen could bind iron also; it did, and to the same extent as transferrin.3 Since that time, the cDNA sequence for the tumor antigen has been determined and the overall protein sequences found to be about 40% identical with serum transferrins.4 The point is that the initial search with a very short sequence set the

2 R. F. Doolittle, Science 214, 149 (1981). 3 J. P. Brown, R. M. Hewick, I. Hellstrbm, K. E. Hellstr(im, R. F. Doolittle, and W. J. Dreyer, Nature (London) 296, 17 1 ( 1982). 4 T. M. Rose, G. D. Plowman, D. B. Teplow, W. J. Dreyer, K. E. Hellstriim, and J. P. Brown, Proc. Nat/. Acad. Sci. U.S.A. 83, 1261 (1986).

102

SEARCHING

01

100

RX

DATABASES

t 200

300

LENGTH (~010~~s)

1. Guide to significance in the comparison of amino acid sequences, emphasizing the importance of sequence length. NAS is a “normalized alignment score,” in which the number of identities between two sequences is multiplied by 10 (20 for cysteines) and the number of gaps is multiplied by 25. After subtraction of the latter from the former, the score is normalized by dividing by the average length of the two sequences and multiplying by 100. (From Ref. 1.) FIG.

investigation on a proper course which would not have been possible by other means. There are many other examples where searches of partial sequences have allowed great leaps forward, including matching of a partial sequence of platelet-derived growth factor (PDGF) with the v-h oncogene5+6 and identification of the epidermal growth factor (EGF) receptor with the v-e& oncogene.’ On a less happy note, more than a few studies have been curtailed when a preliminary search of a sequence revealed it to be a 5 R. F. Doolittle, M. W. Hunkapiller, L. E. Hood, S. G. Devare, K. C. Robbins, S. A. Aaronson, and H. G. Antoniades, Science 221,275 (1983). 6 M. D. Waterfield, G. T. Scrace, N. Whittle, P. Stroobant, A. Johnsson, A. Wasteson, B. Westermark, C.-H. Heldin, J. S. Huang, and T. F. Deuel, Nature (London) 304,35 (1983). ‘J. Downward, Y. Yarden, E. Mayes, G. Scrace, N. Totty, P. Stockwell, A. Ulhich, J. Schlessinger, and M. D. Waterfield, Nature (London) 307,52 1 (1984).

[61

SEARCHING

loolo

40

80

THROUGH

120

SEQUENCE

160 OlSTANCE

240

200 (Dayhoff

PAIts

or

103

DATABASES

200

280 -lag

320

360

O

NAS')

2. Evolutionary distance versus sequence dissimilarity. Two randomly diverging sequences change in a negatively exponential fashion. After the insertion of gaps to align two random amino acid sequences, it can be expected that they will be lo-20% identical. (From Ref. 1.) FIG.

common contaminant such as albumin from the serum or concanavalin A from the lectin column used in purification. As a rule, then, the experimentalist should search early and often. Significance

of Resemblances

Often, one is faced with the problem of judging the similarity of a pair of amino acid sequences. Alliteratively put, the question is one of chance, convergence, or common ancestry. As a rule of thumb, if two sequences of 100 residues or more in length are more than 25% identical, common ancestry will almost certainly be the case. It is when two sequences are less than 25% identical that one must be cautious about inferring homology (Fig. 2). On the other hand, it is never possible to show that two sequences are not related by common ancestry. They may just have been changed so much by repeated mutation that common heritage is no longer apparent, at least to the point where any statistical confidence can be expressed. Sometimes, however, it is possible to make a case by considering several sequences at a time rather than just pairs.* As an example, consider the ever-expanding immunoglobulin-type family. As is well known, vertebrate antibodies are composed of various combinations of variable and * R. A. Jue, N. Woodbury, and R. F. Doolittle, J. Mol. Evol. 15, 129 (1980).

104

SEARCHING

DATABASES

161

constant r+ons, the prototypic units of which are about 100 residues long and each containing two cysteine residues9,r0 Early on it was noticed that the histocompatibility-involved protein j?,-microglobulin’ ’ conformed to this general structural arrangement, and later, a host of other immune response-related proteins were found to fall into the same general family.i2 The really surprising finding was that the family includes certain receptor proteins like the PDGF receptor, not ordinarily thought to be an agent involved in the self- nonself recognition that underlies immunology. i3,i4 The sequence resemblances in these instances are very low level, often amounting to as few as 10 or 12 residues per 100, but they are consistent enough across a large number of comparisons that the statistical likelihood is compelling.13

Worrying

about Convergence

Sequence resemblance owing to convergence, that is, selection for a similar order of amino acids in order to build a specific three-dimensional structure, is rare, mostly because there are so many different combinations of amino acids that can be assembled into equivalent structures. Still, one can imagine instances where similar sequences might be observed in proteins not as a result of common ancestry: in proteins rich in amphipathic helices, which are known for a rhythmic occurrence of polar and nonpolar residues, as one example, or in integral membrane proteins that have a preponderance of nonpolar residues, for another. Again, the case for or against common ancestry will usually be made on nonstatistical grounds. In this regard, common function is more persuasive than standard deviations. As an example, consider the remarkable case of the G-protein (transducer)-linked receptor proteins. The story begins with the visual pigment protein rhodopsin, the first full sequence of which was reported for cattle.i5 Not too long after, a

g S. J. Singer and R. F. Doolittle, Science153, 13 ( 1966). lo R. L. Hill, R. Delaney, R. E. Fellows, and H. E. Lebovitz, Proc.N&l. Acud. Sci. U.S.A.56, 1762 (1966). I1 P. A. Peterson, B. A. Cunningham, I. Bergard, and G. M. Edelman, Proc.N&l. Acad. Sci. U.S.A.69, 1697 (1972). I* A. F. Williams, Immunol. Today8,298 (1987). ‘3 H. Hayashida, K. Kuma, and T. Miyata, Proc.Jpn. Acud. (Ser.B) 64, 113 (1988). I4 A. F. Williams and A. N. Barclay, Annu. Rev.Zmmunol.6, 38 1 (1988). ‘5 Y. A. Ovchinnikov, N. G. Abdulaev, M. Y. Feigina, I. D. Artamanov, A. S. Zoletarev, M. B. Kostina, A. S. Bogachuk, A. I. Miroshinikov, V. I. Martinov, and A. B. Kudelin, Bio. org. Khim. 8, 1011 (1982).

C61

SEARCHING

THROUGH

SEQUENCE

DATABASES

105

rhodopsin sequence was obtained from the fruit fly,r6J7 and not long after that came the publication of three human cone pigment sequences.‘* Then a sequence was determined for the &adrenergic receptor, a binder of adrenalin; a computer search revealed that it was an astonishing 25% identical with bovine rhodopsin. I9 Because these. sequences were of the order of 350 residues in length, there was no doubt that they were homologous. That both proteins transmitted extracellular signals via cytoplasmic G proteins (transducins) seemed to cement the case beyond a doubt. Since that finding, over a dozen more receptor sequences have been assigned to the same family.zo-26. The percent identity of the most distant pairs in this group is well below 25%, however, and the question arises whether convergence might lie at the heart of some of the similarities. As it happens, all of these proteins appear to have seven segments rich in hydrophobic amino acids, as revealed by hydropathy plots.27 A general model has been proposed whereby the amino-terminal segment is exposed extracellularly and, given seven traverses of the cytoplasmic membrane, the carboxy-terminal segment ends up intracellular. For the sake of argument, let us suppose that there is a strong selective advantage for a receptor protein to cross the cytoplasmic membrane seven times in order to form a stable seven-pillared structure. Could not the restriction of having to use only the nonpolar subset of amino acids, in segments long enough to cross the lipid bilayer (a minimum of 20 residues r6 J. E. O’Tousa, W. Baehr, R. L. Martin, J. Hirsh, W. L. Pak, and M. L. Applebury, Cell40, 839 (1985). I7 C. S. Zuker, A. F. Cowman, and G. M. Rubin, CeIl48, 851 (1985). I8 J. Nathans, D. Thomas, and D. S. Hogness, Science 232, 193 (1986). I9 R. A. F. Dixon, B. K. Kobilka, D. J. Strader, J. L. Benovic, H. G. Dohlman, T. Frielle, M. A. Bolanowski, C. D. Bennett, E. Rands, R. E. Diehl, R. A. Mumford, E. E. Slater, I. S. Se@, M. E. Caron, R. J. Letkowitz, and C. D. Strader, Nature (London) 321,75 (1986). *OT. Kubo, K. Fukuda, A. Mikami, A. Maeda, H. Takahashi, M. Mishina, T. Haga, K. Haga, A. Ichiyama, K. Kangawa, M. Kojima, H. Matsuo, T. Hirose, and S. Numa, Nature (London) 323,4 11 ( 1986). *r Y. Masu, K. Nakayama, H. Tamaki, Y. Ham&, M. Kuno, and S. Nakanishi, Nature (London) 329,836 (1987). ** B. K. Kobilka, H. Matsui, T. S. Kobilka, T. L. Yang-Fen& U. Francke, M. G. Caron, R. J. Lefkowitz, and J. W. Regan, Science 238,650 (1987). 23D. Julius, A. MacDermott, R. Axel, and T. M. Jesse& Science 241,558 (1988). 24J. R. Bunzow, H. H. M. Van Tol, D. K. Grandy, P. Albert, J. Salon, M. Christie, C. A. Machida, K. A. Neve, and 0. Civelli, Nature (London) 336,783 (1988). 25 L. Marsh and I. Herskowitz, hoc. Nutl. Acud. Sci. U.S.A. 85, 3855 (1988). 26P. S. Klein, T. J. Sun, C. L. Saxe III, A. R. Kimmel, R. L. Johnson, and P. N. Devreotes, Science 241, 1467 (1988). 27J. Kyte and R. F. Doolittle, .I Mol. Biol. 157, 105 (1982).

106

SEARCHING

DATABASES

I31

at 1.5 A per residue of helix to stretch across 30 A of lipid bilayer), be responsible for the similar amino acid sequences? Indeed, the resemblances between some of these sequences are limited to the predicated membranespanning segments. 25 Were it not for the fact that all these diverse receptors interact with similar transducer proteins, an argument for convergence might be made. Thus, in the case of the most distant members of the group, the major evidence for divergence is not statistical; it is physiological. Shuffled Exons If all extant proteins were descended from a conventional, averagesized parental type consisting of about 350 amino acids, the reconstruction of lineages would be a straightforward problem. As it happens, however, in many instances various parts of gene products appear to have been exchanged. Often, the shuffled segments involve exons, and these in turn frequently correspond to structural domains in the protein.** Obviously, “exon shuffling” can confound attempts to reconstruct the history of events. A number of such genetically mobile units have been identified in eukaryotic proteins. These include the EGF (epidermal growth factor) domain, several classes of units found in fibronectin, the “kringles” characteristic of certain blood-clotting proteins, and several others.29 Typically, these involve sequences of 40-80 amino acids that seem to have the wherewithal to fold into compact units; intrasegment disulfide bonds are commongo The EGF motif, for example, has been found in dozens of animal proteins3r The segments are only 40-45 residues long, however, and the question arises, how confident are we that all these units share common ancestry? What they have in common, for the most part, is an array of six cysteines. Cysteines that participate in disulfide bonds are known to be among the slowest changing residues in a protein, since the replacement of either of the linked single residues leaves an unpaired partner. As a result, such cysteines weigh more heavily in decisions about ancestry. The bulk of the present data favors common ancestry for all the identified EGF types, as opposed to convergence or chance. Cases of segment resemblance have been found that do not involve 28 W. A. Gilbert, Science 228,823 (1985). 29 L. Patthy, Cell 41,652 (1985). M R. F. Doolittle, Trends Biochem. Sci. 10, 233 (1985). ‘: R. F. Doolittle, in “Prediction of Protein Structure and the Principles of Protein Conformation” (G. Fasman, ed.), p. 599. Plenum, New York, 1989.

[61

SEARCHING

THROUGH

SEQUENCE

DATABASES

107

cysteines, of course. For example, a carboxy-terminal skein of 66 residues of the oncogene known as vjun was found to resemble the carboxy-terminal segment of the yeast regulatory protein GCN4.32 It was known that this region of the yeast protein binds DNA, and this provided an important clue to understanding how v-&n and its normal cellular counterpart cjun function. The underlying message is that one must be alert to regions of similarity even when they occur embedded in an overall background of dissimilarity. Reverse Searching Up to this point, we have been dealing with the situation in which an unknown sequence is “searched” against a large database. There is another circumstance in which the sequence is examined to see if it contains particular diagnostic or consensus sequences. Consensus sequence searching itself has two overlapping but distinctly different applications. In the one case, an investigator examines the newly determined sequence to see if it contains subsequences typical of some previously characterized families. This might include a scan for active sites or binding motifs (Table I), or longer prototypic sequences, including those of EGF domains, “zinc fingers,” 33 or immunoglobulins. If one is dealing with raw nucleic acid sequence data, and especially if noncoding regions are suspected, a search for highly conserved but short regulatory elements34 or less conserved but longer “high-repeat” motifs like the primate Alu sequences35 is in order. We call this reverse searching in that extracts from the database are being used to screen the unknown rather than the other way around. The distinction is not trivial. Thresholds can be set much lower when a screen is being conducted against a single new sequence, and things will be found that would be lost in the noise in a forward direction search. As a simple example, it would be very cumbersome, it not impossible, to detect potential asparagine-linked carbohydrate attachment sites in a conventional search of a new protein sequence versus a protein sequence data bank, but it is obviously easy to identify any such sites merely by scanning the sequence directly for the presence of the consensus NXS or NXT sequence (Table I). Similarly, once any consensus is established, say, for two cysteines separated by 30 -40 other noncysteine residues, it is a simple matter to examine a new sequence for its presence. 32P. K. Vogt, T. J. Bos, and R. F. Doolittle, Proc. Natl. Acad. Sci. U.S.A. 84, 3316 (1987). 33J. M. Berg, Science 232,485 (1986). )4 T. Wirth, L. M. Staudt, and D. Baltimore, Nature (London) 329, 174 (1987). 35P. L. Deininger and C. W. Schmid, J. Mol. Biol. 127,437 (1979).

108

SEARCHING

El

DATABASES

TABLE I SIMPLE SEQUENCESTHAT TYPIFY CERTAIN ACTIVE SITES AND BINDING SITES Protein Type Serine protee Zinc-metallopeptid Acid proteti Asn-linked carbohydrated Cell adhesion’ Nucleotide binding’ Cytochrome c8

Sequence GDSG HEXXH

DS$ NX; RGD ;XXXG$ CXXCH

0 H. Neurath and G. H. Dixon, Fed Proc.Fed. Am. Sot. Exp. Biol. 16,791 (1957). b C. V. Jongeneel, J. Bouriver, and A. Bairoch, FEBSLett. 242,2 11 ( 1989). c J. Sodek and T. Hofmann, Can. J. B&hem. 48,1014 (1970). d L. T. Hunt and M. 0. Dayhoff, B&hem. Biophys.Res.Commun. 39,757 (1970). e E. Ruoslahti and M. D. Pierschbacher, Science238,49 1 ( 1987). / J. E. Walker, M. Saraste, M. J. Runswick, and N. J. Gay, EMBO J. 1,945 (1982). g R. E. Dickerson, J. Biol. 57, 1 (197 1).

The other mode of consensus sequence searching arises when one has identified a particular motif and is interested in finding other potential members in the data bank. In this case, the consensus sequence is run through the entire collection in a search for candidates. Reconstructing

History

At the beginning of this chapter I remarked that the most interesting aspect of sequence comparison is the reconstruction of events leading to present-day proteins and organisms. As new proteins evolved, new environments were conquered and novel adaptations enacted. In searching for a good example, I settled on a subset of the transducin-linked receptor proteins discussed in an earlier section. The sequences of a number of visual pigment proteins are available,15-18 and a phylogenetic tree based on these sequences ought to provide insights into the evolution of color vision.

161

SEARCHING

THROUGH

SEQUENCE

DATABASES

109

Accordingly, two distantly related rhodopsins (bovine and fruit fly) and three human cone pigments were subjected to three different tree-building regimens.36-38 Other members of the transducin-linked receptor family were used as “outliers” in order to root the trees. All three trees, obtained by the completely different methods, had the same topology and indicated that color vision in primates has evolved incrementally as the result of widely spaced gene duplications on different lineages. The first of these, which ought to have occurred in some early vertebrate, led to the elaboration of a red-green type diverging from the rhodopsin prototype (denoted by a in Fig. 3). The second key duplication also involved the parental rhodopsin gene and led to a blue type opsin (b in Fig. 3). Assuming an approximately constant rate of change, this duplication ought to have occurred well along in vertebrate evolution, but probably before the divergence of reptiles and mammals. Finally, there has been a very recent duplication on the primate lineage that gave rise to the red-green splitting (c in Fig. 3). By coincidence, Yokoyama and Yokoyama39 recently analyzed these same visual pigments, but using the genomic DNA sequences. They constructed a topology by yet a fourth method and generated a quite different branching order. In their tree, the duplication denoted b in Fig. 3 does not involve the parental rhodopsin gene but involves instead the color-sensitive daughter descendant of the first duplication (a). How can we tell which is the correct interpretation? Once again, in my view statistics will not make the case. Rather, opsin sequences must be examined from a variety of vertebrates. If the depiction in Fig. 3 is correct, some lower vertebrates, perhaps fish or amphibia, will be found to have a red-green-sensitive pigment in addition to rhodopsin, but more recently diverged vertebrates (reptiles and birds) will have in addition a blue-sensitive pigment. Moreover, when the phylogenies of these sequences are determined they will mirror Fig. 3. If the alternative tree of Yokoyama and Yokoyama39 is correct, then the color-sensitive pigment found in lower vertebrates could well be either red-green or blue sensitive. In either case, the additional sequences should reveal the real history of events. The point I would make is that sequence-based topologies can often be verified by examining a judiciously chosen set of organisms which have diverged before and after the duplication events of interest.

36 D.-F. Feng and R. F. Doolittle, J. Mol. Evol. 25, 35 I (1987). 37J. Hein, this volume, [39]. 38 R. F. Doolittle and D.-F. Feng, this volume, [41]. 39 S. Yokoyama and R. Yokoyama, Mol. Biol. Evol. 66, 186 (1989).

110

SEARCHING Light

WI

DATABASES

Receptors ROE0 OPBH

OPRH

OPGH

Pharmacologic Receptors

FIG. 3. Phylogeny of visual pigment proteins rJ-‘* and their relationship to pharmacologic receptors of the /I,-adrenergic receptor type. RODM, Rhodopsin from Drosophila melunogaster; ROBO, rhodopsin, bovine, OPRH, opsin red, human; OPGH, opsin green, human; OPBH, opsin blue, human. The arrow marked Time indicates the direction of evolution; the vertical segment lengths are proportional to amounts of evolutionary change (horizontal lengths are arbitrary). The depiction shows that color vision must have evolved incrementally. The prediction is that there are extant creatures which diverged from other vertebrates in the time zone set off by dashed lines before the duplications leading to blue and green (red) pigments and which possess only a red (or green) pigment and rhodopsin. A significantly different version of events has been described by others.39

Summary Comment The vast majority of extantproteinsarethe resultof a continuousseries of geneticduplications and subsequentmodifications. As a result, redundancyis a built-in characteristicof protein sequences, andwe shouldnot be surprisedthat so many new sequencesresemblealreadyknown sequences. All the same, some sequencesmay have changedso much that their historiesare blurred beyond recognition,and investigatorsmust be cautious in interpretingmarginal similarities. Furthermore,exonshufflingcan confusesituationsto the point wherethe simplest interpretationsmay be misleading. Still, searchinga databasewith a new sequenceis alwaysan exciting proposition. As the data accumulateand the searchingbecomes more systematic,it shouldbe possiblenot only to reconstructthe history of most existingproteinsbut alsoto relatetheir appearanceto the adaptations that underly all organismalexistence.

Searching through sequence databases.

SEARCHINGTHROUGHSEQUENCEDATABASES 163 C61Searching through By RUSSELL 99 Sequence Databases F. DOOLITTLE Introduction Certainly, among the first...
728KB Sizes 0 Downloads 0 Views