Proc. Natl. Acad. Sci. USA Vol. 88, pp. 8121-8125, September 1991 Evolution
Evolution and relatedness in two aminoacyl-tRNA synthetase families GLENN M. NAGEL* AND RUSSELL F. DOOLITTLE Center for Molecular Genetics, M-034, University of California, San Diego, La Jolla, CA 92093
Contributed by Russell F. Doolittle, June 24, 1991
ABSTRACT Sequence segments of about 140 amino acids in length, each containing a selected consensus region, were used in alignments of the amiyl-tRNA synthetases with the aim of discerning their evolutionary relationships. In all cams tested, enzymes specific for the same amino acid from a variety of organisms grouped together, reinforcing the supoton that the aminoacyl-tRNA synthetases are very ancient enzymes that evolved to indude the full complement of 20 amino acids long before the divergence leading to prokaryotes and eukaryotes. The enzymes are divided into two mutually exclusive groups that appear to have evolved from independent roots. Group I, for which two sequence segments were analyzed, contains the enzymes specific for amic acid, glutine, tryptophan, tyrosine, valine, leucine, methionin, and argmine. Group II enzymes include those acti-
vating threonine, proline, serine, lysine, aspartic acid, hisidine, alanine, glycine, and phenylaanine. Both groups contain a spectum of amino acid types, s ing the pobity that each could have once supported an independent system for protein synthesis. Within each group, enzymes specific for c ly similar amino aids tend to duster together, indicating that a major theme of synthetase evolution involved the a ion of binding sites to accommodate related amino acids with subsequent specialization to a Asle amino acid. In a few cases, however, synthetases activating diimiar amino acids are grouped together.
Aminoacyl-tRNA synthetases catalyze the esterification or "charging" of a single amino acid to its cognate tRNA; thus, 20 such enzymes, one enzyme specific for each amino acid found in proteins, constitute a minimum set for protein biosynthesis. The evolution and structural relatedness of these enzymes has been a subject of intense interest for many years (for review, see refs. 1-3). Because they appear to participate universally in protein synthesis, the origins of these "activating" enzymes must be very ancient (4) and studies of their divergence may shed sigificant light onto the development of the genetic code and its expression. In addition, the structural basis for nucleic acidprotein recognition and the manner in which enzymes have come to activate only a single amino acid and cognate tRNA while excluding a large number of structurally similar molecules is a subject of great interest. As the sequences of more and more of the aminoacyl-tRNA synthetases became available, we undertook a study of the interrelationships within the more than 50 reported sequences with the aims of tracing their evolution and of identifying more conserved sequence segments that are presumably essential to their biological function. Because these enzymes catalyze the same overall reaction and utilize a common strategy for chemical activation of their amino acids, aminoacyl-AMP being formed at the expense of ATP, it has long been supposed that all these enzymes had a common ancestral root, even though large differences in polypeptide chain length (303-1104 amino acids) and quaternary structure (a, a2, a4, and a2/32) appeared to argue for more diversity. Initially, studies of primary sequence simi-
larities yielded only limited regions of clear relatedness. The "HIGH" (His-Ile-Gly-His) consensus or signature sequence was identified early in a small group of enzymes (5), but other enzymes appeared to lack this motif. In addition, the "KMSKS" (Lys-Met-Ser-Lys-Ser) sequence was revealed (6, 7) in another group that overlapped significantly in membership with the HIGH enzymes. The HIGH region has been implicated in amino acid activation and the KMSKS region has been implicated in "docking" the acceptor stem of tRNA and/or transferring amino acid to the 3' end of tRNA in these enzymes (7-11). More recently a third consensus sequence denoted "GLER" (Gly-Leu-Glu-Arg) was observed (12, 13). Because the enzymes containing this sequence formed a group apparently exclusive of the earlier group, it was hypothesized (13) that two separate evolutionary families of the synthetases exist. The first x-ray crystallographic structure of an enzyme from the GLER group (14), seryl-tRNA synthetase, SerRS,t supported this view, there being virtually no structural similarity between SerRS and enzymes of the HIGH group MetRS (15), TyrRS (16), and GlnRS (17). Ourown studies, which were in progress when these reports appeared, are in agreement with the existence of two separate groups, designated here simply as group I and group II. The work described in this paper identifies sequence segments in members ofeach group and applies standard computer methods to align the sequences and to discern their evolutionary relationships.
MATERIALS AND METHODS Amino Acid Sequences. The amino acid sequences used in this study either were taken from the National Biomedical Research Foundation sequence collection (Version 21), translated from DNA sequences in GenBank (Release 61) or EMBL (Release 18 on compact disc, April 1989) or were entered by us directly from the original literature. The latter included GluRS from Rhizobium meliloti (18), LysRS from yeast (19), and yeast GlnRS (20). Programs. Sequence alignments were made by the progressive method (21). Phylogenetic trees were constructed from multiple sequence alignments by a matrix procedure (22) and by a nearestneighbor character analysis (23). The program for the latter method is called PAPA (parsimony after progressive alignment). Best trees were defined as those with the lowest percent standard deviations and no negative branch lengths. The most similar *Permanent address: Department of Chemistry and Biochemistry, California State University, Fullerton, CA 92634. tAbbreviations: Aminoacyl-tRNA synthetases specific for a given amino acid have been abbreviated by the convention employing the three-letter designation for the amino acid followed by RS; e.g., SerRS denotes seryl-tRNA synthetase, etc. Where it was necessary to include a designation of the biological source of the enzyme, the abbreviation was modified to include the single-letter symbol for the amino acid along with ec for Escherichia coli, bs for Bacillus stearothermophilus, rm for Rhizobium meliloti, and yc (yeast) for Saccharomyces cerevisiae; e.g., E. coli seryl-tRNA synthetase is Sec. etc.
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. 8121
8122
Proc. Natl. Acad. Sci. USA 88 (1991)
Evolution: Nagel and Doolittle
scores by exploring alternate orders of sequence input into the DFALIGN program and by judicious trimming of regions to be aligned. The output from the scoRE and DFALIGN programs was particularly useful in this regard and, in the final result after several iterations, the input and output order of sequences to and from DFALIGN was the same. For enzymes specific for the same amino acid, the percent identity varied from 64% (Yect vs. Ybs) to 28% (lec vs. Iyc). For enzymes specific for different amino acids, values ranged from 29%o (Vyc vs. Lec) to 7% (Wbs vs. Mec) identical. Because these sequences are quite divergent, we applied a number of internal checks for consistency. First, we calculated distance scores by two independent approaches, matrix and PAPA, with the aim of finding agreement with regard to both the branching order and relative branch lengths. As an additional means of tracing the evolution of these proteins, we applied an identical and independent analysis to the C-terminal sequence segments (Fig. 3). The statistical properties (percent identities and distance scores) of the aligned sequences were remarkably like those of the N-terminal region. (See the two trees obtained by the PAPA analysis in Fig. 5 A and B for the N-terminal and C-terminal segments, respectively.) The main difference between the two trees was the relative branching of leucine and isoleucine. The same approach was used for group II (14 enzymes specific for 10 amino acids) except that only a single sequence segment was analyzed. The range of sequence identities in the aligned regions (Fig. 4) is very similar to that found for the group I enzymes. For enzymes specific for the same amino acid, the observed range was 50%o (Tec vs. Tyc) to 31% (Sec vs. Syc). For enzymes charging different amino acids, the range was 33% (Nec vs. Dyc) to 6% (Aec vs. Syc and Nec vs. Sec). The matrix and PAPA methods for calculating evolutionary distances were in agreement (Fig. SC). Both methods yielded the same branching order with no negative branches.
portions of various pairs of sequences were identified with a program called INSPECT (24).
RESULTS The primary sequence segments we have used in our analysis are depicted diagramatically in Fig. 1. In some instances, it was necessary to delete strings of 5-43 amino acids to obtain correct alignments. Group I enzymes contain two conserved sequences that served as identifiers. The N-terminal segment includes the HIGH sequence and the C-terminal region is characterized by the KMSKS sequence. The large enzymes specific for valine, isoleucine, and leucine also include a central region. MetRS, the next most closely related enzyme, contains a shortened version of this region, whereas in the smallest enzymes (e.g., TrpRS and TyrRS), it is absent altogether. The size of the smaller enzymes, therefore, limits the sequence length that could be utilized in alignments of group I. In this analysis, our attention has focused primarily on the enzymes of bacterial origin from yeast cytoplasm, excluding for the moment sequences from organelles and from higher eukaryotes. Restricting the data set in this way simplified the analysis not only by reducing the number of sequences but also by eliminating some sequences, particularly those of mitochondrial origin, which were more variant and sometimes contained strings with little or no sequence similarities to related enzymes. In every case, however, enzymes specific for the same amino acid could be readily aligned with one another after being suitably trimmed, regardless of source. As a general rule, these enzymes clustered together and were more similar to each other than they were to enzymes specific for any other amino acid. A complete sequence alignment for the N-terminal segments of 16 group I enzymes, specific for nine amino acids, is shown in Fig. 2. In applying the progressive alignment procedure (22), care was taken to optimize the alignment Group
A
213 370 ~~~~419 144 180 322 Wec 5 -33~34
Pec 35
inin~327 Eec2 Eec 138 220 363 ~~~~~471 141
Qec 26
236
Vec 35
Lec 35
742
585
728
40
189
585
735
316
137 198
Myc
Rec
559
196
Mec 7
115
270
329
362
860
303
Syc
173
314
430 462
550
410 557 -5 448 586 330
208 Hec
424 272
Hyc
66
426
261
126
G~c65 GSec
FSec 653
575 591
432
Aec
455 509
158
867
1104
51
lec
Sec
Nec
951 686
167
c572
Dyc
809
-
717
332
183
Vyc
621
537
184
734
171
Kyc
550
391 479
251
481
356 500 KecKec 505
484
162 251 392
Qyc
lyc
374
345
Tyc
3 141 177 316 Wbs Wbs
395
259
Tec
32 172 Ybs Ybs
Erm 5
Group II
B
374 YecYe 34 1 77 217 3 42
188
| 188
303 320
3 27
75 1
577
FIG. 1. Aminoacyl-tRNA synthetase sequence segments used in alignments of group I (A) and group 11 (B) enzymes. Segments are shown as bold lines with the N- and C-terminal limits of each denoted above. Enzymes are abbreviated using the single-letter designation for the specific amino acid activated followed by two lowercase letters for the organism. For the phenylalanine- and glycine-specific enzymes, "S" indicates that the sequences of the small subunits were used. Only bacterial and yeast cytoplasmic sequences were used in this analysis. For Erm, amino acids 74-81 were removed prior to alignment of the N-terminal segments of group I. Similarly, the following residues were removed from the C-terminal segments of group I enzymes: from Yec, residues 251-262; from Ybs, residues 247-258; from Vec, residues 573-609; from Vyc, residues 722-758; from lyc, residues 666-671; and from Lec, residues 573-615.
Evolution: Nagel and Doolittle . . ALYCGFDPTADS LHLGHL . . TLYCGFDPTADS LHIGHL . . IVFSGAQPSGE LTIGNY
Yee
VP AT MG ITIGNY IG .. TIFSGIQPSGV RT .. KIKTRFAPSPTGYLHVGGA YI .. AVRVRIAPSPTGEPHVGTA ..TVHTRFPPEPNGYLHIGHA KS . . KVRTRFPPEPNGYLHIGHS KA
Ybs Wee Wbs Eec Erm
Proc. Natl. Acad. Sci. USA 88 (1991) LLCLKRFQQAGHKPVALVGGATGLIGDPSFKAA
ILTMRRFQQAGHRPIALVGGATGLIGDPSGKKS ALRQWVKMQDDYHCIYCIVDQHAITVRQDAQKL
ALRQFVELQHEYNCYFCIVDQHAITVWQDPHEL ALYSWLFARNHGGEFVLRIEDTDLE RSTPEAI ALFNYLFAKKHGGKFILRIEDTDAT RSTPEFE
ERKLNTEETVQE ERTLNAKETVEA RKAILDTLALYL RQNIRRLAALYL
WV WS AC AV EAIMDGMNWLSL EW KKVLDALKWCGL EW ESIKNDVEWLGF HW ESIKRMVSWLGFKPW ERKIAAEEGKTRHDY
DK AR GI GI DE
8123
IRK IKE DPE DPT GPY
SE$GPY SG NVR ICLNFGIAQDYKGQCNLRFDDTNPV KEDIEYV IMVNFGYAKYHNGTCYLRFDDTNPE KEAPEYF KIT GA EAF ..FCIMIPPPNVTGSLHMGHAFQQT IMDTMIRYQRMQGKNTLWQVGTDHAGIATQMVV . . FCIPAPPPNVTGALHIGHALTIA IQDSLIRYNRMKGKTVLFLPGFDHAGIATQSVV EKQIWAKDRKTRHDY GR EAF TA AEF EQEYGKPGEKF ..FILHDGPPYANGSIHIGHSVNKI LKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKV DKKLGITGKDDVFKY GL ENY ..FSFFDGPPFATGTPHYGHILAST IKDIVPRYATMTGHHVERRFGWDTHGVPIEHII . . YYCLSMLPYPSGRLHMGHVRNYT IGDVIARYQHMLGKNVLQPIGWDAFGLPAEGAA VKNNTAPAPWT . . ILVTCALPYANGSIHLGHMLEHI QADVWVRYQRMRGHEVNFICADDAHGTPIMLKA QQLGITPEQMI . . ILITSALPYVNNVPHLGNIIGSVLSADIFARYCKGRNYNALFICGTDEYGTATETKA LEEGVTPRQLC . . IVVDYSAPNVAKEMHVGHLRSTI IGDAAVRTLEFLGHKVIRANHVGDWGTQFGMLIAWLEKQQQENAGEMELADLEGF YRD
Qec Qyc Vec Vyc Iec
Iyc Lee Mec Myc Ree Yee Ybs Wee lbs Eee Erm Qec Qyc Vec Vyc Iec
QVAPF QLGRF KSTIF QATLF YQTKR RQSDR YSSDY YSSDY
LDFDCGENSAI AANNYDWFGNMNVLT LDFEADGNPAK IKNNYDWIGPLDVIT VQSHVPEHAQLGWALNCYTYFGELSRMT IQSEVPAHAQAAWMLQCIVYIGELERMT
FLRDIGKHFSVNQMINKEAVKQRLNREDQGISFTEFSYNLL.. ETGISFTEFSYMML.. FLRDVGKHFSVNYMMAKESVQSRI AADILL YGTNLV.. AADILL YNTDIV.. DDEPC.. FDRYNAVIDQMLEEGTAYKCYCSKERLE ALREEQMAKGEKPRYDGRCRHSH EHHA AEEV TSRVD.. KDIYKPYVEKIVANGHGFRCFCTPERLE QMREAQRAAGKPPKYDGLCLSLS FDQLHAYAIELINKGLAYVDELTPEQIR EYRGTLTQ PGKN SPYRDR SVEENLALFEKMRA.. FDELYRLAEVLIKNGKAYVCHCTAEEIK RGRGIKEDGTPGGERYACKHRDQ SIEQNLQEFRDMRD.. RLVNWDPKLRTAIS.. IDKIWEWKAESGGTITRQMRRLGNSVDWDGERFTMDEGLSNAVKEVFVRLYKEDLIYRGK QFKDKSARYAENINAGLFDYPVLM QFKEKSAG KEAVSAGLLTYPPLM
RLVINWSVKLNTAIS..
Myc
VGKVWEWKEEYHSRIKNQIQKLGASYDWSREAFTLSPELTKSVEEAFVRLHDEGVIYRAS RAKCREYAATQVDGQRKDFIRLGVLGDISHPYLTMDFKTEANIIRALGKIIGNGHLHKGA NNECRSIVMTYASDWRKTIGRLGRWIDFDNDYKTMYPSFMESTWWAFKQLHEKGQVYRGF YDNIAYMKNQLKMLGFGYDWSRELATCTPEYYRWEQKFFTELYKKGLVYKKT GEMSQEHQTDFAGFNISYDNYHSTHSE ENRQLSELIYSRLKENGFIKNRT DKYHKIHSDVYKWFQIGFDYFGRTTTD KQTEIAQHIFTKLNCNGYLEEQS
KPVHWCVDCRSALA.. KVMPYSTGLTTPLS.. SAVNWCPNDQTVLA.. ISQLYDPEKGMFLP.. ?MKQLYCPVHNSYLA..
Rec
AKKHYDEDEEFAERARNYVVKLQSGDEYFREMWRKLVDITMTQNQITYDRLNVTLTRDDVM
GESLYNPMLPGI VA..
Iyc Lee Mee
FIG. 2. Multiple sequence alignment of the N-terminal sequence segments of group I enzymes. Precise locations of each sequence segment are shown in Fig. 1. The $ symbol corresponds to the deleted string described in the legend to Fig. 1.
In working with divergent sequences such as those examined here, one is we concerned finding the "right" alignments. Although cannot beabout certain in having succeeded in this quest, it can be stated that many things are right about the alignments and the resulting trees presented herein. (i) All alignments were based on recognizable consensus sequences. (ih) Care was taken to employ sequence segments sufficiently large to yield statistically significant data but short enough to make visual inspection and analysis possible. This allowed us, in some cases, to locate homologous regions that initially escaped detection by computer analysis as a result of radical insertion or deletions. (iiM) Because these alignments are sensitive to the relative order in which sequences are submitted to the progressive alignment method, many alternative combinations were tried and the process
DISCUSSION In view of the observation that the use of 20 aminoacyl-tRNA synthetases appears to be universal, it is logical to conclude that the basic evolution of this part of the protein synthetic apparatus was completed and the full complement of amino acids utilized prior to the time when the major groups of present-day organisms diverged. This conclusion is supported by our observation that synthetases specific for the same amino acid, but from diverse organisms, are more similar to each other than they are to enzymes specific for any other amino acid. Thus, the data place the branching of the prokaryotic and eukaryotic lineages later in time than the branching and specialization of enzymes with regard to particular amino acids. Ybs Yec Wec Wbs Eec Erm
Qec Qyc Vec
Vyc Iec Iyc
Lec
Mec Myc Rec
VIRYLKYFTFL SKEEIEALEQELREAPEKRAAQKTLAEEV VYRFLKFFTFM SIEEINALEEEDKNSGKAPRAQYVLAEQV VVKKIK RAVT DSDEPPVVRYDVQNKAGVSNLLDILSAVT IEKKIK SAVT DSEG TIRYDKEAKPGISNLLNIYSTLS PEALLNYL VRLGWSHGDQEIFTREEMIKYFTLNAVSKSASAFN
PEALMNFLGLFFIQIAEGEELLTMEELAEKFDPENLSKAGAIFD RRGYTAASIREFCKRIGVTKQDNTIE RGWD DPRLFTLEAIR RRGVPPGAILSFINTLGVTTSTTNIQ VDGI$GTDALRFTLAALA STGRDINWDMKRLEGYRNFCNKLWNASR ITGI$GTDAMRFALCAYT TGGRDINLDILRVEGYRKFCNKIYQATK MNKL GADILRLWVASTDY TGQMAVSDEILKR AADSY RRIRNTAR LNKY GADALRLYLINSPVLKAESLKFKEEGVKEVVSKVL LPWWNSFK
. . VHPRQYEFSRL NL EYTVMSKRK LNLLVTDKHV EGWD DPRMPTISGLR
NI TGTVLSKRK IAQLVDEKFV EGQKMSKSK G NVIDPLDM QGRKMSKSL G NVIDPLDV PYCQVLTHGFTVDG QGRKMSKSI G NTVSPQDV DGRKMSKSL K NYPDPSIV PAKQLLCQGMVLAD$GMSKMSKSK N NGIDPQVM NGAKMSKSR G TFIKASTW ..MLHHLNTTEYL QY ENGKFSKSR GVGVFGNNAQ . . VPLEHHMFGMMLGK DGKPF KTR AGGTVKLADL .. FRPAQREYGRL .. PFHTVYMTGLIRDD .. PFKEVFCHSLVRDA .. .. PYKNVIVSGIVLAA .. .. KPSNLFVHGYV TV
VERY LNHF DSGI LDEA
GADTVRLFMMFAS
VN
SPSVWRYYLASVRP ESSDSHFSWDDFVARNNSELLANL LERARRLVAEKNPDMPADELE KLANAVGIGAVKY
GN
TK LVHGEEALRQAIRISEALFSGDIANLTAAEIEQGFKDVPSFVHEGG TR LVHGEEGLQAAKRITECLFSGSLSALSEADFEQLAQDGVPMVEMEK
Wec Wbs
GQ GQ
Eec Erm
TD
Qec Qyc
MA
Vec Vyc Iec
Iyc Lee Mec Myc Rec
IQ
PADMTLEWQESGVEGANRFLKRVWK LV
DADSLRYYYTAKLSSRIDDIDLNLEDFVQRVNADIVNKV
Yec
Ybs
in
. . ARAFGLTIPLVTKA DGTKFGKTESGTIWLDKEKT$DDRD . . NQVFGLTVPLITKA DGTKFGKTEGGAVWLDPKKT$ADAD . . PKSGARVMSLLEPT KKMSKSDDNRNNVIGLLE DPKS . . PKVGARIMSLVDPT KKMSKSDPNPKAYITLLD DAKT . . PVPVYAHVSMINGD DGKKLSKRH GAVSVMQYRD DGYL . . EPPVFMHLSLMRNA DKSKLSKRK NPTSISYYTA LGYL
AD
DVPLVELLVSAGISPSKRQA.. GADLMQALVDSELQPSRGQA.. SIPELEKQFEGKMYG HLKGEVADAVSGMLTELQERYHRFRNDEAFLQQVMKDGAEKASAHASRTLK.. SIEELERQYEGKGYG VFKADLAQVVIETLRPIQERYHHWMESEE LDRVLDEGAEKANRVASEMVR.. KLLWLNHHYI NALP PEYVATHLQWHIEQ ENIDTRNGPQLADLVKLLGERCKTLKEM AQSCRY.. KLDWLNARWIREKLSEEEFAARVLAWAMDN E RLKEGLKLSQTRISKLGEL PDLAAF.. SLESCIREDLNENAPRAMAVIDPVKLVIENYQGE GEMVTMPNHPNKPEMGSRQVPFSGEIWIDRADF..
VV RFESAVRKYLEDTTPRLMFVLDPVEWVDNLSDDYEELATIPYRPGTPEFGERTVPFTNKFYIERSDF.. FV LMNTEG QDCGFNGGEMTLSLADRWILAEFNQTIKAYREALDSFRFDIAAGIL YEFTWNQFCDWYL.. FA LMRLGDDYQPPATEGLSGNESLVEKWILHKLTETSKIVNEALDKRDFLTSTSSI YEF WYLICDVYI.. FL LANLNGFD PAKDMVKPEEMVVLDRWAVGCAKAAQEDILKAYEAYDFHEVVQRL MRFCSVEMVSFYL.. FLSLKKMSNIDFQYDDSVKSDN VMDRWILASMQSLVQFIHEEMGQYKLYTVVPKL LNFID ELTNWYI.. YE HTAKGDVAALNVDALTENQKAL RRDVH KTIAKVTDDIGRRQTFNTAIAAI MELMNKLAKAPTD.. LA SRNAGFINKRFDGVLAS ELADPQLYK TF TDAAEVIGEAWESREFGKAVREI MALADLANRYV..
FV NRLIKFVNAKYNGVVPKFDPKKVSNYD GLVKDINEILSNYVKEMELGHERRGLEIAMSLSARGNQFL.. LS KNRTTDYIFDWDNMLAFEG NTAPYMQYAYTRVLSVFRKAEIDEEQLAAAPVI IREDREAQLAARLL..
FIG. 3. Multiple sequence alignment of the C-terminal sequence segments of group I enzymes. Precise-locations of each segment are shown Fig. 1. The $ symbols correspond to the deleted strings described in the legend to Fig. 1.
8124
Evolution: Nagel and Doolittle Tec Tyc Pec Sec Syc Kec Kyc Dyc Nec Hec Hyc FSec GSec Aec
. . EAPGMVFWHN . MSPGSCFWLP . LASGLYTWLP ..TGSRFVVMKG . CGHRGYFFRN
Proc. Nadl. Acad. Sci. USA 88 (1991) DG WTIFRELEVF VRSKLKEYQYQEVKGPFMMDRVLW HG TRIYNTLVDL LRTEYRKRGYEEVITPNMYNSKLW TG VRVLKKVENI VREEMNNAGAIEVSMPVVQPADLW QI ARMHRALSQFMLDLHTEQHGYSENYVPYLVNQDTL YG VFLNQALINYGLQFLAAK GYIPLQAPVMMNKELM WGLGRIVTEIFEEVAEAHLIQPTFI TEYPAEVSPLAR LTNARMLDKLVGELEDTC INPTFI FGHPQMMSPLAK
EKTGHWDNYKDAMF TT SSENR EYC IKPMNC ETSGHWANYKENMF TF EVEKE TFG LKPMNC QESGRWEQYGPELL RFVDRGER PFV LGPTHE YGTGQLPKFAGDLFHTRPLEEEADTSNYA LIPTAE SKTAQPSEFDEELY KVIDGEDE KY LIATSE RNDVNPEITDRFEF FIGGREIGNGFSELNDAED ..ESIGIHVEKS ..VDNKLECPPP YSRDQPGLCERFEV FVATKEICNAYTELNDPFD . . RAAGKEIGD F EDLSTENEKFLGKLVRDKYDTDFYILDKFPLEIRPFYT MPDPANPKYSNSY DFFMRG E EI LSGAQR . . ENCGRKFENPVYWGVDLSSEHERYLAE EHFKAPV VVKNYPKDIKAFY MRLNEDGKTVAAM DVLAPGIG EI IGGSQR PEVQALLNDAPAL GDYLDEESREHFAGLCKLLES AGI AY TVNQRL ..LDSKN VRGL DYYNRTVFEW . LNGSL ARGL DYYTGLIYEV KEIHAVLSADANITSNEKAKQGLDDIATLMKYTEA FDIDSFISFDLSL . . IAPGRVYRN DYDQTHTPMFHQMEGLIVDTNISFTNL KGTLHDFLRNFFEEDLQIRFRPSYF PF TEPSAE . . ATDGR YGE NW ENPTLG NPNRLQHYYQFQVVIKPSPDNIQELYLGSLKELGMD PTIHDIRFVED . . RLWVTVYESD DEAYEIWEKEVGIPRERIIRINDNKGAPYASGNFWRMGGTGPCDPCTEIF YDHG DHIWGGPPGS *
f
Tec Tyc Pec Sec
PGHVQIFNQGLKS PGHCLMFKSRERS EVITDLIRNELSS VPLTNLVRGEIID
Syc
QPISAYHSGEWFEKPQEQLPIHYVGYSSCFRREAGSHGKDAWGVFRVHAFEKIEQFVITEPEKSWE.. QAQRFLDQVAAKDAGDDEAMFYDEDYVTALEHGLPPTAGLGIGIDRMVMLF TNSHTIRDVILFPA.. QRARFEEQARQKDQGDDEAQLVDETFCNALEYGLPPTGGWGCGIDRLAMFL TDSNTIRGVLLFPT.. IHDHALLQERMKAHGLSPEDPGLKDYCDGFSYGCPPHAGGGIGLERVVMFY LDLKNIRRASLFPR.. EERLDVLDERMLEMGLNKED YWWYRDLRRYGTVPHSGFGLGFERLIAYV TGVQNVRDVIPFPR.. VTNSLGSQGTVCAGGRYDGLVEQL GGRAT PAVGFAMGLERLVLLV QAVNPEFKADPVVD..
Kec Kyc Dyc Nec Hec Hyc FSec GSec Aec
YRDLPLRMAEFGSCHRNEPSG
YRELPWRVADFGVIHRNEFSG YKQLPLNFYQIQTKFRDEVRP
SLHGLMRVRGFTQDDAHIFCTEEQIRD.. ALSGLTRVRRFQQDDAHIFCTHDQIES.. RF GVMRSREFLMKDAYSFHTSQESLQ..
EDDLPIKKMTAHTPCFRSEAGSYGRDTRGLIRMHQFDKVEMVQIVRPEDSMA..
VTSAFVGVGSIAAGGRYDNLVNMFSEASGKKSTQI PCVGISFGVERIFSLIKQRINSSTTIKPTAT.. VDVMGKNGKWLEVLGCGMVHPNVLRNV GIDPEVYSGFAFGMGMERLTMLR YGVTDLRSFFENDL.. AWGLGWEV WLNGM EVTQFTYFQQVGGLECKPVTG EITYGLERLAMYI QGVDSVYDLVWSDG.. PEEDGDRYIEIWNIVFMQFNRQA DGTMEPLPKPSVDTGMGLERIAAVL QHVNSNYDIDLFRT..
FIG. 4. Multiple sequence alignment of group II enzymes. Precise locations of each sequence segment are shown in Fig. 1.
was refined to cluster closely related sequences. (iv) Each set of aligned sequences was analyzed by two independent methods to yield trees, and we sought alignments that yielded, ideally, the same evolutionary relationships when analyzed by both methods. For both groups I and II, the alignments reported here satisfied, with the few exceptions noted, these criteria of consistency in evolutionary order. Since there were two sequence segments used for the group I enzymes, we imposed the additional criteria that the four trees that could be drawn for each pair of alignments (completed with the same order of sequence input) be the same. The enzymes are divided into two groups that appear to be exclusive. Although there is clear homology within each group, all attempts to uncover significant sequence relationships between members of the two groups have proven unsuccessful. Similar conclusions have been reported by others (13). Thus, the data argue that the two groups have A
Group
(HIGH)
B
Group
undergone a kind of convergent evolution with regard to function in forming the present day set of enzymes. It is interesting to note that each group contains a rather full complement of chemically diverse amino acids, including acidic, basic, hydrophobic, and hydrophilic representatives. This finding is consistent with each group having arisen independently in two archaic protein synthetic apparatuses, each using a more restricted set of amino acids. Such a situation could have existed in a single organism or in two distinct biological environments that later merged during the course of evolution. If the two did not coevolve, one group must have come first with the other being recruited to supplement the pool of available amino acids. The data provide no reason to believe that one group is the more ancient. Distance scores and percent identity cover a remarkably similar range within each group. Although proteins certainly may evolve at different rates, these data suggest a coordinate development. (KMSKS)
C
Group 11 (GLER)
Kec
-
Kyc Dyc
I
Hyc
* Aec ec
FSec
FIG. 5. Trees showing evolutionary relationships among group I (A and B) and group 11 (C) enzymes. Branching order and distance scores shown in Fig. 2 for A, Fig. 3 for B, and Fig. 4 for C. Distance scores shown were derived from the parsimony-based
are based on the alignment PAPA program (23).
Proc. Natl. Acad. Sci. USA 88 (1991)
Evolution: Nagel and Doolittle
Synthetase pairs specific for the acidic amino acids and their amides, glutamic acid/glutamine and aspartic acid/ asparagine (25), cluster closely despite the fact that each is found in a separate group. The data indicate that the addition of amidated amino acids from their corresponding acids (or vice versa) is a relatively recent addition to protein synthesis. Similarly, the data point to the more recent radiation of the aliphatic amino acids (valine, leucine, isoleucine, and methionine), a cluster first identified by Heck and Hatfield (26). The synthetases specific for the aliphatic amino acids are more closely related in terms of percent identity to one another than members of any other cluster, save aspartic acid and asparagine. In most cases, closely related enzymes recognize amino acids that are chemically similar to one another. Examples already noted include the two acid amide clusters and the aliphatic cluster. In -addition, the tryptophan and tyrosine enzymes, although they apparently were among the first to diverge, are closely related as are the enzymes for the hydroxylated amino acids serine and threonine. The close link between the charged and/or polar amino acids aspartic acid, asparagine, and lysine (27, 28) is an additional example as is the relationship between the enzymes recognizing the small amino acids glycine and alanine. These data argue that radiation occurred by adapting binding sites to chemically similar amino acids. It is possible that; as particular enzymes evolved, an early form may have charged initially a tRNA or a group of tRNAs with two or more amino acids. As long as amino acids were sufficiently similar in chemical properties, this ambiguity may have been tolerable during early stages of life. As protein synthesis became more complex and the requirements for particular amino acids became more stringent, however, one can envision evolutionary pressure for the aminoacyl-tRNA synthetases to select particular tRNAamino acid pairs. Thus, a primordial AspRS gene may have undergone duplication and mutation such that there existed both an AspRS and an AsnRS charging the same set of tRNA molecules with either amino acid. Refinements in specificity may have led to a parallel evolution in cognate tRNA and/or synthetase structure such that separate aspartic acid- and asparagine-accepting tRNA families were established. This scenario predicts that relationships similar to those seen here for the synthetase enzymes may be seen in tRNA structures (29) as well. Not all related enzymes recognize amino acids that are chemically similar. The cluster of histidine and alanine and the inclusion of a proline-specific enzyme in the cluster of threonine and serine are cases in point although in neither case are these amino acids vastly dissimilar. The placement of ArgRS in the aliphatic cluster, however, is surprising. By analogy to group- II, one might have expected a closer relationship to the glutamic acid- and glutamine-specific enzymes. ArgRS is an outlier in this cluster, however, suggesting either that the large size of the amino acid side chain may be the common theme in this group or that the ArgRS cannot be assigned reliably to any subgroup. The data also suggest an early divergence leading to enzymes activating either charged or neutral side chains. Enzymes specific for the four aliphatic amino acids appear to have proliferated subsequent to this separation. The cluster containing the small subunits of the glycine- and phenylalanine-specific enzymes also stands out as a major example of distinctly different amino acids being charged by similar enzymes (21% identical in this region). It is known that these are the only aminoacyl-tRNA syntheses possessing an a2f subunit structure and that the enzymes have a number of other physical and immunological properties in common as well (30). The sequence similarities shown here underscore the close evo-
8125
lutionary relationship between these enzymes and also show clearly that the a2132 enzymes did not evolve independently of the other aminoacyl-tRNA synthetases. The sequence for Cec has appeared (31, 32). It clearly belongs to group I, bringing to 10 the number in each group. The cysteine-specific enzyme most closely resembles the methionine-specific enzyme. We thank Da-Fei Feng for many helpful discussions. G.M.N. gratefully acknowledges a sabbatical supplement award for molecular studies of evolution from the Alfred P. Sloan Foundation. This work was supported by National Institutes of Health Grant GM 34434. 1. Schimmel, P. (1987) Annu. Rev. Biochem. 56, 125-158. 2. Schimmel, P. & Soll, D. (1979) Annu. Rev. Biochem. 48, 601-648. 3. Burbaum, J. J., Starzyk, R. M. & Schimmel, P. (1990) Proteins: Struct. Funct. Genet. 7, 99-111. 4. Doolittle, R. F. (1979) in The Proteins, eds. Neurath, H. & Hill, R. L. (Academic, New York), pp. 1-118. 5. Webster, T. A., Tsai, H., Kula, M., Mackie, G. & Schimmel, P. (1984) Science 226, 1315-1317. 6. Houtondji, C., Desson, P. & Blanquet, S. (1986) Biochimie 68, 1071-1078. 7. Houtondji, C., Lederer, F., Dessen, P. & Blanquet, S. (1986) Biochemistry 25, 16-21. 8. Jasin, M., Regan, L. & Schimmel, P. (1983) Nature (London) 306, 441-447. 9. Leatherbarrow, A. J., Fersht, A. R. & Winter, G. (1985) Proc. Nati. Acad. Sci. USA 82, 7840-7844. 10. Blow, D., Bhat, T. N., Metcalfe, A., Risler, J. L., Brunie, S. & Zelwer, C. (1983) J. Mol. Biol. 171, 571-576. 11. Schimmel, P. (1991) Trends Biochem. Sci. 16, 1-3. 12. Jacobo-Molina, A., Peterson, R. & Yang, D. C. H. (1989) J. Biol. Chem. 264, 16608-16612. 13. Eriani, G., Delarue, M., Poch, O., Gangloff, J. & Moras, D. (1990) Nature (London) 347, 203-206. 14. Cusack, S., Berthet-Colominas, C., Hartlein, M., Nassar, N. & Leberman, R. (1990) Nature (London) 347, 249-255. 15. Bhat, T. N., Blow, D. M., Brick, P. & Syborg, J. (1982) J. Mol. Biol. 158, 699-709. 16. Zelwer, C., Risler, J. L. & Brunie, S. (1982) J. Mol. Biol. 155, 63-81. 17. Rould, M. A., Perona, J. J., Soll, D. & Steitz, T. A. (1990) Science 246, 1135-1142. 18. Laberge, S., Gagnon, Y., Bordeleau, L. M. & LaPointe, J. (1989) J. Bacteriol. 171, 3926-3932. 19. Mirande, N. & Walker, J. P. (1988) J. Biol. Chem. 263, 1844318451. 20. Ludmerer, S. W. & Schimmel, P. (1987) J. Biol. Chem. 262, 10801-10806. 21. Feng, D. F. & Doolittle, R. F. (1987) J. Mol. Evol. 25, 351-360. 22. Feng, D. F. & Doolittle, R. F. (1990) Methods Enzymol. 183, 375-387. 23. Doolittle, R. F. & Feng, D. F. (1990) Methods Enzymol. 183, 659-669. 24. Doolittle, R. F. (1987) in Of Urfs and Orfs (University Science Books, Mill Valley, CA), pp. 26-28. 25. Anselme, J. & Hartlein, M. (1989) Gene 84, 481-485. 26. Heck, J. D. & Hatfield, G. W. (1988) J. Biol. Chem. 263, 868-877. 27. Gample, A. & Tzagoloff, A. (1989) Proc. Natl. Acad. Sci. USA 86, 6023-6027. 28. Leveque, F., Plateau, P., Dessen, P. & Blanquet, S. (1990) Nucleic Acids Res. 18, 305-312. 29. Fitch, W. M. & Upper, K. (1987) Cold Spring Harbor Symp. Quant. Biol. 52, 759-767. 30. Nagel, G. M., Johnson, M. S., Rynd, J., Petrella, E. & Weber, B. H. (1988) Arch. Biochem. Biophys. 262, 409-415. 31. Hou, Y.-M., Shiba, K., Mottes, C. & Schimmel, P. (1991) Proc. Natl. Acad. Sci. USA 88, 976-980. 32. Eriani, G., Dirheimer, G. & Gangloff, J. (1991) Nucleic Acids Res. 19, 265-269.