Evolution and relatedness in two aminoacyl-tRNA synthetase families.

Proc. Natl. Acad. Sci. USA Vol. 88, pp. 8121-8125, September 1991 Evolution

Evolution and relatedness in two aminoacyl-tRNA synthetase families GLENN M. NAGEL* AND RUSSELL F. DOOLITTLE Center for Molecular Genetics, M-034, University of California, San Diego, La Jolla, CA 92093

Contributed by Russell F. Doolittle, June 24, 1991

ABSTRACT Sequence segments of about 140 amino acids in length, each containing a selected consensus region, were used in alignments of the amiyl-tRNA synthetases with the aim of discerning their evolutionary relationships. In all cams tested, enzymes specific for the same amino acid from a variety of organisms grouped together, reinforcing the supoton that the aminoacyl-tRNA synthetases are very ancient enzymes that evolved to indude the full complement of 20 amino acids long before the divergence leading to prokaryotes and eukaryotes. The enzymes are divided into two mutually exclusive groups that appear to have evolved from independent roots. Group I, for which two sequence segments were analyzed, contains the enzymes specific for amic acid, glutine, tryptophan, tyrosine, valine, leucine, methionin, and argmine. Group II enzymes include those acti-

vating threonine, proline, serine, lysine, aspartic acid, hisidine, alanine, glycine, and phenylaanine. Both groups contain a spectum of amino acid types, s ing the pobity that each could have once supported an independent system for protein synthesis. Within each group, enzymes specific for c ly similar amino aids tend to duster together, indicating that a major theme of synthetase evolution involved the a ion of binding sites to accommodate related amino acids with subsequent specialization to a Asle amino acid. In a few cases, however, synthetases activating diimiar amino acids are grouped together.

Aminoacyl-tRNA synthetases catalyze the esterification or "charging" of a single amino acid to its cognate tRNA; thus, 20 such enzymes, one enzyme specific for each amino acid found in proteins, constitute a minimum set for protein biosynthesis. The evolution and structural relatedness of these enzymes has been a subject of intense interest for many years (for review, see refs. 1-3). Because they appear to participate universally in protein synthesis, the origins of these "activating" enzymes must be very ancient (4) and studies of their divergence may shed sigificant light onto the development of the genetic code and its expression. In addition, the structural basis for nucleic acidprotein recognition and the manner in which enzymes have come to activate only a single amino acid and cognate tRNA while excluding a large number of structurally similar molecules is a subject of great interest. As the sequences of more and more of the aminoacyl-tRNA synthetases became available, we undertook a study of the interrelationships within the more than 50 reported sequences with the aims of tracing their evolution and of identifying more conserved sequence segments that are presumably essential to their biological function. Because these enzymes catalyze the same overall reaction and utilize a common strategy for chemical activation of their amino acids, aminoacyl-AMP being formed at the expense of ATP, it has long been supposed that all these enzymes had a common ancestral root, even though large differences in polypeptide chain length (303-1104 amino acids) and quaternary structure (a, a2, a4, and a2/32) appeared to argue for more diversity. Initially, studies of primary sequence simi-

larities yielded only limited regions of clear relatedness. The "HIGH" (His-Ile-Gly-His) consensus or signature sequence was identified early in a small group of enzymes (5), but other enzymes appeared to lack this motif. In addition, the "KMSKS" (Lys-Met-Ser-Lys-Ser) sequence was revealed (6, 7) in another group that overlapped significantly in membership with the HIGH enzymes. The HIGH region has been implicated in amino acid activation and the KMSKS region has been implicated in "docking" the acceptor stem of tRNA and/or transferring amino acid to the 3' end of tRNA in these enzymes (7-11). More recently a third consensus sequence denoted "GLER" (Gly-Leu-Glu-Arg) was observed (12, 13). Because the enzymes containing this sequence formed a group apparently exclusive of the earlier group, it was hypothesized (13) that two separate evolutionary families of the synthetases exist. The first x-ray crystallographic structure of an enzyme from the GLER group (14), seryl-tRNA synthetase, SerRS,t supported this view, there being virtually no structural similarity between SerRS and enzymes of the HIGH group MetRS (15), TyrRS (16), and GlnRS (17). Ourown studies, which were in progress when these reports appeared, are in agreement with the existence of two separate groups, designated here simply as group I and group II. The work described in this paper identifies sequence segments in members ofeach group and applies standard computer methods to align the sequences and to discern their evolutionary relationships.

MATERIALS AND METHODS Amino Acid Sequences. The amino acid sequences used in this study either were taken from the National Biomedical Research Foundation sequence collection (Version 21), translated from DNA sequences in GenBank (Release 61) or EMBL (Release 18 on compact disc, April 1989) or were entered by us directly from the original literature. The latter included GluRS from Rhizobium meliloti (18), LysRS from yeast (19), and yeast GlnRS (20). Programs. Sequence alignments were made by the progressive method (21). Phylogenetic trees were constructed from multiple sequence alignments by a matrix procedure (22) and by a nearestneighbor character analysis (23). The program for the latter method is called PAPA (parsimony after progressive alignment). Best trees were defined as those with the lowest percent standard deviations and no negative branch lengths. The most similar *Permanent address: Department of Chemistry and Biochemistry, California State University, Fullerton, CA 92634. tAbbreviations: Aminoacyl-tRNA synthetases specific for a given amino acid have been abbreviated by the convention employing the three-letter designation for the amino acid followed by RS; e.g., SerRS denotes seryl-tRNA synthetase, etc. Where it was necessary to include a designation of the biological source of the enzyme, the abbreviation was modified to include the single-letter symbol for the amino acid along with ec for Escherichia coli, bs for Bacillus stearothermophilus, rm for Rhizobium meliloti, and yc (yeast) for Saccharomyces cerevisiae; e.g., E. coli seryl-tRNA synthetase is Sec. etc.

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. 8121

8122

Proc. Natl. Acad. Sci. USA 88 (1991)

Evolution: Nagel and Doolittle

scores by exploring alternate orders of sequence input into the DFALIGN program and by judicious trimming of regions to be aligned. The output from the scoRE and DFALIGN programs was particularly useful in this regard and, in the final result after several iterations, the input and output order of sequences to and from DFALIGN was the same. For enzymes specific for the same amino acid, the percent identity varied from 64% (Yect vs. Ybs) to 28% (lec vs. Iyc). For enzymes specific for different amino acids, values ranged from 29%o (Vyc vs. Lec) to 7% (Wbs vs. Mec) identical. Because these sequences are quite divergent, we applied a number of internal checks for consistency. First, we calculated distance scores by two independent approaches, matrix and PAPA, with the aim of finding agreement with regard to both the branching order and relative branch lengths. As an additional means of tracing the evolution of these proteins, we applied an identical and independent analysis to the C-terminal sequence segments (Fig. 3). The statistical properties (percent identities and distance scores) of the aligned sequences were remarkably like those of the N-terminal region. (See the two trees obtained by the PAPA analysis in Fig. 5 A and B for the N-terminal and C-terminal segments, respectively.) The main difference between the two trees was the relative branching of leucine and isoleucine. The same approach was used for group II (14 enzymes specific for 10 amino acids) except that only a single sequence segment was analyzed. The range of sequence identities in the aligned regions (Fig. 4) is very similar to that found for the group I enzymes. For enzymes specific for the same amino acid, the observed range was 50%o (Tec vs. Tyc) to 31% (Sec vs. Syc). For enzymes charging different amino acids, the range was 33% (Nec vs. Dyc) to 6% (Aec vs. Syc and Nec vs. Sec). The matrix and PAPA methods for calculating evolutionary distances were in agreement (Fig. SC). Both methods yielded the same branching order with no negative branches.

portions of various pairs of sequences were identified with a program called INSPECT (24).

RESULTS The primary sequence segments we have used in our analysis are depicted diagramatically in Fig. 1. In some instances, it was necessary to delete strings of 5-43 amino acids to obtain correct alignments. Group I enzymes contain two conserved sequences that served as identifiers. The N-terminal segment includes the HIGH sequence and the C-terminal region is characterized by the KMSKS sequence. The large enzymes specific for valine, isoleucine, and leucine also include a central region. MetRS, the next most closely related enzyme, contains a shortened version of this region, whereas in the smallest enzymes (e.g., TrpRS and TyrRS), it is absent altogether. The size of the smaller enzymes, therefore, limits the sequence length that could be utilized in alignments of group I. In this analysis, our attention has focused primarily on the enzymes of bacterial origin from yeast cytoplasm, excluding for the moment sequences from organelles and from higher eukaryotes. Restricting the data set in this way simplified the analysis not only by reducing the number of sequences but also by eliminating some sequences, particularly those of mitochondrial origin, which were more variant and sometimes contained strings with little or no sequence similarities to related enzymes. In every case, however, enzymes specific for the same amino acid could be readily aligned with one another after being suitably trimmed, regardless of source. As a general rule, these enzymes clustered together and were more similar to each other than they were to enzymes specific for any other amino acid. A complete sequence alignment for the N-terminal segments of 16 group I enzymes, specific for nine amino acids, is shown in Fig. 2. In applying the progressive alignment procedure (22), care was taken to optimize the alignment Group

A

213 370 ~~~~419 144 180 322 Wec 5 -33~34

Pec 35

inin~327 Eec2 Eec 138 220 363 ~~~~~471 141

Qec 26

236

Vec 35

Lec 35

742

585

728

40

189

585

735

316

137 198

Myc

Rec

559

196

Mec 7

115

270

329

362

860

303

Syc

173

314

430 462

550

410 557 -5 448 586 330

208 Hec

424 272

Hyc

66

426

261

126

G~c65 GSec

FSec 653

575 591

432

Aec

455 509

158

867

1104

51

lec

Sec

Nec

951 686

167

c572

Dyc

809

-

717

332

183

Vyc

621

537

184

734

171

Kyc

550

391 479

251

481

356 500 KecKec 505

484

162 251 392

Qyc

lyc

374

345

Tyc

3 141 177 316 Wbs Wbs

395

259

Tec

32 172 Ybs Ybs

Erm 5

Group II

B

374 YecYe 34 1 77 217 3 42

188

| 188

303 320

3 27

75 1

577

FIG. 1. Aminoacyl-tRNA synthetase sequence segments used in alignments of group I (A) and group 11 (B) enzymes. Segments are shown as bold lines with the N- and C-terminal limits of each denoted above. Enzymes are abbreviated using the single-letter designation for the specific amino acid activated followed by two lowercase letters for the organism. For the phenylalanine- and glycine-specific enzymes, "S" indicates that the sequences of the small subunits were used. Only bacterial and yeast cytoplasmic sequences were used in this analysis. For Erm, amino acids 74-81 were removed prior to alignment of the N-terminal segments of group I. Similarly, the following residues were removed from the C-terminal segments of group I enzymes: from Yec, residues 251-262; from Ybs, residues 247-258; from Vec, residues 573-609; from Vyc, residues 722-758; from lyc, residues 666-671; and from Lec, residues 573-615.

Evolution: Nagel and Doolittle . . ALYCGFDPTADS LHLGHL . . TLYCGFDPTADS LHIGHL . . IVFSGAQPSGE LTIGNY

Yee

VP AT MG ITIGNY IG .. TIFSGIQPSGV RT .. KIKTRFAPSPTGYLHVGGA YI .. AVRVRIAPSPTGEPHVGTA ..TVHTRFPPEPNGYLHIGHA KS . . KVRTRFPPEPNGYLHIGHS KA

Ybs Wee Wbs Eec Erm

Proc. Natl. Acad. Sci. USA 88 (1991) LLCLKRFQQAGHKPVALVGGATGLIGDPSFKAA

ILTMRRFQQAGHRPIALVGGATGLIGDPSGKKS ALRQWVKMQDDYHCIYCIVDQHAITVRQDAQKL

ALRQFVELQHEYNCYFCIVDQHAITVWQDPHEL ALYSWLFARNHGGEFVLRIEDTDLE RSTPEAI ALFNYLFAKKHGGKFILRIEDTDAT RSTPEFE

ERKLNTEETVQE ERTLNAKETVEA RKAILDTLALYL RQNIRRLAALYL

WV WS AC AV EAIMDGMNWLSL EW KKVLDALKWCGL EW ESIKNDVEWLGF HW ESIKRMVSWLGFKPW ERKIAAEEGKTRHDY

DK AR GI GI DE

8123

IRK IKE DPE DPT GPY

SE$GPY SG NVR ICLNFGIAQDYKGQCNLRFDDTNPV KEDIEYV IMVNFGYAKYHNGTCYLRFDDTNPE KEAPEYF KIT GA EAF ..FCIMIPPPNVTGSLHMGHAFQQT IMDTMIRYQRMQGKNTLWQVGTDHAGIATQMVV . . FCIPAPPPNVTGALHIGHALTIA IQDSLIRYNRMKGKTVLFLPGFDHAGIATQSVV EKQIWAKDRKTRHDY GR EAF TA AEF EQEYGKPGEKF ..FILHDGPPYANGSIHIGHSVNKI LKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKV DKKLGITGKDDVFKY GL ENY ..FSFFDGPPFATGTPHYGHILAST IKDIVPRYATMTGHHVERRFGWDTHGVPIEHII . . YYCLSMLPYPSGRLHMGHVRNYT IGDVIARYQHMLGKNVLQPIGWDAFGLPAEGAA VKNNTAPAPWT . . ILVTCALPYANGSIHLGHMLEHI QADVWVRYQRMRGHEVNFICADDAHGTPIMLKA QQLGITPEQMI . . ILITSALPYVNNVPHLGNIIGSVLSADIFARYCKGRNYNALFICGTDEYGTATETKA LEEGVTPRQLC . . IVVDYSAPNVAKEMHVGHLRSTI IGDAAVRTLEFLGHKVIRANHVGDWGTQFGMLIAWLEKQQQENAGEMELADLEGF YRD

Qec Qyc Vec Vyc Iec

Iyc Lee Mec Myc Ree Yee Ybs Wee lbs Eee Erm Qec Qyc Vec Vyc Iec

QVAPF QLGRF KSTIF QATLF YQTKR RQSDR YSSDY YSSDY

LDFDCGENSAI AANNYDWFGNMNVLT LDFEADGNPAK IKNNYDWIGPLDVIT VQSHVPEHAQLGWALNCYTYFGELSRMT IQSEVPAHAQAAWMLQCIVYIGELERMT

FLRDIGKHFSVNQMINKEAVKQRLNREDQGISFTEFSYNLL.. ETGISFTEFSYMML.. FLRDVGKHFSVNYMMAKESVQSRI AADILL YGTNLV.. AADILL YNTDIV.. DDEPC.. FDRYNAVIDQMLEEGTAYKCYCSKERLE ALREEQMAKGEKPRYDGRCRHSH EHHA AEEV TSRVD.. KDIYKPYVEKIVANGHGFRCFCTPERLE QMREAQRAAGKPPKYDGLCLSLS FDQLHAYAIELINKGLAYVDELTPEQIR EYRGTLTQ PGKN SPYRDR SVEENLALFEKMRA.. FDELYRLAEVLIKNGKAYVCHCTAEEIK RGRGIKEDGTPGGERYACKHRDQ SIEQNLQEFRDMRD.. RLVNWDPKLRTAIS.. IDKIWEWKAESGGTITRQMRRLGNSVDWDGERFTMDEGLSNAVKEVFVRLYKEDLIYRGK QFKDKSARYAENINAGLFDYPVLM QFKEKSAG KEAVSAGLLTYPPLM

RLVINWSVKLNTAIS..

Myc

VGKVWEWKEEYHSRIKNQIQKLGASYDWSREAFTLSPELTKSVEEAFVRLHDEGVIYRAS RAKCREYAATQVDGQRKDFIRLGVLGDISHPYLTMDFKTEANIIRALGKIIGNGHLHKGA NNECRSIVMTYASDWRKTIGRLGRWIDFDNDYKTMYPSFMESTWWAFKQLHEKGQVYRGF YDNIAYMKNQLKMLGFGYDWSRELATCTPEYYRWEQKFFTELYKKGLVYKKT GEMSQEHQTDFAGFNISYDNYHSTHSE ENRQLSELIYSRLKENGFIKNRT DKYHKIHSDVYKWFQIGFDYFGRTTTD KQTEIAQHIFTKLNCNGYLEEQS

KPVHWCVDCRSALA.. KVMPYSTGLTTPLS.. SAVNWCPNDQTVLA.. ISQLYDPEKGMFLP.. ?MKQLYCPVHNSYLA..

Rec

AKKHYDEDEEFAERARNYVVKLQSGDEYFREMWRKLVDITMTQNQITYDRLNVTLTRDDVM

GESLYNPMLPGI VA..

Iyc Lee Mee

FIG. 2. Multiple sequence alignment of the N-terminal sequence segments of group I enzymes. Precise locations of each sequence segment are shown in Fig. 1. The $ symbol corresponds to the deleted string described in the legend to Fig. 1.

In working with divergent sequences such as those examined here, one is we concerned finding the "right" alignments. Although cannot beabout certain in having succeeded in this quest, it can be stated that many things are right about the alignments and the resulting trees presented herein. (i) All alignments were based on recognizable consensus sequences. (ih) Care was taken to employ sequence segments sufficiently large to yield statistically significant data but short enough to make visual inspection and analysis possible. This allowed us, in some cases, to locate homologous regions that initially escaped detection by computer analysis as a result of radical insertion or deletions. (iiM) Because these alignments are sensitive to the relative order in which sequences are submitted to the progressive alignment method, many alternative combinations were tried and the process

DISCUSSION In view of the observation that the use of 20 aminoacyl-tRNA synthetases appears to be universal, it is logical to conclude that the basic evolution of this part of the protein synthetic apparatus was completed and the full complement of amino acids utilized prior to the time when the major groups of present-day organisms diverged. This conclusion is supported by our observation that synthetases specific for the same amino acid, but from diverse organisms, are more similar to each other than they are to enzymes specific for any other amino acid. Thus, the data place the branching of the prokaryotic and eukaryotic lineages later in time than the branching and specialization of enzymes with regard to particular amino acids. Ybs Yec Wec Wbs Eec Erm

Qec Qyc Vec

Vyc Iec Iyc

Lec

Mec Myc Rec

VIRYLKYFTFL SKEEIEALEQELREAPEKRAAQKTLAEEV VYRFLKFFTFM SIEEINALEEEDKNSGKAPRAQYVLAEQV VVKKIK RAVT DSDEPPVVRYDVQNKAGVSNLLDILSAVT IEKKIK SAVT DSEG TIRYDKEAKPGISNLLNIYSTLS PEALLNYL VRLGWSHGDQEIFTREEMIKYFTLNAVSKSASAFN

PEALMNFLGLFFIQIAEGEELLTMEELAEKFDPENLSKAGAIFD RRGYTAASIREFCKRIGVTKQDNTIE RGWD DPRLFTLEAIR RRGVPPGAILSFINTLGVTTSTTNIQ VDGI$GTDALRFTLAALA STGRDINWDMKRLEGYRNFCNKLWNASR ITGI$GTDAMRFALCAYT TGGRDINLDILRVEGYRKFCNKIYQATK MNKL GADILRLWVASTDY TGQMAVSDEILKR AADSY RRIRNTAR LNKY GADALRLYLINSPVLKAESLKFKEEGVKEVVSKVL LPWWNSFK

. . VHPRQYEFSRL NL EYTVMSKRK LNLLVTDKHV EGWD DPRMPTISGLR

NI TGTVLSKRK IAQLVDEKFV EGQKMSKSK G NVIDPLDM QGRKMSKSL G NVIDPLDV PYCQVLTHGFTVDG QGRKMSKSI G NTVSPQDV DGRKMSKSL K NYPDPSIV PAKQLLCQGMVLAD$GMSKMSKSK N NGIDPQVM NGAKMSKSR G TFIKASTW ..MLHHLNTTEYL QY ENGKFSKSR GVGVFGNNAQ . . VPLEHHMFGMMLGK DGKPF KTR AGGTVKLADL .. FRPAQREYGRL .. PFHTVYMTGLIRDD .. PFKEVFCHSLVRDA .. .. PYKNVIVSGIVLAA .. .. KPSNLFVHGYV TV

VERY LNHF DSGI LDEA

GADTVRLFMMFAS

VN

SPSVWRYYLASVRP ESSDSHFSWDDFVARNNSELLANL LERARRLVAEKNPDMPADELE KLANAVGIGAVKY

GN

TK LVHGEEALRQAIRISEALFSGDIANLTAAEIEQGFKDVPSFVHEGG TR LVHGEEGLQAAKRITECLFSGSLSALSEADFEQLAQDGVPMVEMEK

Wec Wbs

GQ GQ

Eec Erm

TD

Qec Qyc

MA

Vec Vyc Iec

Iyc Lee Mec Myc Rec

IQ

PADMTLEWQESGVEGANRFLKRVWK LV

DADSLRYYYTAKLSSRIDDIDLNLEDFVQRVNADIVNKV

Yec

Ybs

in

. . ARAFGLTIPLVTKA DGTKFGKTESGTIWLDKEKT$DDRD . . NQVFGLTVPLITKA DGTKFGKTEGGAVWLDPKKT$ADAD . . PKSGARVMSLLEPT KKMSKSDDNRNNVIGLLE DPKS . . PKVGARIMSLVDPT KKMSKSDPNPKAYITLLD DAKT . . PVPVYAHVSMINGD DGKKLSKRH GAVSVMQYRD DGYL . . EPPVFMHLSLMRNA DKSKLSKRK NPTSISYYTA LGYL

AD

DVPLVELLVSAGISPSKRQA.. GADLMQALVDSELQPSRGQA.. SIPELEKQFEGKMYG HLKGEVADAVSGMLTELQERYHRFRNDEAFLQQVMKDGAEKASAHASRTLK.. SIEELERQYEGKGYG VFKADLAQVVIETLRPIQERYHHWMESEE LDRVLDEGAEKANRVASEMVR.. KLLWLNHHYI NALP PEYVATHLQWHIEQ ENIDTRNGPQLADLVKLLGERCKTLKEM AQSCRY.. KLDWLNARWIREKLSEEEFAARVLAWAMDN E RLKEGLKLSQTRISKLGEL PDLAAF.. SLESCIREDLNENAPRAMAVIDPVKLVIENYQGE GEMVTMPNHPNKPEMGSRQVPFSGEIWIDRADF..

VV RFESAVRKYLEDTTPRLMFVLDPVEWVDNLSDDYEELATIPYRPGTPEFGERTVPFTNKFYIERSDF.. FV LMNTEG QDCGFNGGEMTLSLADRWILAEFNQTIKAYREALDSFRFDIAAGIL YEFTWNQFCDWYL.. FA LMRLGDDYQPPATEGLSGNESLVEKWILHKLTETSKIVNEALDKRDFLTSTSSI YEF WYLICDVYI.. FL LANLNGFD PAKDMVKPEEMVVLDRWAVGCAKAAQEDILKAYEAYDFHEVVQRL MRFCSVEMVSFYL.. FLSLKKMSNIDFQYDDSVKSDN VMDRWILASMQSLVQFIHEEMGQYKLYTVVPKL LNFID ELTNWYI.. YE HTAKGDVAALNVDALTENQKAL RRDVH KTIAKVTDDIGRRQTFNTAIAAI MELMNKLAKAPTD.. LA SRNAGFINKRFDGVLAS ELADPQLYK TF TDAAEVIGEAWESREFGKAVREI MALADLANRYV..

FV NRLIKFVNAKYNGVVPKFDPKKVSNYD GLVKDINEILSNYVKEMELGHERRGLEIAMSLSARGNQFL.. LS KNRTTDYIFDWDNMLAFEG NTAPYMQYAYTRVLSVFRKAEIDEEQLAAAPVI IREDREAQLAARLL..

FIG. 3. Multiple sequence alignment of the C-terminal sequence segments of group I enzymes. Precise-locations of each segment are shown Fig. 1. The $ symbols correspond to the deleted strings described in the legend to Fig. 1.

8124

Evolution: Nagel and Doolittle Tec Tyc Pec Sec Syc Kec Kyc Dyc Nec Hec Hyc FSec GSec Aec

. . EAPGMVFWHN . MSPGSCFWLP . LASGLYTWLP ..TGSRFVVMKG . CGHRGYFFRN

Proc. Nadl. Acad. Sci. USA 88 (1991) DG WTIFRELEVF VRSKLKEYQYQEVKGPFMMDRVLW HG TRIYNTLVDL LRTEYRKRGYEEVITPNMYNSKLW TG VRVLKKVENI VREEMNNAGAIEVSMPVVQPADLW QI ARMHRALSQFMLDLHTEQHGYSENYVPYLVNQDTL YG VFLNQALINYGLQFLAAK GYIPLQAPVMMNKELM WGLGRIVTEIFEEVAEAHLIQPTFI TEYPAEVSPLAR LTNARMLDKLVGELEDTC INPTFI FGHPQMMSPLAK

EKTGHWDNYKDAMF TT SSENR EYC IKPMNC ETSGHWANYKENMF TF EVEKE TFG LKPMNC QESGRWEQYGPELL RFVDRGER PFV LGPTHE YGTGQLPKFAGDLFHTRPLEEEADTSNYA LIPTAE SKTAQPSEFDEELY KVIDGEDE KY LIATSE RNDVNPEITDRFEF FIGGREIGNGFSELNDAED ..ESIGIHVEKS ..VDNKLECPPP YSRDQPGLCERFEV FVATKEICNAYTELNDPFD . . RAAGKEIGD F EDLSTENEKFLGKLVRDKYDTDFYILDKFPLEIRPFYT MPDPANPKYSNSY DFFMRG E EI LSGAQR . . ENCGRKFENPVYWGVDLSSEHERYLAE EHFKAPV VVKNYPKDIKAFY MRLNEDGKTVAAM DVLAPGIG EI IGGSQR PEVQALLNDAPAL GDYLDEESREHFAGLCKLLES AGI AY TVNQRL ..LDSKN VRGL DYYNRTVFEW . LNGSL ARGL DYYTGLIYEV KEIHAVLSADANITSNEKAKQGLDDIATLMKYTEA FDIDSFISFDLSL . . IAPGRVYRN DYDQTHTPMFHQMEGLIVDTNISFTNL KGTLHDFLRNFFEEDLQIRFRPSYF PF TEPSAE . . ATDGR YGE NW ENPTLG NPNRLQHYYQFQVVIKPSPDNIQELYLGSLKELGMD PTIHDIRFVED . . RLWVTVYESD DEAYEIWEKEVGIPRERIIRINDNKGAPYASGNFWRMGGTGPCDPCTEIF YDHG DHIWGGPPGS *

f

Tec Tyc Pec Sec

PGHVQIFNQGLKS PGHCLMFKSRERS EVITDLIRNELSS VPLTNLVRGEIID

Syc

QPISAYHSGEWFEKPQEQLPIHYVGYSSCFRREAGSHGKDAWGVFRVHAFEKIEQFVITEPEKSWE.. QAQRFLDQVAAKDAGDDEAMFYDEDYVTALEHGLPPTAGLGIGIDRMVMLF TNSHTIRDVILFPA.. QRARFEEQARQKDQGDDEAQLVDETFCNALEYGLPPTGGWGCGIDRLAMFL TDSNTIRGVLLFPT.. IHDHALLQERMKAHGLSPEDPGLKDYCDGFSYGCPPHAGGGIGLERVVMFY LDLKNIRRASLFPR.. EERLDVLDERMLEMGLNKED YWWYRDLRRYGTVPHSGFGLGFERLIAYV TGVQNVRDVIPFPR.. VTNSLGSQGTVCAGGRYDGLVEQL GGRAT PAVGFAMGLERLVLLV QAVNPEFKADPVVD..

Kec Kyc Dyc Nec Hec Hyc FSec GSec Aec

YRDLPLRMAEFGSCHRNEPSG

YRELPWRVADFGVIHRNEFSG YKQLPLNFYQIQTKFRDEVRP

SLHGLMRVRGFTQDDAHIFCTEEQIRD.. ALSGLTRVRRFQQDDAHIFCTHDQIES.. RF GVMRSREFLMKDAYSFHTSQESLQ..

EDDLPIKKMTAHTPCFRSEAGSYGRDTRGLIRMHQFDKVEMVQIVRPEDSMA..

VTSAFVGVGSIAAGGRYDNLVNMFSEASGKKSTQI PCVGISFGVERIFSLIKQRINSSTTIKPTAT.. VDVMGKNGKWLEVLGCGMVHPNVLRNV GIDPEVYSGFAFGMGMERLTMLR YGVTDLRSFFENDL.. AWGLGWEV WLNGM EVTQFTYFQQVGGLECKPVTG EITYGLERLAMYI QGVDSVYDLVWSDG.. PEEDGDRYIEIWNIVFMQFNRQA DGTMEPLPKPSVDTGMGLERIAAVL QHVNSNYDIDLFRT..

FIG. 4. Multiple sequence alignment of group II enzymes. Precise locations of each sequence segment are shown in Fig. 1.

was refined to cluster closely related sequences. (iv) Each set of aligned sequences was analyzed by two independent methods to yield trees, and we sought alignments that yielded, ideally, the same evolutionary relationships when analyzed by both methods. For both groups I and II, the alignments reported here satisfied, with the few exceptions noted, these criteria of consistency in evolutionary order. Since there were two sequence segments used for the group I enzymes, we imposed the additional criteria that the four trees that could be drawn for each pair of alignments (completed with the same order of sequence input) be the same. The enzymes are divided into two groups that appear to be exclusive. Although there is clear homology within each group, all attempts to uncover significant sequence relationships between members of the two groups have proven unsuccessful. Similar conclusions have been reported by others (13). Thus, the data argue that the two groups have A

Group

(HIGH)

B

Group

undergone a kind of convergent evolution with regard to function in forming the present day set of enzymes. It is interesting to note that each group contains a rather full complement of chemically diverse amino acids, including acidic, basic, hydrophobic, and hydrophilic representatives. This finding is consistent with each group having arisen independently in two archaic protein synthetic apparatuses, each using a more restricted set of amino acids. Such a situation could have existed in a single organism or in two distinct biological environments that later merged during the course of evolution. If the two did not coevolve, one group must have come first with the other being recruited to supplement the pool of available amino acids. The data provide no reason to believe that one group is the more ancient. Distance scores and percent identity cover a remarkably similar range within each group. Although proteins certainly may evolve at different rates, these data suggest a coordinate development. (KMSKS)

C

Group 11 (GLER)

Kec

-

Kyc Dyc

I

Hyc

* Aec ec

FSec

FIG. 5. Trees showing evolutionary relationships among group I (A and B) and group 11 (C) enzymes. Branching order and distance scores shown in Fig. 2 for A, Fig. 3 for B, and Fig. 4 for C. Distance scores shown were derived from the parsimony-based

are based on the alignment PAPA program (23).

Proc. Natl. Acad. Sci. USA 88 (1991)

Evolution: Nagel and Doolittle

Synthetase pairs specific for the acidic amino acids and their amides, glutamic acid/glutamine and aspartic acid/ asparagine (25), cluster closely despite the fact that each is found in a separate group. The data indicate that the addition of amidated amino acids from their corresponding acids (or vice versa) is a relatively recent addition to protein synthesis. Similarly, the data point to the more recent radiation of the aliphatic amino acids (valine, leucine, isoleucine, and methionine), a cluster first identified by Heck and Hatfield (26). The synthetases specific for the aliphatic amino acids are more closely related in terms of percent identity to one another than members of any other cluster, save aspartic acid and asparagine. In most cases, closely related enzymes recognize amino acids that are chemically similar to one another. Examples already noted include the two acid amide clusters and the aliphatic cluster. In -addition, the tryptophan and tyrosine enzymes, although they apparently were among the first to diverge, are closely related as are the enzymes for the hydroxylated amino acids serine and threonine. The close link between the charged and/or polar amino acids aspartic acid, asparagine, and lysine (27, 28) is an additional example as is the relationship between the enzymes recognizing the small amino acids glycine and alanine. These data argue that radiation occurred by adapting binding sites to chemically similar amino acids. It is possible that; as particular enzymes evolved, an early form may have charged initially a tRNA or a group of tRNAs with two or more amino acids. As long as amino acids were sufficiently similar in chemical properties, this ambiguity may have been tolerable during early stages of life. As protein synthesis became more complex and the requirements for particular amino acids became more stringent, however, one can envision evolutionary pressure for the aminoacyl-tRNA synthetases to select particular tRNAamino acid pairs. Thus, a primordial AspRS gene may have undergone duplication and mutation such that there existed both an AspRS and an AsnRS charging the same set of tRNA molecules with either amino acid. Refinements in specificity may have led to a parallel evolution in cognate tRNA and/or synthetase structure such that separate aspartic acid- and asparagine-accepting tRNA families were established. This scenario predicts that relationships similar to those seen here for the synthetase enzymes may be seen in tRNA structures (29) as well. Not all related enzymes recognize amino acids that are chemically similar. The cluster of histidine and alanine and the inclusion of a proline-specific enzyme in the cluster of threonine and serine are cases in point although in neither case are these amino acids vastly dissimilar. The placement of ArgRS in the aliphatic cluster, however, is surprising. By analogy to group- II, one might have expected a closer relationship to the glutamic acid- and glutamine-specific enzymes. ArgRS is an outlier in this cluster, however, suggesting either that the large size of the amino acid side chain may be the common theme in this group or that the ArgRS cannot be assigned reliably to any subgroup. The data also suggest an early divergence leading to enzymes activating either charged or neutral side chains. Enzymes specific for the four aliphatic amino acids appear to have proliferated subsequent to this separation. The cluster containing the small subunits of the glycine- and phenylalanine-specific enzymes also stands out as a major example of distinctly different amino acids being charged by similar enzymes (21% identical in this region). It is known that these are the only aminoacyl-tRNA syntheses possessing an a2f subunit structure and that the enzymes have a number of other physical and immunological properties in common as well (30). The sequence similarities shown here underscore the close evo-

8125

lutionary relationship between these enzymes and also show clearly that the a2132 enzymes did not evolve independently of the other aminoacyl-tRNA synthetases. The sequence for Cec has appeared (31, 32). It clearly belongs to group I, bringing to 10 the number in each group. The cysteine-specific enzyme most closely resembles the methionine-specific enzyme. We thank Da-Fei Feng for many helpful discussions. G.M.N. gratefully acknowledges a sabbatical supplement award for molecular studies of evolution from the Alfred P. Sloan Foundation. This work was supported by National Institutes of Health Grant GM 34434. 1. Schimmel, P. (1987) Annu. Rev. Biochem. 56, 125-158. 2. Schimmel, P. & Soll, D. (1979) Annu. Rev. Biochem. 48, 601-648. 3. Burbaum, J. J., Starzyk, R. M. & Schimmel, P. (1990) Proteins: Struct. Funct. Genet. 7, 99-111. 4. Doolittle, R. F. (1979) in The Proteins, eds. Neurath, H. & Hill, R. L. (Academic, New York), pp. 1-118. 5. Webster, T. A., Tsai, H., Kula, M., Mackie, G. & Schimmel, P. (1984) Science 226, 1315-1317. 6. Houtondji, C., Desson, P. & Blanquet, S. (1986) Biochimie 68, 1071-1078. 7. Houtondji, C., Lederer, F., Dessen, P. & Blanquet, S. (1986) Biochemistry 25, 16-21. 8. Jasin, M., Regan, L. & Schimmel, P. (1983) Nature (London) 306, 441-447. 9. Leatherbarrow, A. J., Fersht, A. R. & Winter, G. (1985) Proc. Nati. Acad. Sci. USA 82, 7840-7844. 10. Blow, D., Bhat, T. N., Metcalfe, A., Risler, J. L., Brunie, S. & Zelwer, C. (1983) J. Mol. Biol. 171, 571-576. 11. Schimmel, P. (1991) Trends Biochem. Sci. 16, 1-3. 12. Jacobo-Molina, A., Peterson, R. & Yang, D. C. H. (1989) J. Biol. Chem. 264, 16608-16612. 13. Eriani, G., Delarue, M., Poch, O., Gangloff, J. & Moras, D. (1990) Nature (London) 347, 203-206. 14. Cusack, S., Berthet-Colominas, C., Hartlein, M., Nassar, N. & Leberman, R. (1990) Nature (London) 347, 249-255. 15. Bhat, T. N., Blow, D. M., Brick, P. & Syborg, J. (1982) J. Mol. Biol. 158, 699-709. 16. Zelwer, C., Risler, J. L. & Brunie, S. (1982) J. Mol. Biol. 155, 63-81. 17. Rould, M. A., Perona, J. J., Soll, D. & Steitz, T. A. (1990) Science 246, 1135-1142. 18. Laberge, S., Gagnon, Y., Bordeleau, L. M. & LaPointe, J. (1989) J. Bacteriol. 171, 3926-3932. 19. Mirande, N. & Walker, J. P. (1988) J. Biol. Chem. 263, 1844318451. 20. Ludmerer, S. W. & Schimmel, P. (1987) J. Biol. Chem. 262, 10801-10806. 21. Feng, D. F. & Doolittle, R. F. (1987) J. Mol. Evol. 25, 351-360. 22. Feng, D. F. & Doolittle, R. F. (1990) Methods Enzymol. 183, 375-387. 23. Doolittle, R. F. & Feng, D. F. (1990) Methods Enzymol. 183, 659-669. 24. Doolittle, R. F. (1987) in Of Urfs and Orfs (University Science Books, Mill Valley, CA), pp. 26-28. 25. Anselme, J. & Hartlein, M. (1989) Gene 84, 481-485. 26. Heck, J. D. & Hatfield, G. W. (1988) J. Biol. Chem. 263, 868-877. 27. Gample, A. & Tzagoloff, A. (1989) Proc. Natl. Acad. Sci. USA 86, 6023-6027. 28. Leveque, F., Plateau, P., Dessen, P. & Blanquet, S. (1990) Nucleic Acids Res. 18, 305-312. 29. Fitch, W. M. & Upper, K. (1987) Cold Spring Harbor Symp. Quant. Biol. 52, 759-767. 30. Nagel, G. M., Johnson, M. S., Rynd, J., Petrella, E. & Weber, B. H. (1988) Arch. Biochem. Biophys. 262, 409-415. 31. Hou, Y.-M., Shiba, K., Mottes, C. & Schimmel, P. (1991) Proc. Natl. Acad. Sci. USA 88, 976-980. 32. Eriani, G., Dirheimer, G. & Gangloff, J. (1991) Nucleic Acids Res. 19, 265-269.

Relatedness, conflict, and the evolution of eusociality.

Genetic relatedness in open-pollinated families of two leguminous tree species, Robinia pseudoacacia L. and Gleditsia triacanthos L.

Pathways to social evolution: reciprocity, relatedness, and synergy.

Social evolution in the shadow of asymmetrical relatedness.

Aminoacyl-tRNA synthetase complexes in evolution.

Hawkinsinuria in two families.

Evolution of two major chorion multigene families as inferred from cloned cDNA and protein sequences.

A30 Recombination & evolution in two viral families: effective steps or a random walk?

Polydactyly and brachymetapody in two English families.

Evolution of aminoacyl-tRNA synthetase quaternary structure and activity: Saccharomyces cerevisiae mitochondrial phenylalanyl-tRNA synthetase.

The Roles of Compensatory Evolution and Constraint in Aminoacyl tRNA Synthetase Evolution.

Genes as Cues of Relatedness and Social Evolution in Heterogeneous Environments.

Foxtail Millet NF-Y Families: Genome-Wide Survey and Evolution Analyses Identified Two Functional Genes Important in Abiotic Stresses.

Two molecular measures of relatedness based on haplotype sharing.

9qh+ variant band in two families.

Two informative 1p translocation families.

Two families with Alport's syndrome.

Hyperekplexia: pedigree studies in two families.

Neu Laxova syndrome in two Egyptian families.

Gene conversion and evolution of gene families: an overview.

Streptomyces hygroscopicus has two glutamine synthetase genes.

Glutamine synthetase gene evolution: a good molecular clock.

Adolescents' self-esteem in single and two-parent families.

Correlation between clinical and molecular features in two MELAS families.