J Mol Evol (1992) 34:416-448

Journal of Molecular Evolution @ Springer-Verlag New York Inc. 1992

Evolution of EF-Hand Calcium-Modulated Proteins. II. Domains of Several Subfamilies Have Diverse Evolutionary Histories Susumu Nakayama, 1 Nancy D. Moncrief, 2 and Robert H. Kretsinger I J Department of Biology, Universityof Virginia, Charlottesville,VA 22901, USA : Department of Mammalogy, Virginia Museum of Natural History, Martinsville, VA 24112, USA

Summary.

In the first report in this series we described the relationships and evolution of 152 individual proteins of the EF-hand subfamilies. Here we add 66 additional proteins and define eight (CDC, TPNV, CLNB, LPS, DGK, 1FS, VIS, TCBP) new subfamilies and seven (CAL, SQUD, CDPK, EFHS, TPP, LAV, CRGP) new unique proteins, which we assume represent new subfamilies. The main focus o f this study is the classification of individual EF-hand domains. Five subfamilies-calmodulin, troponin C, essential light chain, regulatory light chain, CDC31/caltractin--and three uniques--call, squidulin, and calcium-dependent protein kinase -- are congruent in that all evolved from a c o m m o n four-domain precursor. In contrast calpain and sarcoplasmic calcium-binding protein (SARC) each evolved from its own one-domain precursor. The remaining 19 subfamilies and uniques appear to have evolved by translocation and splicing of genes encoding the EF-hand domains that were precursors to the congruent eight and to calpain and to SARC. The rates o f evolution of the EF-hand domains are slower following formation of the subfamilies and establishment o f their functions. Subfamilies are not readily classified by patterns o f calcium coordination, interdomain linker stability, and glycine and proline distribution. There are many homoplasies indicating that similar variants o f the EF-hand evolved by independent pathways.

Key words:

EF-hand -- Calcium binding protein -- Gene duplication -- Congruence -- Domain

Offprint requests to: R. H. Kretsinger

transposition -- Calmodulin -- Troponin C -- Light chains of myosin -- Parvalbumin -- $100 -- Calpain

Introduction The first report in this series (Moncrief et al. 1990) summarized the structure of the EF-hand domain, the alignment of amino acid sequences, and format of the data base of amino acid sequences of 152 proteins. It described the methods for computing m i ni m um mutation distances and for constructing dendrograms. The relationships among 12 subfamilies and eight unique proteins were presented in a general dendrogram. Essential characteristics of each subfamily were reviewed in the context of the dendrogram for that subfamily. The incorporation o f 68 new sequences (see Appendix 1) and correction of several errors, or use o f completed sequences where only fragments were previously available, has resulted in the definition of eight new subfamilies and seven more uniques (Table 1). Inevitably, as this report is being prepared and reviewed additional sequences will become available. They will be incorporated into the data base, which is available on request, and indicated in footnotes as editing permits. We will minimize repetition of information presented in Moncrief et al. (1990); however, for the sake of clarity we briefly summarize some of the rationale, procedures, and characteristics of the proteins. The primary goal o f this study is to compute dendrograms using individual domains. The several domains of some subfamilies have distinct evolutionary histories; therefore, we have classified the

417 Table 1.

Subfamilies and unique EF-hand homolog proteins Calcium binding by domain

Abbreviated name

Protein

CAM TNC ELC RLC u CAL u SQUD CDC u CDPK CALP SARC ACTN u EFH5 PARV CLBN

Calmodulin Troponin C Essential light chain Regulatory light chain myosin Call (Caenorhabditis) Squidulin (Loligo) CDC31 and caltractin Calcium-dependent protein kinase Calpain and sorcin Sarcoplasmic calcium-binding protein c~-actinin EFH5 (Trypanosoma) Parvalbumin Calbindin 28 kDa

u u

u u u

TPNV CLNB LPS

Troponin, nonvertebrate Calcineurin B

AEQ DGK TPP 1F8 CVP S100 SPEC LAV VIS CMSE TCBP CRGP

Aequorin- and luciferin-binding proteins Diacylglycerol kinase p24 Thyroid protein (Canis) 1F8 and TB 17 (Trypa nosoma) Calcium vector protein S 100, intestinal calcium-binding protein Strongylocentrotus calcium-binding protein LAVI (Physarum) Visinin and recoverin Calcium-binding protein (Streptomyces) Tetrahymena calcium-binding protein CAM-related gene product

Lytechinus pictus

1

2

3

4

+ + +

+ + _+

+ +_ +_

_+ + _+

-{-

- -

- -

- -

+ + _+ + + + + ....

+ + _+ + -+

+ + + + _ _+

--

+

+

--

--

+

+

+ +

+_

+

+

--

+

+_

+

+ + + -+ + +

+ + + +

+ + -+

+ +

? +

+ + + .... + .... + + +_

+ + + + .... + + + _

_

+

+

+ + +

+ +

_+ +

--

+

+

--

+

+

+

+

--

+

+

+

+

+

(_+) _+ .... +

We identify 20 subfamilies of EF-hand homologs and 9 possible subfamilies for which there is only one published member, so-called uniques, designated by u at the left margin. The first column lists the abbreviation of the subfamily or of the unique. The second column indicates the full name and, space permitting, genus of the organism from which the unique protein came. Domains are numbered sequentially N to C. A domain is inferred to bind calcium if its loop contains neither insertion nor deletions relative to the canonical domain and if the five vertices--X, Y, Z, - X , and - Z - - a l l have, in any combination, Glu, Gln, Asp, Asn, Set, Thr, Cys, or Gly. Inferred calcium binding is indicated by +; - not; and + both binding and not binding inferred in different representatives. The ? in domain 4 of TPP indicates a 17-residue insertion at position - Z . The S100 subfamily contains proteins of various alphanumeric designations; the (_+) indicates that this loop is unique in (potentially) coordinating calcium with four carbonyl oxygen atoms and Glu at vertex - Z . The . . . . before the EF-hand domains of CDPK, CALP, DGK, and LAV indicate that the catalytic domain(s) of the protein precedes the four or two EF-hand domains; . . . . after ACTN indicates that those two EF-hand domains are at the N-terminus. Space limits preclude genus designation for CDPK (Glycine), D G K (Sus), and CVP (Branchiostoma). TPNV from Astacus, Tachypleus, and Balanus have been called troponin, as have the TNCs. LPS refers to Lytechinus pictus SPEC-resembling protein. It contains eight tandem EF-hand domains. Correspondingly, the fifth and sixth domains of CLBN are indicated below its first and second domains. SPEC is Strongylocentrotus purpuratus ectodermal calcium-binding protein. CMSE from Streptomyces erythraeus is the only known EF-hand protein from a prokaryote. CRGP is from Homo

proteins and evaluated their evolution in terms of t h e i r c o n s t i t u e n t d o m a i n s . T h i s is a r e l a t i v e l y n o v e l approach to the study of protein evolution and to the best of our knowledge has not been extensively applied previously. It has proven difficult to estimate rates of evolution; however, we do infer that these rates have slowed markedly following the formation of each subfamily of proteins and refinement of its function. As new subfamilies arose, evolution was quite rapid and can be likened to the rapid adaptive radiations

that occur when organisms encounter a new environment. Patterns of calcium coordination and of domain l i n k a g e a s w e l l a s o c c u r r e n c e o f p r o l i n e a n d o f glycine frequently show homoplasy. Distantly related EF-hands have often acquired similar characteristics by convergent evolution. The third report in this series will use these dendrograms based on individual domains as a reference for evaluation of intron distribution and evolution.

418

Methods Data Base. Our data base consists of 152 proteins analyzed earlier (Moncrief et al. 1990) as well as 66 published subsequently. The Lytechinus calcium-binding protein, LPS, is eight domains long and evolved by a recent gene duplication and splicing. Domains 1 and 5, 2 and 6, 3 and 7, and 4 and 8 closely resemble one another. Both LPS proteins are entered as two sequences of four domains each (1 ~ 4 and 5 ~ 8) for computational purposes; hence, the data base consists of 218 proteins or 220 sequences. Several amino acid sequences of vertebrate calmodulin are identical; they are all considered to be one of the 220 sequences. Between the time of the computation of the dendrograms and the submission and subsequent editing of this report several additional sequences have been published. They are indicated in the text or footnotes and have been incorporated, as appropriate, into text and figures without repeating the very demanding computations of relationships among 804 domains of 218 proteins. The incorporation o f additional sequences sometimes necessitated realignment of sequences already in the data base, and several corrections were made to the original data base of 153 sequences. Appendix 2 lists the amino acid sequences added to this earlier data base as well as those that have been corrected or realigned. The data base will continue to be updated. We are pleased to honor electronic mail ([email protected]) requests for the current data base. It has also been deposited with the National Biomedical Research Foundation, Georgetown University, 3900 Reservoir Road NW, Washington, DC 20007, (202) 687-2121, [email protected]. We appreciate corrections and new sequences. We anticipate some of our results and interpretations in order to clarify and simplify the description of methods. The computation and interpretation of these massive dendrograms is inevitably an iterative process; we spare the reader some false starts and numerous iterations. One of our results was the classification of these 218 proteins into 20 distinct subfamilies and nine unique proteins, assumed to be the first representatives of new subfamilies. The numbers of domains, calcium affinities, and abbreviations for these 29 total subfamilies are summarized in Table 1.

Identification of Homologs, Distinctionfrom Analogs. The EFhand domains known to bind calcium are readily recognized by the mnemonic: --~-helix~--

--a-helixF-**calcium-bindingloop*

En**nn**nX*Y*ZG#I-X**-Zn**nn**n 12345678901234567 890 123456788 E is Glu; n, hydrophobic; *, variable; G, Gly; #, any residue at - Y ; I, Ile, Leu, or Val; X, Y, Z, - X , and - Z represent the vertices of an octahedron. The calcium-coordinating residues at these vertices may be Glu, Gln, Asp, Asn, Ser, Thr, Cys, or Gly; a water molecule coordinates calcium in place of a side chain missing in G1y. Those domains that do not bind calcium are more difficult to recognize; we have developed more sensitive algorithms for identifying homologs and distinguishing them from close analogs (Kretsinger et al. 1991). Even so, the mnemonic is valuable for illustration and manual searches of sequences.

Computation of Distance Matrices and Dendrograms. The major goal of this study was to determine the relationship among the 804 individual EF-hand domains in our data base. All programs used were described in detail in Moncrief et al. (1990). Each domain o f each protein was treated as an independent OTU (operational taxonomic unit). Program M M D was used to con-

struct a pairwise distance matrix for the 804 EF-hand domains; then programs U P G M A and FTE were used to construct input trees for SWAP1. Because 804 OTUs is prohibitively large to analyze, we proceeded in a stepwise fashion, establishing relationships among a large portion of the entire data set; then analyzing subsets of the remaining domains in relation to that large portion alternately using SWAP1 and SWAPS. The results of each step determined the composition of subsets that were examined in subsequent steps. We generated several low scoring dendrograms of 804 individual domains applying SWAPI and SWAP8 to various starting conformations, generated as described above. We could hardly assume that any of these dendrograms had the global lowest score of minimum mutation distances nor was that our goal. However, all of these dendrograms have four characteristics that guided subsequent computations. The domains 1 of any given subfamily cluster together as do the domains 2, etc. Second, the arrangement of domains 1 within the domain 1 subfamily cluster is similar to the arrangement of domains 2 within the domain 2 subfamily from that same subfamily, etc. Further, each of the domain subfamily clusters is similar to the dendrogram based on the entire sequence for that subfamily. The two to four dendrograms based on individual domains within each of the 20 subfamilies are congruent within themselves. This we propose to be one of the defining characteristics of a subfamily. Third, the four SARC domain subfamilies cluster together as do the four CALP domain subfamilies, both distant from CAM, TNC, ELC, and RLC. The fourth characteristic is that the domains 1 subfamily clusters for CAM, for TNC, for ELC, and for RLC come together as do the domains 2 subfamily clusters, and 3 and 4. The four subfamilies--CAM, TNC, ELC, and RLC-are fully congruent with one another. In constant, CALP and SARC are both fully noncongruent with other subfamilies. Based on these four characteristics we constructed a reference tree comprised of 24 domain subfamilies of CAM, TNC, ELC, RLC, CALP, and SARC (CTER/CS). These six subfamilies contain 123 of the 218 proteins and 492 of the 804 domains and form a robust reference dendrogram for analysis of the remaining 312 domains as well as those more recently added to the data base. Further, 22 of 29 recognized subfamilies and uniques consist of four EF-hand domains as do CTER/CS. Although our primary focus has been individual domains, we also computed dendrograms within subfamilies and among fully congruent subfamilies using all domains plus interdomains in order to determine relationships among entire molecules. For instance in a four-domain protein, the sequence including the first through fourth domains (folders 2, 3, 4, 5, 6, 7, and 8 in our terminology) was used; however, the termini, folders 1 and 9, were not used because of the numerous gaps and difficulty of alignment of these segments of the entire sequence among congruent subfamilies. Each of the 24 subfamilies of domains of CTER/CS was fixed in the relationship previously determined using the domains plus interdomains of the proteins from which those domains were obtained. For example, the 32 domains 1 of CAM were arranged in a dendrogram so that they were related to one another in the manner dictated by the results of analyses of CAM based on entire sequences of CAM. Next, the 14 domains 1 of TNC were arranged in a dendrogram so that they reflected relationships based on entire sequences of TNC. This procedure was repeated with the domains 1 from ELC, RLC, SARC, and CALP, as well as with domains 2, domains 3, and domains 4 for each subfamily. This resulted in a total of 24 OTUs (six subfamilies with four domains each), with the relationship within each OTU being determined by relationships based on domains plus interdomains. These relationships were held constant so that relationships could not change within each subfamily of domains. The root of each of these 24 OTUs is the nodal (putative ancestral) sequence for that domain of that subfamily. It is the relationship

419 among these ancestral sequences with which we reconstructed the evolutionary history of individual domains. In order to use the powerful program SWAP8, which computes m i n i m u m mutation distance scores for all 10,395 possible arrangements of eight OTUs, these 24 OTUs were subdivided into three groups of eight OTUs each. Earlier analyses (Moncrief et al. 1990; this study) indicated that CALP and SARC are distantly related to the CAM, TNC, ELC, and RLC grouping. Therefore, relationships among the eight subfamily domains of CALP and SARC were analyzed independently using SWAP8. The fourdomain subfamilies of SARC cluster together some distance from the four-domain subfamilies of CALP. Additionally, earlier studies (Baba et al. 1984; this study) indicated that domains 1 and 3 (odd) of CAM, TNC, ELC, and RLC cluster together and that domains 2 and 4 (even) of CTER are more closely related. Consequently, the odd domains 1 and 3 of CTER comprised a group of eight OTUs for the second SWAP8 analyses, and the even domains, 2 and 4, of CTER comprised the third group of eight OTUs whose relationships were determined using SWAP8. The resulting lowest-score arrangement among each of these three groups of eight OTUs was then used to construct a reference tree with a total of 24 OTUs (length = 2273) for a SWAPI analysis of the 24 total OTUs being used to construct the reference tree, abbreviated as CTER/CS. The tree having the lowest score consistent with these constraints is shown in Fig. 2a. In the second phase of these analyses, we examined relationships among the 24 OTUs of CTER/CS, and subsets of domains from the other 23 subfamilies and uniques in this study. After proper assignments to subfamilies, all of the domains of the 23 total subfamilies were included in 89 (113--24) total subfamily domains. We then examined groups of subfamilies and uniques relative to the 24 OTUs of CTER/CS for two reasons. At one extreme it is still too difficult to determine the lowest scoring dendrogram of 113 OTUs (subfamily domains). At the other extreme we found that adding individually each of the 23 remaining subfamilies and domains to CTER/CS was unsatisfactory because their optimal placements are dependent upon the placements of their closer relatives. Hence we went through several iterations to define subgroups of the 23 that were mutually stable when joined to CTER/SC. In the first iteration the seven groups and the number of different arrangements investigated with SWAP 1 were as follows: (1) EFHS, LAV, and 1F8 (>40 trees); (2) DGK, VIS, CLNB, and TPP (> 80 trees); (3) CAL, SQUD, TPNV, CVP, and CDC (> 100 trees); (4) SPEC, ACTN, LPS (>50 trees); (5) S100, PARV, CLBN, and TCBP (>50 trees); (6) CMSE and AEQ ( > 4 0 trees); and (7) CDC, SQUD, CAL ( > 2 0 trees). Results from these analyses reduced the number of subfamilies and uniques for further consideration in relation to individual domains of CTER/SC from 23 to 19. At this time the sequence of calcium-dependent protein kinase (CDPK) was published. Because CAL, SQUD, CDC, and CDPK are fully congruent with CAM, TNC, ELC, and RLC, the relationships of CAL, SQUD, CDC, and CDPK to CTER were further investigated using the four domains plus interdomains and domains only of each protein (84 sequences total) in a SWAP8 analysis (Fig. 9b). The results from the first iteration indicated new groupings as follows: (1) ACTN, S100 (>20,790 trees); (2) TPP, 1F8, and D G K (> 10,395 trees); (3) EFI-I5 and LAV (>41,580 trees); (4) SPEC and LPS (>41,580 trees); and (5) AEQ and CMSE (> 10,395 trees). The results from the second iteration indicated new groupings as follows: (1) TPP, 1F8, and LPS (>5 trees); (2) LAV, AEQ, and CMSE ( > 7 trees); (3) EFH5, ACTN, and PARV (> 12 trees); (4) TCBP, VIS, and S 100 (> 6 trees); (5) TPNV, SPEC, and CLBN ( > 7 trees); and (6) CLNB, CVP, and D G K (>7 trees). The results from the third iteration indicated new groupings as follows: (1) TPP, IF8, ACTN (>25 trees); (2) EFH5, LPS (> 10 trees); (3) VIS, CMSE (>6 trees); (4) PARV, LAV, DGK, AEQ

(>25 trees); (5) CVP, S100 ( > 7 trees); and (6) CLBN, TCBP (>7 trees). The results from the fourth iteration indicated new groupings for the final iteration as follows: (1) ACTN, EFH5, PARV (>20 trees); (2) CLBN, TPNV, CLNB (>5 trees); (3) LPS, AEQ, D G K (> 10 trees); (4) TPP, 1F8 ( > 3 trees); (5) VIS, CMSE, TCBP ( > 4 trees); and (6) SPEC, LAV (>5 trees).

Generation of Nodal Sequences. Nodal sequences were constructed by the method of Sankoff and Cedergren (1983). In Fig. 1 are shown at the terminal nodes four aligned, hypothetical sequences, each five residues long. In the first coalescence to generate a nodal sequence for the two uppermost termini the two ambiguities ofva*T*qe*I*D are resolved to A*T*Q*I*D by considering the adjacent terminal node. The next inner node, A*tp*Q*il*dn is partially resolved to A*T*Q*L*dn by likewise considering the adjacent terminal node. The next inner node (putative ancestral sequence for all four termini), A*T*qe*L*dng, is unresolved as no adjacent terminal node is shown. The deduced amino acid sequences are shown at each node. The illustrated amino acids at any given position can be related to one another by changes of a single encoding base. At least seven changes are deduced to have occurred. Four amino acid changes in uppercase--A ~ V, L ~ I, T ~ P, and Q ~ E--are assigned to a branch. Three in lowercase are each shown twice to indicate that one base change occurred in one, only, of two branches. This procedure is used to generate the nodal sequences used to align interdomain sequences. Additionally, this is the procedure by which inner nodes were generated for each domain subfamily by SWAP1 and by SWAP8. In Fig. 5c we illustrate nodal sequences for the four loops of ELC as well as for the linker between domains 2 and 3. Similarly, in Figs. 6c and 8c we show nodal sequences for RLC and for SARC. Calculation of Branch Distances and Error Estimations. The hypothetical minimum mutation distance assigned to each branch length, D~ has an estimated error (standard deviation) of Ea = Da '~2.This error estimate assumes a stochastic process like radioisotope decay. For example the two distances, 9 and 3, of Fig. 1 have errors of 3.00 and 1.73. When two branches are coalesced to a single branch to reflect the average distance from ancestral to descendant sequences the length is the arithmetic average, Dab = (Da + Db)/2, e.g., Dab = (9 + 3)/2 = 6.0. The error, Eab = [(E, 2 + Eb0'/q/2, e.g., = [(9 + 3)'~q/2 = 1.73. Note that the error of this coalesced branch, 6.0 mutations long, is 1.73, not 6'/1; more observations increase the signal-to-error ratio. The sum of two colinear branches, Dc + Dd, is illustrated as 6 + 6 = 12. The error of this sum, Ecd = (Ec2 + Ed2) v2 = ( 2 . 4 5 2 + 1 . 7 3 2 ) '/' = (6.0 + 3.0) `/' = 3.0. All error estimates indicated in the dendrograms have been calculated by these formulae. We are painfully aware of the many approximations underlying these procedures. Two important points are emphasized: Any conclusions about rates of evolution are bounded by these or by more sophisticated estimates of error. Frequently, branch lengths are short relative to their errors; an important consequence is that alternate topologies could be generated by altering branch lengths within their errors. Robust subfamilies do not necessarily have longer branch lengths. They usually have higher length-to-error ratios, because determinations of many diverse proteins have improved the statistics. Results

Definition a n d Characterization o f Subfamilies The alignments of amino acid sequences within dom a i n s is u s u a l l y u n a m b i g u o u s , even among dis-

420

a----~d

L'~ l e'-q'q

g'-~d,n

,00

t

2--ATq Ld 1.41 ] e n m

I

3 ~

ATEI

D

1.73

Ld

l

VTQI D

ATQID I Q---~ E

F"-'~'645

4 --ATQ

I

A"~ V 9 ~ 3.00

n

d~n T~P - - 1 0 - - A P Q L N 3.16

g

q "">c d,n-'-> g ATELG

12 3.46

6 2.45

+

6--vTqlD 1.73

a

e

12 3.00

ATQID

- 3.16

APQLN

4 - 2.00

f -

t

-

2 m 1,41

-

1

0

12 3.46

4 2.00

2 1.41

+

-

15.5 2.68

+

ATELG

1 1 - 2.18

AtQid P in

15 2.96

ATQLd

12 3.46

ATELG

13.5 2.28

ATq e

n

Ld n g

ATq Ld e n g

Fig. 1. T h e generation of nodal sequences is shown for four aligned, hypothetical sequences, each five residues long. In the first coalescence to generate a nodal sequence, the two ambiguities o f va*T*qe*I*D are resolved to A * T * Q * I * D by considering the next higher branch. The next higher coalescence, A*tp*Q*il*dn is partially resolved to A*T*Q*L*dn by considering the next higher branch. T h e next coalescence, A*T*qe*L*dng, is unresolved, as no higher branch is shown. T h e deduced a m i n o acid sequences are shown at each node. The illustrated amino acids at any given position can be related to one another by changes of a single encoding base. At least seven base changes are deduced to have occurred. Four in u p p e r c a s e - - A ~ V, L ~ I, T ~ P, and Q ~ E - - a r e assigned to a branch. Three in lowercase are each shown twice to indicate that one base change occurred in only one of two branches. T h e hypothetical m i n i m u m mutation distances (for the full-length domain) have an estimated error (standard deviation) of E, = D , '~. This error estimate assumes a stochastic process like radioisotope decay. For example the two distances, 9 and 3, have errors of 3.00 and 1.73. W h e n two branches are coalesced to a single branch the length is the arithmetic average, D.b = (De + Db)/2, e.g., D,b = (9 + 3)/2 = 6.0. The error Eab = [(Ea 2 + EbE)V2]/2, e.g., = [(9 + 3)v2]/2 = 1.73. N o t e

tantly related domains that do not bind calcium. In contrast, the alignment of sequences connecting domains is sometimes questionable when comparing different subfamilies. Hence, we have used domains plus interdomains when computing the relationships within subfamilies; however, when comparing the relationships among subfamilies, or domains thereof, we have used the EF-hand domains only. Within subfamilies the dendrograms computed using domains plus interdomains and those using domains only are very similar (see illustrations for CAM, TNC, ELC, RLC, CALP, and SARC, Figs. 3-8). These differ little from those presented in Moncrief et al. (1990) and will be discussed only in terms of new sequences or special characteristics. This guideline of domains only among subfamilies and domains plus interdomains within subfamilies is cyclic in that the greater constancy of interdomain sequences within subfamilies is one characteristic used to define a subfamily. In most cases our assignments of proteins to subfamilies as a result of our computations correspond to classifications previously made by researchers who used functional criteria; exceptions are noted below. As discussed in the next report in this series, the distribution of intron sites, as now available for 15 subfamilies, differs among subfamilies but is consistent within subfamilies. For instance, call, which is closely related to calmodulin and was put in the calmodulin subfamily in Moncriefet al. (1990), has a different distribution of introns from that in the calmodulins and is treated in this report as a unique. The distinction between closely related subfamilies can be difficult. To illustrate, call and squidulin might be considered part of the calmodulin subfamily; however, Caenorhabditis and Loligo both appear to have true calmodulins. As discussed in the next section, we consistenly find congruence within subfamilies; however, the extent of congruence varies among subfamilies. Troponin C (TNC) and the troponin from nonvertebrates (TPNV) are not fully congruent, although the two groups might otherwise have been considered members of the same subfamily. Other examples will be noted in the discussion of that subfamily or unique. The 20 subfamilies each have two or more members. The designation as unique (call, squidulin, calcium-dependent protein kinase, EFH5, thyroid calcium-binding protein, calcium vector protein, LAV 1,

that the error of this collapsed branch, 6.0 mutations long, is 1.73, not 6'/2; more observations increase the signal-to-noise ratio. The sum of two colinear branches, Dc + Do, is illustrated as 6 + 6 = 12. The error of this sum, Ec~ = (Ec 2 + Ed2)'/~ = (2.452 + 1.732) 'a = (6 + 3) 'h = 3.0.

421

Streptomyces calcium-binding protein, and calmodulin-related gene product) is admittedly arbitrary in two senses. If a second close homolog were sequenced, the pair would by definition become a subfamily. As already noted, the lumping and splitting of subfamilies follows several guidelines but in some instances ultimately demands a subjective judgment.

Congruence within Subfamilies We find that within statistical error the same topologies of relationships within subfamilies are generated using individual domains, or using all domains in t a n d e m , or using d o m a i n s plus interdomains (see Figs. 3-8). We refer to this correspondence of topologies as congruence. Of course the statistical error is greater when using data for only one domain of 29 residues than for four domains, 116 residues, or the entire sequence, e.g., calmodulin 148 residues. Hence, we use domains plus interdomains when generating dendrograms within subfamilies. Of fundamental importance to further analyses is the result that within all subfamilies the domains 1 cluster together, as do domains 2, etc., illustrated for TNC (Fig. 4). Congruence implies that all members of any given subfamily of molecules diverged from a single subfamily precursor of full length by subsequent speciation and/or gene duplication events in the ancestral organism without subsequent domain shuffling or splicing events. As illustrated for CAM, TNC, ELC, RLC, CALP, and SARC, the interpretation of congruence within subfamilies is strongly supported. However, because of statistical error the details of relationships among members of a subfamily are slightly different as determined using sequences for each domain. Occasionally, one domain will not cluster with the same numbered domains but with another group of domains of that same EF-hand subfamily. We suggest that these minor variations simply reflect the stochastic nature of the evolutionary process and the statistics of dealing with a short length of 29 residues; they do not alter the important interpretation of congruence. Further, this example illustrates why we seek a simple interpretation involving fewer gene duplications or transpositions, not necessarily the lowest M M D score. We have used CTER/CS (CAM, TNC, ELC, RLC, CALP, and SARC) as the reference for evaluating the relationships among the domains of the other 23 subfamilies and uniques. In Figs. 9-16 the relationships within the 24 (6 x 4) subfamily domains were held fixed to the relationships for the respective subfamilies as determined by using domains plus interdomains (Figs. 3-8). For some of the 23 sub-

families we "unfixed" CTER/CS during computation of dendrograms and found that only infrequently did individual domains move inside the 24 reference domain subfamilies.

Relationships among Congruent Subfamilies In contrast to the congruence observed within all of the known subfamilies of EF-hand proteins the relationships among most subfamilies are not congruent. This finding and its elaboration is essential to all further analyses of the EF-hand family and should form an important part of the analyses of all multidomain families of proteins. Four major subfamilies are congruent with one another. The domains 1 of CAM cluster together and this group is in turn near the domains 1 of TNC and the domains 1 of ELC and the domains 1 of RLC. Correspondingly, the domains 2 clusters of CAM, of TNC, of ELC, and of RLC themselves cluster together. This congruence also holds for domains 3 and for domains 4. All four subfamilies-CTER--are congruent with one another. All four evolved from a single four-domain precursor present in an organism ancestral to all extant eukaryotes. Calmodulin is found in all fungi, protists, plants, and animals; essential light chain is found in fungi and animals. The situation in CALP and in SARC is fundamentally different from that in CTER. The four clusters of domains of CALP are most closely related to one another. The four domains of CALP arose by repeated gene duplications and fusion from a single domain, UR/CALP. CALP is not congruent with any other subfamily. Similarly, the four domains of sarcoplasmic calcium-binding protein are most closely related to one another. The single-domain precursor ofCALP, U R / C A L P and the singledomain precursor of SARC, UR/SARC are closely related; however, the two subfamilies are not congruent (Fig. 2a). The proposed evolution of these six subfamilies--CTER/SC--is illustrated in Fig. 2b. A UR EFhand domain duplicated without fusion to form a precursor of CALP and of SARC, indicated as U R / CS, and a precursor of CAM, TNC, ELC, and RLC, indicated as UR/CTER. UR/CS underwent a gene duplication, without fusion, to yield UR/CALP and UR/SARC. Both encoding genes then went through three gene duplications and splicings with fusions to form the precursor four-domain gene of all CALPs and to form the precursor four-domain gene of all SARCs. The other descendant of the U R EF-hand domain, UR/CTER, duplicated with fusion to form the two-domain precursor of all four subfamilies, OD*EV/CTER. This precursor underwent another

422

~ I 2 d'l -'7

(7)

CA_M1 TNC1 ELC1 RLC1 CAM3

I*** i*** r'- CAM+TNc--~CAM r--UR/CTER--~ OD*EV/CTER'~ l*2*3*4/L~ER'~ t--TNC

|

"'"

UR L

~--UR/CALP

'

*** *** 1.(234)

*** *** 1.2.(34)

*** *** CALP

**'2.(341) ***

I"" 2.3"(41) ***

r'* SARC ***

UR/CS

Iz'~"E L C 3 19 R L C 3

,

"'-"

--UR/SARC

~, TNC3

m

CAM2 taa. T N C 2

I

ELC2

I

7

RLC2

I . 'u'CAM4 LJ

~

ELC4 TNC4

t

RLC4 ~

CALP3 CALP4

4

2~ C A L P 2 7

21~ C A L P 1 7 SARC1 SARC4

t.~

,

~3'~ S A R C 3 SARC2

Fig. 2. a The relationships within each of the 24 subfamily domains of the 6 subfamilies--CAM, TNC, ELC, RLC, CALP, and SARC (CTER/CS)--were fixed as indicated in the dendrograms of these respective subfamilies computed using entire sequences (Figs. 3-8). The minimum mutation distances indicated on the 24 terminal nodes represent weighted averages for a domain length of 29 residues as described in the Methods. CAM is quite highly conserved as indicated by the shorter lengths of its four branches. The MMDs in parentheses were computed omitting pseudogene subgroups ~A and ~B. Conversely, SARC and CALP have diverged quite rapidly. The domains 4 of CTER have diverged to a greater extent than have the other three. The four domains of CALP cluster together as do the four SARC domains; the two quartets are completelynoncongruent. In marked contrast, the domains 1 of CAM, TNC, ELC, and RLC cluster together, as do domains 2, domains 3, and domains 4. Although topologies within the four CTER clusters, 1-4, differ slightly, the interpretation of congruence among CTER subfamilies is solid; the order of branching can be compared to that seen in the den-

gene duplication and fusion to generate 1"2"3"4/ CTER. It then duplicated, without fusion, to form the four-domain precursors of CAM+TNC and of ELC+RLC; another pair of duplications, without fusions, yielded CAM and TNC as well as ELC and RLC precursors. The evolution of CAM, TNC, ELC, a n d R L C i n v o l v e d first f u s i o n t h e n d u p l i c a t i o n . This interpretation of the evolution of CTER/CS is s u m m a r i z e d in t h e c a r t o o n o f Fig. 2c. A d u p l i c a t i o n w i t h o u t f u s i o n g e n e r a t e d C a n d S. A p a i r o f

drograms computed using domains only, as well as domains plus interdomains (Fig. 9b). All four subfamilies evolved from a single four-domain precursor, b A symbolic interpretation of the evolutionary pathways of these six subfamilies. A gene encoding a single UR EF-hand in a precursor to fungi, protists, plants, and animals duplicated without fusion to form UR/CTER and UR/ CS. UR/CS again duplicated without fusion to form UR/CALP and UR/SARC. Both underwent several duplications with fusions to form the precursor of all CALPs and of all SARCs respectively. In a different sequence of duplication and splicing, the UR/CTER gene duplicated and fused twice, generating OD*EV/CTER, then I*2*3*4/CTER; this later was the precursor of all CAMs, TNCs, ELCs, and RLCs now found in the four eukaryotic kingdoms, e The cartoon emphasizes that gene duplications without splices (--) generated the precursor of CALP and of SARC. In contrast two gene duplications with splices (***) generated the sole four-domain precursor of CAM, TNC, ELC, and RLC.

duplications with fusions generated 1"2"3"4", the precursor of CTER. CAM, TNC, ELC, RLC, CALP, and SARC (CTER/CS) then formed the reference dendrogram f o r e v a l u a t i o n s o f t h e r e m a i n i n g 23 s u b f a m i l i e s . T h e s e six s u b f a m i l i e s c o n t a i n 123 o f t h e 2 2 0 a v a i l a b l e s e q u e n c e s a n d 4 9 2 o f t h e 804 a v a i l a b l e d o m a i n s . B e c a u s e so m a n y s e q u e n c e s a r e a v a i l a b l e , t h e s e six s u b f a m i l i e s f o r m a s t a b l e d a t a b a s e f o r r e f e r e n c e . O m i s s i o n o f s e v e r a l s e q u e n c e s in t h e

423 ~

Homo Arbacia a [-~ ~ Homo pseudogene t r~ ~ Homo pseudogene 2 I ] l ~ Homo pseudogene 3 I i Electrophorus I I 1 Gallus neocalmodulin 'A1 ~ Homo pseudogene I ~ [ t.~ Gallus pseudogene , ~9 Rattus pseudogene l 6 Rattus pseudogene 2 1 [.9- Triticum I l-k. Arabidopsis ~2~ ~ Spinacia I [ ~ Solanum ~2~ L~ I o Hordeum I I i + Medicago 31 I ~ Trypanosoma I +r~ Tetrahymena Paramecium ~ Arbacia r ~ I~a Strongylocentrotus tl ~ Lytechinus I ^ I o Drosophila & Aplysia ~.~2 1 Metrtdtum 1_9_Patinopecten Chlamydomonas 2 t~ Dicrvostelium ~ Saccharomyces 3 Schizosaccharomyces 2 22 Aspergillus Achlya

~

~

o

f

t

.

I ~ L.~ - [ I

~ ~

~ ' I [~ ' ~...._

Homo Arbacia a Arbacia Strongylocentrotus Lytechinus Drosophila & Aplysia Tetrahymena Trypanosoma Patinopecten Metridium Gallus neocalmodulin E Homo pseudogene 2 Rattvm pseudogene 2 Homo psetutogene 3 ~ Electrophorus Achlya E Homo pseudogene Gallus pseudogene Triticum ~- Spi.acia Solanum Hordeum Medicago Arabidopsis Dictyostelium



f

Aspergillus

.._['" T N C 1

I t._ Chlamydomonas Paramecium Rattus pseudogene 1 Homo pseudogene 1 ...................................................................... Saccharomyces ...................................................................... Schizosaccharomyces

[--~-- a l

0

1s-- 0~2

,_+ ,16-- ~ 26-- 8 Fig. 3. a The relationships of the 32 calmodulins for domains plus interdomains, folders 2, 3, 4, 5, 6, 7, and 8. The alignment of folders 3, 5, and 7 is unambiguous within subfamilies. The topology is identical when computed for domains only. Branch lengths in minimum mutation distances are indicated for domains plus interdomains. Many branch lengths are so short that a few changes in sequence can cause seemingly large changes in branch orders. The CAM sequences from Drosophila and from Aplysia are identical; this sequence is one of 32 indicated. In addition to Arbacia a there are six vertebrate CAMs whose sequences are identical to that of Homo. b The relationships based

on the 32 domains 1; similar results were computed for domain 2, domain 3, and domain 4 (not shown). Dotted lines indicate that a member of the domain subfamily, e.g., Saccharomyces, falls outside the CAM domain 1 cluster or that a foreign domain, e.g., of TNC domain 1 falls inside the CAM domain 1 cluster. Because the dendrograms were computed using only 29 residues, we consider these deviations from perfect congruence to be well within the range of fluctuations associated with evolution. Similar results within other subfamilies led to our conclusion of congruence within subfamilies, e The seven subgroups--a t, a2, ¢}, 3', ~, ~A, and ~B--and the lengths of their coalesced branches.

computation does not alter the fundamental conclusions of complete congruence and of complete noncongruence. Three uniques (CAL, SQUD, a n d CDPK) and the CDC subfamily are congruent with the CTER subfamilies (Table 2 and Fig. 9). The remaining 19 ( 2 9 - 6 - 4 ) subfamilies and uniques appear to have mixed origins, that is, they are only partly congruent. Because most of these subfamilies have few, or only

one member, the statistical significances of their branch lengths and branching orders are lower than are those of the CTER/CS reference dendrogram. Further, for some of the 19, the branch lengths are often short; hence, the choice of exact topology is questionable. If several subfamilies, or domains thereof, are closely related, then these several are jointly compared with CTER/CS. Details are discussed for each subfamily.

424

x~.~ Homo sk Sus sk Oryctolagus sk .[~1 I ~ - ~. . . ~ Ra;~a sk ....................M~ sk m

"~ Electrophorus sk Gallus sk Meleagris sk 2:L ~ Homo cd ra4 1~_Mus cd . . ~ t....a_Bos cd Coturnix cd Gallus cd ~4 Halocynthia L__~

Homo sk

Sus sk

f/

Oryctolagus sk GaUus sk Meleagris sk Electrophorus sk Rana sk Homo cd

Bos cd Coturnix cd Gallus cd

Muscd Halocynthia

Homo sk Sus sk Oryctolagus sk Gallus sk Meleagris sk Homo cd Mus cd Coturnix cd Bos cd Gallus cd Rana sk Electrophorus sk ............................... HalocTnthia

Homo sk r-[ t_ Sus sk 2 [..[ ~ Oryctolagus sk ['~ ~ Gallus sk ['1 ~ Rana sk [1 ~ Meleagris sk [i Electrophorus sk rq I Homo cd ~

II II "q [

H I-"- Mus ~d I -7-I-- Bos cd

I t

t.- Gallus cd Coturnix cd Halocynthia

,-d"- Homo sk r~ I..- Oryctolagus sk 4 r-~ ~ Sus sk I L--4-" Gallus sk ~ ' 7. - - [ L Meleagris sk ~ Rana sk t..- Electrophorus sk ~ Homo cd Mus cd Bos cd Gallus cd ~ Coturnix cd Halocynthia

Fig. 4. a The relationships of 13 troponin Cs plus the recently determined skeletal form from Mus are identical for domains only and for domains plus interdomains, b The relationship of all 52 (13*4) domains illustrates the range of variation and the strength of congruence seen in this and in other subfamilies. The dotted line joining Halocynthia in domains 3 indicates that it is just outside the subfamily. All domains 1 cluster together as do all domains 2, domains 3, and domains 4. Domain subfamilies 1 and 3 are joined as are 2 and 4. The relationships within each of these four clusters are quite similar to those computed using entire sequences. In order to save space, corresponding results showing congruence are not shown for other subfamilies. The distinction between skeletal (sk) and cardiac (cd) TNCs is mainmined in dendrograms based on individual domains; for all four dendrograms Halocynthia TNC is the most divergent.

The important conclusion is that the major subfamilies--CAM, TNC, ELC, and RLC--are highly congruent. They all evolved from a single four-domain precursor. In marked contrast, SARC and CALP are completely noncongruent. SARC evolved

from a single-domain SARC precursor, and CALP evolved from a closely related, but distinct, singledomain CALP precursor. Many of the remaining 23 subfamilies are partially congruent. Regardless of the exact details of each subfamily, to be discussed, this observation of partial congruence demands an evolutionary scheme involving complex pathways of gene duplications, transpositions, and fusions. Further this finding of partial congruence invalidates many previous dendrograms of EF-hand proteins that depict relationships among subfamilies based on entire molecules. It is valid to compare subfamilies that are fully congruent; it is valid to compare the ur-domains of subfamilies that are completely noncongruent. However, it is meaningless to construct a dendrogram relating the tandem domains of partially congruent proteins or subfamilies; one must consider the dendrogram of the individual constituent domains, which may have different evolutionary histories.

Relationships within and Characteristics of Subfamilies In Moncrief et al. (1990) we reviewed the characteristics of each subfamily and unique in terms of its relationship to other subfamilies and its own subfamily dendrogram. We discuss here only the new subfamilies or uniques and the new insights into the subfamilies previously discussed. For each of the six subfamilies--CAM, TNC, ELC, RLC, CALP, and SARC--used as reference, we have computed dendrograms based on domains plus interdomains, domains only, and the four individual domains. To illustrate the range of variation, we show, for each of the six, dendrograms computed using domains plus interdomains, and using domain 1 data only. To save space the domains 2, domains 3, and domains 4 dendrograms are not shown except for TNC for which we show dendrograms for all four individual domains to illustrate congruence within the subfamily (Fig. 4b). Calmodulin, CAM, 32 sequences (Fig. 3) Many authors faithfully note that calmodulin is the most conserved protein except for histones. This is half true. The dendrogram of Fig. 3a is coalesced to show seven distinct subgroups--al of vertebrates; a2 of plants and protists; ~ of echinoderms, insects, and molluscs; 3, of Chlamydomonas and Dictyostelium; ~ of other fungi, pseudogenes A, and pseudogenes B--in Fig. 3c. This presentation emphasizes two points. There is a fivefold change in minimum mutation distance, 28 from the deepest node to the fungi (subgroup ~), versus 6 or 5, to the a 1 or/~ subgroups. Not surprisingly, the pseudogene subgroups, ~bA and ~bB, have many changes in se-

425 m~

'~ IN

Evolution of EF-hand calcium-modulated proteins. II. Domains of several subfamilies have diverse evolutionary histories.

In the first report in this series we described the relationships and evolution of 152 individual proteins of the EF-hand subfamilies. Here we add 66 ...
3MB Sizes 0 Downloads 0 Views