J. theor. Biol. (1979) 76, 369-386

How Reliably do Amino Acid Composition Comparisons Predict !3equenceSimilarities between Proteins? ATHEL

CORNISH-BOWDEN

Department of Biochemistry, University of Birmingham, P.O. Box 363, Birmingham B15 2TT, England (Received March 20, 1978) A method for comparing amino acid compositions of proteins (CornishBowden, 1977) has been extended to allow proteins of unequal lengths to be compared. The method has been tested by applying it to proteins of known sequence.It tends to exaggerate the amount of difference between unrelated proteins. It is therefore a reliable guide to possible sequence similarities, in that it does not suggest that sequencesare similar when they are not, though it sometimes fails to detect genuine similarities. When applied to related proteins the method gives results in good agreement with those predicted. A phylogenetic tree for 37 snake venom toxins has been constructed from their compositions and is similar in most important respects to one constructed from the corresponding sequences. 1. Introduction

In a previous paper (Cornish-Bowden, 1977) I proposed a theory for the use of amino acid compositions to estimate the amount of identity between the sequences of pairs of proteins. The composition indexes of Marchalonis & Weltman (1971) and of Harris, Kobes, Teller & Rutter (1969) proved to be particularly simple to interpret, at least in the simplest case of proteins of equal length ; but further study showed that the index of Metzger, Shapiro, Mosimann & Vinton (1968) could be analysed similarly (Cornish-Bowden, 1978a). In developing the theory it was necessary to make a statistical assumption about the distribution of the various types of amino acid in proteins, and the validity of the theory thus depends on the validity of this assumption. It is obvious that proteins would be unable to fulfil their biological functions if the distribution of residues were entirely random, and in any case there is ample evidence (Holmquist & Moise, 1975; CornishBowden & Marson, 1978) that it is not. Nonetheless, it is not wholly unreasonable to argue that dlfirences between closely related proteins must occur at sites where the requirements for particular types of amino acid are weak, and that these differences may be distributed in a way that can be 369

0022%5193/79/040369+ 18$02.00/O 14

0 1979 Academic Press Inc. (London) Ltd.

A.

370

CORNISH-BOWDEN

analysed as if it were random. (To understand how a non-random sequence can be analysed as if it were random, it may be helpful to consider the decimal expansion of a transcendental number such as rc: the sequence of digits is entirely predictable, but it will give an insignificant result to any statistical test of the hypothesis that the digits are distributed multinomially with a probability of 0.1 of finding any digit at any site.) In this paper I shall give evidence from proteins of known sequence that the theory provides results in accordance with expectation when it is applied to related proteins. but that it tends to underestimate the amount of similarity between the sequences of unrelated proteins. This tendency to underestimate the amount of similarity in such comparisons is hardly a disadvantage because it means that there is a negligible danger of deducing a sequence relationship between two proteins if none exists. The difference between the behaviour with related and unrelated pairs of proteins can be explained by the fact that it is a reasonable approximation to suppose that the amino acids are distributed similarly in related proteins but there is no reason why such an approximation should hold in comparisons between unrelated proteins. There have been several previous investigations of the behaviour of composition indexes when applied to proteins of known sequence (Marchalonis & Weltman, 1971; Harris & Teller, 1973 : Dedman, Gracy & Harris, 1974; Black & Harkins, 1977). These have al) suffered, however, from the absence of any theoretical knowledge of how the indexes ought to behave, and especially from the failure to realize that the interpretation of most indexes is crucially dependent on the lengths of the proteins compared. For example, a value of 20 for the “difference index” of Metzger et al. (1968). would indicate a very high degree of sequence identity, of the order of 950;,, if it referred to a pair of proteins with 20 residues each; the same value of 20 would indicate negligible relationship if it referred to a pair of proteins with 100 residues each. To obtain useful information from a comparison between compositions, it is necessary not only to allow for this length dependence, but also to correct for any appreciable difference in length between the two proteins compared: this can be done by a simple extension of the earlier theory. Some of the results in this paper have been described in preliminary form elsewhere (Cornish-Bowden, 19786). 2. Theory (A)

ALLOWANCE

FOR

DIFFERENCES

IN

LENGTH

For two proteins A and B of equal length, the most convenient index of

AMINO

compositional

ACID

COMPOSlTIONS

AND

SEQUENCE

371

SIMILARITY

difference is SAn, defined as follows: SAn = 3 F (niA - nie)‘,

(1)

i= 1

in which niA and niB are the numbers of amino acid residues of the ith type in A and B respectively, and the summation is carried out over the 18 types of amino acid that are readily distinguishable in composition measurements. Apart from simplicity of calculation, the main advantage of this index is that it is an unbiased estimator of the number of differences between the two sequences, if one assumes that the probability pi of finding the ith type of amino acid at any site is the same in both sequences (Cornish-Bowden,

1977). For comparing proteins with lengths N, and N, that are similar but unequal I previously suggested a generalized definition using mole fractions (Cornish-Bowden, 1977), by analogy with the index of Marchalonis & Weltman (1971). Further study, however, has shown that that definition leads in general to an inconveniently complicated expression for the expected? value of SAn. A much better generalized definition of SAn is the following: SAn = ~ C (nia-nie)‘-t(NA-NB)‘C

P~+~INA-N~I(~

+C p:).

(2)

With this definition, and with the same statistical assumption that the probability pi of finding the ith type of amino acid at any site is same for both A and B, Sbn is an unbiased estimator of the number of residues in the longer sequence that are unmatched with identical residues in the shorter sequence when the two are aligned. This may be demonstrated by the following argument. Consider two sequences A and B of lengths L, and L, respectively with no identities apart from those that occur by chance when they are aligned. (I assume that there exists a unique alignment that would be obvious if the sequences were known. This is hardly a reasonable assumption for comparing unrelated proteins, but it becomes one if we assume, as I shall below, that there are an appreciable number of loci at which the residues are identical because of ancestral relationship.) If A is taken by definition to be the shorter sequence, and M is defined as the number of unmatched residues in B, the expected value of M is

t The word expected is used in this vaguer everyday meaning. The expected probability distribution.

paperwith value

its precise statistical meaning and not with its of a random variable is defined as the mean of its

372

A.

CORNISH-BOWDEN

(cf. Cornish-Bowden, 1977), and the expected value of the first term on the right-hand side of equation (2) is E[)~(ni,-ni,)2]

= f(L,-LH)2Cp2+f(Lq~t,~)(1-Cpf).

(3)

Introduction of (NA - LA) additional loci at which the sequences are identical because of ancestral relationship has no effect on the number of differences either between the sequences or between the compositions. Consequently equation (3) may be written in terms of N,,, and NR as follows: E[+ C (nia-nis)“] Combining

= $(NA-NB)2 1172 fL,(l-CPi2)+tlN,-N,l(l-CPiZ).

this with equation (2) gives E(SAn) = L,h(l -&;)+lN,+V,l

= E(M),

(4)

which shows that SAn is an unbiased estimator of the number M of residues in the longer sequence that are unmatched with identical residues in the shorter sequence when the two are aligned. Practical application of equation (2) requires a value for c p’, though not, fortunately, values of the individual pi. For almost any pair of proteins, if one puts

one obtains a value of c p” that does not differ appreciably from 0.07. There is therefore considerable convenience and very little inaccuracy in rewriting equation (2) as follows: SAn = +c (n,A-ni,)2-0.035(N,-N,)2+0.5351N,-N,j. The two IN, - N,I by using order of

correction terms in is small, as shown in the simpler definition + 2 if the two proteins

(B)

PROTEINS

(5)

this equation approximately cancel out if Fig. 1. This means that the error introduced of SAn expressed by equation (1) is of the differ in length by 18 or fewer residues.

OF VERY

DIFFERENT

LENGTH

Comparisons of the compositions of proteins of very different lengths (as opposed to those of unequal but similar lengths considered in the previous section) are likely to be meaningless unless there is a good independent reason for supposing a relationship to exist. There is, however, an important exception that applies to pairs of proteins in which one is double, or approximately double, the size of the other. In this case one may reasonably

AMINO

ACID

COMPOSITIONS

AND SEQUENCE

Dlfference

SIMILARITY

373

m length

FIG. 1. Correction for differences in length. The value of -0,03S(N, - N,)* + 05351N, -NJ, i.e. the correction to equation (1) implied by equation (5), is plotted against IN, -Ns[.

postulate that gene duplication has occurred in the evolution of the longer sequence but not in the evolution of the shorter one. The sequences of the IXchains of human haptoglobin provide an unequivocal example of this of rat-liver phenomenon (Black & Dixon, 1968) ; the compositions glucokinase and rat-muscle hexokinase type II suggest that they may provide another example (Cornish-Bowden, 1977). In such cases the sort of correction for unequal length embodied in equation (2) would be inappropriate. Instead, one may postulate that the two halves of the doublelength protein are similar enough in sequence for one to be able to estimate the composition of either half by halving the numbers of residues of each type in the complete composition. This halved composition may then be compared with the composition of the shorter protein by the method described in the previous section, making any further correction that may be required by any residual small difference in length. (C) PROPERTIES OF SAn UNDER WEAK ASSUMPTIONS

SAn is an unbiased estimator of the number of sequence differences under much weaker assumptions than those that I used to derive expressions for its mean and variance (Cornish-Bowden, 1977). This may help to explain why it behaves well in practice when applied to related proteins (see the Results section of this paper), even though it can hardly be true that proteins are random sequences of amino acid residues. Although it is not easily possible to derive rigorous expressions for the mean and variance of Sbn under weak

314

A. CORNISH-BOWDEN

assumptions, one may readily deduce the assumption needed for any sequence change to have an expected effect of changing SAn by exactly + 1. Consider any mutation of an amino acid of thejth type into one of the kth type in one only of the two sequences compared. For clarity I shall assume that only sequence B is changed, but this is not essential to the argument and the result would be the same if it were sequence A that was affected. The mutation cannot affect the 16 terms in the summation for SAn for which naa)2 are affected. If the subscripts old and ifj,k;only(nj,-n,,)‘and(n,,new refer to the states before and after the mutation respectively. then n,B(n~w)

=

njB(old)

-1

nkB(new)

=

nkB(old)

+ 1.

and and so tnjA @kA

-

nj,)i?w

-nkB)iew

=

tnjA

-

n,B)fld

+

2(njA

=

(nkA-nkB)~~d-2(nkA-~IkB)old+

-

n,B)old

+

1 I.

Reference to equation (2) or equation (5) shows that the change in SAn must be Skew -

SA%d

=

njA

-

nkA

-

nb(old)

+ nkB(old)

+

1.

(6)

In the absence of any systematic effects on the way in which j and k are selected in any mutation (assuming, for example, that there is no systematic replacement of leucine by valine), it is reasonable to assume that (njA - n,B)old and (nkA-nk&d have expected VhieS of zero. This is a much weaker and more plausible assumption than that the probability distribution of the amino acids is the same at every site, and it predicts that the change in SAn brought about by any sequence change has an expected value of exactly + 1. The derivation does, of course, assume that the two proteins compared have arisen by successive mutations from a common ancestor; if this is not true the conclusion will not be either.

(D)

EXPECTED

VALUE

0~

SAn

FOR

UNRELATED

PROTEINS

If SAn, or any other composition index, is used to compare proteins that have no ancestral or other relationship it is unreasonable to assume that the point probabilities pi are the same for both proteins. This would hardly matter if the proteins were known to be unrelated because then the number of sequence differences would be of scant interest. However, if one used a composition index to investigate the possibility of relationship, as for example by the test that I suggested previously (Cornish-Bowden, 1977),

AMINO

ACID

COMPOSITIONS

AND

SEQUENCE

SIMILARITY

375

one would want an assurance that there was little danger of deducing a relationship that did not exist in reality. It is therefore appropriate to demonstrate that the effect of assuming different probability distributions for the two proteins must be to increase the expected value of SAn and hence to decrease the danger of obtaining a spuriously significant result. If the point probabilities are different in the two sequences the pi values used previously must be replaced by piA and piB for the point probabilities in sequences A and B respectively. Then, for two unrelated sequences of the same length L, the expected value of SAn, defined by equation (l), is as follows :

WAn) = EC+1 (niA- d21 =fL(L-1)CPi2,+3L-L2CP,PiB+)L(L-1)CPi2,+)L = U1-3

1 Pl?,-3 C P$)+*c

1 (PiA-PiB)2

and the expected value of M, the number of differences between the two sequences, is as follows: E(M) = Ul-C Combining

PiAPB).

these two results gives

E(SAn--M) = +L(L- 1) 1 (PiA-Pia)‘. (7) This expression Cannot be negative and iS zero only if PiA = pis for all i, i.e. only when the point probabilities are the same for both sequences is SAn an unbiased estimator of M. In other cases, when the compositions of unrelated proteins are compared, SAn is likely to overestimate the amount of difference between the sequences. The significance test for Sbn is therefore likely to be highly conservative in practice, i.e. it is unlikely to indicate significant similarity between unrelated proteins.

(E)

EFFECT

OF

EXPERIMENTAL

ERROR

In practice, composition determinations are subject to experimental error, which must affect the value of SAn and contribute to the error in estimating the extent of sequence similarity. If niA contains an error &IAand niB contains an error &iB, the error in SAn iS Error in SAn = ) c (niA + EiA- n, - &iB)’ - + c (niA - nir,)2 =

1

(njA

-niB)(&iA-EiB)++

c

‘$A-1

&iAEiB+$

c

Ej?B.

Under any plausible assumption about the distribution of errors the first summation on the right-hand side of this expression has an expected value of

A. CORNISH-BOWDEN

376 zero, and so

Expected error in SAn = c (1 --pi)~f, where a: is the variance of niA or niH and pi is the correlation coefficient between the errors in determining the ith type of amino acid in different proteins. As some amino acids are consistently underestimated whereas others are consistently overestimated, it is likely that 0 d pi < 1 for all i values and so the expected error in SAn is unlikely to exceed c 0:. Regardless of the values of the pi, the expected effect of experimental error is to increase SAn and to decrease the chance of detecting sequence similarity. The value of 1 C; in practice must depend on the techniques used and the care taken, and presumably therefore varies from laboratory to laboratory. However, examination of data for several proteins for which compositions are published both as directly measured and as calculated from the sequences (Ramshaw, Scawen, Bailey & Boulter, 1974; Kelly & Ambler, 1974; Milne, Wells & Ambler, 1974; Aitken, 1975) suggests that a typical value of c C$ for a 100-residue protein is about 1.5. Making the reasonable assumption that the coefficient of variation for determination of each type of aminl: acid is TABLE

Comparison Amino

acid

Aspartate + asparagine Threonine Wine Glutamate + glutamine Proline Glycine Alanine Valine Methionine Isoleucine Leucine Tyrosine Phenylalanine Histidine Lysine Arginine Cysteine Tryptophan Total

1

of two proteins of di@ent Long neurotoxin 1

lengths

Cytotoxin 1

Difference squared

6

6

1

3

3 5 8 3 ! 3 1 2 3 3 0 1 9 3 10 2 71t

4 z 3 2 6’ 2 3 5 3 0 1 7 z 8 0 6ot

0 16 I 9 16 I 0 9 I 1 3 0 0 0 4 1 3 3 71t

t Insertion of these totals into equation (5) gives SAn = 35.5 -4.235 + 5.885 = 37.2.

AMINO

ACID

COMPOSITIONS

AND

SEQUENCE

SIMILARITY

377

approximately independent of the total number of residues, one would expect c 0: to increase with the square of the length of the protein. Thus one might predict a trivial bias (less than +05) in SAn values calculated from compositions of proteins with fewer than 60 residues, but a bias as high as + 40 for proteins with more than 500 residues. However, even the latter bias is likely to be unimportant in comparison with the inherent 240% imprecision in estimating the number of sequence differences from SAn. 3. Example Table 1 shows how the compositions of long neurotoxin 1 and cytotoxin 1, both from forest cobra, can be used to estimate the amount of sequence identity between them. The crude value of SAn given by equation (1) is 71/2 = 35.5, and the difference in lengths is )71-60) = 11, so the corrected value of SAn is 35.5 -O-035,121 [email protected] = 37.2. This is an estimate of the number of residues in the longer sequence, long neurotoxin 1, that are unmatched when it is aligned with cytotoxin 1. Reference to the sequences (Dayhoff, 1976, p. 150) shows the correct value to be 51, so the value of 37.2 estimated from the compositions is about 27% low.

4. Results (A)

GENERAL

NOTE

The purpose of this paper is to assessthe reliability of conclusions derived from composition data in the usual practical situation in which direct sequence information is not available. For this reason, the results given in this paper take no account of amide assignments, which are not usually known from composition measurements. Thus even though most of the sequences used in this study do distinguish between glutamate and glutamine, and between aspartate and asparagine, these distinctions are ignored and the word identical is used to mean “identical apart from possible differences in amide assignments”. As a result, for most comparisons the numbers of sequence identities counted are slightly higher than the numbers given elsewhere, for example by Dayhoff (1972, 1973, 1976). (B)

TESTING

FOR

POSSIBLE

RELATEDNESS

Table 2 shows the results of testing 83 proteins in pairs for significant compositional or sequence similarity. The sequences were chosen from

378

A.

Values of

Protein (source) Bombinin (European firebellied toad) Melittin (honey bee) Secretin (pig) Glucagon (pig) Calcitonin (human) Protamine (tuna) Corticotropin (cow) Gastric inhibitory pohvtide (pig) Viscotoxin A, (mistletoe) Posterior pituitary peptide (cow) Thrombin A chain (cow) Rubredoxin (Peptosrreptococcus elsdenii) Ferredoxin (Clostridium pasteurianum) Basic trypsin inhibitor (cow) Cardiotoxin§ (Formosan cobra) Neurotoxin h (Cape cobra) Neurotoxin &j (Cape cobra) Erabutoxin!j (Sea snake) Neurotoxin a§ (Cape cobra) a-Bungarotoxin§ (Formosan banded krait) Acyl carrier protein (E. coli E-26) Proinsulin (cow) Cytochrome cssL (Pseudomonas aeruginosa) Protease inhibitor (lima bean) Parathyroid hormone (cow) Haptoglobin al chain (human) Antibacterial substance A (Streptomyces corzinostaticus)

Lipotropin fi (sheep) Chorionic gonadotropin a-chain (human) Cytochrome b, (cow)

CORNISH-BOWDEN

TABLE 2 SAnfor unrelated pairs of proteins

Referencet

N

Compared with the protein one above in the Table M SAn Ratro

Compared with the protein two above in the Table M SAn Ratio

D-224 D-223 D-207 D-208 D-205 s-71 D-196

24 26 27 29 32 34 39

23.0 25.5 15: 30.3 33.0 36.3

21.9 24.0 17.9 218 255.9 196.3

@95 0.94 1.20 0.72 I.76 540

25.2 28.2 30.0 323 37.6

42.8 318 38.3 2093 39.5

1.69 1.13 1.28 6.47 1.05

s-51 D-226

43 46

40.6 43.0

346 68.8

0.85 160

41.8 442

267.5 65.5

6.40 1.48

D-194 D-109

48 49

45.3 45.5

509 64.0

1.12 1.41

45.2 463

67.3 119.8

1.49 2.59

D-46

52

49.3

97.8

1.99

4X.8

13.6

1 51

D-43

55

50.0

ax

0.82

53.1

136.0

2.56

D-168

58

54.8

73.8

1.35

54-3

44.0

0.81

D-219 D-217 D-217 D-219 D-220

60 61 61 62 71

54.7 540 ll.O$ 19.01 57.0:

74.9 82.0 14.01 19.011 69.5

I 37 1.52 1.27 1W 1.22

57.2 56.3 53.5 20.01 55.0f

105.3 51.8 90.0 32.0 49.9

I .x4 0.93 1.68 1.60 0.91

S-56

74

41 ,o:

41.8

1.02

69.3

44-4

0.64

D-324 D-209

71 81

7@8 75.4

204.8 141.6

289 1.88

74.0 75.9

252.0 142.5

340 I ,88

D-26

82

74.5

110.0

1.48

77-O

98.3

1.28

D-171 D-205 D-314

84 84 84

80.3 84.0 80.0

280.9 229.0 101.0

3.50 2.13 1.26

77.3 78.3 79.0

298.8 70.9 204.0

3.87 0.91 2.58

D-226 D-197

87 90

82.0 82.3

152-R 163.8

1.86 1.99

X1.8 84.7

1X3.8 178.0

2.25 2.10

s-47 D-31

92 93

85.3 84.5

156.9 110.0

I.84 I.31

86.3 86.5

158.3 89.8

I .83 l-04

.-

AMINO

ACID COMPOSITIONS TABLE

Protein (source) Thyrotropin a chain (cow) Keratin, high-S fraction

(sheep)

Neurophysin (cow) D-2 Microglobulin (human) Cytochrome c (horse) Ribonuclease Tl (AspergilIus oryzae) Cytochrome cJ (Desulfovibrio vulgaris) Thioredoxin (E. co/i) Cytochrome c2 (Rhodospirillum

rubrum)

Haemerythrin (sipunculid worm) Thyrotropin B-chain (cow) Adrenodoxin (cow) Nerve growth factor (mouse) Lactalbumin (guinea pig) Ribonuclease (cow) Histone IIB2 (cow) Azurin (Pseudomonas Juorescens) Avidin (chicken) Lysozyme (human ) Histone IIBl (cow) Prophospholipase A, (pig) Histone III (cow) Uavodoxin (Feptostreptococcus elsdenii)

Chorionic gonadotropin B-chain (human) Haemoglobin a-chain (human) Haemoglobin B-chain (human) Haemoglobin y-chain (human) Nuclease (Sraphylococcus aureus V8) Aspartate transcarbamylase R chain (E. coli) Myoglobin (horse) Coat protein (tobacco mosaic virus vulgare) Myelin membrane encephalitogenie protein (cow) Growth hormone (cow)

AND SEQUENCE

SIMILARITY

379

2-continued

Referen@

N

Compared with the protein one above in the Table M SAn Ratio

D-198

96

90.3

128.8

1.43

29.01

D-303 S-42 S-67 D-15

97 97 100 104

92.5 86.0 93.0 98.2

131.0 183.0 265.8 199.6

1.42 2.13 2.86 2.03

92.8 90.5 94.8 98.1

2786 192.0 2198 322.5

3.00 2.12 2.32 3.29

D-132

104

100.0

3540

3.54

97.4

128.6

1.32

D-25 D-50

107 108

101.0 101.5

297.8 229.0

2.95 2.26

98.3 99.4

155.8 240.6

1.59 2,42

D-25

112

103.2

85.6

0.83

103.8

110.3

1.06

D-50 D-200 D-49 S-53 D-136 D-130 D-279

113 113 118 118 123 124 125

104.5 107.0 111.3 108-O 115.2 114.0 113.5

143.0 203.0 231.3 164.0 258.3 215.0 162.0

1.37 1.90 2.08 1.52 2.24 1.89 1.43

105.7 1060 108.3 110.5 114.5 114.9 115.7

88.3 204-o 120.3 148.3 77.3 90.0 329.9

0.84 192 1.11 1.34 0.68 0.78 2.85

D-47 D-319 D-l37 S-68 D-156 S-69

128 128 129 129 130 135

118-8 116.0 122.0 123.0 1220 125.7

212.8 116.0 240.0 201.0 406.0 442.3

1.79 1.97 1.63 3.33 3.52

117.4 120.5 124.5 120.0 120.5 123.4

130.6 259.8 2040 3270 1540 89Q

1.11 2.16 164 2.73 1-28 0.72

S-18

137

124.0

271.9

2-19

128.9

302.5

2.35

s-47 D-56 D-64 D-77

139 141 146 146

130.0 440.9 131.7 426.9 lOO.O$ 91.3 39.ot 61.011

3.39 3-24 o-91 1.56

130.6 133.0 138.5 lOl.O$

400.6 257.6 407.5 116.3

3.07 1.94 2.94 1.15

D-132

149

139.8

169.8

1.21

138.3

2268

164

D-163 D-82

152 153

141.8 140.5

203.8 245.0

144 l-74

141.9 141.4

1360 I196

0.96 0.85

D-285

158

149.2

476.3

3.19

146.1

137.0

0.94

D-324 D-203

170 189

161.2 177.1

497.4

456-O

3.09 2.58

1609 177.8

4205 239.5

2.61 1.35

1Qo

Compared with the protein two above in the Table M SAn Ratio 20.611 0.71

A. CORNISH-BOWDEN

380

TABLE 2-continued

Protein (source)

-

Lactogen (human) a-Lytic protease (Myxobucter

495)

Prolactin (sheep) cr,,-Casein (cow) Papain (papaya) Immunoglobulin i-chain, V-III (human SH) Immunoglobulin K-chain. V-I (human AG) Immunoglobulin K-chain, V-III (human TI) Immunoglobulin i.-chain, V-I (human NEW) Immunoglobulin K-chain, V-II (human CUM) Trypsinbgen (cow) Enterotoxin B (Staphylococcus aureus S-6) Elastase (pig) Chymotrypsinogen A (cow) Penicillinase (Staphylococcus aureus PC- 1) Tryptophan synthetase a-chain (E. co/i K, 2 ) Subtilisin (Bacillus amyloliqwfaciens)

Carboxypeptidase A (cow) Glyceraldehyde 3-phosphate dehydrogenase (pig) Alcohol dehydrogenase E-chain (horse)

Reference?

N -.---

Compared with the protein one above in the Table M SAPI Ratio

Compared with the protein two above in the Table SAt1 Ratio

D-201

190

121.0:

1004

0x3

178.5

601-7

3.37

D-110 D-201 D-320 D-121

198 198 199 212

1866 182.0 181.5 197.7

846.0 659.0 267.0 579.5

4.54 3.62 1.47 293

1859 677-5 1x1.7 86.0 191.5 1094.0 700.1 382.6

3.64 0.47 5.7 1 1.91

D-257

213

200~5

376.0

1.88

199.1

479.6

2.41

D-241

214

170.0$

93.0

055

196.7

400.9

2.04

D-251

215

25.0t

64.01

7.56

136.0$

69.9

0.51

D-255

216

159-01 124.0

0.78

170.0f

134.9

0.79

D-247 D-105

221 229

142.01 144.3 2148 321.0

1Q2 1.49

28.0: 212.9

30.01 291.5

1.07 I.37

D-227 D-108 D-107

239 240 245

224.1 92@9 228.5 1134.0 2247 193.3

4.1 1 496 @86

2248 894.3 319.1 334.2 227.7 1082.0

3.98 1.53 4.75

D-159

257

241.0

894.4

3.71

2409

1211.5

5.03

D-163

267

247.0 10749

4.35

249.4

756.8

3.03

D-122 D-126

275 307

255.0 287.1

852.0 664.3

3.34 2.31

‘55.9 1319.3 288.1 719.4

5,15 2.50

D-147

332

311.7

572.0

1.84

311.x

467.3

1.50

D-145

374

35@4

3167

0.90

354.1

715.2

2.02

t Page numbers, in Dayhoff (1972) if prefixed D-, in Dayhoff (1973) if prefixed S-. 1 Significant sequence similarity at 99.99; confidence. 9: Snake venom toxins are listed in this Table with the names used by DayholI (1972). For the relationship of these names with the more systematic nomenclature used by Dayhoff (1976) and later in this paper. see Dayhoff (1976, p. 147). 11Significant composition similarity according to the 95”, conlidence test described m CornishBowden (1977). i.e. SAn less than 42p/, of the length of the shorter sequence.

Dayhoff (1972, 1973) to form a seriesof length varying from 24 to 374 with, asfar as possible,no neighbouring pair differing in length by more than a few percent. Only fully determined sequenceswere included, and no more than

AMINO

ACID

COMPOSITIONS

AND

SEQUENCE

SIMILARITY

381

one protein listed with the same name was used. The sequences were then arranged in order of increasing length [retaining the order used by Dayhoff (1972,1973) for proteins of equal length] and each was then compared with the proteins 1 and 2 places ahead of it in the order. In all, therefore, the Table contains 163 comparisons, and each protein, apart from those at the beginning and end, appears in four comparisons. For each comparison, the two sequences were aligned in all possible ways that permitted no gaps except at the beginning and end of the longer sequence. For lengths N, and N,, this allowed IN, -NJ + 1 alignments, and for each of these the sequences were considered to be significantly similar if the prior likelihood of at least the observed number of identities was less than OWl/(IN, - N,I + l), assuming that the probability of a random identity was 0.07. Thus the test of significant sequence similarity was formally a 99.9% significance test. This high level of significance was chosen because genuinely related sequences are usually so obviously related that a statistical test is hardly necessary. (For example, there are 46 identities between human a- and j?-haemoglobins, but as few as 22 would be sufficient to satisfy the 99.9?; significance test.) For each composition comparison, SAn was calculated as defined by equation (5) and was considered significant if it was less than 42% of the length of the shorter sequence: this is formally a 957; significance test if applied to sequences of equal length with amino acids distributed in both sequences with the same point probabilities (Cornish-Bowden, 1977) but one would expect it to be a much more conservative test when applied to unrelated proteins [cf. equation (7)]. Of the 163 comparisons in Table 2, 19 showed significant sequence similarity. Of these 19, 7 also showed significant composition similarity, but 12 did not. This accords with one’s intuition that it should be more difficult to detect similarity when the sequences are not known. Of the remaining 144 comparisons that showed no significant sequence similarity, all 144 failed to show significant composition similarity. In summary, the composition significance test proved highly conservative in practice, as predicted by equation (7). Although one might anticipate about 9 spuriously significant results out of 144 trials with a 95% test, none were observed. This suggests that in general a value of SAn less than 42% of the length of the shorter sequence is virtually certain to indicate an appreciable amount of sequence identity. (C)

RELATED

PROTEINS

To test the properties of SAn when applied to pairs of related proteins, for which the statistical assumptions underlying the theory of SAn should be

A. CORNISH-BOWDEN

382

approximately correct, seven sets of related sequences were selected from Dayhoff (1976). These ranged from the snake venom toxins, a diverse set with 77% sequence difference between the most dissimilar pair and lengths ranging from 60 to 74 residues, to the r-chains of haemoglobin from primates, a much more homogeneous set with only 1 lo,, sequence difference between the most dissimilar pair and all with 141 residues. Apart from a slight tendency of SAn to overestimate the number of sequence differences. which is understandable in the light of equation (7) the results, which are summarized in Table 3, are remarkably good. Not only is the mean value of SAn/M within a few percent of the predicted value of 1.0 in every case, but the coefficient of variation is in every case close to the predicted value of 38”” (Comish-Bowden, 1977). Although the scatter of Sbn values about their means is considerable, it is not so great as to vitiate SAn as a guide to the extent of sequence identity. This is illustrated in Fig. 2, which is a plot of all the results obtained with the

TABLE

Values

of SAn for sets of related Number Of

Protein

3

sequences

Number of pans

set

Ref.t

InsulinI) Snake venom toxins11 Cytochrome CI Pancreatic ribonucleaseT a-Haemoglobintt 8, bHaemogJobinf: Myoglobinll

128

18

153

150 34

37 37

666 666

78 212 210 206

11 10 20 15

55 45 190 105

Diversity:

proteins

SAtriM Length

Mean

C.V.5

so 52

l-242

X,7”,,

77” I/ w,,

60 74 102 113

I.173 1,109

41 3”,, 457”,,

39” () 1 I”,, 28” 5ou::

124- 127 141 144 153

1.704 1.108 1 130 0.972

30.9” ,, 42.40,, 385”,, 39.?“,,

t Page number in Dayhoff (1976). 1 Sequence difference between the most dissimilar pair. 4 Coefficient of variation, calculated by taking the number of degrees of freedom as 1 less than the number of sequences. nSsq,. The variance was therefore estimated by dividing the sum of squares of deviations from the mean by (nreq -1)*/I?, rather than by the number of pans. b.(L. -- 1 n 1’ All of those given in DayhotT (1976) were used. B All pancreatic ribonuclease sequences in Dayhoff (1976) were used. I.e. the enzyme from bovine semen was excluded. tt All available primate sequences were used apart from the two from irus macaque. whtch lack residues 1055139. $$ All available primate sequences, including human ;I-haemoglobin, were used apart from Iihaemoglobin from brown lemur, which lacks residues 121- 132.

AMINO

ACID

COMPOSITIONS

AND

SEQUENCE

SIMILARITY

383

% IdentIty 100

50.

90 I

95

,

05 ,

a0! .

E

.

i’

I’

f’ 11’

II

75 -70

‘s.

0

IO xl Numberof sequence dlfferences,M

40

FIG. 2. Composition and sequence differences between myoglobins. All possible (105) pairs of 15 myoglobin sequences were compared. For each pair the value of SAn, an estimate derived from the compositions of the number of sequence differences, is plotted against M, the actual number of sequence differences. Diagonal groups of points represent multiplets that would be superimposed if plotted in true positions. The continuous line shows the expected value of SAn as an unbiased estimator of M, and the broken lines are drawn at _+l standard deviation, assuming a coefficient of variation of 38%.

set of 15 myoglobin sequences. The worst group of results in this example consists of the six points at M = 19, which show SAn values ranging from 5 to 40: but this &fold range in estimates of the amount of difirence between the sequences corresponds to a usefully narrow range, from 74% to 97x, of estimates of the amount of similarity between them, the correct value being 88%. The same effect operates in reverse when very dissimilar sequences are compared, but this is not a severe drawback because there is little value in being able to quantify the amount of similarity accurately if it does not exceed random chance. (D)

PHYLOGENETIC

TREES

DERIVED

FROM

COMPOSITION

DATA

In spite of the ease with which protein compositions can be determined, they have not often been used to deduce evolutionary relationships between proteins, because they have been thought to provide too crude a measure of

A.

384

CORNISH-BOWDEN

a Green momba 2 a Green momba I i’ For!lmson banded

/

i

I 20

1

1

IO M

J

1 0

km+

I

I 33

4

I

I

20 SAO

-2..

IO

--

I

0

FIG. 3. Phylogenetic trees for the snake venom toxins. The trees were calculated by the UPGMA method (Sneath & Sokal, 1973) from: (a) values of M, the number of unpaired residues in the longer sequence when two sequences are aligned as given in Dayhoff (1976); (b) values of Shn calculated from the compositions according to equation (5). The main classes of toxin are represented by the following symbols: 0, long neurotoxins; 4. short toxins; D . short neurotoxins: 0. cytotoxins. The nomenclature of Dayhoff (1976) is used. except that “broad-banded blue sea-snake” is abbreviated to “blue sea snake”.

sequence similarities. However, the micro-complement fixation method, an immunological technique introduced by Sarich & Wilson (1966) has been used extensively in evolutionary studies even though it is of comparabie precision. Prager & Wilson (1971) have shown that it can provide phylogenetic trees that are similar to those obtained from comparisons of the sequences of the same proteins. The statistical character of the micro-complement fixation method is less clear than that of composition indexes, because the relationship between sequence and antigenicity is much less well understood than the relationship between sequence and composition. Nonetheless, Nei (1977) has recently analysed several sets of data and has estimated that the coefficient of variation of the micro-complement fixation index of difference

AMINO

ACID

COMPOSITIONS

AND

SEQUENCE

SIMILARITY

385

is about 60x, appreciably worse than the 40% predicted for SAn (CornishBowden, 1977) and observed in practice with the seven sets of proteins listed in Table 3. Figure 3 shows phylogenetic trees for the snake venom toxins derived from their sequences and compositions by the simple clustering method described by Nei (1975). [In the terminology of Sneath & Sokal(1973), this is the UPGMA method, or “unweighted pair-group method using arithmetic averages”.] Although there are obvious differences between the two trees, the similarities are nonetheless striking. The main defect of the composition tree is that it separates the long neurotoxins into two groups that appear to be more distantly related than they are in fact. The other major groups, the short neurotoxins and the cytotoxins, are correctly grouped and separated both from each other and from the long neurotoxins. The composition tree also places the two short toxins from green mamba, neither of which has any close relatives, in two separate groups, though the estimated 37 differences between them is not far from the correct value of 33. Within the main groups the toxins are grouped almost correctly: there are no important errors within the group of long neurotoxins; the arrangement of the cytotoxins would be essentially correct if the positions of cytotoxin 2 from Indian cobra and cytotoxin 4 from Mozambique cobra were interchanged ; the arrangement of the short neurotoxins is less good but still similar to that in the tree derived from the sequences. Similar pairs of trees have been constructed for the other sets of proteins in Table 3, with similar results: in no case is there perfect agreement about grouping, but in all cases the results are similar enough to suggest that composition comparisons can provide a useful guide to likely evolutionary relationships. 5. Discussion The results of this paper show that protein composition indexes can provide useful information about sequence similarity in both of the main contexts in which they are likely to be used. First, in comparisons between proteins that are not known to be related, the test of significant composition similarity (Cornish-Bowden, 1977) provides very conservative results : in all of the cases studied a significant result was obtained in this test only when the sequences were indeed significantly similar; in 144 comparisons between pairs of proteins for which no significant sequence similarity was detected, the composition test indicated no significant similarity of composition. Second, when applied to seven sets of related proteins, the composition index SAn gave results in good agreement with theory, both mean and dispersion being close to the predicted values.

386

A.

CORNISH-BOWDEN

Black & Harkins (1977) have shown that the similarities between the compositions of cytochrome c from different species agree well with the phylogenetic relationships deduced from the sequences. Moreover, both theory (Cornish-Bowden, 1977) and the results of this paper indicate that the scatter of SAn values about their theoretical values is substantially less than the scatter of values given by the micro-complement fixation method (Nei. 1977), a method that has been extensively used for the construction of phylogenetic trees (Prager & Wilson, 1971). This raises the question of whether Prager & Wilson (1971) are justified in their view that comparison of compositions is not sensitive enough to provide a useful method of constructing phylogenetic trees. The results of this paper, and in particular the trees in Fig. 3, suggest that they are not: the tree drawn for the snake venom toxins from their compositions is similar in most respects to that drawn from the sequences. One may argue therefore that amino acid compositions offer a value source of information about evolutionary relationships, which has so far remained largely untapped. REFERENCES AITKEN. A. (1975). Biochem. J. 149, 675. BLACK, J. A. & DIXON, G. H. (1968). Nurure, Lo&. 218, 736. BLACK, J. A. & HARKINS, R. N. (1977). J. theor. Biol. 66, 281. CORNISH-BOWDEN,A. (1977). J. theor. Biol. 65, 735. CORNISH-BOWDEN,A. (1978,). J. theor. Biol. 74, 155. CORNISH&WDEN, A, (19786). Biochem. Sot. Tram. 6, 767. CORNISH-BOWDEN,A. & MARSON, A. (1978). J. mol. Evol. IO, 231. DAYHOFF,M. 0. (1972). Afias ofProtein Sequence andSfrucfure. Vol. 5. Silver Sprmg: National Biomedical Research Foundation. DAYHOFF,M. 0. (1973). Atlas ofProtein Sequence and Structure. Vol. 5. suppl. 1. Silver Spring : National Biomedical Research Foundation. DAYHOFF,M. 0. (1976). Atlas of Protein Sequence and S~rudure. Vol. 5. ,uppl. 2. Silver Spring: National Biomedical Research Foundation. DEDMAN, J. R., GRACY, R. W. & HARRIS, B. G. (1974). Camp. Biwhem. Phy.k/. 49B, 7 15. HARRIS,C. E., KOBE~, R. D., TELLER, D. C. & RUTTER, W. .I. (1969). Biochemistry 8, 2442. HARRIS,C. E. & TELLER, D. C. (1973). J. theor. Biol. 38, 347. HOLMQUIST, R. & MOISE, H. (1975). J. mol. Evol. 6, 1. KELLY, J. & AMBLER, R. P. (1974). Biochem. J. 143,681. MARCHALONIS,J. J. & WELTMAN, J. K. (1971). Comp. Biochem. Physid. 38B, 609. METZGER, H., SHAPIRO,M. B., MOSIMANN, J. E. & VINTON. J. E. (1968). Narure. Lord. 219, 1166. MILNE, P. R., WELU, J. R. E. & AMBLER, R. P. ( 1974). Biochem. J. 143, 69 I. NEI, M. (1975). Molecular Population Genetics and Evolution, pp. 197-202. Amsterdam: NorthHolland Publishing Company. NEI, M. (1977). J. mol. Evol. 9, 203. PRAGER,E. M. & WILSON, A. C. (1971). J. biol. Chem. 246, 597X. RAMSHAW.J. A. M., SCAWEN.M. D., BAILEY, C. J. & BOULTER,D. (1974). B&hem. J. 139. 583. SARICH, V. M. & WI~N, A. C. (1966). Science 154, 1563. SNEATH,P. H. A. & SOKAL, R. R. (1973). Numerical Taxonomy. pp. 230-234. San Francisco: W. H. Freeman & Company.

How reliably do amino acid composition comparisons predict sequence similarities between proteins?

J. theor. Biol. (1979) 76, 369-386 How Reliably do Amino Acid Composition Comparisons Predict !3equenceSimilarities between Proteins? ATHEL CORNISH-...
1MB Sizes 0 Downloads 0 Views