Use of homologous sequences to improve protein secondary structure prediction.

[4]

HOMOLOGOUS

SEQUENCES

AND STRUCTURE

PREDICTION

45

[4] U s e o f H o m o l o g o u s S e q u e n c e s to I m p r o v e P r o t e i n Secondary Structure Prediction B y THOMAS N I E R M A N N a n d KASPER KIRSCHNER

General Introduction

The development of cDNA cloning and sequencing techniques has led to an accelerated discovery of new protein sequences, but without concomitant information on structure. Because sequence determines structure, the veritable flood of new protein sequences is both a challenge and an opportunity to improve the accuracy of predicting secondary structure.

How can one exploit the information content of new single protein sequences? If the function of the protein is not known, a search of the database for proteins with similar sequences may reveal that the new entry is a member of a known family of homologous proteins, that is, proteins with the same function in different organisms, which have arisen from the same primordial ancestor. If the structure of one family member is known, a tentative three-dimensional model of the new member may be constructed. By contrast, if the function is known, significant sequence similarity can also be found, in principle, to proteins with a different function, that is, paralogous proteins. Such similarity indicates independent evolution of the new sequence after gene duplication of the primordial ancestor. Because sequence determines the secondary structure of segments of the polypeptide chain, several strategies have been attempted for predicting secondary structure, but they rarely achieve more than 65% correctly predicted residue positions. 2 Nevertheless, the analysis gives some idea of the structural class (all a, all r, a/fl, or a + r) of the new sequence) Moreover, the predicted sequence of secondary structural elements is a pattern, which can be used to enhance the discrimination of sequence similarity searches. Linear patterns of various physicochemical properties of the amino acid side chains have also been implemented for this purpose. It is desirable to improve the accuracy of predicting secondary structure. Progress in these efforts could lead ultimately to the correct predic-

I R. A. Jensen, Annu. Rev. Microbiol. 30, 409 (1976). 2 G. D. Fasman, in "Prediction of Protein Structure and Principles of Protein Conformation" (G. D. Fasman, ed.), p. 193. Plenum, New York, 1989. 3 M. Levitt and C. Chothia, Nature (London) 261, 552 (1976).

METHODS IN ENZYMOLOGY, VOL. 202

Copyright © 1991 by Academic Press, Inc. All rights of reproduction in any form reserved.

46

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[4]

tion of chain folding. 4 Fortunately, cloning and sequencing of new coding DNA sequences do not follow a random choice. Frequently new homologous sequences from organisms living in different environments (e.g., thermophilic organisms) are sought for comparative purposes. Examples are the enzymes of a coordinately regulated pathway such as tryptophan biosynthesis or the targets of important drugs such as dihydrofolate reductase. We have been developing an approach toward improving the score of predicting secondary structure by exploiting the structural information hidden in sets of aligned, homologous sequences. 5 The protein folding code is degenerate in the sense that many different sequences can fold to stable proteins with the same structure. Averaging of the empirical folding information at aligned residue positions improves the prediction of secondary structure decisively. 6'7 Importantly, the averaged property profiles that are correlated to particular secondary structural elements are useful quantitative weights for enhancing the prediction score. Analysis of the sequence of predicted secondary structural elements can reveal the folding topology of the protein family in certain c a s e s . 8'9

Alignment

Rules and Strategies It is known from a number of three-dimensional protein structures that homologous enzymes, which catalyze the same metabolic reaction in different organisms, adopt the same folding topology. The divergence of sequence and structure of eight different protein families has recently been summarized. 1° Details in the tertiary structure of homologous proteins may vary significantly as a result of amino acid substitutions, insertions, or deletions. It is known from spatial superpositions of homologous structures that insertions or deletions in the amino acid sequence occur mainly

4 G. E. Schulz, Annu. Rev. Biophys. Chem. 17, 1 (1988). 5 j. Gamier, D. J. Osguthorpe, and B. Robson, J. Mol. Biol. 120, 97 (1978). 6 M. J. Zvelebil, G. J. Barton, W. R. Taylor, and M. J. E. Sternberg, J. Mol. Biol. 195, 957 (1987). 7 T. Niermann and K. Kirschner, Protein Eng. 4, 359 (1991). 8 W. R. Taylor and M. N. Green, Fur. J. Biochem. 179, 241 (1989). 9 I. P. Crawford, T. Niermann, and K. Kirschner, Proteins 2, 118 (1987). 10 C. Chothia and A. M. Lesk, EMBO J. 5, 823 (1986).

[4]

HOMOLOGOUS SEQUENCES AND STRUCTURE PREDICTION

47

in surface loops, ll'12 Although core s e c o n d a r y structural elements generally exhibit similar relative positions, it has been o b s e r v e d that they m a y v a r y significantly in length 1°'13 or are absent. 14 The principal p r o c e d u r e to align protein sequences is to maximize the identity or similarity at aligned positions. Residues that p e r f o r m equivalent structural and functional roles are usually invariant. The average probability of residue exchange by point mutation (the D a y h o f f similarity matrix 15) is generally used as an adequate m e a s u r e of similarity. G a p s must be introduced into individual sequences to a c c o m m o d a t e insertions in other sequences. A n y inference f r o m an alignment depends on its accuracy. The confidence with regard to an alignment is limited if the n u m b e r of identical residues drops below a threshold value 16'17 and if they are not evenly distributed o v e r the entire sequence. As a c o n s e q u e n c e , below the threshold value the a v e r a g e mutation probability does not provide for sufficient information to align sequences properly, and partially erroneous alignments h a v e resulted. 16'17 Matrices analogous to the D a y h o f f matrix but which h a v e b e e n obtained f r o m h o m o l o g o u s proteins of k n o w n threedimensional structure p r o v e d to be m o r e sensitive than the D a y h o f f m a t r i x when incorporated into alignment programs. 18 C o m p u t e r p r o g r a m s to a u t o m a t e the alignment p r o c e d u r e have been d e v e l o p e d since 1970,19 but only recently have a u t o m a t e d p r o c e d u r e s for multiple sequence alignment been published. 2° The p r o g r a m s either combine pairwise aligned sequences, proceeding f r o m the pair with the highest similarity to the one with the least, or consider all sequences simultaneously, thus performing a true multisequence alignment. A strategy to avoid biasing a multisequence alignment toward highly related

II j. Greet, J. Mol. Biol. 153, 1027 (1981). 12K. W. Volz, D. A. Matthews, R. A. Alden, S. T. Freer, C. Hansch, B. T. Kaufman, and J. Kraut, J. Biol. Chem. 257, 2528 (1982). 13G. Buisson, E. Du6e, R. Haser, and F. Payan, EMBO J. 6, 3909 (1987). 14K. Piontek, P. Chakrabarti, H.-P. Sch~ir, M. G. Rossmann, and H. Zuber, Proteins 7, 74 (1990). 15M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, in "Atlas of Protein Sequence and Structure" (M. O. Dayhoff, ed.), Vol. 5, Suppl. 3, p. 345. National Biomedical Research Foundation, Washington, D.C., 1978. 16G. J. Barton and M. J. E. Sternberg, Protein Eng. 1, 89 (1987). 17A. M. Lesk, M. Levitt, and C. Chothia, Protein Eng. 1, 77 (1986). 18j. L. Risler, M. O. Delorme, H. Delacroix, and A. Henaut, J. Mol. Biol. 204, 1019 (1988). 19S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48, 443 (1970). 20D. F. Feng and R. F. Doolittle, J. Mol. Evol. 25, 351 (1987).

48


[4]

subsets of sequences has recently been described.~1 To our knowledge the commercially available software packages only provide programs that either align pairs of sequences or align one sequence to an already aligned set of sequences. 22

Structural Information Contained in Aligned Sequences A correct linear alignment must correspond to a unique superposition of the (hypothetical) tertiary structures and secondary structural elements, and therefore provides exceedingly useful information. Although highly diverse sequences can encode for the same folding topology, the distribution of sequence variability is not uniform over the alignment. It depends on the role of particular segments either in stabilizing the protein core or in providing clustered catalytic residues. If homologous sequences with sufficiently high diversity are available, residues that are directly involved in the catalysis are invariant, and the variability of other residues in contact with the substrate are low. Moreover, because the hydrophobic character of the protein core is generally preserved, buried positions will only tolerate substitutions of one hydrophobic amino acid by another. By contrast, high sequence variability at aligned positions is often characteristic for regions on the surface of the protein. Some sequences with characteristic patterns of polar and apolar residues are associated with amphipathic a helices, and clustered gaps in the alignment, which correspond to tolerated insertions, are useful for locating surface loops. Sequence alignments of homologous proteins have been used with various aims. First, they allow more sensitive homology searches owing to the possibility of assigning position-specific weights (also including gap positions, see Gribskov's approach22). Second, templates derived from aligned sequences have been used to locate a fl-a-fl supersecondary structure in the mononucleotide binding fold. 23,z4 The search template is more effective when a combination of averaged state propensity, hydropathy, and amphipathic moment is used. z5 Third, if the three-dimensional structure of one member of a set of homologous sequences is known, all sequences in the set can be analyzed with respect to this known structure. For example, such a database has been used to derive multiplet frequen21 S. F. Altschul, R. J. Carroll, and D. J. Lipman, J. Mol. Biol. 2117, 647 (1989). 22 M. Gribskov, A. D. McLachlan, and D. E. Eisenberg, Proc. Natl. Acad. Sci. U.S.A. 84, 4355 (1987). 23 R. K. Wieringa, M. C. H. DeMayer, and W. G. Hol, Biochemistry 24, 1346 (1985). 24 R. K. Wieringa, P. Terpstra, and W. G. Hol, J. Mol. Biol. 187, 101 (1986). 25 T. A. Webster, R. H. Lathrop, and T. F. Smith, Biochemistry 2,6, 6950 (1987).

[4]

H O M O L O G O USEQUENCES S AND STRUCTURE PREDICTION

49

cies. 26 Fourth, aligned sequences have also been used for the prediction o f secondary structure 8'9'27-3° as follows.

Secondary Structure Prediction Methods The ultimate challenge of predicting secondary sequences is to recognize the correct chain folding. This goal, however, can hardly be achieved with the accuracy of existing methods. There are three original concepts for predicting secondary structure from the amino acid sequence. (1) The pattern matching method introduced by Lira 3~ relies on physical principles of the protein architecture, and it considers mainly runs of hydrophobic and hydrophilic amino acids that are characteristic for particular secondary structural elements. L a t e r Cohen e t al. 32 included more specific patterns, for example, charge pairs with the appropriate spacing characteristic of a helices. (2) Homology methods, which require large databases, were published in 1986. 33-35 (3) As for propensity (statistical) methods, Chou and Fasman 36 derived local and Garnier e t al. 5 directional state propensities from statistical evaluations of the occurrence of particular residues in known secondary structural elements of proteins. Recently, neural networks have also been used to derive directional state propensities. 37,38 Despite the variety of different secondary structure prediction programs, only a few are generally available, or their applicability is limited to certain classes of proteins. State propensities have been updated repeatedly, but the accuracy of predictions based on propensities of single residues appears to have reached saturation. These diminishing returns arise because the contribution o f an individual residue toward stabilizing the protein is only realized in its three-dimensional context and is strongly influenced by tertiary interactions with other residues. Consideration of specific tertiary interac26 K. Nagano, J. Mol. Biol. 75, 401 0973).

27L. Sawyer, L. Fotlmrgill-Gilmore,and P. S. Freemont, Biochem. J. 249, 789 0988). 28L. Sawyer, L. Fothergill-Gilmore, and G. A. Russell, Biochem. J. 236, 127 0986). S. J. Perkins, P. I. Hafts, R. B. Sire, and D. Chapman, Biochemistry 27, 4004 0988). 30S. J. Perkins, A. S. Nealis, J. Dudhia, and T. E. Hardingham, J. Mol. Biol. 206, 737 (1989). 3TV. I. Lim, J. Mol. Biol. 88, 857 0974). 32F. E. Cohen, R. M. Abarbanel, I. D. Kuntz, and R. J. Fletterick, Biochemistry 22, 4894 (1983). 33j. M. Levin, B. Robson, and J. Gamier, FEBS Lett. 205, 303 (1986). 34R. M. Sweet, Biopolymers 25, 1565 (1986). 35 K. Nishikawa and T. Ooi, Biochim. Biophys. Acta 871, 45 (1986). 36p. y. Chou and G. D. Fasman, Ado. Enzymol. 47, 45 (1978). 37N. Qian and T. J. Seijnowski, J. Mol. Biol. 202, 865 (1988). 38L. H. Hotley and M. Karplus, Proc. Natl. Acad. Sci. U.S.A. 86, 152 (1989).

50

PROTEINS AND PEPTIDES2 PRINCIPLES AND METHODS

[4]

tions is still beyond the reach of secondary structure prediction methods, but nonspecific (e.g., hydrophobic) interactions have been implemented in one of the published prediction algorithms. 39 The main criticism of statistical methods focuses on this circumstance. Single residue propensities are an oversimplified representation of local secondary structure. For example, residues interact in pairs in the same secondary structural element, say, an a helix. Therefore, reliable structural propensities for pairs of residues must increase the accuracy of statistical prediction methods. Although the database is still not big enough to quantify all possible residue-pair interactions, propensities for significant pairs have already been derived, leading to improved prediction. 4° A reappraisal of statistical evaluations of protein structures is supported by recent experiments. It has been shown independently that short isolated peptides can have unexpectedly high a-helical content in solution, which was attributed either to high alanine content 41or to clearly identifiable side chain interactions. 42'43 We have chosen the method of Garnier, Osguthorpe, and Robson 5,4° (GOR method) for our analysis for several reasons. Directional propensities represent the influence of a residue on the secondary structure of its neighbors because particular positions in a secondary structural element are often populated by specific amino acids. By contrast, the average local (Chou and Fasman 36) propensity ignores neighbor interactions. For example, the directional propensities of glutamic acid describe quantitatively how its occurrence in a helices decreases gradually from the amino to the carboxyl terminus of the a helix) The explicit use of directional interaction is the main reason for the higher accuracy of the GOR method in comparison to other methods. This approach uses accessible information between the local propensity of the Chou-Fasman method on one hand, and the desirable complete set of pair interaction propensities 4° on the other. The GOR method is straightforward because the prediction depends on the simple summation of superimposed directional propensities; it does not require extra rules. The GOR method is more versatile than other methods. It allows complicated analyses and manipulations of the state propensity profiles, which were necessary for the prediction of supersec39 O. B. Ptitsyn and A. V. Finkelstein, Biopolymers 22, 15 (1983). 40 J.-F. Gibrat, J. Gamier, and B. Robson, J. Mol. Biol. 198, 425 (1987). 41 S. Marqusee, V. H. Robbins, and R. L. Baldwin, Proc. Natl. Acad. Sci. U.S.A. 86, 5286 (1989). 42 j. L. Krstenansky, T. J. Owen, K. A. Hagemann, and L. R. McLean, FEBS Lett. 2,42,

4O9 (1989). 43 S. Marqusee and R. L. Baldwin, Proc. Natl. Acad. Sci. U.S.A. 84, 8898 (1987).

[4]

H O M O L O G O USEQUENCES S AND STRUCTURE PREDICTION

51

ondary structures. 44 Finally the inventors of this method have repeatedly adapted the method to the growing database and have added new features to increase its effectiveness.

Prediction of Secondary Structure of Aligned Sets of Sequences The protein folding code is degenerate, that is, highly diverse amino acid sequences can code for the same topology. Therefore, homologous sequences provide to some extent independent structural information. It has been suggested earlier but without further qualification that the utilization of homologous sequences should improve secondary structure prediction. 45 Although the number of sequences increases more rapidly than the number of elucidated three-dimensional structures, attempts to exploit this gratuitous structural information are rare. In principle, any prediction method should benefit from additional sequencesY but we sought a method that could process this information without the explicit prediction of individual sequences. 5 As a statistical method the GOR approach can be applied to an aligned set of sequences to produce a condensed form of output. There are two straightforward approaches: (1) Each sequence is predicted individually by replacing the amino acid symbols by a symbol for one of three secondary structural elements: a helix,/3 strand, or coil. Then a consensus prediction is obtained from the most frequent symbol at each aligned position. (2) The individual propensities are averaged at the aligned positions, and the highest state propensity determines the secondary structure. Consensus predictions from individually predicted sequences have serious disadvantages: (1) The information contained in the amplitudes of the propensities is ignored. (2) Ad hoc rules are required where there is no predominantly predicted state. (3) The consensus is necessarily biased toward the subset of sequences that have the highest homology. In contrast to the consensus procedure the averaging procedure yields an unequivocal prediction. The quantitative output has the form of three propensity profiles and is formally equivalent to that from a single-sequence prediction. However, it incorporates information representative of the entire set of aligned sequences. Averaging biases the prediction toward those segments with high propensities, and the accuracy of a prediction increases with the amplitude of the propensity. 4° Moreover, the 44 W. R. Taylor and J. M. Thornton, J. Mol. Biol. 173, 487 (1984). 45 p. Argos, J. Schwarz, and J. Schwarz, Biochim. Biophys. Acta 439, 2261 (1976).

52


[4]

profiles obtained by averaging can be processed further, for example, by the template matching approach 44 or by the weighting procedure. 7 An earlier implementation of homologous sequences 6 did not lead to a general improvement. In our analysis of the aligned sets of nine different enzymes, 7 the overprediction of the a-helical state in the highly diverse sets had to be compensated by adjusting the decision constants. Moreover, compared to the known secondary structure of one member of the set, the success rates of individual sequences varied widely. Because a single sequence cannot be representative of all sequences in the aligned set (see application example), the average of the individual success rates was chosen as a basis for comparison. When this was done the score of the prediction increased in proportion to the variability in the set. 7 There are two complementary explanations for this improvement. First, the suppression of scatter helps. Second, it is known that the intrinsic driving force for the formation of secondary structure varies widely, depending on the different amino acid sequences. 41-4aThe notion here is that a segment corresponding to a given secondary structural element will not have a strong negative propensity for that secondary structure. Therefore those segments having a strong intrinsic tendency for the correct secondary structure will dominate. Joint Prediction One shortcoming of secondary structure prediction method seems to be due mainly to the particular bias toward one or the other parameter (either hydropathy or state propensity, or patterns of side-chain properties), thus ignoring other important structural information. Of course these parameters are not independent. For example, most hydrophobic residues are associated with high B-strand propensity. Frequently the methods of Chou and Fasman 36 and Garnier e t al. 5 are also combined in a joint prediction for single sequences, but this approach ignores the fact that both methods rely on similar concepts. Although predictions from these methods may differ, this is frequently due to marginal differences in alternative state propensities. Furthermore, it is nearly impossible to weight the predictions adequately because the a helix, /3 strand, and coil states are predicted with different accuracies by each method. 46 Based on the three methods for secondary structure prediction mentioned above, various hybrid approaches have been developed. 47 Where

46 W. Kabsch and C. Sander, FEBS Lett. 155, 179 (1983). 47 V. Biou, J. F. Gibrat, J. M. Levin, B. Robson, and J. Gamier, Protein Eng. 2, 185 (1988).

[4]


53

conflicts arise between the results of the various methods, they are resolved by scoring those predictions that have the highest probability. The correlation of profiles of physicochemical properties with either internal or external segments is frequently used to identify secondary structural elements. 48,49We observed that local maxima of the hydrophobicity profile coincide well with buried/3 strands, whereas the correlation of local minima with surface loops was less significant. On one hand, the avoidance of hydrophilic residues in the protein core seems to be more pronounced than the avoidance of hydrophobic residues on the surface, that is, "some of the grease is on the surface."5° On the other hand, local maxima of the neighbor-correlated chain flexibility superimposed well onto chain segments on the surface of the protein. 51 Moreover, maxima in the amphipathic moment profile coincided with amphipathic a helices: 2 However, not every known secondary structural element was associated with a maximum in the cognate property profile, and some maxima were not associated with cognate secondary structural elements. As for the propensity profiles discussed above, averaged property profiles of aligned sets are obtained easily by averaging at homologous positions. Visual inspection reveals a better correlation with known secondary structural elements than that obtained from property profiles of single sequences. The explanation seems to be that an amphipathic c~helix can be realized by various amino acid sequences in the aligned segments, and therefore maxima in the corresponding individual amphipathic moment plots will have different amplitudes. Since the amphipathic character must be conserved for structural reasons, averaging will enhance superimposed signals. Conversely, incorrect maxima in individual sequences, which may arise from ambivalent segments, are suppressed. We have recently incorporated the correspondence between pronounced local maxima in averaged property profiles and cognate secondary structural elements into an automatic weighted prediction method for aligned sets of sequences. 7 We obtained higher prediction scores in our test set of proteins by optimizing threshold values for the propensity profiles, thus accepting strong amplitudes but ignoring minor ones. The additional information available from the aligned set per se has also been incorporated into the prediction algorithm. For example, clustered gaps 48 G. D. Rose, Nature (London) 272, 586 (1978). 49 j. L. Cornette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky, and C. Delisi, J. Mol. Biol. 195, 659 (1987). 50 F. M. Richards, Annu. Reo. Biophys. Bioeng. 6, 151 (1977). 5x p. A. Karplus and G. E. Schulz, Naturwissenschaften 72, 212 (1985). 52 D. Eisenberg, R. M. Weiss, and T. C. Terwilliger, Proc. Natl. Acad. Sci. U.S.A. 81, 140 (1984).

54

PROTEINS AND PEPTIDES" PRINCIPLES AND METHODS

[4l

are overriding indicators for loop segments and therefore identify these directly. The development of an automatic joint prediction algorithm must be based on the processing of quantitative data (i.e., the amplitudes of propensity, property, variability, or other profiles) in order to permit empirical optimization of the procedure. The method must be automatic to provide feedback during intermediate stages of optimization and to give reproducible results in the hands of others. The quantitative output is then also useful for defining a suitable score for comparative purposes. Our work published hitherto 7,9 concerns monomeric proteins of the a//3 and a + 13 classes, in particular proteins with the parallel 8-fold/3/~ barrel topology. The automatic prediction procedure that was optimized for seven test proteins led to a relatively high score (about 70% on a per residue basis). The per segment prediction score was above 95%. Caution is necessary when the approach is applied either to oligomeric proteins or to proteins of other classes (e.g., all a or all/3). On one hand, in oligomeric proteins of the a//3 or a +/3 classes, the intersubunit (or interdomain) contacts are frequently made via a helices. Therefore, these internal a helices in general are not amphipathic. On the other hand, the variability profiles of parallel 8-fold/3/a barrel proteins reflect the repetitive motif of the eight/3/a supersecondary structures and the exclusive location of the invariant residues of the active site in loops between/3 strands and helices. 53 Enzymes with a different topology will not necessarily show a clear-cut correlation between variability and known secondary structure. Based on these considerations we recommend, for the analysis of oligomeric proteins and of proteins of other classes, that the averaged propensity profiles be regarded as the primary output. Where ambiguities (e.g., overlap between maxima of two propensity profiles) occur, inspection of the averaged property profiles can be used to resolve them as discussed below.

Method After we had published the prediction of the secondary structure of the a subunit of tryptophan synthase 9 we had strong feedback from colleagues who sent us their alignments for prediction. We conclude, at least from our experience, that there is a real need for generally available and user-friendly computer programs for predicting the secondary structure of aligned sets of protein sequences. The usefulness of prediction programs 53 G. K. Farber and G. A. Petsko, Trends Biochem. Sci. 15, 228 (1990).

[4]


55

in commercially available software packages is often limited in terms of their applicability to aligned sequences.

Input The computer program processes aligned sequences. Each sequence, which is usually edited during alignment by the insertion of gaps, must be stored as a new file. The computer program then opens a file that must contain the file names of these edited sequences. The running program also requests the input of several critical parameters that control its operation.

State Propensities The secondary structural propensities for a helix, 13 strand, and coil are calculated for each individual sequence by the original method of Garnier et al.,5 using the parameters of Gibrat et al. 4° The three directional GOR state propensities for each amino acid cover 8 positions in both the amino- and the carboxyl-terminal direction. The propensity of an amino acid at the ith position for the a helix,/3 strand, and coil state is obtained by the sum of the overlapping state propensities of the 17 amino acids in the i -+- 8 neighboring positions in the sequence. The secondary structure of that position is then determined by the largest propensity. The method can be adjusted either to a special protein class or to an aligned set of sequences being processed, by choice of suitable decision c o n s t a n t s Y These parameters effect a vertical displacement of the propensity profiles relative to each other over the entire sequence. Gaps in aligned sequences need special treatment, as follows. First the three propensities are calculated for each continuous sequence. The second propensities are relocated to the actual positions of each residue in the alignment. Third, the averaged propensity profiles are smoothed with a span over three residues. Smoothing of single sequences can therefore only be achieved if the same sequence is entered twice, that is, treated as a pseudo set.

Hydropathy, Amphipathy, and Chain Flexibility Hydropathy and amphipathy are correlated with certain secondary structural elements in proteins. Maxima in hydropathy plots are expected to correlate with/3 strands that are buried in the interior of the protein. A characteristic pattern of hydrophobic and hydrophilic residues is observed for external a helices that pack against the protein core and mediate contact with the solvent at the opposite face. Therefore, these a helices are associated with maxima in the amphipathic moment plot.

56

[4]


A number of measured and empirical hydropathy scales have been developed, many of which are, however, strongly correlated among each other. 54 The scale of Kyte and Doolittle 55 is implemented in our program for the calculation of both the hydropathy and the amphipathic moment profiles. Of course, it is possible to replace this scale by any other desired scale. The span setting for the calculation of the profiles should correspond to the approximate length of the known secondary structural elements. We recommend a span of 5 residues for the hydropathy plot and a span of 11 residues for the amphipathic moment plot. Each of these spans is requested by the program. The neighbor-correlated chain flexibility needs explicit mention here because it comprises a set of three indices for each residue. The flexibility of the two neighboring residues defines the actual index of the central

a

5 I

25

15 I

35

I

45

I

55

I

65

I

75

85

I

I

I

95 l

2501 2251 2001 175l 1501 1251 i001C 75lC

SS C

C

SS

C

S

S C

501 25~ ol

-251H -501 -751

H SC

-I001HH -125[ s -150q -175 I S

-200IS -2251 -2501 av. Sce Tbr

C c

cSHH

C HHHH c

SS

CC

H

S C

C

C

S

HHH

S C

C H

s

C

S

CC

C

H

H

~ HH

CC

H

H

HHH

S

SS C

C S

C S

C C

SSS

SC S SCCCC

S

C

C S

C C C SH S HC S C C H HH C C C C H S CH H cCCc S HS SS C HS SS HHC S H Sc HCS S CCCH C S CHH H SC SSSH H C S H CS C C S C SCC SSSS CH HHC S H HS H S C S S HCHC S S CC CSSS H HHH C C SS H H H H H SS C C S SSSS S H H CH CCCS C H H C HH C C HH SS C S H S HS SS S CC H C H H C CC H H S S H C H HS SS H H C H HH HC

SSSSS SSSSSSS SSSSSSSSS

HSSHHBHHH HHHHHHHHHHHHHHH HHHHHHHHHHHHH

I~i

SSSSS HHHHHHHHHHHHHHHHHHHHH SSSSS HHHHHHHHH SSSSS SSSSSSS_HHHHHHHHHHH SSSSS

ul

~2

u2

SSSSS

SS S

SSSS3

HHEHH~IH'--SSSSS'--~ -- SSSSSH ¢3 ~4

~3

** ## i01 , , # 91 # 81 # o 71 * o# * # # # * o o *** ** 6~ * o O # * * o# 51 ** ### # * # o o ### # ### * 41 oooo *** * # ooo1 # * ### ##1 # ** * # o ° ° o * * ** # 31 * * # * ##1#*##0#*#1# ** ### * 0####0 ## Oh o ##O o 21 o, **°* # o ####### o, o * ##### o o o o o ### # # # # # * o 0#####0# II o o o o* o* o o o* ** # * 01######## ** o oooo o * o o, o ooo * -ll * o o * ****o o * * * -21 o o * o o# ° o o# *o -31 o -41 o o o °o * * * o# -51 o o * o * * * -61

ooo

-81 -91 -101 l 5

I 15

I 25

I 35

I 45

F I G . 1.

t 55

I 65

l 75

I 85

I 95

[4]


57

K. 5 i av.

15 P

25

Sce Tbr

HSSHHHHHH

SSSSSSS

45 I

55 I

65 I

SSSSS

HHHHHHHHHHH SSSSS HHHHHHHHH SSSSSSS_~HHHHHHH 82 a2

HHHHHHHHHHHHHHH

SSSSSSSSS 81

cons.

35 I

I

SSSSS

HHHHHHHHHHHHH al

HHHHHHHHHS SSSSS SSSSS ~3

75 I S

85 I

95 I

SSSSS SSSSS HHHHHHHH[~S S S SS---~ H l q H H H H -- SSSSSHH a3 ~4

r v gn k nG 1 v AQn GAfTGE S g w GHS .......... DHQQAI GTVQKLAFALPKE YFEKV .... DVAVTVPETD IRSVQTLVEGDKLEVTFGAQDVSQHE SGAYTGEVSASMLAKLNCSWVVVGHS --MPRKFFVGGNFKMNGNAES TT S I I KNLNSANLDK-SVEVVVSpPALYLLQAREVANK-E I GVAAQNVFDKPNGAFTGE I SVQQLREANI DWT I LGHS --APRKFFVGGNWKMNGKRKS LGE L I HTLDGAKLSA--DTEWCGAP S I YLDFARQKLDA-K I GVAAQNCYKVPKGAFTGE I SPAMIKD IGAAWVILGHS ---MRHP LVMGNWKLNG SRHMVHELVSNLRKELAGVA-GCAVAIAPPEMY I DMAKREAEGSH IMLGAQNVNLNLSGAFTGETS~ I GAQY I I I GHS MAP SRKFFVGGNWKMNGRKQS LGEL I GTLN KVPA--DTEVVCAPPTAY I DFARQKLDP-K IAVAAQNCYKVTNGAFTGE I SPGMIKDCGATWVVLGHS MAP SRKFFVGGNWKMNGRKQNLGEL I GTLNAAKVPA--D TEWCAPPTAY I DFARQKLDP -KIAVAAQNCYKVTNGAFTGE I SPGMIKDCGATWVVLGHS -AP SRKFFVGGNWKMNGRKKNLGE L I S TLQ AKV A--D TEVVC I GPTAYLDFARQKLDQ-KIAAGAQNCYKVTNGAFTGE I SPGMI KDCGATWVVLGHS --APRKFFVGGNWKMNGDKKSLGE L I QTLNAAk"4PF --TGE IVCAPPEAYLDFARLKVDP-KFGVAAQNCYKVSKGAFTGE I SPAMIKDCGVTWVI LGHS --MGRKFFVGGNWKCNGTTDQVEK I VKTLNE GQVPP S DVVEVVVSpP yVFLPVVKSQLRQ-EFHVAAQNCWVKKGGAFTGEVSAEMLVNLGVPWVI LGHS - -MGRKFFVGGNWKCNGTS EEVKK IVTLLNEAEVPSE DVVEVVVSPP YVFLPFVKNLLRA-DFHVAAQNCWVKKGGAFTGEVSAEMLVNLGIPWVI LGHS -AP SRKFFVGGNWKMNGRKKNLGE L I TTLNAAKVPA--DTEWCAPPTAY I DFARQKLDP-K IAVAAQNCYKVTNGAFTGE I SPGMIKDCGATWVVLGHS --MARTFFVGGNFKLNGSKQS IKE IVERLNTAS I PE--NVEVVI CPPATYLD YSVS LVKKPQVTVGAQNAYLKASGAFTGENSVDQI KDVGAKWVI LGHS --MARKFFVGGNFKMNGS LE SMKT I I EGLNTTKLNVGDVETVI FPQNMYL I TTRQQVKK-D I GVGAQNVFDKKNGAYTGENSAQS L I DAG I TYTLTGHS -MSKP QP I AAANWKCNGS QQSLSE L I D LFNSTS INH--DVQCWASTFVHLAMTKERLSHPKFVIAAQNAIA-KSGAFTGEVS LP I LKDF GVNWIVLGHS

Cgl And

Chi Eco Hli

Mac Hpl Lat Mal

CJa Rab Sce Spo Tbr vat.

12 2111 1 61 2 1 21 1 1 21 ii Ii 12 1 1300252424424272119429685532474617816118454767487651277647216051731125249491121

23-1

201 191 181 171 161 151 14 I 131 121 Ii] 10d 9i

8i 7 I 61 51 41 31 21 iI

*

*

*

*

*

*

*

* * * * * * * * *

* * * * * * * * *

* ** * * ** * * ** * * * ** * * *** * * * *** * * * **** * ** * **** * ** * * * ** **** * ** * * * ** ****** ** * * * ** ******* ** * * ** ** * ******* ** ** **** ** * ******* ** ******* ** * * *********** ******* ** * * ** * * ******************* ** * * ** * * ******************* ************** ******************* **********************************

I 5

I 15

I 25

I 35

*

2

ii

2

2 1 1 i18194648672892465111

*

*

*

*

* * * * * * ** * * ** * * * ** * * * ** ** * * ** ** * * ** * ** ** * * ** ** ** ** * * ** ** ** ** * * ** ** ** ** ** * ** ** ** ** ** *** ** ********* ** *** *** ********* ** * *** ************* ********************** ********************** ********************** **********************

* * * * * * * * * *

* *

* * ** **

* * * * * * * * *

* *

**

*

i 45

I 55

*

*

* * * * * *

* * * * * *

* * * * * *

* *

* *

* *

* *

*****

**

*

* * ***** * ** * * ** * ** ***** * ** * * ** * ** ***** * *** *** ** * **** ****** * *** *** ** ** **** ****** * ******* ** *** ***** ****** * ******* ** *** ***** ******* * * ************** ***************************************

*

I 65

r 75

I 85

I 95

FIG. 1. (a) Averaged profiles of the state propensities (H, a helix; S,/3 strand; C, coil) and residue properties (#, amphipathic moment; o, hydrophobicity; *, flexibility) from 14 different sequences of triose-phosphate isomerase. The averaged secondary structure prediction (av.) is compared to the known secondary structural elements of the protein from Saccharomyces cerevisiae (See) 57 and Trypanosoma brucei (Tbr). 56 Blank spaces in these lines correspond to gaps in the alignment. The known non-a helix and non-/3 strand states (Sce and Tbr) as well as the predicted coil states (av.) are represented by lines. (b) Section of aligned sequences of 14 triose-phosphate isomerases from 13 different organisms, comprising the region from/3! to a4. Key to organisms: Cgl, Corynebacterium glutamicum; And, Aspergillus nidulans; Chi, chicken muse&; Eco, Escherichia coli; Hli, human liver; Mac, Macacca mulatta; Hpl, human placenta; Lat, Latimeria; Mai, maize; Cja, Coptis japonica; Rab, rabbit; Sce, Saccharomyces cereuisiae; Spo, Schizosaccharomyces pombe; Tbr, Trypanosoma brucei. The triose-phosphate isomerase sequences used here were obtained from the MIPS data bankfl The row above the stack of aligned sequences (cons.) indicates conserved residues: uppercase letters correspond to 14 identical residues, and lowercase letters allow for two nonidentical residues. The variability index 59 (var.) for each position of the alignment is represented below as a truncated histogram. Blank columns in the histogram correspond to gaps in the alignment.

58


[4]

residue. A weighted average over 7 residues is then calculated for the central residue according to the algorithm of Karplus and Schulz) l Output

The averaged profiles are displayed on a line printer plot. This representation is an adequate compromise for simultaneously showing the alignment, the averaged properties, the averaged state propensities, and the prediction. Example The application of the computer program described briefly above is demonstrated in Fig. 156-59with an alignment of 14 different sequences of triose-phosphate isomerase (TIM). In Fig. la (top) the averaged propensity profiles for the three states (a helix, fl strand, and coil) are visualized by plots of the characters H, S, and C, respectively. The scaling of the Y axis is in centinats, 5 and the continuous numbering of the X axis refers to the residue positions in the alignment. In the lines below, the predicted secondary structure (labeled av.) is compared to the known secondary structures of TIM from both Trypanosoma bruce: 6 (labeled Tbr) and Saccharomyces cerevisiae 57 (labeled Sce). Both sequences are members of the aligned set. In the lower portion of Fig. la the averaged property profiles are visualized by plots of the following characters: #, helical amphipathic moment; O, hydrophobicity; and *, chain flexibility. Statistical methods ignore not only the physico-chemical properties of a helices (amphipathy), fl strands (hydrophobicity) and turns (flexibility), but also their preferred lengths. Therefore the prediction based only on the GOR method must be reconsidered in view of the above characteristic properties. Correlated averaged property profiles can be used to resolve ambivalent averaged propensities. This joint prediction can be done either quantitatively by using the cognate property profiles as weights 7 or qualitatively by visual inspection of the amphipathic moment, hydrophobicity and chain flexibility profiles. In the example of Fig. la profiles of the a helix and fl strand properties strongly overlap around position 25, but a 54 G. D. Rose, L. M. Gierasch, and J. A. Smith, Adv. Protein Chem. 37, 1 (1985). 55 j. Kyte and R. F. Doolittle, J. Mol. Biol. 157, 105 (1982). 56 R. K. Wieringa, K. H. Kalk, and W. G. Hol, J. Mol. Biol. 198, 109 (1988). 57 E. Lolis, T. Alber, R. C. Davenport, D. Rose, F. C. Hartman, andG. A. Petsko, Biochemistry 29, 6609 (1990). 58 H. W. Mewes, Max-Planck-Inst. Protein Sequences (1990). 59 T. T. Wu and E. A. Kabat, J. Exp. Med. 132, 211 (1970).

[5]

CONFORMATION OF fl HAIRPINS

59

strong maximum in the amphipathic moment plot supports the prediction of helix ~1. Local maxima of averaged propensity that are submerged under a broad plateau of propensity for another secondary structure present an interesting problem. This situation holds for the failure to predict strand/33 because its propensity maximum is submerged under that of the otherwise correctly predicted helix a 3 . However, a local maximum in the hydrophobicity, a minimum in the flexibility, and insignificant amphipathy rather favor a/3 strand in this region. This pattern is typical for all correctly predicted/3 strands in Fig. 1.

[5] C o n f o r m a t i o n o f / 3 H a i r p i n s in P r o t e i n S t r u c t u r e s : Classification a n d D i v e r s i t y in H o m o l o g o u s S t r u c t u r e s B y BANCINYANE L Y N N SIBANDA a n d JANET M . THORNTON

Introduction fl Hairpins are widespread in globular proteins, occurring as separate motifs or forming part of an extended/3-sheet structure. Recent studies ~-5 have provided new insights into hairpin classification and revealed structural families in the short loops connecting the strands. In this chapter we describe a systematic classification of/3 hairpins that is indicative of both the length of the polypeptide and the hydrogen bonding between the two antiparallel strands. The classification provides a useful tool to aid "modeling by homology" of protein structure. In this approach the sequence of the "unknown" is modeled onto the three-dimensional coordinates provided by one or more homologous proteins (see Ref. 6 for review). Comparisons of three-dimensional structures of homologous proteins demonstrate that the cores of proteins are highly conserved, although there are differences in relative positions and orientations of the ot helices and /3 strands. 7,8 Insertions, deletions, and radical changes in conformation I B. L. Sibanda and J. M. Thornton, Nature (London) 316, 170 (1985). 2 B. L. Sibanda, T. L. Blundell, and J. M. Thornton, J. Mol. Biol. 206, 759 (1989). 3 j. Milner-White and R. Poet, Biochem. J. 240, 289 (1986). 4 V. Pavone, Int. J. Biol. Macromol. 10, 238 (1988). 5 A. V. Efimov, Mol. Biol. (Moscow) 20, 2250 (1986). 6 T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, and J. M. Thornton, Nature (London) 326, 347 (1987). 7 j. Greer, J. Mol. Biol. 153, 1027 (1981). 8 C. Chothia and A. M. Lesk, EMBO J. 50, 823 (1986).

METHODS IN ENZYMOLOGY,VOL. 202

Copyright© 1991by AcademicPress, Inc. All rightsof reproductionin any formreserved.

Hybrid system for protein secondary structure prediction.

Machine learning approach for the prediction of protein secondary structure.

Bayesian model of protein primary sequence for secondary structure prediction.

Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields.

PPCM: Combing Multiple Classifiers to Improve Protein-Protein Interaction Prediction.

Protein structure prediction: assembly of secondary structure elements by basin-hopping.

A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction.

Protein-specific prediction of mRNA binding using RNA sequences, binding motifs and predicted secondary structures.

Pattern-based approaches to protein structure prediction.

Serum albumin domain secondary structure prediction.

RNA-SSPT: RNA Secondary Structure Prediction Tools.

In silico prediction of drug-target interaction networks based on drug chemical structure and protein sequences.

Use of flanking sequences to study secondary structure-activity correlations of a Mycobacterium leprae T cell epitope.

Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder.

Secondary Structure Prediction of Protein Constructs Using Random Incremental Truncation and Vacuum-Ultraviolet CD Spectroscopy.

Secondary structure of the human membrane-associated folate binding protein using a joint prediction approach.

An assessment of protein secondary structure prediction methods based on amino acid sequence.

Prediction of the secondary and tertiary structure of plastocyanin.

Knowledge base and neural network approach for protein secondary structure prediction.

Protein secondary structure prediction from circular dichroism spectra using a self-organizing map with concentration correction.

Improvements in protein secondary structure prediction by an enhanced neural network.

NNvPDB: Neural Network based Protein Secondary Structure Prediction with PDB Validation.

Prediction of RNA secondary structure, including pseudoknotting, by computer simulation.

Nonsporulating bacterial species contain DNA sequences homologous to the Bacillus spore-specific C-protein gene.