On the formation of protein tertiary structure on a computer.

Proc. Natl. Acad. Sci. USA Vol. 75, No. 2, pp. 554-558, February 1978

Chemistry

On the formation of protein tertiary structure on a computer (protein folding/computer simulation/protein evolution/role of glycines)

ARNOLD T. HAGLER* AND BARRY HONIGt * Department of Chemical Physics, Weizmann Institute of Science, Rehovot, Israel; and t Department of Physical Chemistry, The Hebrew University, Jerusalem, Israel

Communicated by Cyrus Levinthal, November 17, 1977

to which computer-based models may be expected to represent the in vivo events associated with protein folding. In this study we report the results of energy minimizations performed on polypeptide sequences consisting of only alanines and glycines. The type of conformations that are obtained have important implications for the mechanism of protein folding, for the role of specific structural elements in determining three-dimensional conformation, and for the strategies required for its simulation on a computer. We show that it is easy to obtain globular structures from a sequence containing only alanines and glycines. Our calculations of the conformational energetics indicate the glycines play a crucial role in protein folding, both by destabilizing helical structures (as has been indicated by statistical studies) and by stabilizing globular structures. Moreover, our results suggest the possibility of obtaining globular structures from a sequence consisting of only a limited set of amino acids containing glycine, which may have implications for the evolution of early proteins. One of the major conclusions of this study is that the criteria that have been used to evaluate the success of most folding simulations have been overly permissive. Our argument has two parts. First, we show that it is possible to obtain a computed structure of PTI that satisfies all of the criteria that have been used previously to define successful folding simulations, from a sequence that would certainly not yield PTI-like conformation in vivo. Specifically, we obtain a structure superficially resembling that of native PTI, from a sequence containing only alanines and glycines. Our analysis of the factors that yield this result leads to the conclusion that many current models contain built-in features specific to PTI. Many of the positive results that have been reported are due entirely to these features and may thus be regarded as artifacts. The second part of our argument is based on comparison of all simulated structures reported here and previously to the structure of native PTI. A careful examination reveals that despite superficial similarities to the native protein, all computed structures have fundamental flaws in that they fail to reproduce important features characteristic of the tertiary structure of native PTI. In fact, as shown in detail below, the computed structures that have been reported appear sterically inaccessible from the native conformation and would have to partially unfold before the correct conformation could be obtained.

ABSTRACr In this paper we carry out computer simulation studies of some of the Factors responsible for protein tertiary structure. We show that it is possible to obtain (fold) a compact globular conformation from a sequence of amino acids consisting of only glycines and alanines. Our results indicate that glycines play a central role in stabilizing globular structures by facilitating the formation of turns and by destabilizing helical structures. Using this simple two-amino-acid re resentation, which serves as a control experiment, we are able to obtain a conformation that resembles the native structure of pancreatic trypsin inhibitor, as closely as any obtained previously in folding studies. However, careful examination reveals that the true chain topology has not been reproduced here or in previous studies. We suggest that the discrepancies between calculated and observed structures are more significant than the similarities. The implications of these results for the validity of models for protein folding, the use of pancreatic trypsin inhibitor in folding studies, and the possible role of glycine in the evolution of protein structure are discussed.

There has recently been an enormous increase in the number and scope of computational studies related to the problem of how proteins acquire their tertiary structure (folding) (1, 2). For example, several attempts to fold a complete protein on a computer (3-7) have been reported. Pancreatic trypsin inhibitor (PTI) has been chosen for studies of this type because it is a small protein containing only 58 amino acids, and its x-ray structure is known (8). Burgess and Scheraga (5) obtained a starting conformation from empirical prediction algorithms of secondary structure and then minimized the total conformational energy. They emphasize the difficulties inherent in using minimization techniques to predict three-dimensional protein conformation. In a more recent series of papers Tanaka and Scheraga (6) have adopted Monte Carlo techniques and in a preliminary study have predicted a number of the long-range residue-residue contacts present in PTI. Levitt and Warshel (3), Levitt (4), and Kuntz et al. (7) have used extremely simplified representations of PTI, which, upon energy minimization, fold into globular structures that in some ways resemble the native protein. The most far-reaching claims of success have been made by Levitt and Warshel (3), and Levitt (4), who suggest that their technique may be used to simulate the entire folding pathway leading to a final conformation close to that of native PTI. The impression generated by these various simulations is that major progress has been made towards predicting the tertiary structure of a protein from its amino-acid sequence; i.e., the folding problem may be far more tractable than has generally been considered (9). In light of these developments it seems important to carefully evaluate the extent

Representation of PTI In order to determine the minimum information required to produce tertiary structure with specific features (in this case some facsimile of native PTI), we limited ourselves to a repre-

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U. S. C. §1734 solely to indicate this fact.

Abbreviation: PTI, pancreatic trypsin inhibitor. 554

ClProc. Natl. Acad. Sci. USA 75 (1978) Hagler and Honig Chemistry: sentation involving only sequences of alanine separated-by glycine. There are several reasons underlying this choice: (a) We want to choose the simplest representation possible without introducing any ad hoc assumptions or building any "models." (b) We pick alanine and glycine because we feel that these two amino acids alone, along with their relative positions in the chain, contain a great deal of structural information. Alanine, which is the simplest amino acid containing a fB carbon, tends to form secondary structures (a helix and A strands). Glycine, with its greater flexibility, tends to disrupt secondary structure, and, more importantly, allows the chain to fold back on itself. (c) It is likely that the factors enumerated in b are the essential factors responsible for the results obtained in previous simulations. In contrast to our analysis, Levitt and Warshel (3, 4), for example, have placed particular emphasis on the role of long-range "time averaged forces," between all 20 types of amino acids in the protein, ascribing the structures obtained in their folding simulation to these interactions. They draw conclusions as to the strategy to be used in attempting to simulate protein folding on the computer, and more importantly about the actual mechanism of protein folding itself. Thus, it is important to determine to what extent the crucial information for obtaining tertiary structure in this type of study results from a specific sequence or from the presence or absence ofA carbons at particular positions along the chain. Potential functions and initial configuration As was the case in previous studies of PTI, we start with a completely extended structure. In fact, much of native PTI is composed of extended strands and thus much of its secondary structure is in fact defined in published studies, including our own, before folding begins (10). The complete backbone of PTI is included in the calculation together with all CH3 groups, using standard amide geometry. No further assumptions as to the degrees of freedom needed to specify the chain conformation are made beyond those usually used in peptide conformational analysis (1, 11). Thus, we assume rigid trans, planar, peptide groups, with each residue having two torsional degrees of freedom, 4 and i,&. The energy parameters used in the calculations were those derived previously for peptide systems. These parameters (1, 11) were shown to account for the packing modes and energetics of amide crystals. The 6-9 potential was used (11, 12), and no additional parameters or assumptions were introduced, again to avoid introducing possible artifacts. Minimization of the energy of the polypeptide with respect to theX and i,& angles was carried out by the routine VA09A available from the Harwell (England) subroutine library. This routine has proved its effectiveness in the minimization of lattice energies (12) as well as in other conformational calculations (3, 4). For the folding simulations, alternate minimization and "kicking" steps were carried out. In order to "kick"-the polypeptide out of a local minimum the normal coordinates were incremented by a random number multiplied by a constant as done by Levitt and Warshel (3) and Levitt (4). In practice, with the high "effective temperatures" used, this is essentially equivalent to the random search procedure discussed by Crippen and Scheraga (13). On the success of folding simulations The structure of native PTI (Fig. 1A) has been described in detail by Huber et al. (8). Its major feature is an anti-parallel #3 sheet from residues 16 to 36 which is twisted by 1800 (Fig.

555

FIG. 1. Stereo figures of the conformation of PTI, represented by connecting the Ca atoms of each residue. (A) Experimental structure. (B) Results of folding simulation of sequence of Ala and Gly with glycines placed in positions occupied by Asp, Asn, or Gly in native structure (see text). 2B). An a helix consisting of residues 47-56 is packed against the fl sheet, as is the extended region from residues 7 to 13. A distinct hairpin turn is formed by residues 24-28, while less sharp turns that contain no local hydrogen bonds and are not "d type" are found around residues 12-17 and 36-40. Basically, the chain involves extended regions that fold back on themselves at each turn. The most unusual topological features are the "threading" (Fig. 2A) of the segment running from residues 14 to 25 through the covalent loop formed by the disulfide bond 30-51, and the 180° twist in the fi sheet alluded to above. In the first set of folding simulations attempted, we represented all glycines, asparagines, and aspartic acids as glycines [these are the amino acids that were ascribed a glycine backbone potential in the work of Levitt and Warshel (3, 4)]. All other residues were taken to be alanines. It should be noted that, when the sequence is represented in this way, glycines are present in each of the three major turns of PTI but are absent from the extended regions (10). The results of the first folding simulation we attempted are shown in Fig. 1B. The energy of this conformation is 50 kcal/ mol, as opposed to 195 kcal/mol for the fully extended minimized conformation (1 keal = 4.184 kJ). It is clear that the conformation of Fig. 1B superficially resembles that of native PTI. A structure resembling an antiparallel fl sheet has formed between residues 16 and 36 and the chain turns and folds back on itself at approximately the right places. This type of general agreement has been obtained in previous studies. An interesting feature of our simulated structure is that a helical structure has formed in the COOH-terminal region, although it is a lefthanded rather than an aR helix. The helix is packed against the fl sheet as it is in native PTI and many other proteins.

556

Proc. Nati. Acad. Sci. USA 75 (1978)

Chemistry: Hagler and Honig A

B

FIG. 2. Stereo figures of important topological features of PTI. (A) Threading of 30-51 loop by chain. Residues 10-51 are shown along with the 30-51 disulfide bond. (B) The 1800 twist in the,3 structure indicated by running from residues 16 to 36. Hydrogen bonds broken lines. are

The root mean square deviation from the native structure obtain, 6.2 A, is within the range used to define successful folding simulations in previous studies (3, 4, 7). An additional run with a different set of random numbers led to a structure with many of the same features as those seen in Fig. 1B, and with a root mean square deviation of 6.8 A from the native. These results show that a polypeptide of 58 amino acids consisting of just alanines and glycines can be folded on a computer to a conformation containing many of theelements of tertiary structure normally associated with proteins. Moreover, the computed structure bears a resemblance to PTI, as do many of the conformations of PTI obtained by other workers (3, 4, 7). However, as noted above, a close comparison of the various computed structures with native PTI reveals differences that, it appears to us, are more significant than the similariwe

ties. In particular, neither the threading of the NH2-terminal sequence through the 30-51 loop (which is here hypothetical

because no disulfide bonds are actually formed) nor the 1800 twist in the ,3 sheet is reproduced. In fact, none of the published folding simulations have yielded these features, so that all attempts to date to fold PTI on a computer have failed to predict key elements of its tertiary structure. Moreover, none of the computed structures would fold correctly even if more accurate potentials were introduced after the folding step, because a major rearrangement is required in order to thread the chain

correctly. Because a rather large number of folding simulations have been attempted, the failure to obtain even a single positive result with regard to these features may imply that a very fundamental error is inherent in these studies. Before commenting on this error and its implications, it is important to

consider why it is possible, with simplified representations, to produce conformations that bear even superficial similarities to native PTI, and what, if any, significance this resemblance has. In fact, our results provide an immediate answer to this question. PTI is a small protein that can be roughly described in terms of four short nearly parallel strands (residues 7-14, 15-26, 28-38, 39-45), and a COOH-terminal a helix (residues 47-56). To achieve a structure fitting this approximate description requires little more than building in the (three) turns in the right places and ensuring that the regions between the turns have a roughly extended conformation. This is particularly true when starting from the favorable (from the point of view of PTI) extended conformation. The general similarity obtained is related entirely to the definition of the backbone in terms of two types of residues, one that favors turns in the chain at the correct place and the other that favors extended regions. It appears then, that the "'timeaveraged" potentials that have been introduced to represent side chain interactions (3, 4) have little to do with the quality of the final result. Our results demonstrate that it is possible to obtain a structure that resembles the native conformation, from a sequence that contains essentially no relation to the true primary structure and therefore no information relevant to the way it actually folds in vio. Clearly, using overly permissive criteria for the success of a folding simulation can lead to incorrect conclusions as to the validity of a particular representation of protein structure. The features of PTI that were not reproduced by the Ala-Gly sequence (or any other representation) were the 180° twist in the A structure and proper threading of the chain. This, then, suggests that if PTI is to be used as a system for developing folding algorithms, it is precisely these more subtle features, and their underlying cause in terms of primary structure, that must be reproduced and understood. The recent experimental studies of Creighton (ref. 14 and references therein) may provide extremely useful information for this task. Creighton has found (15) that PTI must form an intermediate containing an "incorrect" disulfide bond before the native conformation is reached. Furthermore, an intermediate containing two correct disulfide bonds cannot fold to the native structure without first undergoing rearrangement. It appears then that quite specific directed interactions are operating throughout the folding pathway and, as suggested by Levinthal (16), that a significant fraction of the information present in the primary sequence is used in defining this pathway. In the case of PTI the achievement of proper threading appears to require an intricate mechanism, associated perhaps with the formation of the "incorrect" disulfide intermediate (14). It is possible that the simulation of such a specific pathway will require the introduction of highly detailed interactions at relatively early stages in folding. This point of view should be contrasted with the suggestion that folding may be represented in terms of several distinct stages, requiring the successive addition of greater detail into the potential functions as the final conformation is approached. While it is difficult to draw general conclusions, it is clear that crude representations of early steps in folding have biased current computer simulations to incorrect local minima. It is, of course, not at all clear that the simulation of an entire pathway is the most plausible approach for the prediction of tertiary structure. One might, for example, ignore the details of the pathway and try to predict the final structure directly. This philosophy is implicit in the Monte Carlo studies of Tanaka and Scheraga (6) and in the recent folding simulation of Kuntz

Chemistry: Hagler and Honig

Proc. Natl. Acad. Sci. USA 75 (1978)

Table 1. Conformational energies of various Gly-Ala saqUe'ne& Initial structure

Gly position in PTI sequence*

Energy, kcal/mol Initial Final

Extended Asp, Asn, Gly 333 195 Extended Gly 331 197 Extended 330 202 Helix Asp, Asn, Gly 3207 52 Helix Gly 3643 38 Helix 3727 30 Folded (Fig. 1B) Asp, Asn, Gly 59 59 Folded (Fig. 1B) Gly 225 93 Folded (Fig. 1B) 343 113 * Arg-Pro-Asp-Phe-Cys-Leu-Glu-Pro-Pro-Tyr-Thr-Gly-Pro-CysLys-Ala-Arg-Ile-Ile-Arg-Tyr-Phe-Tyr-Asn-Ala-Lys-Ala-Gly-

Leu-Cys-Gln-Thr-Phe-Val-Tyr-Gly-Gly-Cys-Arg-Ala-Lys-ArgAsn-Asn-Phe-Lys-Ser-Ala-Glu-Asp-Cys-Met-Arg-Thr-Cys-GlyGly-Gly.

et al. (7). The best structure obtained by Kuntz et al. (7) had a root mean square deviation of only 4.7 A, which may be due in part to the imposed requirement that the correct disulfide bonds be formed. Kuntz et al. (7) present a thoughtful discussion of the criteria needed to properly assess the success of

folding simulations. The role of glycine in protein folding We have demonstrated in this work that a remarkably simple sequence consisting of just glycine and alanine can produce, at least on the computer, a globular structure with many of the features characteristic of native proteins. In order to obtain further insights into the factors involved, a number of other computer experiments were performed and are summarized in Table 1. All chains in the table contain only alanine and glycine and are 58 residues long. The positions of the glycines are indicated by the corresponding amino acids they replace in the primary sequence of PTI. In all cases, the final conformations generated were similar to those of the initial structure because only a single minimization was carried out. The importance of placing glycines in appropriate positions is apparent. Minimization from an extended structure leads to conformations which are not strongly influenced by the presence or absence of glycines along the chain. On the other hand, glycines clearly play a major role in stabilizing the folded conformation because their removal leads to much higher energy structures. For example, when aspartic acid and asparagines are represented by alanine rather than glycine, the energy of the folded conformation of Fig. 1B is increased from 59 kcal/mol to 225 kcal/mol and could not be reduced to less than 93 kcal/mol by subsequent minimizations. In another set of experiments we used a starting conformation corresponding to that of an a-helix. As can be seen in Table 1, when all residues were taken to be alanine, a stable conformation of 30 kcal/mol (compared to 93 kcal/mol for the folded conformation) was obtained following minimization. This helical structure is clearly destabilized by the presence of glycines which raise its energy to 52 kcal/mol, close to that of the corresponding folded conformation (59 kcal/mol). Thus, the role of glycine as a helix breaker found in statistical studies is also reflected by these calculations. The role of glycines in stabilizing folded structures results from their ability to assume conformations that are forbidden to other residues and which facilitate the formation of turns. In our calculations this arises from their tendency to form C7 structures (4 +80°; t -~80°) that correspond to one of the -

557

four local minima on their Ramachandran map. This computational result that a polypeptide sequence tends to form turns at glycine positions provides the theoretical demonstration of this characteristic behavior, using realistic potentials that reflect the structure of the glycine residue. It should be emphasized that the result is not an artifact because the turn is not imposed by local potentials. Rather, the turns that form in the simulation are a consequence of longrange interactions providing a driving force for globularity combined with the ability of glycines to assume bent conformations. In the Levitt-Warshel studies (3, 4) the chain also bends at the glycines, but this appears to be a direct consequence of the unusual backbone potentials they used. That is, a potential with only a single minimum in a conformation corresponding to a turn was used for glycines, while a single minimum in a conformation corresponding to ,B structure was used for alanines. A justification for the removal of other minima in the Ramachandran map is not given. The results obtained here suggest that glycine may have played a central role in the evolution of early proteins. First, glycine residues destabilize helical structures that lack the complex geometry required for subtle biological functions. Second, glycines tend to allow the formation of turns, which are an inherent property of any globular structure. In this way glycines stabilize globular structures, which is reflected in Table 1 by the steep increase in the energies of the folded structures as glycine is replaced by alanine (Table 1). Of course, many turns in proteins do not contain glycine, but these, based on statistical studies, are dependent on a specific sequence containing more than one amino acid. Glycine is the only amino acid that, alone, allows sharp reversal in chain direction. This last feature suggests that specific glycines play an important role in a folding pathway and may explain why glycines in certain positions are strongly conserved in evolution (17). It is generally believed that the probability of obtaining an active protein from a random sequence of amino acids is vanishingly small. While this is certainly the case, our results lead us to the conjecture that the criteria for obtaining globularity are less stringent than for activity and that many sequences, given favorable distribution of glycines, might be expected to assume a well-defined tertiary structure rather than an ensemble of conformations. Efforts to simulate prebiotic synthesis (see, for example, discussions in ref. 18) indicate that glycine was a common amino acid in the primordial environment, and this may have favored the formation of folded structures from essentially random sequences. Prebiotic experiments also suggest that a limited number of amino acids were abundant early in evolution. On the basis of our results, it seems reasonable to obtain tertiary structure from a sequence consisting of only a limited set of amino acids. It is of major interest, in this respect, to determine whether relatively small polypeptides of this size (58 residues) containing a limited number of amino acids can form a stable globular structure. Some of the ideas discussed can be tested by synthetic studies. Two major questions that can be addressed in such studies are: (i) Can a chain of approximately 60 residues containing only three or four amino acids form a unique, stable tertiary structure in aqueous solution? (ii) Are the presence of location of glycine as crucial as our results indicate in determining the turns, and through them, the final conformation? We thank Ms. Ruth Sharon for her invaluable aid in the programming of the algorithms and computations. We are indebted to D. Osguthorpe for help in preparation of the figures. We are very grateful to Drs. T. Creighton, C. Levinthal, S. Lifson, J. Moult, and J. Sussman for their thoughtful comments on the manuscript. This work was

558

Proc. Nati. Acad. Sci. USA 75 (1978)

Chemistry: Hagler and Honig

supported in part by the Israel Academy of Sciences (A.T.H.) and by the U.S.-Israel Binational Science Foundation, Jerusalem (B.H.). 1. Anfinsen, C. B. & Scheraga, H. A. (1975) Adv. Protein Chem. 205,300-331. 2. Hagler, A. T. & Lifson, S. (1978) in The Proteins, eds. Neurath, H. & Hill, R. L. (Academic Press, New York), 3rd Ed., Vol. 5, in press.

3. Levitt, M. & Warshel, A. (1975) Nature 253,694-698. 4. Levitt, M. (1976) J. Mol. Biol. 104,59-107. 5. Burgess, A. & Scheraga, H. A. (1975) Proc. Natl..Acad. Sci. USA 72, 1221-1225. 6. Tanaka, S. & Scheraga, H. A. (1975) Proc. Natl. Acad. Sci. USA 72,3802-3806. 7. Kuntz, I. D., Crippen, G. M., Kollman, P. A. & Kimelman, D. (1976) J. Mol. Biol. 106,983-994.

8. Huber, R., Kukla, D., Ruhlman, A. & Steigemann, W. (1971) Cold Spring Harbor Symp. Quant. Biol. 36, 141-148. 9. Schultz, G. E. (1977) Angew. Chem. Int. Ed. Engl. 16,23-32. 10. Honig, B., Ray, A. & Levinthal, C. (1976) Proc. Nati. Acad. Sci. USA 73, 1974-1978. 11. Hagler, A. T., Huler, E. & Lifson, S. (1974) J. Am. Chem. Soc. 96,5319-5327. 12. Hagler, A. T. & Lifson, S. (1974) J. Am. Chem. Soc. 96, 5327-

5335. 13. Crippen, G. & Scheraga, H. A. (1971) Arch. Biochem. Biophys. 14. 15. 16. 17. 18.

144,453-461. Creighton, T. (1975) J. Mol. Biol. 95, 167-199. Creighton, T. (1977) J. Mol. Biol. 113, 175-193. Levinthal, C. (1968) J. Chem. Phys. 69,44-45. Rossman, M. G. & Argos, P. (1977) J. Mol. Biol. 109,99-129. Margulis, L., ed. (1977) Origins of Life (Gordon and Breach, New York).

Synthesis of a new helical protein: the effect of secondary structure rearrangement on structure formation.

The effects of deoxyribonucleic acid secondary structure on tertiary structure.

Calculation of protein tertiary structure.

Computer analysis of protein structure.

CoMOGrad and PHOG: From Computer Vision to Fast and Accurate Protein Tertiary Structure Retrieval.

On the computation of the tertiary structure of globular proteins.

Three-dimensional structure for the beta 2 adrenergic receptor protein based on computer modeling studies.

Computer analysis of protein functional sites projection on exon structure of genes in Metazoa.

Investigation on the Effects of the Formation of a Silver "Flower-Like Structure" on Graphene.

Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction.

Protein-Protein Interactions Mediated by Helical Tertiary Structure Motifs.

Making the bend: DNA tertiary structure and protein-DNA interactions.

Mechanical trauma effect on clot structure formation.

Influence factors analysis on the formation of silk I structure.

On the importance of cotranscriptional RNA structure formation.

An instantaneous colorimetric protein assay based on spontaneous formation of a protein corona on gold nanoparticles.

Dependence of centriole formation on protein synthesis.

Computer Simulation of Water Sorption on Flexible Protein Crystals.

On the computation of the tertiary structure of globular proteins II.

Impact of aggregate formation on the viscosity of protein solutions.

Rapid search for tertiary fragments reveals protein sequence-structure relationships.

Generalized protein tertiary structure recognition using associative memory Hamiltonians.

Spatial structure formation in protein gels.

Protein tertiary structure recognition using optimized Hamiltonians with local interactions.