Calculation of protein tertiary structure.

,I. Mol.

Riol. (1976)

106, 983-994

Calculation

of Protein Tertiary Structure

1. D. KUTNTZ, G. M. CRIPPEN, P. A. KOLLMAN Department 1inGxmity

AND D. KIMELM~N

of Pharmaceutical Chemistry, Xchool of Pharmacy #an Francisco, Calif. 94143, U.S.A.

of Cahfomia,

(Received

9 March

1976)

We describe a method for calculating the tertiary structure of proteins given their amino acid sequence. The algorithm involves locally minimizing an energylike expression as a function of the Cartesian co-ordinates of the Co of all residues. .Uhough the approximation to the true polypeptide geometry and conformat)ional energies is extremely approximate, quite respectable results have been ejbtained for the small proteins rubredoxin and trypsin inhibitor, where the root

mean

square

errors

are

as low

as 4.0 A and

4.7 .i,

respectively.

1. Introduction There is a developing interest in extremely simplified models to predict the structure of small globular proteins (Levitt & Warshel, 1975; Ptitsyn & Rashin, 1975; Burgess 6 Scheraga, 1975). These models represent the protein backbone and side chains as one or two “united” atoms per amino acid residue. In some cases, entire secondary structural elements such as a-helices are treated as single units. Approximate. empirical potential functions are used for the steric and energetic requirements of the peptide backbone and for selected long-range interactions between amino acid sidechains. Empirical hydrophobic interactions are frequently included. The most successful of these calculations, by Levitt & Warahel, can “fold up” pancreatic trypsin inhibitor (58 amino acids) starting from an extended chain conformation. A number of compact conformations can be produced, depending on the computational details. These structures compare moderately well with the known crystal structure (Huber et al., 1972), having r.m.s. errors of 5 to 7 At. This is a quite respectable result, given the major approximations used. and holds promise for a real advance in our understanding of protein structures. In this paper, we describe a calculation that shares many of the features of that of Levitt & Warshel, although our formulation of the problem has some novel aspects. We focus on the development of a model that is sufficient t’o generate compact structures with moderately good correspondence to crystallographic data. We try to incorporate, in a computationally useful way, the factors that appear to govern protein folding (Kauzmann, 1959; Kuntz, 1972). We will not attempt to determine the “correct” forces that operate for real proteins, and we do not suggest that t,he interactions we have chosen to work with are faithful representations of actual tThe error is determined from the sums of the squares of the differences for all corresponding (‘p~-(‘,q distances between the calculated and crystal structures (Levitt & Warshel use C’, (‘,, distances in this computation). Of course, c’, co-ordinates are used for Gly residues. 983

984

I.

D.

KUNTZ

ET

AL.

physical forces. Nor are we concerned with simulation of the protein folding process, per se (that is, the pathway of folding). Rather we ask: what is the minimum input required to “predict” the tertiary structure from the amino acid sequence! We are also concerned that any calculation should be sufficiently simple, flexible, and eflicient that we can examine a number of different proteins of various sizes,potential functions, and geometric constraints. We have had some successin meeting these objectives for two small proteins: trypsin inhibitor and rubredoxin. We can generate a number of structures for each protein with r.m.s. errors of 4.7 to 6.5 A for trypsin inhibitor and 4.0 to 6.0 A for rubredoxin. These compact structures are forced to have the proper disulfide (trypsin inhibitor) and sulfur-iron (rubredoxin) bonding. The backbone turns and major loops are not forced and occur in the proper places. There is some resemblance to the crystal structures (Huber et al., 19’72; Watenpaugh et al., 1972). The computer program takes approximately 30 secondsper protein on a large computer (Control Data 7600, Lawrence Berkeley Laboratory). It is written in Fortran and has not been optimized for efficiency.

2. Protein

Model

In this section we describe the main features of our model for proteins. We point out the more unusual features of the model, with a brief motivation for the novel aspects. (1) The n amino acids of the protein (n = 58 for trypsin inhibitor; n = 53 for rubredoxin) are represented by n spheres located at the beta carbon positions, see below. Each sphere has a radius roughly equivalent to the van der Waals’ radius, V: of the appropriate side chaint. (2) We select as independent variables the 3n Cartesian co-ordinatesof the residue spheres. This choice has a number of implications for the rest of the model. Our basic reason for using Cartesian co-ordinates rather than angular variables was the assumption that the optimization process would be more eflicient. (A. Warshel & M. Levitt Quantum Chemistry Program Exchange, no. 247; L&on & Warshel, 1968; Wilde & Beightler, 1967). This hope has been at least partly realized as we shall see. (3) Residuesare allowed to move subject to a set of constraints. Violation of these constraints contributes to an error function that is the actual quantity minimized during the calculation. Each constraint is meant to “mimic” some important aspect of the folded protein structure, but there is considerable overlap between some of the constraints, and, as alluded to earlier, they are not directly related to conventional interatomic forces. We now use five constraints, listed below. (A) The distances between the ith and jth residue, rij, for j = i + 1 (that is for residuesadjacent in sequence), are allowed to be 3.8 to 4.9 A without penalty. It is this constraint that defines the residue locations as the beta carbons; for alpha tWe use the following values for the radii & Rubin, 1971) in Angstrom units: Lys,

4.3;

Arg,

of the residues

(i.e. sideohain

4.3;

Glu,

His,

Ser, 3.5; Thr,

3.8; Gin,

Ala,

3.0;

Pro,

3.5; Cys,

Leu,

4.2;

He, 4-2; Phe,

4.0;

4.2; Am,

4.2;

Asp,

4.0;

Gly,

and backbone) 4.0; 2.0;

3.0; Met,

3.8; Val,

4.0;

4.6; Trp,

6.0; Tyr,

4.6.

(Krigbaum

CALCULATION

OF

PROTEIN

TERTIARY

carbons, the separation would always be 3.8 8. Nearest-neighbor this range contribute an error proportional to

where

L, =

986

STRUCTURE

distances

outside

3.8, for rij < 3.8 a 4.9, for rij > 4.9 A , 0, for 3.8 _( rij 2 Ii4.9

This term defines the potential that we use to fix the local chain geometry. It contains little direct information about conventional secondary structures. In addition, the rt,i+z and riyii3 are constrained to be in the range allowed for helix and extended chain conformations. This extra set of restrictions is included in the results, below, but does not have an appreciable impact on the r.m.s. error. (B) Residue interactions are described by a matrix of interaction coefficients, clj, Table 1. Negative signs indicate attractive interactions. Interactions between ionic residues are in the upper left corner of the Table, interactions between the hydrophobic residues are in the lower right corner. Hydrophobic residues attract each other, as do appropriately charged ionic sidechains ; ionic-hydrophobic interactions are repulsive, as are those for polar-hydrophobic residues. Polar-ionic interactions are weakly attractive. We take the amino and carboxyl terminal residues t,o be ionic. These coefficients were chosen to allow hydrophobic terms of approximately equal strength, on the average, to the ionic interactions. The hydrophobic terms progress roughly as the geometric mean of the number of carbon atoms in the intera,cting side chains. No attempt at systematic adjustment has been made. Pairs of residues with G,~less than zero are constrained to be within 10 A of each other. while cij values greater than zero constrains the residues to be further apart than 15 8. If cii equals zero, the inter-residue separation can have any value without penalty from this term. The error penalty for violation of this constraint is of the form: -cij. (rFj - LB2), where L, is either 10 A or 15 A as appropriate. Note that this function depends quadratically on the inter-residue distance providing a much longer range contribution than the usual formulation of interactive potential functions. Our expression does serve t’o pack attractive residues near each other and to tbrce repulsive residues apart. (C) “Hydrophilic” residues, defined as residues 1 to 13 in Table 1. are constrained to be more than S L%from the center of mass, where the constant S term roughly determines the size of t’he hydrophobic core?. We set S = 10 A based on examination of the polar and non-polar distributions in rubredoxin and trypsin inhibitor. Clearly, S is expected to be a function of protein size. The error contribution for placing hydrophilic residues within the core is proportional to (S2- df) where d, is the distance from the ith residue to the center of mass. (D) Residues should be non-overlapping, using the van der Waals’ radii to define the size of each residueI. Residues that are too close together are penalized as I’~,-& where Vii is the sum of the van der Waals’ radii of the ith and jth residues. (E) We form the disulfide bonds in trypsin inhibitor and the sulfur-iron bonds in ruhredoxin by providing an error term if r,, is larger than a characteristic length, tNiote that Cys is included. Disulfide bonds in trypsin inhibitor doxin are observed to be on the “outside” of the molecules. 3 See footnote to p. 984.

and the S-Fe

cluster

in rubre-

tC!ys

25 25 25

0 0 0 10 10 10 15 15 15 15 -5

-10 -10 -5 -5 -5 -5

interactions

Lys Arg His Glu ASP Ser Thr Am Gin GUY Ala Pro cys Met Val Leu Ile Phe Trp Tyr

Lys

are

25 25 25 - 10 -10 -5 -5 -5 -5 0 0 0 10 10 10 15 15 15 15 -5

Arg

- 1 if not

25 25 25 -10 -10 -5 -5 -5 -5 0 0 0 10 10 10 15 15 15 15 -5

His

paired,

- 10 -10 -10 25 25 -5 -5 -5 -5 0 0 0 10 10 10 15 15 15 15 - 10

Glu

-- 100

25 25 -5 -5 -5 -5 0 0 0 10 10 10 15 15 15 15 -10

-10

-10 - 10

Asp

0 0 0 2 2 2 2 2 2 2

if paired.

-3

-5 -5 -5 -5 -iTi -5 -5 -5 -5

SW

-5 -5 -5 -5 -5 -5 -5 -5 -5 0 0 0 2 2 2 2 2 2 2 -- 3

0 0 0 2 2 2 2 2 2 2 - 3

Am

-5 -5 -5 -5 -5 -5 -5 -5 -5

Thr

-3

-5 -5 -5 -5 -5 -5 -5 -5 -5

Gln

0 0 0 2 2 2 2 2 2 2

Interaction

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

WY

TABLE

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Ala

coefficients

1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

PI.0

-1 -1 -1 -1 -1 -1 -4

t”

10 10 10 10 10 2 2 2 2 0 0

C!y3

-8 -8 -8 -8 -16 -16 -4

.-:

10 10 10 10 10 2 2 2 2 0 0

Met

-8 -8 -12 -12 -20 -20 -10

10 10 10 10 10 2 2 2 2 0 0 0 -1

Val

-8 -12 -16 -16 -30 -30 -15

15 15 15 15 15 2 2 2 2 0 0 0 -1

Leu

-8 -12 -16 -16 -30 -30 -15

15 15 15 15 15 2 2 2 2 0 0 0 -1

Ile

- 16 -20 -30 -30 -50 -60 -15

15 15 15 15 15 2 2 2 2 0 0 0 -1

Phe

- 16 -20 -30 -30 -60 -80 -20

15 15 15 15 15 2 2 2 2 0 0 0 -1

Trp

-10 - 10 -3 -3 -3 - 3 0 0 0 -4 -. 4 - 10 -15 ~ 15 -- 20 - 20 -30

1;

-5

TY~

CALC’CLATION

OF

PROTEIN

TERTIARY

STRI~t”LYlJRE

9s;

L,. taken

as 6 A, whenever residues i and j are appropriate members of disultide pairs or are Cys residues in rubredoxin. We note that all Cys residues in rubredoxin are bonded to the iron. The error is proportional to (r:j - Lz)‘. Our present calculation does not contain any representation of hydrogen bonds or dispersion forces. Electrostatic contributions are crudely handled by term R. abow. hut no partial charges are permitted. Our expectation was that the interaction matrix, as written. would lead to a hydrophobic core while t.he ion&non-polar repulsions and the constraints on the hydrophilic residues would generate a “hydrophilic” shell. These expectations are largely realized in the structures w shall describe.

3. Computational The

calcrilatjioll

is performed

in the T =

following

(x.lr yl,

Method way.

For

21, . . ., .r,,, y,,

a ,gi\-tL11 cc,rlforlll~rtic,ll. 2,).

expressed as the set of Cartesian co-ordinates of the n spheres (.ri, yi. zi, for i : 1. . . ., jr). we calculate 2 quantities: the error funct,ion E(T), and the 3n component gratlit~r~t vtvtor. G(T) = VE. The error function is made up of several t,erms, as outlined abo\.c>. For con\.enienct~ wc as tllr pre\%nls disindex these terms with the subscripts A, . ., E, in the same ordt,r cussion. Then,

where the e terms are the individual error terms described earlier, and the Cl. trrlns ar(’ empirically determined weighting factors. The numerical constants used for the strurtllrc of Figures 1 and 2 are: WA = 100, W, = 25, Wc = 100, U’, : 100. II’, = 100. The components of the gradient vector can be readily determinrd analyt,irally. givcbn the functional forms presented in the last section. The error function is minimized with respect to tllc Cartesian co-ordinates of (sac11 residue using a simple steepest descent minimizer routine without line research. That is, only one step is taken along the direction of tile steepest, descent before the gradient is recalculated. The step size for the next step is allowed to increase if the previous iteration was successful, otherwise it decreases until it, reaches a predetermined minimum level. At tllis point, the error associated with the first term, A (the term dealing with the nearestneighbor distances), is examined. If this sum is less than a fixed limit (current,ly t,he allo\{ able average error in the i, i f 1 distances is about 043 -h), the calculation termina.trs. Otherwise, the weighting factor for this term, H’*. is doubled and the calculation con Gnues. Typically 2 to 5 complete cycles (approx. 200 to 500 iterations through t,he minimizar) are needed to collapse from an extended chain starting conformation to a rompact structure with acceptable nearest-neighbor distances. The present version of tile prograIn requires less than 0.05 s/minimizer iteration on the CDC 7600. We review the unusual features of this calculation compared with t.he more c~ollvr,t~t,ioll;lI csnergy minimization procedures (Levitt & Warshel, 1975 ; Scheraga, 1 I)‘? 1). (1) The use of Cartesian co-ordinates instead of internal angular co-ordinates appears to generate a smooth surface for the error function that permits rapid conrergenrc’. No local minima are detected until a quite compact conformation is achieved for tile small proteins we have studied. The ease of computing a simple analytical gradient vt,ctor IS also an important feature of this choice of co-ordinates. We are thus able to use a particlelarly simple optimizing algorithm. A disadvantage of this approach is that it, is easy to generate larger local geometry distortions that are physically reasonablr, as if t’hc virtual CR---CO bonds were quite “rubbery” during the early phases of the calculatioll. \Yhile thts

988

I. D. KUNTZ

ET

AL.

effect is a distinctly non-physical one, the local distortions are readily removed in later steps. (2) We do not calculate an energy function. Nor is there a simple proportional factor to convert the total error function into energy units. One can, however, think of energy

analogsfor each of the terms we have described.Thus, the nearest-neighborterm corresponds to bond stretching. The interaction coefficients clearly act as “conventional” residue-based pair-wise energies, except that the shape of the interactive “potential” is of a parabolic type. Our van der Waals’ overlap expressionis a “soft” sphereapproximation, since the error increases only as I/?, instead of, say, I/@. Our third term is an approximation to water-polar residue interactions. It favors a spherical hydrophilic shell that is only a fair approximation to the ellipsoidal protein molecules. There is, of course, no direct physical analog for the disulfide pairing function we employ. It cannot be thought of as the energy of the S-S bond, because it operates at long distances as a very specific, attractive force. While this term does not dominate the folding process, see below, it

clearly contributes to the overall successof our procedure.

4. Results We make use of distance and direction map representations to present our results. These maps are independent of rotation and translation of the molecular framework and thus serve as a convenient method of comparing three-dimensional structures (Rossmann & Liljas, 1974; Kuntz, 1975; Crippen & Kuntz, 1976). We also inolude error plots that subtract the native protein inter-residue distances from those for the calculated structures. Table 2 summarizes a number of secondary and tertiary structural features for the native and calculated conformations of trypsin inhibitor and rubredoxin. Two calculations are shown for each protein: in each case the first map is the result when identical parameters are used for both proteins, while the second map is the best structure obtained thus far for the particular protein (Figs 1 and 2). These structures were generated from an extended starting conformation in which each residue was placed at 3.8 A from the preceding residue, the step being made cyclically along the X, Y, Z axes. Inspection of the Table and the Figures showsthat the major bends in the peptide chains, for both proteins, are quite well-represented in the calculated structures. There are some stretches of crude secondary structure, better defined in trypsin inhibitor than in rubredoxin, but the match is only fair. The intermediate level tertiary structure, expressed in terms of specific large loops (Table 2) and in terms of the general off-diagonal pattern of the distance maps, is obviously similar. The major defects in the computed structures include (seeerror plots and Fig. 2) : (1) the “hairpin” loops are generally too compact ; (2) the turns themselves are too gradual; (3) a few long range contacts are not close enough in the calculated structures, particularly residues5 to 43 and 23 to 43 in trypsin inhibitor and residues 1 to 28 in rubredoxin. Generally, the distance maps for the calculated structures are less sharply delineated than those for the proteins themselves. This feature, and the relatively poor correspondenceof the maps near the diagonal come about becauseof the lack of good secondary structure potential functions in our procedure and the lack of concerted motion of the residues. Inspection of the stereo pictures shows numerous defects in the calculated structures, the principal ones being poor correspondence of packing of some of the large loops. The r.m.s. errors for our best trypsin inhibitor structures are 4.7 to 5.0 A, comparing favorably with the best structure from Levitt and Warshelt (r.m.s. error, 5.3 A). A t See footnote to p. 983.

47-56

/I /3 j3 p p

6-12 16-22 25-29 3641 47-55

5.08

3-27, 5-22 15-38 22-45 3652’

‘cm,’

l-5

5 13 26 33 45

9.9

Trypsin inhibitor Model ( l)a

of structural

p B p p

4.70

5-23, 5-32 13-40’ 23-46 32-46, 3&53’

“a”

5 12 26 36 46

9.3

(2)”

for m.odel

1620 26-33 37-42 4C-55

l-5

Model

features

2

3.32

1-14 15-18 19-38 30-46 33-46

11-13 17-19

4-68

7 15 21 27 34 40 46

9.7

Native

/3 B

( l)a

4.31

1-14 13-30 21-36 30-52 36-45

34-36 41-44

p p

l-4 p 8-12 /9

6 14 22 29 33 39 47

9.8

Rubredoxin Model

and nakive conformations

9.4

3.99

1-14 14-29 23-37 39-46 36-46

34-37

l-4

7 l&17 23 27 32-34 39-41 47

Model

j3

p

(2)”

a Structures numbered 1 were obtained using identical parameters. The second structure in each pair (2) has the lowest r.m.s. error obtained for that protein. b Average of the distances of each residue from the center of mass. c Identified from near-diagonal features of distance maps (see also Kuntz, 1975). d Assignments for native proteins taken from refs (Huber et al., 1972; \Vatenpaugh et aZ., 1972; Chou & Fasman, 1974); a is helical, /3 is extended, “a” for the model structures refers to compact regions of chain but, is not meant to imply precise “helical” co-ordinates. e Major off-diagonal contacts, taken from distance maps; these are the smallest loop structures. Larger loops, incorporating these, can also be identified. f Reflects disulf?de bridge formation; none of the loops listed for rubredoxin includes S--Fe-S bridges. g Error as calculated in text and footnute to p. 983; the error listed for native structures is the error obtained when the native st,ructure is used as the starting conformation. The other entries in the native columns are based on the crystallographic coclrdinates.

2.28

(A)

16-24 27-36

r.m.s

b fl IX

u.

2-6

5 13 23 36 45

10.3

5-23 14-38’ 22-43 30-51’

error=

(A)

Major 1oopse

Secondary structured

TWX&

Sizeb

Native

Comparison

TABLE

FIG. 1. The upper tbe darkest symbols larger than 12 A; B structures as defined cate calculated-native of the backbone, the alignment; E is the

the row refers to trypsin inhibitor; indicate contacts within 6 A, the is the distance map for the crtlculeted in the footnote to p. 983; regions errors larger than 6 A. D is the intermediate shading indicating equivalent matrix for the calculated

row

to rubredoxin.

\

For

each

row

: A is the

I ::.:. :. .

map

for

the

regions white

structure

(p carbons)

parallel roughly

alignment anti-parallel

regions indicate separations the native and calculated regions marked “-I’ indi-

native

indicating roughly regions indicating

A, and the white of errors between than 5 A while

distance

regions indicate contacts between 6 and 12 structure (model (1) in Table 2); C is the map native-calculated errors larger marked “+ ” indicate direction matrix for the native structure, the darkest and the perpendicular alignment of the backbone, structure.

shaded

lower

C

.. ::

;

. : --.: . -I :. :::..: I . : -. . -. . -. __...... . :: ‘...::..: .I .::..: v ::-.. . C

CALCULATION

OF PROTEIN

TERTIARY

991

STRUCTURE

(d)

FIG. 2. Htereoviews of X-ray and calculated structures. native (b); rubredoxin: calculated (c), and native (d). The the nlain loops are readily seen.

Trypsin inhibitor: structures

have

calculated (a), and been

rotated

so that

rubredoxin structure calculated with exactly the same weighting factors and other parameters as those used for trypsin inhibitor had an r.m.s. error of 4.7 A. Changing the refinement parameters led to better rubredoxin structures; the best we have found has an r.m.s. error of 4-O 8. In fact, this last structure is quite close to the one we get using the native conformation as the starting point of the calculation. suggesting that there are relatively few local minima on the error function surface for rubredoxin, and that these minima are located in the vicinity of t,he native conformation. We have explored, in a limited way, the sensitivity of the folding process and the final conformation to variation of our parameters. The folding occurs in two stages. First, there is a general collapse of t,he extended starting conformation into a compact shape that is still expanded by 20 to 30% compared to the actual protein. Secondly, there is a process of chain refinement that reduces the average radius (that is, the

992

I. D. KUNTZ

ET

AL.

average distance of the residues to the center of mass) to within 5% of that observed for the native protein. We should stress that assessing the effects of small changes in the parameters on the calculated structures is not always straightforward. We have some evidence, especially for trypsin inhibitor, that our final structures are “kinetically” determined. We know, for example, that the structures represented in Figures 1 and 2 are local but not global minima, because we find deeper minima quite near the native conformation by using the native conformation as our starting point. Alteration of any of the calculation parameters by even 5 to 10% can sometimes lead to different local minima end-points, having an unpredictable effect on the r.m.s. error. We can make a few tentative generalizations about our model. (1) The initial collapse of the extended conformation is primarily driven by the attraction of hydrophobic residues for each other via term B, and the cysteine pairing term, E, with the former term being the more important: that is, collapse occurs if W, = 0 but not if W, = 0. (2) Several terms act to stop complete collapse; namely, term A and the repulsive elements in B, C and D, with no one term being predominant. The hydrophilic shell is important in preventing a rather efficient segregation of the residueswhereby the polar and ionic residuesare located at one end of the molecule and the hydrophobic residuescluster at the other end. (3) Refinement of the structure in the second stage is almost entirely guided by the nearest-neighbor term, A, becausethe i to i + 1 distances become considerably distorted during the first part of the calculation. Setting WA to zero doesnot prevent the collapse, but it does preclude r.m.s. errors below approx. 7 A. (4) Normally, small variations in weighting factors and other parameters produce small variations in structure, with the usual effect being displacement of “loops” with respect to one another. In most cases,one can vary the weighting factors at least two- to fivefold without producing r.m.s. errors larger than 7 A. As noted above, however, these structures involve different local minima and similarity of r.m.s. errors must not be taken as evidence that such structures are readily interconvertible in a kinetic sense. In sum, our procedure generates families of related conformations that have respectable agreement with the tertiary aspects of two small globular proteins. The calculated secondary structures are not in very good agreement, but this does not seriously compromise the calculated structures.

5. Discussion Our calculation is most closely comparable to that of Levitt $ Warshel (1975). Both methods use a greatly simplified version of the protein, with that of Levitt & Warshel being more realistic. The Levitt $ Warshel potential function includes a unidimensional version of the dipeptide potential surface, a 6,12 potential, and hydrophobic attractions. Our approach has very approximate models for the first two terms and a somewhat more elaborate development of residue-residue interactions. The use of n angular variables by Levitt t Warshel requires a relatively time-consuming calculation of the gradient and a sophisticated optimizer routine. Our choice of Cartesian co-ordinates requires 3%variables, but permits a very simple gradient calculation and minimizer. Both methods require a few hundred cycles of the

CALCULATION

OF

PROTEIN

TERTIARY

STRUCTURE

993

optimizing routine for convergence. The time required for both methods is at least quadratically dependent on the number of amino acids in the peptide chain, since all pairwise interactions of the residues are considered. The local minimum problem is very complex, with neither approach able to guarantee a global minimum (Crippen, 1975a). Levitt & Warshel compute the normal modes of the model protein chain and use a random “t,hermalization” to proceed out of local minima. It is not clear, at t,his point, how efficient this method will be, since some fraction of t,he runs does not. reach compact forms. In our case, refinement of the i to i + 1 distances by changing the weighting factor, W,, provides a means of leaving a local minimum, by varying the shape of the error function surfacet. We observe that the r.m.s. error decreases R,S WA is progressively increased, reaching a limiting value as the error for term A approaches acceptable values. Roth we and Levitt & Warshel use ad hoc assumptions that significantly influence the folding pathway and the final folded states. As described above, we force the S-S and S-Fe bonds to form correctly. In rubredoxin, this constraint clusters t,he four cystines in a rough tetrahedron about the iron. In trypsin inhibitor, the disulfide pairs 5-55, 14-38, 30-51 are required to form, but we do not control the separation or orientation of the three pairs with respect to each other. Each pair of cystine residues moves independently under the influence of the appropriate constraints given earlier. Our approach makes use of information that is generall) availa,ble in advance of crystallization and X-ray studies. While it does not offer iI means of obtaining a kinetic pathway, it appears to be a useful stratagem for efficient generation of satisfactory equilibrium structures. We note in passing that rubredoxin folds so readily that, a fairly good structure (r.m.s. error approx. 6 A) is obtained tlven when the Cys-Fe interaction is not included. The crucia,l ad hoc assumption of Levit,t & Warshel is their use of the Gly potential function for Asn and Asp as a method of generating turns. This assumption favors turns at residues 3, 24, 43-44. and 50; the middle two being essential to proper folding. We are not, aware of a fundamental reason for selecting these two residues from a number of other “turn farmers” (Chou & Fasman. 1974; Lewis et al., 1971; Kuntz, 1972). The Levitt & Warshel procedure is clearly superior to ours for characterizing the kinet’ic folding process. The kinetic pathway scans several regions of the potential surface. a,nd is a much more sensit)ive test of a potential function than is a single structure that,, in essence, only requires information nea,r a local minimum. Our computation contains “non-physical” aspects such as the disulfide forming function, t’hat could only fortuitously produce an approximation to the true folding mechanism. However, it appears that our technique offers some savings in computer time and cost compared with that of Levitt t Warshel. Either approach is far more economical than the convent,ional calculations (Scheraga, 1971) for “predicting” equilibrium strucbtures. It is interesting that two ca,lculations that differ appreciably in detail can haw comparable success. The significant common features of the two procedures seem t,o he (1) the simple chain description that greatly reduces the local minimum problem, and (2) the inclusion of hydrophobic interactions that produces compact, reasonably realistic model structures. tThis approach is similar to the method of successive approximat,ion in difficult optimization tasks that initially uses easy optimization tasks, the solutions to these being good starting points for the more difficult ones (Wilde & Beightler, 1967).

994

I.

D.

KUNTZ

ET

AL.

Finally, we make a few general comments on calculations of this type. First, we find it helpful to think of the r.m.s. error as a rough “resolution” parameter. In these terms, r.m.s. errors of 5 to 7 A correspond approximately to the first preliminary X-ray results on a new protein. The molecular outline is reasonably clear, but “active sites” and precise chain directions are not discernible. We are optimistic that there is much to learn from “low resolution” models, but, unlike X-ray studies, there is no assurance that the model calculations can ever be refined to high accuracy. Secondly, there is a distinct problem in speaking in a precise way of the “quality” of the model structures. We feel keenly the need for a general method of characterizing structures in such a way that realizable conformational transitions can be recognized and families of structures can be delineated. The r.m.s. error from distance maps is not very useful in this regard. A model structure might have co-ordinates very similar to that of the native conformation, yet be convertible to the native one only by passing one segment of chain through another (Crippen, 1974,19756). Such a structure would have a low r.m.s. error but could not be considered a good approximation to the native state. A more realistic, but less convenient comparison of conformations might be the length of the shortest path in conformation space connecting the two conformations, such that the energy along the path remains below some given value. Such a measure of similarity has yet to be precisely defined in a feasible algorithm. Thirdly, one must face the problem of extending these calculations to other proteins. Preliminary studies with lysozyme (129 amino acids) indicate that our present program can accommodate a molecule of this size with t’he computer time required being about two to three minutes. Obviously, larger molecules will provide a severe test of the procedure we have outlined. Financial support from the National Institutes of Health (to I. D. K. and P. A. K.) and the Academic Senate of the University of California (to I. D. K. and G. M. C.) and stimulating discussions with Lynn Ten Eyck and David Hayes are gratefully acknowledged . REFERENCES Burgess, Chou, Crippen, Crippen, Crippen, Crippen, Huber,

A. W.

& Scheraga,

A. (1975). Proc. Xut. Acad. Sci., U.S.A. 72, 1221.--1225. Biochemistry, 13, 211-245. G. M. (1974). J. Theoret. Biol. 45, 327-338. G. M. (197%). J. Comp. Phya. 18, 224-231. G. M. (197%). J. Theoret. Biol. 51, 495-500. G. M. & Kuntz, I. D. (1976). J. Theoret. Biol. In tile press. R., Kukla, D., Ruhlman, A. & Steigemann, W. (1972). Gold Ljpring Harbor Symp. Quant. Biol. 36, 141-151. Kauzmann, W. (1959). Advan. Protein Chem. 14,1-63. Krigbaum, W. R. & Rubin, B. H. (1971). Biochim. Biophys. Acta, 229, 368-383. Kuntz, I. D. (1972). J. Amer. Chem. Sot. 94, 8568-8572. Kuntz, I. D. (1975). J. Amer. Chem. Sot. 97, 4362-4366. Levitt, M. & Warshel, A. (1975). Nature (London), 253, 694~-698. Lewis, P. N., Momany, F. A. & Scheraga, H. A. (1971). I’roc. LVat. dcad. Sci., U.S.A. 65, 2293-2297. Lifson, S. & Warshel, A. (1968). J. Chem. Phys. 49, 5116~5129. Ptitsyn, 0. B. & Rashin, A. A. (1975). Biophys. Chem. 3, I-20. Rossmann, M. G. & Liljas, A. (1974). J. Mol. Biol. 85, 177--181. Scheraga, H. A. (1971). Chem. Rev. 71, 195-217. Watenpaugh, K. D., Sieker, L. C., Herriott, J. R. & Jensen, L. H. (1972). CoZd Spring Harbor Symp. Quant. Biol. 36, 359-367. Wilde, D. J. & Beightler, C. S. (1967). Foulzdotions of Optimization, Prentice-Hall, Englewood Cliffs.

P. Y. & Fasman,

H.

G. D. (1974).

Protein-Protein Interactions Mediated by Helical Tertiary Structure Motifs.

On the formation of protein tertiary structure on a computer.

Rapid search for tertiary fragments reveals protein sequence-structure relationships.

Making the bend: DNA tertiary structure and protein-DNA interactions.

Generalized protein tertiary structure recognition using associative memory Hamiltonians.

Protein tertiary structure recognition using optimized Hamiltonians with local interactions.

Serpin tertiary structure transformation.

H3N2.

Direct Calculation of Protein Fitness Landscapes through Computational Protein Design.

The effects of deoxyribonucleic acid secondary structure on tertiary structure.

Quaternary and tertiary structure of haemerythrin.

Tertiary structure of myohemerythrin at low resolution.

Proposed tertiary structure of the sodium channel.

Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels.

Bayesian weighting of statistical potentials in NMR structure calculation.

Protein backbone engineering as a strategy to advance foldamers toward the frontier of protein-like tertiary structure.

SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information.

Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction.

Large-scale model quality assessment for improving protein tertiary structure prediction.

CoMOGrad and PHOG: From Computer Vision to Fast and Accurate Protein Tertiary Structure Retrieval.

A practical implicit solvent potential for NMR structure calculation.

Accelerated isotope fine structure calculation using pruned transition trees.

Structure calculation, refinement and validation using CcpNmr Analysis.

Calculation of the free energy of association for protein complexes.