156
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
but an artificial model of what the receptor surface might look like. The surface is triangulated and dotted with control points that can be "grabbed" interactively and moved to a new position in space. The surface is malleable, so that when one control point is moved, the local surface deforms to take account of the motion: thus bulges or dents can be added to the KARMA surface. This allows the user to change the model of the site interactively. The KARMA surface can also be colored with hydrophobic or electrostatic parameters. QSAR studies on families of derivatives provide detailed information about local regions of the site based on local variations within a family of ligands. KARMA aids the interpretation of these data by providing an interactive way of visualizing the model of the deduced binding site and changing the model in response to new information. Finally, as it is in many ways similar to the surface models derived by direct methods, it can also be used to create a negative sphere image for database searches or other forms of rational drug design. Conclusion There are many methods available for directly determining clefts and binding sites in receptors, of which the negative sphere method appears to give the most reliable results. Indirect methods for site mapping can give valuable qualitative pictures of the binding site, but the models need to be verified experimentally. Acknowledgments R.A.L. wouldlike to thankthe RoyalCommissionof 1851for a ResearchFellowshipand the FulbrightCommissionfor a Senior Scholarship. This work was partially supported by National Institutes of Health Grants GM-31497 and GM-39552 (George Kenyon,Principal Investigator). Elain Mengand Natalie Colloc'h are thankedfor helpfuldiscussions.
[9] M o d e l i n g o f S i d e C h a i n s , L o o p s , a n d I n s e r t i o n s in P r o t e i n s
By
NEENA
L. SUMMERS and
MARTIN KARPLUS
Introduction Detailed knowledge of the three-dimensional structure of a globular protein is essential for an understanding of its biological function. Experimental data at atomic level resolution can be obtained by using X-ray METHODS IN ENZYMOLOGY, VOL. 202
Copyright © 1991by AcademicPress, Inc. All fights of reproduction in any form reserved.
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
157
diffraction, neutron diffraction, and/or nuclear magnetic resonance (NMR) techniques. Ideally, a well-determined experimental structure would serve as a starting point for the analysis of the function of a protein. However, there is a rapidly increasing number of protein sequences for which structural information is nonexistent or incomplete. Also, protein engineering has as its objective the design of folded structures with predetermined functions and/or physical properties (e.g., enhanced thermal stability or shifted pH optima). For both problems, methods that permit the modeling of protein structures would be very useful. It is known that the three-dimensional structure of a protein is determined by its amino acid sequence.l However, no methodology exists for ab initio structure prediction from the sequence alone. In this chapter, we concern ourselves with a more limited problem, that of homology modeling. Homology modeling is the most hopeful approach for predicting the three-dimensional structure of a protein. 2 It is founded on two observations. First, proteins belong to a limited number of structural families) Second, proteins within the same family have very similar three-dimensional structures. 4-6 Based on these observations, homology modeling uses known (template) structures of one or more structure-function family members to predict the (target) structure of yet another family member whose sequence is known. 7-11 In homology modeling, most of the target backbone can be adapted from the template molecule because of the similarity of the global fold within a given structure-function family. Two exceptions to backbone transferability 12are encountered. Backbone conformations differ near residue insertions and deletions because similar distances must be spanned by fragments of different lengths. Also, local fragments containing non-Pro --~ Pro mutations must sometimes be reconstructed because the structural rigidity and steric requirements of the pyrrolidine ring of the Pro residue i G. E. Schulz and R. H. Schirmer, "Principles of Protein Structure." Springer-Verlag, New York, 1979. 2 D. Blow, Nature (London) 304, 213 (1983). 3 M. O. Dayhoff, "Atlas of Protein Sequence and Structure," Vol. 5. National Biomedical Research Foundations, Washington, D.C., 1972. 4 A. M. Lesk and C. Chothia, J. Mol. Biol. 136, 225 (1980). 5 C. Chothia and A. M. Lesk, J. Mol. Biol. 182, 151 (1985). 6 C. Chothia and A. M. Lesk, Trends Biochem. Sci. 190, 116 (1985). 7 N. L. Summers and M. Karplus, J. Mol. Biol. 210, 785 (1989). s N. L. Summers, Chem. Des. Autom. News 1 (December, 1989). 9 j. Greer, J. Mol. Biol. 153, 1027 (1981). l0 M. J. Sutcliffe, I. Haneef, D. Carney, and T. L. Blundell, Protein Eng. 1, 377 (1987). 11 M. J. Sutcliffe, F. R. R. Hayes, and T. L. Blundell, Protein Eng. 1, 385 (1987). 12 N. L. Summers and M. Karplus, J. Mol. Biol. 216, 991 (1990).
158
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
19]
limit the number of energetically accessible conformations of both the Pro and the residue immediately preceding it. Side-chain positions must also be determined accurately to complete the structure for which a backbone model has been developed. Further, side chains are involved in alterations of secondary structure packing, as well as in the positioning of the inserted/deleted regions. Finally, sidechain positions as well as main-chain positions are important in assuring that the potential energy surface of the protein in chemically or structurally interesting regions of the molecule is complete and accurate. A knowledge of the spatial arrangement of residues other than those participating directly in binding may be required to examine and model ligand-protein interactions. Two types of approaches are being used to develop methodologies for structural prediction: energy function based conformational searches and rules derived from the analysis of structural data. Each approach has advantages and limitations. In principle, complete conformational searches locate all possible conformations. However, the amount of cpu (central processing unit) time and/or disk space required to complete a comprehensive search successfully may be prohibitive. Moreover, even if all the conformations are found, determining which one is most stable is very difficult. The results depend on the accuracy of the force field and, in many cases, on the proper treatment of solvent. Statistical trends obtained from databases are indicative of conformation selection in evolution and/or inherent properties of amino acid sequences. Once identified, trends can be used to generate approximate structures. This approach drastically reduces the extent of conformational space which must be examined in order to generate protein models. The cpu and disk storage requirements also are significantly reduced compared with most conformational search methods. However, no statistically based conformational selection is universally applicable. Any structural or functional requirement that is unique, infrequently observed, or not represented proportionately in the database from which the rules are derived cannot be obtained by a statistical analysis. This chapter proposes a hybrid approach to structural modeling which combines the detection, extension, and application of statistical trends in known protein structures with the identification of energetic fingerprints which characterize well-folded, globular proteins. 8 All methods based on the application of statistical trends incur exceptions. However, the exceptions are also shown to follow readily identifiable statistical trends. To find and use the trends and fingerprints, side-chain and main-chain modeling can be considered to be loosely coupled problems that can be solved in an iterative fashion. We describe side-chain modeling methods
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
159
before the main-chain techniques, even though the reverse order would be employed in most applications. An important aspect of the present approach is that it not only provides a prescription for homology modeling but also makes it possible to test the results and determine where they are most likely to be in error. The methods described here are expected to be useful not only for homology modeling but also for completing structures for which experimental data are incomplete (e.g., some NMR structures), as well as for refinement and testing of the results of more general modeling efforts. Side-Chain Modeling
General Background and Modeling Applications Three situations commonly arise in which side-chain conformation predictions are of use. First, there is the prediction of the conformational changes induced by single, double, or multiple point mutations. This may be needed if the structures are not available to facilitate the interpretation of experimental data or as a prelude to introducing mutations to alter protein function or specificity [e.g., to induce enhanced thermal stability by incorporating additional (nonnative) disulfide bridges13]. Second, the conformations of side chains forming a contiguous segment are of interest. Such predictions might be used to interpret specificity differences (e.g., the antigen binding loops ofimmunoglobulinsl4'lS), to assist in the interpretation of NMR 16or X-ray crystallographic 17'18data, or to augment experimental data in regions where the data are poor or nonexistent (e.g., to partially define incompletely characterized portions of NMR structures or to assign preliminary conformations to segments of X-ray structures in regions of poor electron densitylS). A similar application is to endow a protein with a new, bioengineered function (e.g., to initiate calcium uptake through the addition of a calcium-binding loop). Third, in order to generate a model for an entire homologous protein it is necessary to examine all of the side chains, even if many are identical in the template and target systems. In practical terms, the first two applications are special cases of the third. 13 j. A. Wells and D. P. Powers, J. Biol. Chem. ?.61, 6564 (1986). 14 C. Chothia, A. M. Lesk, M. Levitt, A. G. Amit, R. A. Mariuzza, S. E. V. Phillips, and R. J. Poljak, Science 233, 755 (1986). 15 C. Chothia and A. M. Lesk, J. Mol. Biol. 196, 901 (1987). 16 p. j. Kraulis and T. A. Jones, Proteins 2, 188 (1987). 17 T. A. Jones and S. Thirup, EMBO J. 5, 819 (1986). 18 N. L. Summers, unpublished work (1989, 1990).
160
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
A number of approaches to side-chain modeling have been tried with varying degrees of success. It has generally been assumed that identical residues at structurally equivalent positions in homologous proteins adopt similar conformations. However, the procedures used to place the side chains, particularly those that are not identical, have varied considerably. If the sequentially paired residues are conservative analogs (i.e., substitutions with similar shapes and chemical properties such as Ile and Val or Asn and Asp), similar conformations are often chosen. 19-24 Alternatively, the Xl and X2 dihedral angles of the homologous residue have been u s e d , 9'25-31 or the root mean square (rms) deviation of the side-chain atoms with respect to the homolog have been minimized. 10,11In either case, it is unclear what starting conformation should be selected for branched side chains, particularly when the residues in question are not conservative analogs. When insufficient data are available, either because the side chains to be constructed are longer than their counterpart in the homologous protein or are structurally or chemically dissimilar, they have been positioned arbitrarily in the fully extended conformation, 24 omitted, 28,3° positioned so as to optimize tertiary c o n t a c t s , 21'22 taken to be equivalent to conformations observed for small model c o m p o u n d s , 31 or chosen from the calculated minima of the appropriate dipeptide potential energy surface. 25-27,32
Several methods do not distinguish between identical and mutated I9 W. J. Browne, A. C. T. North, D. C. Phillips, K. Brew, T. C. Vanaman, and R. L. Hill, J. Mol. Biol. 42, 65 (1969). 20 S. Bedarkar, W. G. Turnell, and T. L. Blundell, Nature (London) 270, 449 (1977). 21 T. L. Blundell, S. Bedarkar, E. Rinderknecht, and R. E. Humbel, Proc. Natl. Acad. Sci. U.S.A. 75, 180 (1978). 22 T. Blundell, B. L. Sibanda, and L. Pearl, Nature (London) 304, 273 (1983). 23 R. J. Feldmann, D. H. Bing, B. C. Furie, and B. Furie, Proc. Natl. Acad. Sci. U.S.A. 75, 5409 (1978). 24 R. J. Feldmann, D. H. Bing, M. Potter, C. Mainhart, B. Furie, B. C. Furie, and L. H. Caporale, Ann. N.Y. Acad. Sci. 439, 12 (1985). 25 p. K. Warme, F. A. Momany, S. V. Rumball, R. W. Tuttle, and H. A. Scheraga, Biochemistry 13, 768 (1974). 26 G. F. Endres, M. K. Swenson, and H. A. Scheraga, Arch. Biochem. Biophys. 168, 180 (1975). 27 M. K. Swenson, A. W. Burgess, and H. A. Scheraga, "Frontiers in Physicochemical Biology." Academic Press, London, 1978. 28 j. Greer, Proc. Natl. Acad. Sci. U.S.A. 77, 3393 (1980). 29 j. Greer, J. Mol. Biol. 153, 1042 (1981). 30 j. Greet, Ann. N. Y. Acad. Sci. 439, 44 (1985). 31 W. Carlson, M. Karplus, and E. Haber, Hypertension 7, 13 (1985). 32 K. A. Palmer, H. A. Scheraga, J. F. Riordan, and B. L. Vallee, Proc. Natl. Acad. Sci. U.S.A. 83, 1965 (1986).
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
161
residue substitutions. The Ponder and Richards rotamer library approach 33 collects and then applies the statistical distributions of side-chain conformations observed in known protein structures. Molecular dynamics has been used to anneal side-chain atoms into position one degree of freedom per side chain at a time, 34 and Monte Carlo searches have been employed to manipulate all side-chain atoms simultaneously. 35 The algorithmic simplicity of such approaches, even if they are computationally intensive, is very seductive. For most of these approaches, tests remain to be done to determine how robust the methodology is with respect to uncertainty in the backbone structure. In what follows, we describe first the statistical trends observed for the amino acids in homologous proteins and then show how to use these results and empirical energy calculations to develop a scheme for side-chain modeling. Statistical Trends
It is possible to derive reliable side-chain conformation selection criteria for most residues by studying pairs of side chains in topologically equivalent positions in homologous proteins 7,9,11,36;by topologically equivalent, we mean residues in homologous proteins which occupy equivalent spatial positions with respect to the global fold. To establish meaningful trends, comparisons with respect to well-refined, high-resolution X-ray crystallographic structures must be made. Equally important, the database should contain proteins representative of many situations in which modeling might be attempted if identifiable trends are to be generally applicable. The 10 proteins chosen for examination 37 are shown in Table I. They are treated as 7 pairs of homologous proteins (Table II). To obtain clear and predictable patterns of side-chain conformational transferability between homologous proteins, the database is analyzed in the following manner. First, the dihedral angles which define side-chain conformations are compared rather than considering side-chain atom rms deviations. The atomic motions that would be characterized by normal modes of vibration and packing flexibilities are described in terms of variations in the low frequency, dihedral degrees of freedom. Second, dihedral angles which differ by 40 ° or less are assumed to correspond to the same minimum energy conformation. A three-fold potential energy surface for a Val X~rotation is shown in Fig. la. Such a surface is character33 j. W. Ponder and F. M. Richards, J. Mol. Biol. 193, 775 (1987). 34 p. E. Correa, Proteins 7, 366 (1990). 35 C. Lee, S. Subbiah, and M. Levitt, unpublished (1990). 36 A. M. Lesk and C. Chothia, Philos. Trans. R. Soc. London, Ser. A 317, 345 (1986). 37 N. L. Summers, W. D. Carlson, and M. Karplus, J. Mol. Biol. 196, 175 (1987).
162
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
TABLE I X-RAY AND SIZE DATA FOR SELECTED PROTEINS Protein
ID a
No. of residues
Resolution (~)
R factor
Penicillopepsin b Endothiapepsin c Rhizopuspepsin d Trypsin e T-Chymotrypsin f Myoglobin g Leghemoglobinh Erythrocruorin i Lysozyme Hen egg white j Human k
PEN END RHI TTRP GCHY MYO LEG ERY
323 330 325 223 241 153 153 136
1.8 2.1 1.8 1.7 1.9 2.0 2.0 1.4
0.14 0.16 0.14 0.20 0.18 0.24 0.18 0.19
HEN HUM
129 130
1.6 1.5
0.22 0.19
a Abbreviations are also used in the text. b M. N. G. James and A. R. Sielecki, J. Mol. Biol. 163, 299 (1983). c L. Pearl and T. Blundell, FEBS Lett. 174, 96 (1984). d K. Suguna, R. R. Bott, E. A. Padlan, E. Subramanian, S. Sheriff, G. H. Cohen, and D. R. Davies, J. Mol. Biol. 196, 877 (1987). e Brookhaven Protein Data Bank, PDB-ID 3PTN. f G . H. Cohen, E. W. Silverton, and D. R. Davies, J. Mol. Biol. 148, 449 (1981). g T. Takano, J. Mol. Biol. 110, 537 (1977). h E. H. Arutyunyan, J. Deisenhofer, A. V. Teplakov, I, P. Kuranova, H. B. Obmolova, and B. K. Vainshtein, Dokl. Akad. Nauk SSSR 270, 732 (1983). i W. Steigemann and E. Weber, J. Mol. Biol. 127, 309 (1979). J H. H. G. Handoll, unpublished results (1984). k p. j. Artymiuk and C. C. F. Blake, J. Mol. Biol. 152, 737 (1981).
ized by three maxima near Xl = ( - 120°, 0°, and 120°) and three minima near XI = ( - 180°, - 60°, and 60°). Each local minimum and maximum can be displaced by 20 ° or 30 °, and the relative energies of the maxima and minima can be strongly influenced by local packing interactions specific to the region of the protein under study (Fig. la,b); however, the basic three-fold nature of the potential is retained in most cases. Clearly, if a side chain varies by 40 ° about any minimum its local minimum energy conformation remains the same. In addition, crystallographic uncertainty in side-chain atom positions leads to a variability in dihedral angles of at least ---10° even in the most well-defined regions. Therefore, if topologically equivalent dihedral angles differ by 40 ° or less they are considered to describe equivalent conformations. The comparison criteria used for y atoms are as follows: (I) For topologically equivalent residue pairs, identical or mutated, in which both residues have a single 31 heavy atom (pairs comprised of Arg, Asn, Asp,
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
163
TABLE II PAIRED PROTEIN SUPERPOSITIONDATA
(Proteinl ,Protein2)
Root mean square deviation (/~,)~
(HEN,HUM) (PEN,END) (TTRP,GCHY) (PEN,RHI) (MYO,ERY) (MYO,LEG) (ERY,LEG)
0.63 1.62 1.20 1.83 1.70 2.90 3.12
Sequence homology (%)b,c 60.5 56.4 44.1 39.3 22.1 17.7 16.2
(78) (182) (99) (127) (30) (27) (22)
Structure class a +/3 /3 /3 /3 a a c~
Global backbone rms deviations are given. b All percentages are based on the number of residues in the smaller protein of each pair: PEN, 323 residues; END, 330 residues; RHI, 325 residues; TTRP, 223 residues; GCHY, 241 residues; MYO, 153 residues; ERY, 136 residues; LEG, 153 residues; HEN, 129 residues; and HUM, 130 residues. c The number of sequential residue pairs are given in parentheses for each protein pair. Gly-, Ala-, and Pro-containing residue pairs are included.
Cys, Gin, Glu, His, Met, Phe, Ser, Trp, and Tyr), the X~ dihedral angles are compared. (2) For pairs in which both residues are branched at 3"(pairs comprised of lie, Thr, and Val), the approximate superposition of both atoms is required. This is equivalent to comparing X1 dihedral angles of identical pairs and of (Ile,Thr) pairs. In (Ile,Val) pairs, the Ile N-C~-C~-Cv2 and Val N-C~-C~-CVI dihedral angles are compared38; (Thr,Val) pairs are treated analogously. (3) For pairs in which one residue of the pair is branched at the 3' position (i.e., Ile, Thr, or Val) and the other is not, the dihedral angles of the atoms most similar in chemical properties are compared (e.g., Thr C~2 and Gin CV). If the two atoms of the branched residue are chemically equivalent (e.g., Val C~T and Cv0 comparisons are made with respect to either 3' atom. The 8 atoms and the associated X2 angles are treated analogously. The comparison of higher dihedrals, X3, X4, and ×5, has not been attempted as crystallographic uncertainty tends to increase significantly for side-chain heavy atom positions defined by these degrees of freedom. The comparison results 37 are summarized in Table III and Figs. 2-5. For the highly homologous proteins [i.e., those with sequence homologies of 39% or better, i.e., (HEN,HEN), (PEN,END), (TTRP,GCHY), and (PEN,RHI)], 80% or more of the identical residue pairs have matching 3, 38 j. C. Kendrew, Biochemistry 9, 3471 (1970).
164
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
a
I
40
0
I
30
20
Z o
1
10
CJ
0
-lO -180
I
I
I
I
I
-120
-60
0
60
120
I
I
180
X I (degrees)
40
_ o
I
I
I
t
30
0 ,M
20 r.~
z o
10
-10 -180
]
I
I
l
I
-120
-60
0
60
120
X1
180
(degrees)
FIG. I. Two three-fold torsional potentials showing differences characteristic of changes in the local environment. Pepsinogen Val-197 (a) and Val-208 (b).
[9]
165
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS TABLE III •y- AND 8-ATOM MATCHING RESULTSa y matches (%) Protein
All
P,B b
(HEN,HUM) (PEN,END) (TTRP,GCHY) (PEN,RHI) (MYO,ERY) (MYO,LEG) (ERY,LEG) All proteins
90.5 81.6 81.7 91.2 61.5 76.2 64.7 82.8
(HEN,HUM) (PEN,END) (TTRP,GCHY) (PEN,RHI) (MYO,ERY) (MYO,LEG) (ERY,LEG) All proteins
80.6 50.0 60.6 59.2 55.6 64.5 60.3 59.6
No. of pairs All
P,B
8 matches (%)
No. of pairs
All
P,B
All
P,B
Identical residue pairs 97.0 63 33 92.4 125 79 82.0 71 50 92.1 91 63 68.8 26 16 76.9 21 13 61.5 17 13 87.3 414 267
89.2 86.7 78.1 90.6 78.6 53.8 87.5 84.3
94.4 91.7 76.2 90.7 100.0 62.5 100.0 88.9
37 60 32 53 14 13 8 217
18 48 21 43 10 8 5 153
Mutated residue pairs 88.6 31 9 71.4 84 28 76.0 81 25 68.1 115 47 75.0 63 24 71.4 76 21 60.0 68 20 71.2 518 174
64.7 84.2 75.0 71.9 72.7 55.6 66.7 69.7
100.0 80.0 100.0 79.0 70.0 75.0 100.0 82.5
17 19 20 32 22 27 18 155
4 10 7 19 10 8 5 63
From N. L. Summers, W. D. Carlson, and M. Karplus, J. Mol. Biol. 196, 175 (1987). b B denotes buried side chains; P denotes partly buried side chains.
values. Of the y matching residues, 78% or more have matching 8 values. In general, side chains which are not highly accessible 39-42show somewhat better agreement than accessible residues. For the proteins with low sequence homology and large global rms backbone deviations [(MYO,ERY), (MYO,LEG), and (ERY,LEG)], the matching probability decreases; 62 to 39 Accessibilities are calculated using a modified Lee and Richards algorithm40,41 and CHARMM radii. 42Accessibilities are classified as buried (B), partly buried (P), or accessible (A) according to side-chain fractional accessibilities and absolute atom accessibilities. The maximum accessibility was calculated for each residue, R, from the tripeptide Gly-R-GIy in the protein conformation. Unless otherwise noted, a 1.4-/~ probe was used. Buried side chains have less than 35% total exposure, and all single atom accessibilities decrease as the probe size is increased from 1.4 to 2.8 A. This behavior is found for side chains that lie in crevices on the surface and whose conformations should be more strongly influenced by the protein than by the solvent. Accessible (A) side chains have exposures of at least 35% or have atoms whose accessibility does not decrease with probe size. 40 B. Lee and F. Richards, J. Biol. Chem. 55, 379 (1971). 41 M. Handschumacher, unpublished results (1984). 42 B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, J. Comput. Chem. 4, 187 (1983).
166
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9l
76% of y atoms and 54 to 88% of the 8 atoms match. The probability of achieving a 8 atom match is considerably better for buried/partly buried residues (i.e., 80 versus 62% for all residues). Since X2for aromatic residues tends to be -+90°, the 40 ° cutoff for matching employed here implies that most aromatic residues should match. However, this does not appear to bias the results. All of the remaining residue pairs are nonidentical and were classified as mutated. They included pairs with disparate shapes or chemical properties. The y values for mutated residues in highly homologous proteins still correspond in most cases. The matching probabilities range from 50 to 80%. For the residues with matched y dihedrals, the 8-dihedral agreement ranges from 65 to 85%. In striking contrast to the identical residue pairs, a significant improvement in the y matching propensity of buried/partly buried, relative to accessible residue pairs, was observed; the matching probabilities of buried/partly buried residues is 10 to 20% greater than for accessible residue pairs. With the exception of the (PEN,END) protein pair, a similar improvement for buffed/partly buried residues is found for the 8 atoms. The matching probabilities found for the low homology globins follow the same trend, with good transferability maintained for many of the structurally or functionally important residues even when the relative displacement of helices between homologous proteins is large. The side-chain conformations in loop regions tend not to match so well, in accord with their larger differences in main-chain conformations (C" deviations are in the 3-8 ,~ range). 43 The correspondence of side-chain conformations is actually better than these matching probabilities imply. For all pairs of homologous proteins studied, the y-dihedral and 8-dihedral angle deviations have approximately Gaussian distributions centered about lAy1 = 0° and IASI = 0 °, respectively (Figs. 2-5). Thus, at least 80% of the conformational matches (92% of the identical y-dihedral matches) are characterized by deviations of 20° or less. The distribution of identical residue deviations tends to be broader for highly accessible matches and for the 8 atoms of nonaromatic residues. The distributions for mutated residues tend to be broader than for their identical counterparts, and the accessible residues tend to be nearly uniformly distributed. This uniformity, along with the lower matching probabilities of accessible, mutated residues, probably reflects the importance of environmental effects on the conformational selection of such highly exposed side chains. Why similar behavior is not observed for identical pairs is unclear. Residue mutations that result in the retention of similar volume and hydrophobicity profiles ("conservative" analogs) are ex43 N. L. Summers, unpublished results (1985).
[9]
167
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS a
g
d
60
20
50
16
6O 5O 4O
40 12
3O 8 20
20 4
'°
0
10
'
20
'
30
' "'-
10 0
0
10
40
zxX1
20
30 AX 1
40
10
2O
30
40
e
b
20
70 6O
16
50
12
40 3O
8
20
10
20
30
10
40
z~X1
20
3O
4O
zxX1
f
C 6O
2O
16
4O 12
20 10 0
4
lo
20
3o AXl
40
~o
, []~, 20
I'PI , ao
40
~'N
FIG. 2. Distribution of 7-atom dihedral angle deviations for pairs of identical residues. (a) (PEN,END), (b) (PEN,RHI), (c) (TTRP,GCHY), (d) (MYO,ERY), (e) (MYO, LEG), (f) (ERY,LEG), and (g) (HEN,HUM). Open bars represent all residue pairs, horizontally hatched bars represent buried/partly buried residue pairs, and diagonally hatched bars represent accessible residue pairs. Pairs with mixed accessibility lie above the horizontal division in the diagonally hatched bars. [From N. L. Summers, W. D. Carson, and M. Karplus, J. Mot. Biol. 196, 175 (1987).]
168
PROTEINS AND PEPTIDES" PRINCIPLES AND METHODS a
30
d
30 25
25
20
2O
20
15
15
15
I1 10
20
30
o
0l,.-
40
10
20
z~ X1
30
40
~X1
b
40
g
30
25
0
[9]
~o.
I" 10
20
30
n 40
A X1
e
35 30 25
15 10
o
10
20
30
nE
C
30
10
40
25
20
20
15
15
10
10
5
5
20
30
40
40
f
30
25
10
20 30 AX1
10
20
30
40
&XI
FIG. 3. Distribution of y-atom dihedral angle deviations for pairs of mutated residues. (a) (PEN,END), (b) (PEN,RHI), (c) (TTRP,GCHY), (d) (MYO,ERY), (e) (MYO,LEG), (f) (ERY,LEG), and (g) (HEN,HUM). Open bars represent all residue pairs, horizontally hatched bars represent buried/partly buried residue pairs, solid bars represent buried/partly buried residue pairs lacking pseudohomology, and diagonally hatched bars represent accessible residue pairs. Pairs with mixed accessibility lie above the horizontal division in the diagonally hatched bars. [From N. L. Summers, W. D. Carlson, and M. Karplus, J. Mol. Biol. 196, 175 (1987).]
[9] 30
169
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS a
-
d
10
25 t
g
30
8
25
20
.~0 6
11
0
I0
~ 10
20
30
40
10
20
~
e
10
20
,',X2
0 .30
40
z~ X2 10
30 25
1 10
20
30
I"1 40
AX 2
8
20 6 15 10 5 0 10
20
30
40
z~X2 c
30
30
40
,,,X2 f
lO
25 20 15
5
2
o
o 10
20 AX 2
30
40
r~ 10
2o
30
40
AX 2
FIG. 4. Distribution of 8-atom dihedral angle deviations for pairs of identical residues. (a) (PEN,END), (b) (PEN,RHI), (c) (TTRP,GCHY), (d) (MYO,ERY), (e) (MYO,LEG), (f) (ERY,LEG), and (g) (HEN,HUM). Open bars represent all residue pairs, horizontally hatched bars represent buried/partly buried residue pairs, solid bars represent buried/partly buried, nonaromatic residue pairs (i.e., Phe, Tyr, and Trp data are excluded), and diagonally hatched bars represent accessible residue pairs. [From N. L. Summers, W. D. Carlson, and M. Karplus, J. Mol. Biol. 196, 175 (1987).]
170
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
10
6
10
10
8
8
6
6
[9]
4
2
0
10
20
30
40
10
20
zxX2
b
10
30
40
zx×2
10
20
30
40
X2
e
10
8
6
10
20
30
10
40
',x2
30
40
zxX2
c
10
20
f
10
4 2 o
10
20
30
zXX2
40
10
20
30
40
Ax2
FIG. 5. Distribution of 8-atom dihedral angle deviations for pairs of mutated residues. (a) (PEN,END), (b) (PEN,RHI), (c) (TTRP,GCHY), (d) (MYO,ERY), (e) (MYO,LEG), (f) (ERY,LEG), and (g) (HEN,HUM). Open bars represent all residue pairs, horizontally hatched bars represent not fully accessible residue pairs, solid bars represent buried/partly buried residue pairs lacking pseudohomology, and diagonally hatched bars represent accessible residue pairs. [From N. L. Summers, W. D. Carlson, and M. Karplus, J. Mol. Biol. 196, 175 (1987).]
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
171
pected to create only small structural perturbations. Hence, conserved substitutions are often treated analogously to identical substitutions. However, when the conservative analogs are removed from the buried/partly buried distributions, those distributions are not altered. With respect to the comparison criteria employed here, the behavior of the mutated residues which are not conservative analogs is not appreciably different from that exhibited by all buried/partly buried, mutated residues. Clearly, a high degree of positional matching can be expected for the y and 8 atoms of side chains in homologous proteins whether residue pairs are identical, conservative analogs, or nonconservative mutations. (In each case, a 45% matching probability is expected for random side-chain orientations when the 40 ° criterion is used.) In fact, the matching probabilities and difference distributions are very similar for all residues. The only exceptions are the highly accessible, mutated residue pairs, which have uniform distributions as mentioned above. This is an important result because it implies that, to a first approximation, most residue substitutions are made in essentially the same way. Similar atom positions tend to be maintained even when side-chain connectivities differ. However, because of the significant number of exceptions, it is important to be able to anticipate and, if possible, to predict the side chains whose conformations are not expected to match in homologous proteins. In virtually all cases, conformational deviations in homologous residues can be sorted into six distinct categories) 7 We emphasize these here because such an analysis has not been reported previously. 1. Residues which participate in specific interactions, for example, ion-pair formation or chelation, in one but not both homologous proteins are more likely to be observed in different conformations. For example, TTRP 44 contains a calcium-binding loop; its counterpart in GCHY 45 does not bind calcium. A number of side-chain conformation mismatches are located in this region. 2. Side-chain conformational transferability is poor for lie to aromatic and nonidentical, Met-containing mappings. In a helices Ile residues favor the Xl = g+ conformation owing to steric restrictions. 37,46 No similar restraint is placed on aromatic residues. Ile C ~ atoms are apparently oriented randomly with respect to their aromatic counterparts in/3 structures. For Met-containing pairs, a preferred orientation of the Met S 8 lone pairs toward the ring plane of aromatic groups has been demonstrated. 37,47 44 Brookhaven Protein Data Bank, PDB-ID 3PTN. 45 G. H. Cohen, E. W. Silverton, and D. R. Davies, J. Mol. Biol. 148, 449 (1981). 46 j. Janin, S. Wodak, M. Levitt, and B. Maigret, J~ Mol. Biol. 125, 357 (1978). 47 K. S. C. Reid, P. F. Lindley, and J. M. Thornton, FEBS Lett. 190, 209 (1985).
172
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
Several rationalizations have been offered. It has been suggested that this orientation arises from electrostatic interactions between negatively charged Met S 8 atoms and positively charged, aromatic hydrogen atoms.47 Also, S-~r interactions 48'49and multipole moment 37interactions have been suggested. In any case, the overriding factor in conformational selection may be the maintenance of this highly specific interaction, which is not present for other residues that replace Met. 3. Mismatches are most likely to occur for residue pairs which have at least one highly accessible member. 4. Atomic B factors are proportional to the square of the positional uncertainty: ° A strong correlation between conformational mismatches and large B factors is found. For example, in (PEN,END) side chains with B factors greater than about 2.1Bave(PEN) (Bave = 14.5 ~2,50a 2.1Baye -30 A 2) often mismatch. This is not unexpected since a B factor of this size is approximately equivalent to a rotation from one well of a three-fold potential to another. However, because the overall magnitude of B factors is sensitive to the restraints applied during minimization, the optimal B factor cutoff will vary for different crystal structures. 5. Multiple minima are observed on the rigid rotation potential energy surfaces (RR surfaces 37'42'51-57) of a number of mismatched side chains, especially Val, Ser, Thr, Ile, and Leu, when such surfaces are calculated in the presence of the protein. Typically, minima of equivalent energy and equivalent extent are present on one or both surfaces. The side-chain conformation observed in one protein of the homologous pair may then correspond to one minimum and the observed conformation of the other
R. S. Morgan, C. E. Tatsch, R. H. Gushard, J. M. McAdon, and P. K. Warme, Int. J. Pept. Protein Res. 11, 209 (1978). 49 R. S. Morgan and J. M. McAdon, Int. J. Pept. Protein Res. 15, 177 (1980). s0 j. D. Dunitz, "X-Ray Analysis and the Structure of Organic Molecules." Cornell Univ. Press, Ithaca, New York, 1979. Brookhaven Protein Data Bank, PDB-ID 2APP. 51 RR surfaces are obtained by calculating the energy of all possible side-chain orientations with respect to the potential energy field generated by the rest of the protein. Typically, 10° dihedral angle increments are used. In all cases, the program CHARMM 42 and parameter sets 42"52-54detailed elsewhere were employed. A complete description of the RR surface requires the inclusion of polar hydrogen atoms in addition to the heavy atoms. Polar hydrogens were placed by the method of Brunger and Karplus.55 52 D. J. States, Ph.D. Thesis, Harvard University, Cambridge, Massachusetts (1984). 53 W. E. Reiher, Ph.D. Thesis, Harvard University, Cambridge, Massachusetts (1985). R. E. Bruccoleri and M. Karplus, J. Comput. Chem. 7, 165 (1986). 55 A. T. Brunger and M. Karplus, Proteins 4, 148 (1988). B. R. Gelin and M. Karplus, Proc. Natl. Acad. Sci. U.S.A. 72, 2002 (1975). 57 B. R. Gelin and M. Karplus, Biochemistry 18, 1256 (1979).
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
173
to the second minimum (Fig. 6a). Inclusion of environmental effects, bound solvent, or neighboring protein molecules may be necessary to distinguish the experimental minimum. It is also possible that at least some multiple minimum surfaces actually describe multiple side-chain conformations. It has been elegantly shown by Smith et al. 58that a number of side chains do assume two or more distinct conformations in protein crystals. As the resolution and refinement quality of structures improve, the relative number of multiple conformations tends to increase. 6. For mismatched residues with significant solvent exposure, the detection of very broad minima characterized by dihedral values which extend over 60 ° to 100° is likely (Fig. 6b). The broad well contains the sidechain conformations of each member of the residue pair, frequently with no apparent transition barrier. These conformations have a tendency to converge after brief minimization in the absence of solvent or neighboring protein molecules, again suggesting that either the inclusion of solvent is necessary to correctly identify observed conformations or there is a significant conformational flexibility in the side chain. 37'57 7. Overall agreement is poorer for residue pairs containing the longer, charged side chains, Lys and Arg. Two factors contribute to this behavior. First, multiple conformations may be present. Side chains are anchored at the C t~ position by the backbone directionality, and charged ends are frequently anchored by electrostatically stabilizing interactions. However, the intervening dihedral degrees of freedom, four for Lys and five for Arg, need not be restricted to a single conformation to satisfy the unique spatial requirements of the side-chain end points. Second, owing in part to this inherent flexibility, the longer side chains are often not as well refined as other parts of a protein structure. Hence, the probability of conformational disagreement is enhanced by a larger experimental error in side-chain position. Energetic Fingerprints
The generation and analysis of a large number of side-chain rigid rotation (RR) surfaces 51 led to the characterization of several interesting and useful properties which can serve as energetic fingerprints. 7 These consist of characteristic well depths and well shapes. Because correct high-resolution protein structures tend to conform to the fingerprints, their violations can be useful for indicating regions which are poorly modeled. The specific results discussed in the following paragraphs were calculated using the CHARMM force field42 with a polar hydrogen parameter set detailed 58 j. L. Smith, W. A. Hendrickson, R. B. Honzatko, and S. Sheriff, Biochemistry 25, 5018 (1986).
174
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
8 180
120
-
60
k.\X ( ,,-, ~, I ~ "?f'A',?. , , I I / "1
-,~o[ ),))"//,--\-~.v.}I',',',NN/ I -180
-120
-60
0
80
120
'
'
180
Xl ( d e g r e e s )
IS°l ~ ~
..
2.~'~
'
-
-1,o -180 -180
I
I
l
IA
~/l
I ~. /, -120
-80
-(
~ 0
80
120
180
X1 ( d e g r e e s ) FIG. 6. Side-chain rigid rotation potential energy surfaces. (a) Multiple, equivalent m i n i m a are f o u n d for G C H Y Ser-190 (Xl = 62 °) [contour levels: (--) - l, 1, 3 kcal/mol; ( - - ) 5, 7, 9 kcal/mol; ( - - ) 11, 13, 15 kcal/mol]. (b) A broad m i n i m u m is f o u n d for p e p s i n o g e n Asn-142 {~contour levels: ( ~ ) - I 1, - 9, - 7 kcal/mol; ( - - ) - 5, - 3, - I kcal/mol; ( - - ) 1,3, 5 kcal/mol].
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
175
elsewhere. 53However, the fingerprints themselves are quite general. Similar results should be obtained with any other well-characterized, carefully executed force field-parameter set combination, although the specific energies involved in the fingerprints would have to be recalibrated. In well-refined X-ray structures of proteins, the RR surface associated with a given residue type has a unique and characteristic surface shape. Contours drawn at 5, 10, and 25 kcal/mol above the global minimum are provided for typical Ile, Leu, Asp, Asn, Thr, Phe, and Tyr RR surfaces in Fig. 7. A Ser surface is shown in Fig. 6a; Val surfaces are shown in Fig. I. The highest energy contour can be considered to define the shape of the van der Waals (VDW) envelope surrounding the side chain. Although the packing environment dictates which conformation is actually observed for a specific side chain in a given protein, the basic shape of the high-energy contours about the observed conformation are preserved. This implies that the packing environment for every residue of a given type is qualitatively the same regardless of local connectivities and the amino acids in close proximity. In addition, most side chains are found within 10° to 15° of the global minimum on their RR surfaces,7,56,57 and each dihedral angle has at least 30° to 40 ° of flexibility within 5 kcal/mol of energy above the minimum on an RR surface. Although 5 kcal/mol may at first appear to be an unrealistically large energy to consider, it is in the range that can be obtained by allowing bond lengths and bond angles to relax to the values optimal for the local environment. Such relaxation is, by definition, not allowed in the evaluation of an RR surface. Use of RR surfaces with such energy criteria allows relatively inexpensive and qualitatively complete characterization of the local environment. Although the correspondence of observed side-chain conformations in globular proteins to RR surface minima was first reported a number of y e a r s ago, 56,57 the relation of RR surface shape with residue type and the identification of a characteristic "extent" for the minimum are new. It is particularly important that all good models meet the "extent" criteria. The extent of the minima can be associated with estimates of local entropy. 59 Further, they tend to correlate with reported atomic B factors in accord with the view of protein equilibrium structures as nonstatic entities. The RR surface picture of protein side chains "rattling around" in the cage formed by the surrounding environment is very similar to descriptions based on molecular dynamics trajectories (see, e.g., Ref. 59a). If this RR surface interpretation is correct, the surfaces would be 59 N. G6 and H. A. Scheraga, J. Chem. Phys. 51, 4751 (1969). 59a C. B. Post, B. R. Brooks, M. Karplus, C. M. Dobson, P. J. Artymiuk, J. C. Cheetham, and D. C. Phillips, J. Mol. Biol. 191}, 455 0986).
176
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
It
180
/'/'
I'l
so-
'[
O-
v
g! -eO -120
-
-180 - 180
-120
-60
0
60
120
180
X1 (degrees)
b
180 120 80 tP
III
-80
-120
-180 -180
I -120
~
1 -60 Xl
0
I
I
60
120
180
(degrees)
FIG. 7. Characteristic side-chain rigid rotation potential energy surface shapes for (a) Ile, (b) Leu, (c) Asp, (d) Asn, (e) Thr, and (f) Phe, Tyr [contour levels (relative to global minimum): (--) +5 kcal/mol; (--) + 10 kcal/mol; (--) +25 kcal/mol].
[9]
MODELINGSIDECHAINS,LOOPS,AND INSERTIONS
177
o
1gO
Q;,/
eo
-t
-80 -120
-180 ~ -180
~
-120
-80
Xl
I
0
60
120
I
I
180
(degrees)
d 180
I
120
m
80
i°
w
-80
-
120
-180 -180
I
I
I
I
I
- 120
-60
0
80
120
X1 (degrees)
FIG. 7. (continued)
1BO
178
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
e
1~o
I"
l/,
' J "l
' ,I I-,
'°
\, Vl ' /
I
\ I
-60
%.
-1:zo -zso -180
t t -120
J~'ll
-60
0
'
60
120
180
Xl (degree,)
180
1'
t
t' "
I
I
60
1gO
1~0
!°
60
I~
-eo
-1~Z0
-180 -180
I -120
I,, -60
,I 0
7.1 (degrees) FIo. 7, (continued)
I 180
[9l
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
179
able to provide a qualitative measure of local configurational free energy distributions of the side chains. Individual residues in well-refined proteins are rarely found in energetically unfavorable conformations, that is, conformations in which one or more intraresidue energy terms, primarily backbone-side-chain interactions, are strongly destabilizing. Because these are correlated with the specific backbone conformation of the residue, a general list of prohibited side-chain conformations is difficult to compile. However, nearly all prohibited side-chain conformations for a given backbone structure can be identified by simple energy criteria in terms of residue "self-energies." The self-energy is defined as the internal energy of the residue in the conformation observed in the protein, without including any of the interactions with the rest of the protein. The upper limit for the total residue selfenergies and individual self-energy components (i.e., bond stretching, angle bending, torsional, van der Waals, and electrostatic contributions) for residues that are clearly defined in a number of the high-resolution X-ray structures is found to be 9 and 5 kcal/mol, respectively. In addition to the above, thresholds for the interaction energy of each residue with the rest of the protein can be identified. The energy range for each residue (Gly, Ala, Val, Ile, etc.) is different because they differ in the number of atoms and have distinct van der Waals envelopes, hydrophobicity profiles, and electrostatic properties. However, each residue does appear to have a characteristic minimum energy and maximum energy value. The ranges determined for the aspartyl proteinases are reproduced in Table IV. The existence of the thresholds can be rationalized as follows. The free energy difference between fully folded protein and completely unfolded protein is quite small (on the order of 5-15 kcal/mol at 298 K; the specific enthalpy of unfolding of proteins has a broader range at 298 K, but the specific enthalpies of all compact proteins are about 13 cal/g at 373-383 Kr°). Many more or less randomly selected mutations result in engineered variants which fail to fold. This suggests that each residue contributes a minimum stabilization free energy to the folded protein. The RR surfaces indicate that a certain minimum entropy is associated with each dihedral degree of freedom. To a first approximation, the entropy distribution is uniform; this is in accord with normal mode analysis of per residue entropies in proteins. 6~ Hence, a minimum enthalpy contribution per residue, the minimum interaction energy threshold, seems physically reasonable. Entropy arguments must again be invoked to interpret maximum resi60 p. L. Privalov, Adv. Protein Chem. 33, 167 (1979). 61 M. Karplus, T. Ichiye, and B. M. Pettitt, Biophys. J. 52, 1083 (1987).
180
[9]
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
TABLE IV RESIDUE INTERACTION ENERGY RANGES FOR ASPARTYL PROTEINASES WITH POLAR HYDROGENSa
Energies
Energies
Residue
Min.
Max.
Residue
Min.
Max.
Gly Ala Val Ile Leu Cys Met Phe Trp Tyr
- 29 - 33 -38 - 33 - 38 - 29 - 38 - 40 - 48 - 50
- 6 - 10 -20 - 23 - 20 - 22 - 20 - 22 - 37 - 28
Hisb Ser Thr Asn Gln Asp Glu Arg Lys Pro
- 62 -41 - 42 - 48 - 50 - 53 - 45 - 62 - 62 - 28
- 11 -15 - 16 - 14 - 12 - 16 - 23 - 30 - 30 - 9
a Constructed following the method of Brunger and Karplus [A. T. Brunger and M. Karplus, Proteins 4, 148 (1988)]. Values are given in kcal/mol. b Values refer to doubly protonated histidines.
d u e stabilities. A t s o m e critical p o i n t the i n t e r a c t i o n e n e r g y o f a r e s i d u e w i t h the r e s t o f the p r o t e i n c a n i n c r e a s e o n l y as a r e s u l t o f c o m p r e s s i o n o f the p r o t e i n . I n t e r m s o f the p a i r w i s e a t o m i c force fields d i s c u s s e d h e r e , if v a n d e r W a a l s a n d / o r e l e c t r o s t a t i c c o m p o n e n t s are to c o n t r i b u t e m o r e , t h e r e s i d u e s m u s t p a c k m o r e tightly a n d the q u a l i t a t i v e e n t r o p y f i n g e r p r i n t is lost; i n o t h e r w o r d s , the i n h e r e n t d i h e d r a l flexibility i m p l i e d b y the R R s u r f a c e s w o u l d b e " c o m p r e s s e d a w a y " b y the c l o s e p r o x i m i t y o f o t h e r r e s i d u e s . T h i s is p a r t i c u l a r l y e v i d e n t d u r i n g free m i n i m i z a t i o n in vacuo w h e r e t h e r e is a t e n d e n c y for large e l e c t r o s t a t i c g r a d i e n t s to d r i v e the late stages o f m i n i m i z a t i o n . A s a r e s u l t , the p r o t e i n s t r u c t u r e b e g i n s to distort. It b e c o m e s m o r e tightly p a c k e d a n d c o r r e s p o n d i n g l y m o r e e l e c t r o s t a t i c a l l y s t a b l e while, s i m u l t a n e o u s l y , v a n d e r W a a l s r e p u l s i o n s i n c r e a s e , b o n d a n d a n g l e g e o m e t r i e s d i s t o r t , a n d the e x t e n t f i n g e r p r i n t is lost. T h a t B f a c t o r s are o f significant m a g n i t u d e , e v e n for h i g h - r e s o l u t i o n s t r u c t u r e s o f small m o l e c u l e s , is i n c o n s i s t e n t w i t h the o c c u r r e n c e o f e x t r e m e c o m p r e s s i o n .
Formulation o f Modeling Scheme for Side Chains T h e i d e n t i f i c a t i o n o f statistical t r e n d s a n d the a b i l i t y to i d e n t i f y c l e a r l y a n d r e l i a b l y s i t u a t i o n s in w h i c h t h o s e t r e n d s are likely to b e v i o l a t e d s u g g e s t t h a t t h e s e c o r r e l a t i o n s c a n b e r e f o r m u l a t e d as a c o n c i s e set o f m o d e l i n g steps 7 t h a t d o n o t rely o n i n t u i t i v e d e c i s i o n s o r insights d e r i v e d
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
181
from graphical examination. Further, the "accuracy" of the model constructed can be tested by verifying the existence of known protein characteristics. These include hydrogen bond saturation, 62 the transferability of side-chain accessibility classifications, 37 and, most significant, the energetic fingerprints. 7 The sequence of modeling steps shown in Fig. 8 provides a tested protocol for side-chain placement (for more details, see Ref. 7). With a predetermined topological residue mapping from the chosen template to the target, Stage 1 transfers all possible side-chain conformations and partial conformations (e.g., for a Thr ---> Gin mutation transferability exists for y atoms only) from the template to the target according to the side-chain comparison criteria outlined in the Statistical Trends subsection. A limited number of side-chain clashes invariably results from the initial placement. In Stage 2, rather than making small changes in several side-chain conformations or subjecting the model to local minimization, dynamics, or Monte Carlo searches to relieve repulsive van der Waals interactions, corrections are made via conformational transitions. For example, the side chain might be shifted from conformation g ÷, g - , or t to another. The categories of probable side-chain mismatches given in the Statistical Trends subsection are used to determine which side-chain conformation should be changed. These are scanned for the applicable category with the highest probability, and the suggested change is then made to the model. If identical and mutated side chains are involved and no category is applicable, the mutated side chain should be altered. The new conformation is chosen by analyzing a contact list of van der Waals repulsive interactions generated from the RR surface. To reproduce the characteristic extent of the side-chain minima observed in globular proteins and to correct unfavorable steric contacts, the new conformation is required to have no van der Waals overlap interactions exceeding 1 kcal/mol within a -+20° increment of the chosen dihedral angle values. In most instances a single conformation will meet this criterion. If there are several van der Waals allowed conformations, the one that has the greatest rotational freedom and is within 30° of an ideal side-chain dihedral angle value (e.g., 60 °, - 6 0 °, or 180° for an sp3-hybridized atom) is chosen. The criterion is thus based on crystallographic survey r e s u l t s 37'46'63 and entropy considerations. This prescription is used again in Stage 3 to place any side-chain heavy atoms for which no information was contained in the template (e.g., the C a, O~1 and N~ atoms of a Thr ---> Gin mutation). Since atomic mobility 62 E. N. Baker and R. E. Hubbard, Prog. Biophys. Mol. Biol. 44, 97 (1984). 63 M. N. G. James and A. R. Sielecki, J. Mol. Biol. 163, 299 (1983).
182
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS , o o
. . . . . . .
o , ° . °
. . . .
°
. . . . . .
°°
. . . .
° ° . ° ° ° ° ° ° o ° ° ° o ° * * ° °
. . . .
[9]
° ° ° ° ° ° °
I
STAGE i
INITIAL SIDE-CHAIN PLACEMENT
STAGE 2
INITIAL SIDE-CBAIN CORRECTION from VDV contact check
.......................... STAGE 3 PLACE
i ...................................
REMAINING ATOMS
heavy atoms by
I RR(VDW)
evaluation
( I ) X1 only
(2) X1,X2 only (3) longer side chains hydrogen atoms by RE evaluation . . . . . . . . . . . . . .
STAGE 4
,
. . . . . . . . . . .
,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MODEL REFINEMENT & EVALUATION Self-enerE7 check .............. Interaction energY check
...........
Accessibility classification check Hydrogen bond check
RR map RR map
....... RR map
...............
RE map
Minimize
RR check Reminimize Minimization check: Asn(X2), Gln(X3)
FIG. 8. Flow chart for the proposed side-chain construction scheme. R R indicates the apphcation of the rigid rotation mapping procedure describcd in the Energetic Fingerprints subsection. R R ( V D W ) dcnotcs a rigid rotation sampling of the van der Waals surfacc for a specific sidc chain. XI and X2 refer to the generation of a rigidrotation potentialenergy surface for a specific dihcdral angle. Minimization is with respcct to all protein atoms and is done using an adopted-basis Newton-Raphson procedure with diminishing mass-weighted force constants. Minimization check refers to an adopted-basis Ncwton-Raphson energy minimization procedure applied to the Asn sidc chains while holding the remaining protein atoms fixed. [From N. L. S u m m e r s and M. Karplus, J. Mol. Biol. 210, 785 (1989).]
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
183
and, therefore, uncertainty are likely to increase as the number of degrees of freedom required to specify the atomic position increases, 64 "missing" atoms defined by a single degree of freedom are placed first, followed by atoms defined by two dihedral degrees of freedom, etc. Finally, polar hydrogens are built using the procedure of Brunger and Karplus. 55 Although hydrogen positions are not identifiable in X-ray structures, polar hydrogen atoms must be included in order to obtain an adequate energetic description of the molecule. Stages 1-3 have been based on either statistical data or the steric interaction of side chains with other residues. Generally at this point in more conventional modeling studies, the models are evaluated with respect to their apparent agreement or disagreement with available experimental data, typically fluorescence measurements, the effect of specific mutations on reactivity, low resolution X-ray data, or incomplete NMR data. The interpretation of these data at the molecular level is frequently somewhat ambiguous, and the data are never sufficient to test the model comprehensively. In the presence procedures, an additional state of model testing and refinement is used prior to experimental comparisons. The model is evaluated by the hierarchical procedure shown in Stage 4 of Fig. 8. The most sensitive checks are made in the latter steps after many of the modeling errors have already been detected and corrected. In each step, when violations are found, a RR surface is produced. If the model conformation lies near the minimum, the minimum possesses the characteristic extent, and the RR surface has a shape appropriate for the residue type, no corrections are made. Some problems of this type are expected because no purely statistical fingerprint, such as a self-energy, will be valid for all residues. Unique environments can require unique conformational solutions. Whenever another conformation satisfies the comprehensive RR surface requirements, that conformation is adopted. Since initial modeling errors will be amplified as they are carried through the modeling procedure, the corrections indicated should be made and verified during each step. If numerous changes are made, iteration through the previous steps for comprehensive verification is suggested. Refinement proceeds as followsT: First, the conformation of each residue is verified as probable by establishing that the residue self-energy and self-energy components are characteristic of well-folded proteins (see Statistical Trends). Second, the local packing environment of each residue is examined and verified as being typical of well-folded protein structures by establishing that the residue interaction energy lies within the characteristic range for the residue type. Third, a simple check of the detailed shape 64 G. A. Petsko and D. Ringe, Annu. Rev. Biophys. Bioeng. 13, 331 (1984).
184
PROTEINS AND PEPTIDES" PRINCIPLES AND METHODS
[9]
of the molecular envelope is made. It was noted, when the statistical trends for side-chain conformations were generated, that buried side chains almost invariably map into buried side chains, partly buried side chains map into partly buried side chains, and accessible side chains map into accessible side chains regardless of the mutation involved. Thus, a check of the transferability of side-chain accessibility classifications 39 between the template and target structures verifies that the molecular surface texture has been reproduced. Fourth, because the global fold is preserved within a protein family, hydrogen bonding patterns are expected to be similar. The hydrogen bonding pattern is examined to determine that it cross-correlates between the model and template structures, 9'65 and the overall structure is verified in the sense that it contains the level of hydrogen bond saturation characteristic of well-folded proteins. 62 It has been observed, for both the main chain and side chains, that all polar hydrogen atoms and all possible hydrogen bond acceptor atoms which are not exposed to solvent participate in strong protein-protein hydrogen bonds. 62 The few atoms which do not, reside in highly stabilizing electrostatic environments; a case of this type might be described as one in which the atom participates in two to four badly distorted hydrogen bonds. Fifth, local packing should be optimized by a brief energy minimization before the RR surface fingerprints are checked in order to avoid ambiguous interpretations. Prior to optimization, all side-chain conformations remain approximate with respect to their defining dihedral angles, as well as to their bond lengths and bond angles. Internal geometries as well as any remaining small van der Waals repulsive interactions can be removed, and local electrostatic interactions and hydrogen bonds can be optimized using a brief minimization procedure. Minimization must be done carefully, otherwise the RR surface fingerprints are "compressed away." It is suggested that the minimization be comprised of alternating steps of backbone relaxation with side-chain conformations fixed and side-chain relaxation with mass-weighted constraints on the backbone. This allows local packing optimization and secondary structural shifts while preventing most minimization artifacts (e.g., the distortion of angle geometries and the destabilization of van der Waals interactions). At this point the most comprehensive checks can be performed, namely, each side chain is verified to lie at the global minimum on its RR surface and the surfaces are verified to have the proper shapes and characteristic extents. If numerous changes have been made (i.e., if 10% or more of side chains are altered), the entire Stage 4 refinement should
65 C. Chothia a n d A. M. L e s k , EMBO J. 5, 823 (1986).
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
185
be repeated. The X2 dihedral of Asn and X3 of Gin are verified last because the stable conformation is sensitive to the local packing environment.
Sample Application To illustrate the utility of the method, an aspartyl proteinase test case has been worked out; it is presented in detail in Ref. 7. The C-terminal lobe side chains of rhizopuspepsin 66 (RHI) were placed using the homologous protein, penicillopepsin 63 (PEN), as a template. The aspartyl proteinases (Fig. 9) are bilobal. The lobes are independent folding entities with sidechain interactions limited to one ion pair and the arrangement of the two catalytic aspartic acids. Therefore, side-chain construction in the two lobes can be approached as nearly independent modeling problems. The C-lobe of PEN consists of residues 177 to 323.63 As stated previously, the methodology performs most reliably for side chains which are not highly accessible to solvent. Owing to the bilobal structure of the aspartyl proteinases, the fraction of highly accessible side chains in the C-lobe is significantly less than that of a 146-residue globular protein. The level of C-lobe sequence homology between PEN and RHI is 35.2%. This is probably near the lower limit for successful application of homology modeling. Large secondary structure drifts between template and target structures begin to become a problem for lower sequence homologies. Further, the homologous residues are not found uniformly along the sequence. Several long segments have the sequence homology expected for random sequences; that is, the segment from 181 to 262 has 13.9% sequence homology, while the segment from 267 to 294 has only 7.1% homology. Such regions, though not atypical of proteins, are difficult to model. Because high-resolution X-ray structures 63'66are available for both proteins (Table I), there is no ambiguity in the template structure, and detailed comparisons of the model and the actual structure can be made to evaluate unequivocally the reliability of the method. These factors make the C-lobe a particularly comprehensive test case for the proposed methodology. Factors other than the modeling methodology which might influence the predicted side-chain conformations must be eliminated if a concise evaluation of the modeling scheme is to be made. The propagation of modeling errors initiated by incorrect sequence alignments was avoided by adopting the topological residue mapping which resulted from the optimal root mean square (rms) overlay of the PEN and RHI backbones. Similarly, errors linked to incorrect local backbone models were avoided 66 K. Suguna, R. R. Bott, E. A. Padlan, E. Subramanian, S. Sheriff, G. H. Cohen, and D. R. Davies, J. Mol. Biol. 196, 877 (1987).
186
PROTEINS AND PEPTIDES." PRINCIPLES AND METHODS
[9]
b
FIG. 9. Stereographic views of the optimal C a trace overlays of the homologous aspartylproteinases (a) penicillopepsin (--) and rhizopuspepsin (--) (39% sequence homology) and (b) pepsinogen (--) and rhizopuspepsin (--) (34% sequence homology).
[9]
187
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
120 BO 40 0 -40 -BO -120
120 BO 40 0 -40 -80 -120 1BO
i
t
I
200
220
240
-
260
-
~
280
300
320
RESIDUE NO.
FIG. 10. Deviations of the 7-dihedral angles in the model of the RHI C-lobe from (a) the X-ray structure of RHI and (b) the X-ray structure after energy minimization. Open circles indicate that the 7-dihedral angles differ by less than 2°. [From N. L. Summers and M. Karplus, J. Mol. Biol. 210, 785 (1989).]
by building directly onto the backbone obtained from the RHI X-ray structure. All RHI C-lobe side chains were constructed by applying the scheme outlined in Fig. 8 using only PEN template information. The 3/ dihedrals of the final RHI model and RHI X-ray structure are compared schematically in Fig. 10a; the 8 dihedrals are compared in Fig. 1la. The agreement between the model and X-ray structures is quite good. Using the 40 ° comparison criterion, 91.6% of the 3/ dihedrals were modeled correctly. In fact, most of the y-dihedral deviations are less than 15°. For the 8 dihedrals, 80.9% were modeled correctly; most of the 8 deviations were within 20 °. Further, when the X-ray structure is subjected to the same brief minimization as the model (see Stage 4), many of the dihedral deviations disappear (Figs. 10b and 1lb). The side chains whose predicted and observed conformations differ
188
PROTEINS AND PEPTIDES" PRINCIPLES AND METHODS
19]
160
120 00
I
40
J'""
0
1'" ....
-40
,,., ]l, J ,, ,.~.,i., I,,
-60 - 120
-160
-
160 120 80
40 0 -40
I "" .....
-80
I "'lj "~"
....
1
L
' .......
r'""J
J'"
|
-120 -100 180
200
220
240
260
280
300
320
RESIDUE NO.
FIG. 11. Deviations of the 8-dihedral angles in the model of the RHI C-lobe from (a) the X-ray structure of RHI and (b) the X-ray structure after energy minimization. Open circles indicate that the 8-dihedral angles differ by less than 2°. [From N. L. Summers and M. Karplus, J. Mol. Biol. 210, 785 (1989).]
significantly fall into three of the probable mismatch categories. The appearance of large deviations is frequently related to very large side-chain accessibilities (>50%). The conformations of the side chains are likely to be strongly influenced by solvation and intermolecular packing forces not accounted for by the modeling scheme. Mismatches also tend to be observed for side-chain positions with large experimental errors (elevated B factors). In either case it is difficult to determine the "correct" result. Most of the remaining mismatches have RR surfaces which are virtually identical in the model and in the X-ray structure. There are broad or multiple minima so that it is difficult to determine whether, in actuality, the side chains sample several conformations. It is clear that the methodology reproduces the side-chain potential energy surfaces and, therefore, the complex side-chain environments remarkably well.
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
189
Main-Chain Segment Modeling
General Background and Modeling Applications There are four situations in which the modeling of segments of the backbone are of interest. First, fragments near residue insertions and deletions must be reconstructed. Second, where there is a Pro ~ non-Pro substitution owing to a mutation with respect to the homologous protein, fragment modeling may be required. The Pro tk dihedral angle is restricted to values near - 6 0 ° while the ~bdihedral angle must assume values near - 5 5 ° or 145°. For the residue preceding Pro, the ~b value must lie in the ranges - 210° to - 30° or 30° to 90°, and the ~bvalue must be between 60° and 180° . When the template backbone fails to meet these geometric criteria, the local backbone conformation must be generated. Third, fragment construction may be useful in the interpretation of experimental NMW 6 or X-ray crystallographic 12: data, particularly for regions where the data are poor or nonexistent. 18 Fourth, the effect of residue insertion or deletion by protein engineering may be predicted. Several statistical approaches to generate fragments have appeared in the literature. It has been suggested that for topologically equivalent segments of homologous proteins which contain the same number of residues, the template backbone can always be adopted. 4 However, the assumption that such fragments will always have the "correct" conformation for the target molecule is not v a l i d . 9'14'15'67'68 Structural motif approaches assume that the amino acid sequence or the residue identity at certain key positions uniquely determines fragment conformation. 11,69-73Although the concept is an appealing one, no study has established unique relationships between sequence and conformation. In fact, it is clear that fragments with identical amino acid sequences can have different conformations. 69'71'72'74-77 However, for homologous proteins, closer agreement is 67 R. J. Read, M. Fuginaga, A. R. Sielecki, and M. N. G. James, Biochemistry 22, 4420 (1983). 68 R. J. Read, G. D. Brayer, L. Janasek, and M. N. G. James, Biochemistry 23, 6570 (1984). 69 B. L. Sibanda and J. M. Thornton, Nature (London) 316, 170 (1985). 70 E. J. Milner-White and R. Poet, Biochem. J. 240, 289 (1986). 71 p. Argos, J. Mol. Biol. 197, 331 (1987). 72 M. S. Edwards, M. J. E. Sternberg, and J. M. Thornton, Protein Eng. 1, 173 (1987). 73 T. L. Blundeil, B. L. Sibanda, M. J. E. Sternberg, and J. M. Thornton, Nature (London) 326, 347 (1987). 74 p. y . Chou and G. D. Fasman, J. Mol. Biol. 115, 135 (1977). 75 W. Kabsch and C. Sander, Proc. Natl. Acad. Sci. U.S.A. 81, 1075 (1984). 76 p. L. Easthope and G. D. Brayer, Int. J. Pept. Protein Res. 27, 666 (1986). 77 j. K. Leszczynski and G. D. Rose, Science 234, 849 (1986).
190
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
expected. Further, the work of Sibanda and Thornton 69 suggests that if such a correlation does exist it is likely to degrade as the fragment length increases. For two- to four-residue fragments, sequence-conformation relationships may predict the correct fragment conformations in 80% of all test cases. But as the fragment length increases, conformations other than those predicted are more likely to be observed. Conformational search algorithms 78-8°for insertions or deletions do not have this limitation. In principle, these algorithms generate all possible fragment conformations and evaluate them with respect to the potential energy surface of the protein. However, search algorithms are computationally intensive; the number of fragment conformations which must be considered frequently exceeds a thousand for a five-residue fragment. For homology modeling applications, it is unlikely that all other atoms of the protein have been placed prior to fragment construction. This means that the potential energy surface which is evaluated is "incomplete," and the relative energies of fragment conformations are shifted in an unpredictable manner. In addition, in many fragment construction applications the atom positions on either side of the model fragment are likely to be in error by 1.0/~ or more. These atoms are the structural "anchors" for the search. The effect of such large coordinate uncertainties on the conformations generated and on their relative energies is uncertain and likely to differ for each application. It has been shown that approximate C ~ positions could be used to generate reasonably good first approximations for segment conformations during X-ray crystallographic 17or N M R 16refinement. A database of highresolution crystal structures is scanned for fragments whose relative C ~ positions are similar to the relative C ~ positions in the protein being constructed. An array of conformationally "correct" and conformationally "incorrect" fragments is generated. The correct conformation is identified after least-squares refinement against either experimental electron density 17 or two-dimensional NMR resonance assignments. 16 This type of approach can also be applied to conformational searches in the absence of experimental data. Its use can drastically reduce the extent of a potential energy surface which must be examined. In an application of this approach to a segment of retinol-binding protein, Jones and Thirup 17 found the correct solution in addition to a number of other possible fragment struc78 R. M. Fine, H. Wang, P. S. Shenkin, D. L. Yarmush, and C. Levinthal, Proteins 1, 342 (1986). 79 j. Moult and M. N. G. James, Proteins 1, 146 (1986). 8o R. E. Bruccoleri and M. Karplus, Biopolyrners 26, 137 (1987).
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
191
tures but gave no criteria for distinguishing the correct solution from the others. Using the work of Jones and Thirup ~7 as a point of departure, the strategy of identifying and applying statistical trends to limit the number of conformations selected as modeling options has been developed recently. 12 Further, it is often possible to identify the "correct" fragment from the subset of selected conformations by examining energetic fingerprints. 12 This is an extension of the approach already described for side-chain placement on the main chain. Statistical trends are used to select initial fragment conformations. Then, energetic fingerprints are used to refine the fragment structure and to identify incorrect selections. In what follows we describe a modeling scheme based on these ideas.
Formulation of Modeling Scheme for Main Chains The modeling scheme is based on four assumptions. First, in accord with the work of Jones and Thirup, ~7 it is assumed that the backbone conformation of any folded protein fragment can be (partially) defined with respect to the position of the backbone flanking the fragment. For any nresidue fragment, (3n + 3) dihedral degrees of freedom [~b, ~b, and to for each residue in the fragment plus ~b of residue (gap - 1), and to and tb of residue (gap + 1) which are defined by the docking of the fragment with the rest of the backbone model] are required to specify completely the fragment conformation with respect to the rest of the protein. If no Pro segments are contained in the fragment, all trans peptide bonds can be assumed, although a small number of non-Pro trans peptide bonds have been reported (see, e.g., Ref. 81). This reduces the fragment degrees of freedom to (2n + 2). Thus, the backbone conformation of any threeresidue fragment is defined by approximately 8 dihedral degrees of freedom; correspondingly, six-residue fragments have 14 dihedral degrees of freedom. This is already a large and computationally difficult conformational search problem. However, the total number that needs to be examined is assumed to be reduced significantly because only a few conformations will satisfy the geometric closure requirements for the energetically favorable attachment of each end of the fragment to the protein. Second, this conformational subset is assumed to be nearly independent of sequence in accord with the already mentioned lack of identifiable correlation between fragment sequence and conformation. Third, a relatively small database of well-refined, high-resolution protein crystal structures sl D. C. Rees, M. Lewis, and W. N. Lipscomb, J. Mol. Biol. 168, 367 (1983).
192
[9]
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
Step i :
] CALCULATION OF DESCRIPTORS (I)I
I Step 2 :
[ SELECTION (t1)[
I
d a t a base scan s o r t - s e l e c t e d conformers
I Step 3 :
[ MANIPULATION (iii)[
I one-constraint minimization I five-constraint minimization I ............... I R...s. INs~rno. ~zsT
I
I Step
4
[ INSERTION (iv)[
:
I
minimization v i t h d e c r e a s i n g harmonic c o n s t r a i n t s
.I. . . . . . . . . . . . . . . I
I MINIMUM ENERGY TEST I I SUSSTZnWZO. ~CY
~ST I
I Step 5 :
I ITLRATION (v) I cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle
a: b: cz d: e: f: g: hz i: J: k:
I
s h i f t n - r e s i d u e gap p o s t i o n by +-1 r e s i d u e s f o r I/D only, repeat (a) u s i n g a 2 residue s h i f t change gap length by +1 r e s i d u e apply cycle ( a ) - ( b ) methodology to gaps in (c) change gap length by -1 apply cycle ( a ) - ( b ) methodology to gaps in (e) f o r n~4 only, repeat c y c l e s ( e ) - ( f ) f o r n~5 only, repeat c y c l e s ( e ) - ( f ) s h i f t l o c a l sequence allEnaent by i residue end repeat (a)-(h) s h i f t l o c a l sequence a l l p m e n t by 2 r e s i d u e s end r e p u t (a)-(h) t e s t any segment c o n s i s t e n t l y s e l e c t e d in s t e p s ( a ) - ( J )
FIG. 12. F l o w c h a r t of the s e g m e n t modeling s c h e m e . [From N. L. S u m m e r s and M. Karplus, J. Mol. Biol. 216, 991 (1990).]
will contain a sufficient range of variability to generate satisfactory test fragments for many modeling problems. Fourth, the arguments for the existence of serf-energy and interaction energy thresholds invoked for side-chain modeling can be extended with equal applicability to short, well-formed fragments. Based on these four assumptions, a five-stage modeling scheme has been developed and tested.~2 A flow chart of the procedure is given in Fig. 12.
[9]
193
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
n-residue gap
I m
(~:ap+l
./ FIG. 13. Schematic representation of the set of target distances, {tn(i,j)}, generated by the three C~ positions immediately preceding and following an n-residue gap in a protein backbone, that is, an (n,3) segment. The end-to-end distance is represented by a dotted line; target distances associated with C%~p-3are shown as dashed lines. Virtual bonds are shown as solid lines. [From N. L. Summers and M. Karplus, J. Mol. Biol. 216, 991 (1990).]
Fragment Descriptors In Step 1, the relative orientations o f the protein b a c k b o n e flanking the missing segment are described by a set o f C ~ separations. T h e s e target distances, {tn(i,j)}, w h e r e n is the number o f fragment residues, are initially generated from the three C ~ positions immediately preceding and following the "missing" fragment. Fifteen unique internuclear distances are generated. One (the dotted line in Fig. 13) specifies the end-to-end distance or closure condition for the gap. The remaining m e m b e r s o f the set qualitatively describe the conformational restrictions placed on the ends o f the gap fragment by the rest o f the molecular model. At least three C ~ positions on either side o f the gap are required for a local description o f the b a c k b o n e c o n f o r m a t i o n on either side o f the "missing" fragment. T w o adjacent C ~ positions describe a virtual bond; their internuclear separation is relatively constant. The C ~ subset o f b a c k b o n e atoms is c h o s e n to generate the target distances b e c a u s e crystallographic uncertainty tends to be smallest for these atoms.
194
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
Statistical Trends In Step 2, the database is scanned for all n-residue fragments with similar target distances. The search is made in a straightforward way by calculating the root mean square deviation, RMS(n,P,S), between the set of target distances {tn(i,j)} and the equivalent set of distances {rn(P,S;k,l)} generated for every n-residue segment S in each protein P in the database, where
RMS(n,P,S) = I X [tn(i'J) - r~P'-S;k'l)]211/2 ~:~
2n+2
J
The indices i and j run over the descriptor C a positions in the target structure; the indices k and I correspond to the descriptor C a positions in the test fragment equivalent to descriptors C{ and C;, respectively. The ten segments with the smallest RMS(n,P,S) values are selected from the database. Use of a root mean square cutoff is important. In comprehensive conformational search algorithms, the segment end points are taken to be fixed for simplicity, that is, there is no flexibility in the atom positions flanking the missing segment, although such flexibility could be introduced. By making selections with respect to a cutoff in the database search, the atom positions are no longer fixed in space; they are constrained to spherical volumes. Because the closure condition is no longer exact, there is an effective increase in the number of independent degrees of freedom. This implies a tolerance for normal experimental coordinate uncertainty in the backbone atom positions anchoring the search. Moreover, for homology modeling there is no reason that the target and template positions be identical, and larger differences are expected when sequence identity is low. A corresponding uncertainty is expected for bioengineering applications involving main-chain design. In each case, relatively large coordinate uncertainty is anticipated for the residues flanking the gap. The selected segments are sorted into conformational groups with rspect to the (co,~b,~b) dihedral angles. Within a conformational group, corresponding dihedral angles are required to lie within 60 ° of each other. These large conformational bins tend to normalize the statistical scatter in the database by allowing for dihedral angle uncertainties arising from the inherent crystallographic error associated with each atom, differing data resolutions, varying degrees of refinement, and the use of a variety of refinement algorithms. Any "conformational group" which contains three or more members is considered to represent a statistically viable conformational solution.
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
195
Energetic Fingerprints In Steps 3 and 4 energetic fingerprints are used to evaluate whether the statistically derived "multiple selections" are " g o o d " modelsJ 2 In Step 3, the stability of the model fragments are evaluated in isolation from the rest of the protein. Since a root-mean square cutoff is used for fragment selection, it is likely that no trial segment exactly satisfies the closure conditions for the gap. If a trial segment were inserted directly into the protein model, localized high-energy regions ("hot spots") might be created on the fragment potential energy surface. The problem can be avoided by first isolating the trial fragment consisting of the n-residue gap fragment plus the (gap - 1)st and (gap + 1)st residues from its native protein, removing the side-chain atoms beyond C t~, and then briefly minimizing the fragment in the presence of a harmonic potential that forces the C~gap_1)C~gap+1) separation to converge to the end-to-end gap length in the model. By setting the force constant for the additional term approximately equal to that o f a C - C bond stretch, artifactual compression during minimization is avoided, and the internal energy changes which result tend to be uniformly distributed along the fragment, if the trial fragment is a viable modeling option. Although the resulting conformation has the correct end-to-end length for insertion, there is no unique orientation for insertion relative to the rest of the protein model. The orientation is fixed by a second brief minimization step with five additional harmonic terms represented as thin lines in Fig. 14. Again, the additional harmonic force constants are set equal to that of a C - C stretch. If the trial fragment can be made to satisfy these additional energetic restraints through brief minimization, the trial conformation can assume a conformation appropriate for insertion. When the root mean square deviation of the N and C ~ atoms of residue ( g a p - 1) and the C ~ and C atoms of residue (gap + 1) between the target and manipulated fragment backbones (R.M.S. Insertion Test in Fig. 12) is greater than about 0.25 ,~, large conformation changes were found to be required to force the trial segment to satisfy the insertion constraints. As long as the force constants assigned to the additional terms are in proportion to the rest of the force field, the additional terms are usually not large enough to initiate the required conformational change. The fragment is discarded because it is not a viable modeling option. Trial segments which pass the insertion rms test are examined in Step 4, where the fingerprints which characterize the interaction of the docked fragment with the rest of the protein model are evaluated. The side chains are built on the fragment to complete the potential energy surface (see the Side-Chain Modeling section).
196
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
Nf/~
/ / /
C
I l
I I
~
//C~"~
I I
I
c ............ target
N~
C°
~~
i
segment
FIG. 14. Typical trial segment end-to-end length and orientation compared to the rest of a molecular model. Backbone atoms for the residues ( g a p - 1 ) and (gap+ 1) are shown explicitly, and bonds are drawn as heavy lines. The atoms of the n-residue gap are represented by dotted lines. Narrow lines indicate the additional harmonic potential terms applied to the trial segment in order to satisfy the end-to-end closure and the fragment orientation conditions for the n-residue gap. Trial segment insertion occurs with respect to the N and C ~ atoms of residue ( g a p - 1 ) and the C ~ and C atoms of residue (gap+ 1); it is represented by dashed lines. [From N. L. Summers and M. Karplus, J. Mol. Biol. 216, 991 (1990).]
The simple docking algorithm which has been used to insert the fragment frequently results in highly repulsive van der Waals overlaps with the rest of the protein model. Prior to the evaluation of the interaction of the fragment with the rest of the protein and before the side-chain fingerprints can be examined, the trial segment must be relaxed in the model to relieve internal and steric strain. As before, the minimization procedure must be gradual and brief so that compression artifacts do not destroy the energetic fingerprints. This is accomplished by briefly minimizing the segment with diminishing mass-weighted harmonic constraints 54 that tether the atoms to specific points in space in the presence of the fixed protein. At this point the interaction energy fingerprints can be applied. First, the stabilization of the relaxed fragment by the rest of the protein is evaluated. Unminimized n-residue segments in high-resolution crystal structures (2.0 ,~ or better resolution, R -< 20.0%) typically have energies below - 8n kcal/mol, that is, the average minimum stability of each residue is 8 kcal/mol. Although force field and parameter set dependent, this minimum energy threshold per residue appears to have little sequence or fragment length dependence. Viable modeling options must satisfy this minimum energy criterion (Minimum Energy Test in Fig. 12). Extended minimization could cause the interaction energy criteria to be satisfied through a combination of electrostatic and van der Waals compression. Therefore, a second simple test for minimization artifacts at the junctions
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
197
of the trial fragment with the rest of the model is made. If the trial segment is a good model, it should not have induced significant strain in the rest of the model when it was inserted. The induced strain can be estimated by removing the n-residue gap segment from the relaxed model and reinserting it in the original protein model prior to minimization. If the interaction energy of the fragment with the rest of the protein is still negative (Substitution Energy Test in Fig. 12), the trial fragment induced little or no strain in the model. Finally, a more detailed fragment evaluation can be made by verifying the side-chain interaction and hydrogen bond saturation fingerprints discussed in the previous section. The accessibility classification fingerprint should also be satisfied if it is applicable. If all of the above criteria are satisfied, it is still not guaranteed that the fragment conformation is, in fact, correct. In the best case, any fragment which satisfies these criteria will be indistinguishable (within experimental error) from the correct, well-folded fragment. In the worst case, the propagation of error in the model caused by an incorrect fragment conformation should be minimal since its energetic behavior should be very similar to that of the correctly folded protein fragment, ff no fragment satisfied all of the criteria, a new set of descriptors can be defined and the scheme repeated until a viable model fragment has been identified. The iteration scheme (Step 5) addresses the likely sources of error in decreasing probability of occurrence. 12 The first iteration steps shift the position of the fragment by one and then two residues. These shifts tend to compensate for structural perturbations between the model and actual atoms in the backbone flanking the gap segment. The last iteration steps tend to compensate for local sequence alignment errors in homology modeling applications. If no iteration step locates a viable fragment conformation, a complete conformational search must be considered.
Applications The test cases for the fragment modeling methodology have been reported in detail elsewhere.12 High-resolution X-ray structures were used to derive the descriptor distances and to compare the final results. The database consisted of 31 high-resolution X-ray structures (Table V). Briefly, the test cases consisted of ten three-residue X-Pro-Y PEN segments of antiparallel fl sheet, isolated 350 helix, 310 helix associated with a helix, and random coil structures. Two cis-Pro conformations were ineluded as well. Eighteen one-, two-, and four-residue insertions and deletions in the homologous protein pairs (PEN,END), (TTRP,GCHY) and ( H E N , H U M ) were also used. For the Pro-containing test cases the descriptor distances were taken from the PEN protein itself. For the inser-
198
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
TABLE V DATA FOR PROTEINS CONTAINED IN DATABASE a
Protein Actinidin c Endothiapepsin d Penicillopepsin e Rhizopuspepsinf Carboxypeptidase A g Parvalbuminh Crambin i Cytochrome c j Dihydrofolate reductase k Erythrocruorin ~ Ferredoxin m Flavodoxin n 3,-Chymotrypsin ° High-potential iron proteinP Insulin q
Lactate dehydrogenase r Leghemoglobin ' Myoglobin t Neurotoxin B u Proteinase inhibitor, Kazal v
Plastocyanin w A v i a n pancreatic peptide x T r y p s i n inhibitory Trypsin z Bence-Jones protein a~ R h o d a n e s e bb R u b r e d o x i n ad Protease A ee Thermolysinff Lysozyme H e n egg white ge H u m a n hh
No. of residues
Resolution (/~)
R factor
2ACT 4APE 2APP 2APR 5CPA 1CPV 1CRN 4CYT 3DFR
220 330 323 325 307 108 46 103 162
1.7 2.1 1.8 1.8 1.5 1,9 1,5 1.5 1.7
0.17 0.16 0.14 0.14 0.19 0.25 0.11 0.17 0.15
IECA 1FDX 3FXN 2GCH 1H I P
136 54 138 241 85
1.4 2.0 1.9 1.9 2.0
0.19 0.19 0.21 0.18 0.24
IlNS A B C D 4LDH
21 30 21 30 329
1.5
0.18
2.0
0.20
1LH2 2MBN 1NXB 1OVO A B C D 1PCY 1PPT
153 153 62 56 56 56 56 99 36
1.5 2.0 1.4 1.9
0.18 0.24 0.24 0.20
1.6 1.4
0.17 0.28
4PTI 2PTN 1REI A B IRHD 2RXN 2SGA 3TLN
58 223 107 107 293 53 181 316
1.5 1.7 2.0
0.16 0.20 0.24
2.5 1.5 1.5 1.6
m 0.76 c¢ 0.13 0.13 0.21
HEWL HUML
129 130
1.6 1.5
0.22 0.19
ID/PDB-ID b
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
199
TABLE V (continued) a From N. L. Summers and M. Karplus, J. Mol. Biol. 216, 991 (1990). b F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). c E. N. Baker and E. J. Dodson, Acta Crystallogr., Sect. A. 36, 559 (1980). d L. Pearl and T. Blundell, FEBS Lett. 174, 96 (1984). e M. N. G. James and A. R. Sielecki, J. Mol. Biol. 163, 299 (1983). f K. Suguna, R. R. Bott, E. A. Padlan, E. Subramanian, S. Sheriff, G. H. Cohen, and D. R. Davies, J. Mol. Biol. 196, 877 (1987). g D. C. Rees, M. Lewis, and W. N. Lipscomb, J. Mol. Biol. 168, 367 (1983). h p. C. Moews and R. H. Kretsinger, J. Mol. Biol. 91, 201 (1975). i W. A. Hendrickson and M. M. Teeter, Nature (London) 290, 107 (1981). J T. Takano and R. E. Dickerson, Proc. Natl. Acad. Sci. U.S.A. 77, 6371 (1980). k j. T. Bolin, D. J. Filman, D. A. Matthews, R. C. Hamlin, and J. Kraut, J. Biol. Chem. 257, 13650 (1982). t W. Steigemann and E. Weber, J. Mol. Biol. 127, 309 (1979). m E. T. Adman, L. C. Sieker, and L. H. Jensen, J. Biol. Chem. 251, 3801 (1976). n W. W. Smith, R. M. Burnett, G. D. Darling, and M. L. Ludwig, J. Mol. Biol. 117, 195 (1977). o G. H. Cohen, E. W. Silverton, and D. R. Davies, J. Mol. Biol. 148, 449 (1981). P C. W. Carter, J. Kraut, S. T. Freer, W.-H. Xuong, R. A. Alden, and R. G. Bartsch, J. Biol. Chem. 249, 4212 (1974). q J. Bordas, G. G. Dodson, H. Grewe, M. H. J. Koch, B. Krebs, and J. Randall, Proc. R. Soc. London, Set. B 219, 21 (1983). r j. L. White, M. L. Hackert, M. Buehner, M. J. Adams, G. C. Ford, P. J. Lentz, I. E. Smiley, S. J. Steindel, and M. G. Rossman, J. Mol. Biol. 102, 759 (1976). s E. H. Arutyunyan, J. Deisenhofer, A. V. Teplakov, I. P. Kuranova, H. B. Obmolova, and B. K. Vainshtein, Dokl. Akad. Nauk SSSR 270, 732 (1983). t T. Takano, J. Mol. Biol. 110, 537 (1977). u D. Tsernoglou and G. A. Petsko, FEBS Lett. 68, l (1976). E. Papamokos, E. Weber, W. Bode, R. Huber, M. W. Empie, I. Kato, and M. Laskowski, J. Mol. Biol. 158, 515 (1982). w j. M. Guss and H. C. Freeman, J. Mol. Biol. 169, 521 (1983). x T. L. BlundeU, J. E. Pitts, I. J. Tickle, S. P. Wood, and C.-W. Wu, Proc. Natl. Acad. Sci. U.S.A. 78, 4175 (1981). YM. Marquart, J. Walter, J. Deisenhofer, W. Bode, and R. Huber, Acta Crystallogr., Sect. B 39, 480 (1983). z Brookhaven Protein Data Bank, PDB-ID 3PTN. aa O. Epp, E. E. Lattman, M. Schiffer, R. Huber, and W. Palm, Biochemistry 14, 4943 (1975). bb j. H. Ploegman, G. Drent, K. H. Kalk, and W. G. J. Hol, J. Mol. Biol. 123, 557 (1978). cc Average figure-of-merit. dd K. D. Watenpaugh, L. C. Sieker, J. R. Herriott, and L. H. Jensen, Acta Crystallogr., Sect. B 29, 943 (1973). ee M. N. G. James, A. R. Sielecki, G. D. Brayer, L. T. J. Delbaere, and C.-A. Bauer, J. Mol. Biol. 144, 43 (1980). ff A. F. Monzingo and B. W. Matthews, Biochemistry 21, 3390 (1982). gg H. H. G. Handoll, unpublished results (1984). hh p. j. Artymiuk and C. C. F. Blake, J. Mol. Biol. 152, 737 (1981).
b
c
d
[9]
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
201
tion/deletion test cases the descriptor distances were taken from the homologous protein to realistically evaluate the effect of normal modeling errors. To avoid the propagation of error owing to incorrect side-chain conformations, they were taken from the internal coordinates of the X-ray structures. However, because the backbone rms deviation between the X-ray and trial segments is frequently on the order of 1 to 1.5 ,~, the individual side-chain atoms may be displaced by as much as 2.5 A with respect to the X-ray structure. The side-chain modeling methodology generally produces smaller side-chain placement errors. Therefore, the initial side-chain placement errors are reasonably representative of those incurred by a carefully executed modeling procedure, such as the one described above. Overall, the scheme performed quite well; 79% (22 out of 28) of the test cases were modeled to within experimental error. Several of these are illustrated in Fig. 15. For 97% of the conformations generated by the search scheme, the statistical and energetic fingerprints "correctly" discriminate between trial conformations which correspond to the X-ray conformations and those which are different. The origins of the viable fragment conformations are of particular interest. Most of the fragments were taken from proteins which belonged to structure-function families different from that of the target protein and exhibited no identifiable sequence or structural relationships to the target protein. This supports the assumption that the local backbone fold of many shorter segments is dominated by the adjacent backbone conformations without any definite sequence dependence. Failures of the scheme were invariably correlated with the presence of infrequently observed conformations (e.g., cis-prolines) in the target protein or with locally unique protein folds. The level of success achieved implies that this combination of statistical trends and energetic fingerprints is relatively insensitive to uncertainty and/or error in the model coordinates and so should be of general utility in evaluating segment models, even when generated by schemes other than the one proposed here. Concluding Discussion Detailed, atomic level descriptions of proteins are required to interpret and extrapolate sequence, structure, and function data effectively. Since data from sequence analyses, optical spectroscopy mutation studies, and kinetic measurements can be collected and analyzed far more readily than
FIG. 15. Stereographic views of protein fragments (--) relative to predicted conformations (--) for (a) PEN 266-270, (b) PEN 252-256, (c) GCHY 35-40, and (d) GCHY 201-208.
202
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[9]
atomic resolution structures can be determined, modeling methodologies are needed. Although no ab initio methods for complete protein structure prediction exist, significant advances have been made, particularly in the area of homology modeling where the prediction of side-chain conformations and main-chain segment conformations is now possible. In the past, two approaches have been taken, conformational searches and the application of conformational statistics. Each approach has advantages and disadvantages. Conformational searches are computationally expensive and yield a range of possible modeling solutions. It is generally difficult to select the "correct" model from the subset generated by the search. The selection is typically made with respect to relative conformer energies. However, Novotny et al. 82'83 have shown that this criterion is generally unable to distinguish well-folded from poorly folded structures. The application of statistical trends can drastically limit the extent of conformational space that must be searched as well as the number of possible modeling solutions which are generated. However, the effectiveness of statistically derived methods is critically dependent on the size and composition of the database used. Little has been done so far to develop criteria for the determination of the correctness of the selected structures. A hybrid approach to structural modeling has been proposed in this chapter which combines the detection, extension, and application of statistical trends with the identification of energetic fingerprints which characterize well-folded, globular proteins. The coupling of the two approaches circumvents many of the difficulties encountered when either is applied alone. The Statistical Trends subsections have shown that relatively small databases comprised of well-determined, experimental structures are sufficient for many applications under two conditions. First, the databases must be representative of the modeling problems under consideration. Second, even high-resolution, well-refined structures are not of uniform quality. Selection algorithms should be robust enough to disregard poorly or incompletely refined side chains or segments. Under these two conditions, several possible conformations or incorrect conformations may still result. Subsequent identification of the correct conformation is required to obtain a structural model. We have described criteria that make it possible to differentiate correct from incorrect conformations in many cases and to determine when such an evaluation is likely to be difficult. Modeling accuracy can be improved significantly by the identification and use of statistical trends which pinu j. Novotny, R. Bruccoleri, and M. Karplus, J. Mol. Biol. 177, 787 (1984). 83 j. Novotny, A. A. Rashin, and R. E. Bruccoleri, Proteins 4, 19 (1988).
[9l
MODELING SIDE CHAINS, LOOPS, AND INSERTIONS
203
point modeling situations in which the most statistically probable conformations are not likely to be observed. In addition to observed geometric properties such as dihedral angle transferabilities between homologous structures, potential energy surface characteristics also exhibit useful trends that have not been pointed out previously. If the force field description chosen for a protein is reasonably accurate, carefully executed, and internally consistent, each residue and each contiguous segment possess energetic traits which appear not to be violated in correctly folded protein structures. The rigid rotation, potential energy surfaces of side chains are particularly sensitive to the quality of a folded structure. Each residue (Val, Ile, Phe, etc.) has a characteristic shape which is unique to that residue type. These shapes describe surface residues as well as core residues. Each side chain is usually found within 10° to 15° of its global minimum on these surfaces. Further, the shape of the minima are such that the introduction of about 5 kcal/mol of energy (for the CHARMM 42 force field) permits a range of at least 30° to 40 ° in each side-chain dihedral degree of freedom. These characteristic extents can be viewed as a qualitative measure of the local configuration entropy present in a protein structure. In addition, the self-energy of each residue (calculated in the absence of interactions with the other residues in a protein) corresponds to a characteristic minimum stability. Similarly, the energies of interaction of residues and continuous protein fragments with the rest of the molecule have characteristic threshold values of minimum and maximum stability. Minimum stability thresholds can be rationalized in terms of the stabilization energy necessary to favor a folded state over a denatured state. Maximum thresholds can be related to the entropy requirements. These energetic fingerprints appear to be rather general, though more work is required to establish their limitations. The importance of such energetic fingerprints is that they can be used to evaluate conformational search results or to evaluate the quality of a protein structure, whether it is a theoretical model or derived directly from experimental measurements. We have tried to show how the statistical trends and energetic fingerprints have been combined to create main-chain and side-chain modeling methodologies which are generally applicable, require significantly less disk storage space than conformational search methods, are cpu efficient, and generate model structures which are fully reproducible. The energetic fingerprints can frequently distinguish "incorrect" from "correct" structures and are relatively insensitive to small parameterization inconsistencies, simplified force field formulations, and coordinate errors typical of most modeling applications.
204
PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS
[10]
The insertion-deletion geometries and the side-chain conformations provide a basis for approaching the most difficult aspect of homology modeling, that of correctly predicting the differences in more global aspects in the structure of the target and the template protein. The global fold is qualitatively preserved throughout a structure-function family. However, the individual members of the family may differ significantly in terms of the relative rotation and translation of their secondary structure elements (Fig. 9). Chothia and Lesk 84 have shown that these relative displacements increase as sequence homology decreases. For the homology range of 25-40% in which protein modeling is most frequently attempted, the global backbone rms deviation between the homologous structures tends to be between 1.2 and 3.0 ,A. These differences are induced by the insertion and/or deletion of residues in one sequence with respect to another and by the secondary structure packing differences resulting from residue mutations. Initial homology models consist of the target sequence mounted on a known template backbone. Once "correct" main-chain and side-chain conformations can be predicted in the presence of such backbone coordinate differences, additional algorithms need to be developed to optimize the secondary structure packing and, consequently, produce models with quantitatively correct global structure. It is these models which would be excellent starting points for structure-function studies, inhibitor-drug design, or larger-scale bioengineering problems. The methodologies presented in this chapter provide a procedure for the critical steps of generating "correct" initial main-chain and side-chain conformations. Full global optimization of the resulting homologous structure remains as a task for the future. u C. Chothia and A. M. Lesk, "Computer Graphics and MolecularModelling" Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 1986.
[10] N e u r a l N e t w o r k s for P r o t e i n S t r u c t u r e P r e d i c t i o n
By L. HOWARD HOLLEY and MARTIN KARPLUS Introduction The prediction of the structure of a protein from its amino acid sequence remains one of the most challenging problems in biochemistry. Although X-ray crystallography has revealed the complexity of the threeMETHODS IN ENZYMOLOGY, VOL. 202
Copyright © 1991 by Academic Press, Inc. All rights of reproduction in any form reserved.