Native contacts determine protein folding mechanisms in atomistic simulations Robert B. Best1, Gerhard Hummer2, and William A. Eaton Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892-0520 Edited by Michael Levitt, Stanford University School of Medicine, Stanford, CA, and approved September 10, 2013 (received for review June 18, 2013)

The recent availability of long equilibrium simulations of protein folding in atomistic detail for more than 10 proteins allows us to identify the key interactions driving folding. We find that the collective fraction of native amino acid contacts, Q, captures remarkably well the transition states for all the proteins with a folding free energy barrier. Going beyond this global picture, we devise two different measures to quantify the importance of individual interresidue contacts in the folding mechanism: (i) the log-ratio of lifetimes of contacts during folding transition paths and in the unfolded state and (ii) a Bayesian measure of how predictive the formation of each contact is for being on a transition path. Both of these measures indicate that native, or near-native, contacts are important for determining mechanism, as might be expected. More remarkably, however, we found that for almost all the proteins, with the designed protein α3D being a notable exception, nonnative contacts play no significant part in determining folding mechanisms. funnel

| internal friction | frustration | reaction coordinate | Go models

P

rotein folding theory and simulations of coarse-grained models have yielded many insights into the principles that allow a protein to fold rapidly to its native state (1–7). To a great extent, such models have also been successful in reproducing and helping to interpret experimental data. Nonetheless, to be amenable to analytical solution, or to simulations of limited computational complexity, they required certain simplifying assumptions. One of the most powerful assumptions is that only native contacts play a significant role in determining the folding mechanism; this statement is motivated by the so-called “principle of minimal frustration” (1); that is, the folding energy landscape of proteins has been designed by evolution such that the energy is correlated as far as possible with the nativeness of the structure, with local traps due to misfolding or nonnative interactions being reduced or eliminated [leading to a relatively smooth, “funneled” energy landscape (8)]. This assumption is also the basis for using G o-type models (9), where only residue–residue contacts present in the native state are energetically favorable by design, leading to a perfect correlation of energy and nativeness (10). These models have successfully predicted a wide range of properties (11), including experimental ϕ-values probing the protein folding mechanism (12), hydrogen exchange protection factors (13), the mechanism of protein–protein association (14, 15), domain swapping (16), and misfolding in multidomain proteins (17), to list a few. The same assumption underlies the use of even further simplified Ising-like theoretical models for protein folding based on the native structure (18–20) and motivates the use of low-dimensional models to interpret folding experiments (21) and simulations (22). This native-centric assumption even underlies the interpretation of experimental folding ϕ-values in terms of native structure formation (23). That is not to say that nonnative contacts play no role in folding; indeed, in a free energy landscape description, making and breaking nonnative contacts may slow folding by increasing the free energy barrier height or by decreasing the effective diffusion coefficient on the energy landscape (24–27). It has even been shown that in some circumstances, weak nonnative inter17874–17879 | PNAS | October 29, 2013 | vol. 110 | no. 44

actions may accelerate folding by lowering the free energy barrier (22, 28–30). Despite its many successes in explaining experimental results, the assumption that nonnative contacts do not play any specific role in the folding mechanism of natural proteins is hard to test explicitly and has indeed often been questioned (31–33). For example, nonnative contacts have been implicated in the folding of the protein Im7 (34), based on the experimental observation of “nonclassical” ϕ-values, and several groups have postulated the importance of specific nonnative contacts for the folding of knotted proteins (35, 36). Moreover, although a funneled landscape will certainly facilitate folding, alternative scenarios in which specific nonnative contacts help to guide folding cannot be ruled out a priori. Molecular dynamics (MD) simulations allow us to examine folding mechanisms in atomistic detail (37), particularly now that simulation techniques have advanced to sample the millisecond time scale (38–41). The recent publications by Shaw and coworkers (40–44) of atomistic folding trajectories for more than 10 relatively small proteins provide us with a valuable resource. These MD simulations used a transferable force field in which every atom was represented and the protein was surrounded by an environment of explicitly represented water molecules. Therefore, these systems represent the ultimate level of detail and accuracy that is currently computationally accessible, albeit only on the specialized computers developed by the Shaw group (45). Because the trajectories were obtained from long equilibrium simulations at constant temperature, there is no question of mechanism bias due to the sampling method within the given context of the empirical energy function being used. These successful equilibrium folding simulations have many implications, including the possibility of testing methods for enhancing sampling Significance Understanding the mechanism by which proteins fold to their native structure is a central problem in protein science. Clearly, interactions between residues that are in contact in the folded state (native interactions) are likely to be important for folding, but, in principle, nonnative interactions may play a role. Here, we use recently published microsecond to millisecond all-atom molecular dynamics simulations of proteins folding and unfolding to show, remarkably, that nonnative contacts are irrelevant to the mechanism of folding in most cases. This statistical analysis would be very difficult to perform by experiment. Although this is a limited set of proteins, the results nonetheless strongly support coarse-grained theoretical and simulation models of folding in which only native contacts are energetically favorable. Author contributions: R.B.B., G.H., and W.A.E. designed research; R.B.B. performed research; R.B.B. analyzed data; and R.B.B., G.H., and W.A.E. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. 1

To whom correspondence should be addressed. E-mail: [email protected].

2

Department of Theoretical Biophysics, Max Planck Institute of Biophysics, 60438 Frankfurt am Main, Germany.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1311599110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1311599110

of reaction coordinate for G o-type models in which the energy is strongly correlated with Q; indeed, it often is a good reaction coordinate in those cases (a precise definition of Q is provided in SI Appendix). To test whether it is also a good coordinate for atomistic simulations, we use a Bayesian criterion based on TP theory (55), which has previously been used, e.g., to assess reaction coordinates for folding (50), coupled folding and binding (15, 56), conformational transitions (57), and water transport in nanotubes (50). To illustrate the formalism of testing Q as a reaction coordinate for folding, we show in Fig. 1A long equilibrium folding simulations of a G o model of protein G (22, 58) (details are provided in SI Appendix) and also a part of the all-atom folding trajectories of the protein G variant studied by Shaw and coworkers (41). We start by defining TPs as regions of the trajectories that cross directly from the unfolded well to the folded well, specifically crossing between the Q values corresponding to the two vertical blue lines in Fig. 1A (two example TPs are highlighted). Transition states are then defined as the configurations x that are most predictive of being on TPs [i.e., for which the conditional probability P(TPjx) is highest, with a maximum possible value of 1/2 for diffusive dynamics] (55). An ideal reaction coordinate Q could be considered as one that localizes only transition states at a single value of Q(x), that is, for which the projected P(TPjQ(x)) is as close as possible to 1/2 (i.e., avoiding overlap of transition states with other Best et al.

Time [µs]

100

1

Peq(Q)

TP

A

50

TP

0

6

6

4

4

2

2

0

0

3

3

2

2

1

1

0 0.5 0.4 0.3 0.2 0.1 0 0

0.2 0.4 0.6 0.8

Q

0 0.5 0.4 0.3 0.2 0.1 0 0 1

B

C

D 0.2 0.4 0.6 0.8

1

Q

Fig. 1. Bayesian criterion for quantifying the quality of Q as a reaction  model of protein G (Left) and an all-atom model coordinate. Results for a Go of protein G variant (Right) are shown. (A) Projection of trajectories onto fraction of native contacts, Q. (B) Potential of mean force F(Q) (units of kBT where kB is Boltzmann’s constant and T is the absolute temperature; solid curve) and corresponding equilibrium probability density Peq(Q) (broken red curve). (C) Probability density of Q on TPs, P(QjTP). (D) Conditional probability P(TPjQ) of being on a TP for a given value of Q (solid curve); the broken red line shows the theoretical maximum of P(TPjQ) of 0.5. In each plot, the vertical blue lines show the boundaries used to define TPs and the vertical green line shows the Q with the highest value of P(TPjQ).

nonreactive states). The latter quantity can be straightforwardly computed from equilibrium trajectories using Bayes theorem via PðTPjQÞ = PðQjTPÞPðTPÞ=Peq ðQÞ , where P(QjTP) is the probability density of Q on TPs (Fig. 1C), P(TP) is the fraction of time spent on TPs, and Peq(Q) is the equilibrium distribution of Q (Fig. 1B). As shown in Fig. 1D, P(TPjQ) attains a value of ∼0.46, close to the theoretical maximum of 0.5, at a Q value of ∼0.5 for the G o model. Remarkably, however, P(TPjQ) also reaches a similar value of ∼0.45 at a Q value of ∼0.67 in the all-atom force field model, showing that Q is a coordinate of similar quality in that case also. Note that the positions of the maxima on Q may differ due to the different resolutions of the models and consequent differences in the definitions of Q. Will Q also be a good coordinate in the other all-atom simulations? We have applied the same test to all the proteins studied by Shaw and coworkers (40, 41, 44), which (i) have a folding free energy barrier when projected onto Q and (ii) have at least 20 residues. This list comprises BBA (1FME), villin subdomain HP35 (1YRF), NTL9, Trp cage (2JOF), α3D (A3D), the designed protein G already mentioned (NuG2), ubiquitin, a fast-folding WW domain mutant (GTT), and λ-repressor (lambda). Note that we are analyzing the wild type villin (44) rather than the fast folding mutant (although comparable results are obtained for the latter). For each of these proteins, we have identified TPs using a similar scheme to that for protein G; a full summary of simulation statistics and the bounds used to define unfolded and folded states is given in SI Appendix, Table S1. In SI Appendix, Fig. S1, we present the corresponding free energy surfaces F(Q) and Bayesian criterion P(TPjQ). Remarkably, we find that in all these cases, spanning a range from 20 to 80 residues and with folding times from the microsecond to the millisecond time scales, the function P(TPjQ) is sharply peaked with a maximum near 1/2. In SI Appendix, Fig. S2, we show that consistent results are obtained PNAS | October 29, 2013 | vol. 110 | no. 44 | 17875

CHEMISTRY

Fraction of Native Contacts Is a Good Folding Coordinate for Atomistic Simulations. The fraction of native contacts, Q, is a natural choice

150

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

Results and Discussion

2

0

P(TP|Q) P(Q|TP)

of folding events (46, 47). For our present purpose, they may be used to ascertain whether nonnative contacts play any specific role in determining folding mechanism, provided an appropriate metric can be devised. In the accompanying paper, the distribution of folding mechanisms predicted by the molecular dynamics simulations for the villin headpiece subdomain are compared with those of an Ising-like model that makes the G o assumption (48). The outline of this paper is as follows. We first examine how well the fraction of native amino acid contacts, Q (49), performs as a reaction coordinate for folding. Defined as the fraction of heavy atoms proximal in the native structure that are in close spatial proximity at some instant in time, Q is commonly used in combination with G o models for folding, where it is a natural choice that is usually very successful (22, 25, 50, 51). Although it is not at all obvious that Q would be a good reaction coordinate for all-atom simulations in which nonnative interactions are fully accounted for, we find that for all the proteins with an energy barrier and having at least 20 residues, Q is a remarkably good coordinate. Because Q is a global measure and does not (as we show) preclude nonnative contacts from also playing a relevant role, we have devised additional measures that can be used to quantify the importance of every pair of residue–residue contacts on folding transition paths (TPs): the rare reactive portions of the trajectories crossing between the unfolded and folded states that contain the mechanisms of folding. The first of these measures is based on the lifetimes of contacts on TPs relative to the unfolded state, whereas the second is a Bayesian criterion for how much a given contact determines the likelihood of being on a TP. Both criteria establish that native contacts do indeed play a dominant role on folding TPs. Most surprisingly, we find that nonnative contacts appear to be unimportant, in almost all cases, for determining folding mechanism. For most of the proteins, neither criterion reveals any significant role for nonnative interactions. However, for one designed protein, α3D, we find that nonnative contacts are an integral part of the folding mechanism. This latter finding is consistent with the idea that designed protein sequences are selected to yield a stable folded structure, and not also to be minimally frustrated, as in naturally occurring proteins (52–54).

Contact Lifetime Test. We first consider the relative lifetimes of contacts on TPs and in the unfolded state. The choice of this quantity is motivated by the expectation that contacts contributing to the folding mechanism would, on average, have longer lifetimes during the TPs. We define contact formation to occur when any two heavy atoms from two different residues that are not yet forming a contact come within 3.5 Å of each other, and contact breaking to occur when the distance between the nearest two heavy atoms of a residue pair forming a contact becomes greater than 8 Å. The reason for this assignment, based on TPs for contact formation, is to suppress any apparent reduction of the contact formation/breaking times that would be obtained due to recrossing when using a single dividing surface (63). In SI Appendix, Fig. S4, we show that the results are robust to the choice of these threshold distances. To compare the average contact lifetimes on TPs, tTP, and in the unfolded state, tU, we plot in Fig. 2 their log-ratio, log10[tTP/tU], for each of the nine proteins under consideration. By comparing log10[tTP/tU] (Fig. 2, above diagonal) with the native contact map (Fig. 2, below diagonal), it is clear that positive values of log10[tTP/tU] tend to be associated with native contacts. It is also clear, as might be expected, that not all native contacts are equally important by this metric, as discussed further below. To quantify the relation between log10[tTP/tU] and native contacts better, we have plotted in Fig. 3 A and B the cumulative distributions of log10[tTP/tU] for native and nonnative contacts for villin and ubiquitin as representative of the smallest and fastest and the largest and slowest extremes of the protein dataset (distributions for the other proteins are given in SI Appendix, Fig. S5). The distribution of log10[tTP/tU] for native contacts is substantially shifted toward positive values. By contrast, the distribution for nonnative contacts is centered on zero and is approximately symmetrical. We can show that the width of this distribution mostly comes from statistical errors and does not reflect any special role of nonnative contacts by computing a control distribution of log10[tU*/tU] for nonnative contacts formed during a synthetic set of “TPs” comprising short segments chosen, without replacement, from the unfolded portion of the trajectory to have the same length distribution as the real TPs. The cumulative distribution for this reference dataset has been shaded in Fig. 3 such that any distribution skewed toward longer log10[tTP/tU] will appear in the unshaded region of the plot. For all the proteins, there is a strong similarity of the control distribution for log10[tU*/ 17876 | www.pnas.org/cgi/doi/10.1073/pnas.1311599110

]

0.5

ax

0

1

1.5

2 1

0.5

m

g

/N

10

[t

TP

/t

U

0

ij

lo

N

1FME

log10[t TP /t U] Nij / N max

1YRF

NTL9

30 30

20 20 10

0

20

10

0

10

10

20

30

0

10

0 0

A3D

30

40

Ubiquitin

GTT 30 20 10 0 0

20

50

40 30 20 10 0 0 10 20 30 40 50 60 70 20

70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70

10

NuG2

70 60 50

10

0

0 0

20

10

2JOF

20

Residue j

for the distribution of P(TPjQ) for a range of reasonable choices for the boundaries of the folded and unfolded states. In some cases, P(TPjQ) exceeds 1/2, although with large statistical uncertainties in the estimate arising from a small number of TPs in those instances. Thus, in all these cases, it appears that formation of native contacts alone, as captured by Q, is sufficient to locate transition states. The finding that Q is a good coordinate for off-lattice G o model simulations and for atomistic folding simulations is in contrast to results showing that Q is often a poor coordinate for lattice model simulations of folding (59). Because we have not yet explored this discrepancy, we can only speculate that it may relate to topological frustration (12, 60–62) due to the limited local move set typically used in lattice folding simulations, such that structures with relatively high Q may still require many moves to reach the folded state. However, the quality of Q as a reaction coordinate here does not indicate which native contacts are indicative of being on TPs and leaves open the possibility that nonnative contacts may also be indicative (but the total number of nonnative contacts, shown in SI Appendix, Fig. S3, is not a good folding coordinate). To determine the relevance of individual native and nonnative contacts, a contact-specific metric is required. We have devised two such metrics, as described below.

10

20

30

30 20 10 0

0 10 20 30 40

80 70 60 50 40 30 20 10 0

50

lambda

0 10 20 30 40 50 60 70 80

Residue i Fig. 2. Lifetime test. For each contact (i, j) between residues i and j, the logratio of its lifetime on TPs, tTP, to the lifetime in the unfolded state, tU (i.e., log[tTP/tU]) is plotted for j > i. The native contact map is shown for j < i, colored by the number of contacts between residues. A3D, α3D; GTT, a fastfolding WW domain mutant; lambda, λ-repressor.

tU] to that of log10[tTP/tU] for nonnative contacts, indicating that statistical noise largely explains the width of the distribution. The distribution for nonnative contacts does have a slight tail within this unshaded region, suggesting that some nonnative contacts may be significant. Because such a tail may be visually suppressed in a cumulative distribution due to the large number of other nonnative contacts, in Fig. 3 C and D, we also plot the unnormalized tail distributions for villin and ubiquitin contact lifetimes, that is, the absolute number of each type of contact for which log10(tTP/tU) is greater than a certain value. Presented this way, the number of nonnative contacts that are long-lived on TPs appears more comparable to that for native contacts. Another possibility, however, is that they are merely close to long-lived native contacts. To test this hypothesis, we have removed from the set of nonnative contacts any contact between a pair of residues (k, l) with a Manhattan distance to a native contact between a pair of residues (i, j) of 4 or less, that is, ji − kj + jj − lj ≤ 4, where i, j, k, and l are the positions in the amino acid sequence. This almost completely eliminates the tail at large log10[tTP/tU] from the resulting distribution of nonnative contact lifetimes. The results for the other proteins (shown in SI Appendix, Figs. S5 and S6) show a similar pattern, with the exception of α3D, as discussed further below. The clear separation in the distributions of native and nonnative contact log-lifetime ratios suggests that, for the proteins studied, nonnative contacts play a negligible role in determining folding mechanism. Bayesian Contact Test. One general caveat to the above analysis is that a nonnative contact might have the same average lifetime during a TP and in the unfolded state yet be formed more frequently on TPs than in the unfolded state, suggesting that its formation is important for the mechanism. A second consistency Best et al.

2

0.8 0.6 0.4

E

0.2 0

G

150 100 50 0 0

0.1

0.2

0.3

P(TP|qij)nn

0 -2 1

-1

0

1

2

log10[tTP/tU]

0.8 0.6 0.4

F

0.2 0

H

60 40 20

0 0 0.05 0.1 0.15 0.2 0.25

P(TP|qij)nn

Fig. 3. Distributions of lifetime ratios and P(TPjqij)nn for native and nonnative contacts. The cumulative distributions of the log-ratio of contact lifetimes on TPs and in the unfolded state are shown for ubiquitin (A) and villin (B) with the distribution (Dist.) for native contacts in black, that for nonnative contacts in red, that for nonnative contacts not close to native contacts (definition is provided in main text) in blue, and that for nonnative contacts from a synthetic reference dataset in which synthetic TPs are selected randomly without replacement from the unfolded state in brown. The distribution for the reference dataset is shaded such that any distribution skewed to longer log10[tTP/tU] will appear in the unshaded region. The corresponding unnormalized tail distributions [i.e., the absolute number of contacts for which log10(tTP/tU) is greater than a given value] are shown for ubiquitin (C) and villin (D), respectively. The cumulative distributions of P(TPjqij)nn for contact pairs (i, j ) are given for ubiquitin (E) and villin (F ) using black for native contacts, red for nonnative contacts, and blue for nonnative contacts not close to native contacts (as defined in the main text). The analogous tail distributions for P(TPjqij)nn are given for ubiquitin (G) and villin (H), respectively. Corresponding plots for the other proteins are given in SI Appendix, Figs. S4–S7.

1FME

where P(qijjTP)nn is the probability that a given contact is formed on TPs (i.e., the fraction of time for each of the TPs in which the contact is present), P(TP)nn is the probability of being on a TP Best et al.

0.2

0.4

1YRF 30

20

Residue j

20

0 0

20

10

10

2JOF

20

30

0

0

A3D

10

Ubiquitin 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70

GTT

20 10

10

20

20

30

50 40 30 20 10 0 0

10 20 30 40 50

lambda

30

0 0

10

NuG2

70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 20

10

0 0

20

10

10

P(TP|q )nn ij Nij / N max

NTL9

20

10

0.6 1

0.5

30

0 0

test, which also addresses this caveat, is to identify those contacts that are best able to distinguish TPs from the unfolded state. To do this, only the nonnative segments of the trajectory were considered, that is, excluding the folded state but including TPs. We asked, given that a contact qij is formed between residues i and j (as defined above), what is the probability that the protein is on a TP [i.e., P(TPjqij)nn]? The subscript “nn,” meaning “nonnative,” indicates that the segments of the trajectory corresponding to the folded state are excluded from this analysis, because native contacts will almost always be formed there. Using Bayes theorem, this conditional probability can be expressed in terms of quantities that can be straightforwardly computed from the equilibrium simulations,   P qij jTP nn P TP nn   P TPjqij nn = ; [1] P qij nn

0 0

30

80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80

Residue i Fig. 4. Bayesian measure for contacts on TPs. For each protein, the quantity P(TPjqij)nn is plotted for each pair of residues (i, j) for j > i, whereas the native contact map, colored by number of native contacts, is shown for j < i.

PNAS | October 29, 2013 | vol. 110 | no. 44 | 17877

CHEMISTRY

1

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

0

log10[tTP/tU]

n

1

-1

20

)n

0 -2

40

ij

50

D

60

ax

100

0

m

C

150

B

0.2

TP / N |q

0

0.6 0.4

P(

A

0.2

1 0.8

ij

Tail Distribution Cumulative Dist.

0.6 0.4

Tail Distribution Cumulative Dist.

Tail Distribution Cumulative Dist.

Tail Distribution Cumulative Dist.

1 0.8

(the fraction of the total nonnative portion of trajectory spent on TPs), and P(qij)nn is the probability (fraction of time) of a given contact in the total nonnative portion of the trajectory. The definitions of TPs are the same as used for the other calculations above. If the formation of a certain nonnative contact qij were obligatory on TPs [large P(qijjTP)nn] and that contact formed rarely in the unfolded state, then P(TPjqij)nn would be large. Note that, unlike P(TPjQ), P(TPjqij)nn is not limited to values less than 0.5, even for diffusive dynamics. This analysis of P(qijjTP)nn (Fig. 4) results in a similar picture to that obtained from the lifetimes (Fig. 2), but with some quantitative differences. Regions of high P(TPjqij)nn are typically located in the vicinity of native contacts, with long-range contacts showing up prominently as being indicative of TPs. Although the cumulative distribution of P(TPjqij)nn for nonnative contacts for ubiquitin and villin indicates that most values are near zero (Fig. 3 E and F; data for other proteins are given in SI Appendix, Fig. S7), the tail distributions again indicate a number of nonnative contacts with higher P(TPjqij)nn (Fig. 4 G and H; data for other proteins are given in SI Appendix, Fig. S8). However, as was the case for the lifetime analysis, this tail arises primarily from nonnative contacts that happen to be near native contacts. Indeed, by excluding nonnative contacts that are proximal to native contacts, as defined above, the tail at high P(TPjqij)nn for nonnative contacts is essentially eliminated. By contrast, the distribution for native contacts has a significant tail at higher values, arising from individual amino acid contacts that are central to the folding mechanism. One might ask whether the observation that nonnative contacts proximal to native contacts are more long-lived and more indicative of being on a TP is due merely to geometric constraints, as we have thus far implied, or whether they have been specifically optimized to guide the correct formation of native contacts. We can discriminate between these possibilities by applying our two

N

Non-native contacts Non-native contacts not near native contacts

Non-native Reference Native Contacts

17878 | www.pnas.org/cgi/doi/10.1073/pnas.1311599110

A

B

70

60

50

50

40 30 20 10 0

C

70

60

Residue j

Residue j

tests to the folding of the G o model of protein G discussed earlier, in which all nonnative interactions are simple short-ranged repulsions. The results, presented in SI Appendix, Fig. S9, show that, even in this case, the proximal nonnative contacts are indicative of TPs and more long-lived. Therefore, geometric constraints are a sufficient explanation of our observations on proximal nonnative contacts in the all-atom case. Of course, not all native contacts are equally important, as may be expected from differences in relative strength, or in sequence separation. An interesting observation is that those contacts with high P(TPjqij)nn tend to be anticorrelated with those that have a high frequency of formation in the unfolded state Punf(qij). That this should be so follows from the definition of P(TPjqij)nn or by considering that formation of these persistent contacts would, by themselves, not indicate an increased likelihood of folding. A good example is the N-terminal hairpin of the protein G variant NuG2, which is largely formed in the unfolded state and plays little role in determining TPs. However, this conclusion suggests that although interactions observed in the unfolded state may affect the folding kinetics via their influence on unfolded state stability (64), according to our analysis, those interactions would not be expected to be important on TPs (i.e., in deciding the likely outcome of a possible folding transition). We have mainly considered the possibility that nonnative contacts may influence folding mechanism through their formation. An alternative possibility is that the breaking or absence of specific nonnative contacts may play a role in folding. To investigate this, we have computed the complementary quantity PðTP qij Þnn , the probability of being on a TP, given that the contact between residues i and j is not formed. This is closely related to the former quantities, being given by PðTP qij Þnn = ð1 − Pðqij jTPÞnn ÞPðTPÞ= ð1 − Pðqij Þnn Þ. The analysis, shown in SI Appendix, Fig. S10, gives only very low values for PðTP qij Þnn ; indicating that the absence of contacts is not predictive of TPs, and therefore that breaking of specific nonnative contacts is also not significant mechanistically. However, in the designed protein α3D (65, 66), nonnative interactions appear to play a significant role, even though they are not proximal to the native contacts. For this protein a distinct pattern of nonnative contacts between helices exists (Fig. 5). Because these are essentially shifts of helix register, a possible explanation may lie in the repeating pattern within the sequence of each helix in this designed protein. This pattern may allow these helices to associate with a nonnative register shift. Indeed, a closer examination of the α3D TPs clearly shows this to be the case. In Fig. 5B, we show a superposition of contact maps for a sequence of structures extracted from the same TP (Fig. 5C). As is evident from comparing Fig. 5 A and B, the contacts present in these contact maps, many of them nonnative, are the same as those with nonzero P(TPjqij)nn. The structures giving rise to these nonnative contact maps are indeed those with “offregister” helix packing. A similar pattern observed for the λ-repressor suggests that this may be a more generic feature of helical proteins and may be related to the increased internal friction observed in certain spectrin repeats (67, 68). The additional slight frustration for α3D may also be the consequence of optimizing a structure for stability, a phenomenon observed in a more dramatic fashion in another designed protein, Top7 (54). As in protein–protein binding (69), designing not only for overall stability but explicitly against nonnative interactions should result in more facile self-assembly processes. One might ask if the set of proteins chosen, including many fast folders, may be biased toward proteins with ideally funneled landscapes, and therefore the potential influence of nonnative interactions may be suppressed. For example, if folding and collapse were nearly simultaneous in the chosen sequences, there may be little opportunity for frustration due to nonnative contacts

40 30 20 10

0

10 20

30

40

50

Residue i

60

70

0

0

10 20

30

40

50

60

70

Residue i

Fig. 5. Nonnative contact formation on the TPs for α3D due to off-register helix packing. (A) Map of P(TPjqij)nn (above diagonal) and native contact map (below diagonal), with the color scale as in Fig. 4. (B) Contact maps of snapshots chosen from one TP. The contact map of each snapshot is assigned a different color. (C) Structures corresponding to the snapshots; boxes give the color key used in B. The backbone structure is shown in cartoon form, colored on a blue, green, and red scale. Asp and Glu side chains are colored red, and Arg and Lys side chains are colored blue.

(70, 71). In fact, we find that in the MD simulations these chains are almost always quite compact, with the average radius of gyration in the unfolded and transition states being no more than ∼10% larger than that of the folded state for almost all the proteins (SI Appendix, Figs. S11 and S12). Whether or not such compact denatured states are in accord with experiment, it is clear that they should provide ample possibilities for the formation of nonnative contacts. Thus, the lack of a role for nonnative contacts in folding mechanism observed here is not just a consequence of relatively expanded nonnative states. Conclusions Our findings generally support the view that native interactions are the key to determining folding mechanism for most small proteins, in agreement with the expectations for a landscape designed such that energy is correlated with native structure formation. In most cases, nonnative interactions (aside from those immediately adjacent to native contacts) play no role in determining mechanism. This overall consistency between a native-centric mechanism and the full MD simulations with a generic, transferable energy function mirrors findings based on lattice models for protein folding by Gin et al. (72), who concluded that although nonnative contacts influence stability and rates, they play a limited role in determining the mechanism. The many experimental and computational studies that have invoked an important role for nonnative contacts are not necessarily in contradiction with our results. Nonnative contacts can influence folding in many ways, reducing the folding rate via local energetic traps or by modulating the free energy landscape. Many studies have inferred such a role for nonnative interactions. In fact, by the nature of our analysis, we cannot comment directly on these effects. However, our analysis of the MD folding simulations for a range of proteins strongly suggests that specific nonnative interactions are largely irrelevant for determining the mechanism of their folding. ACKNOWLEDGMENTS. We thank D. E. Shaw Research for making this work possible by giving us access to the MD trajectories; Prof. Eugene Shakhnovich, Dr. Attila Szabo, and Dr. Eric Henry for many insightful discussions; and Prof. Kresten Lindorff-Larson and Dr. Stefano Piana for their comments on the manuscript. This work was supported by the Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health.

Best et al.

Best et al.

PNAS | October 29, 2013 | vol. 110 | no. 44 | 17879

CHEMISTRY

37. Snow CD, Nguyen H, Pande VS, Gruebele M (2002) Absolute comparison of simulated and experimental protein-folding dynamics. Nature 420(6911):102–106. 38. Voelz VA, Bowman GR, Beauchamp K, Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc 132(5): 1526–1528. 39. Bowman GR, Voelz VA, Pande VS (2011) Atomistic folding simulations of the fivehelix bundle protein λ(6−85). J Am Chem Soc 133(4):664–667. 40. Piana S, Lindorff-Larsen K, Shaw DE (2013) Atomic-level description of ubiquitin folding. Proc Natl Acad Sci USA 110(15):5915–5920. 41. Lindorff-Larsen K, Piana S, Dror RO, Shaw DE (2011) How fast-folding proteins fold. Science 334(6055):517–520. 42. Piana S, Lindorff-Larsen K, Shaw DE (2011) How robust are protein folding simulations with respect to force field parameterization? Biophys J 100(9):L47–L49. 43. Shaw DE, et al. (2010) Atomic-level characterization of the structural dynamics of proteins. Science 330(6002):341–346. 44. Piana S, Lindorff-Larsen K, Shaw DE (2012) Protein folding kinetics and thermodynamics from atomistic simulation. Proc Natl Acad Sci USA 109(44):17845–17850. 45. Shaw DE, et al. (2007) Anton, a special-purpose machine for molecular dynamics simulation. Isca’07: 34th Annual International Symposium on Computer Architecture, Conference Proceedings (Association of Computing Machinery, New York), pp 1–12. 46. Best RB (2013) A “slow” protein folds quickly in the end. Proc Natl Acad Sci USA 110(15):5744–5745. 47. Lane TJ, Shukla D, Beauchamp KA, Pande VS (2013) To milliseconds and beyond: Challenges in the simulation of protein folding. Curr Opin Struct Biol 23(1):58–65. 48. Henry ER, Best RB, Eaton WA (2013) Comparing a simple theoretical model for protein folding with all-atom molecular dynamics simulations. Proc Natl Acad Sci USA 110: 17880–17885. 49. Shakhnovich EI, Farztdinov G, Gutin AM, Karplus M (1991) Protein folding bottlenecks: A lattice Monte Carlo simulation. Phys Rev Lett 67(12):1665–1668. 50. Best RB, Hummer G (2005) Reaction coordinates and rates from transition paths. Proc Natl Acad Sci USA 102(19):6732–6737. 51. Best RB, Hummer G (2006) Diffusive model of protein folding dynamics with Kramers turnover in rate. Phys Rev Lett 96(22):228104. 52. Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG (1995) Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins 21(3):167–195. 53. Sadqi M, de Alba E, Pérez-Jiménez R, Sanchez-Ruiz JM, Muñoz V (2009) A designed protein as experimental model of primordial folding. Proc Natl Acad Sci USA 106(11): 4127–4132. 54. Watters AL, et al. (2007) The highly cooperative folding of small naturally occurring proteins is likely the result of natural selection. Cell 128(3):613–624. 55. Hummer G (2004) From transition paths to transition states and rate coefficients. J Chem Phys 120(2):516–523. 56. De Sancho D, Best RB (2012) Modulation of an IDP binding mechanism and rates by helix propensity and non-native interactions: Association of HIF1α with CBP. Mol Biosyst 8(1):256–267. 57. Best RB, Chen Y-G, Hummer G (2005) Slow protein conformational dynamics from multiple experimental structures: The helix/sheet transition of arc repressor. Structure 13(12):1755–1763. 58. Graham TGW, Best RB (2011) Force-induced change in protein unfolding mechanism: Discrete or continuous switch? J Phys Chem B 115(6):1546–1561. 59. Du R, Pande VS, Grosberg AY, Tanaka T, Shakhnovich ES (1998) On the transition coordinate for protein folding. J Chem Phys 108(1):334–350. 60. Shea J-E, Onuchic JN, Brooks CL, 3rd (1999) Exploring the origins of topological frustration: Design of a minimally frustrated model of fragment B of protein A. Proc Natl Acad Sci USA 96(22):12512–12517. 61. Li MS, Klimov DK, Thirumalai D (2002) Dependence of folding rates on protein length. J Phys Chem B 106(33):8302–8305. 62. Thirumalai D, Klimov DK, Woodson SA (1997) Kinetic partitioning mechanism as a unifying theme in the folding of biomolecules. Theor Chem Acc 96:14–22. 63. Buchete N-V, Hummer G (2008) Coarse-grained master equations for peptide folding kinetics from atomistic molecular simulations. J Phys Chem B 112:6057–6069. 64. Brewer SH, et al. (2005) Effect of modulating unfolded state structure on the folding kinetics of the villin headpiece subdomain. Proc Natl Acad Sci USA 102(46): 16662–16667. 65. Walsh STR, Cheng H, Bryson JW, Roder H, DeGrado WF (1999) Solution structure and dynamics of a de novo designed three-helix bundle protein. Proc Natl Acad Sci USA 96(10):5486–5491. 66. Zhu Y, et al. (2003) Ultrafast folding of α3D: A de novo designed three-helix bundle protein. Proc Natl Acad Sci USA 100(26):15486–15491. 67. Wensley BG, et al. (2010) Experimental evidence for a frustrated energy landscape in a three-helix-bundle protein family. Nature 463(7281):685–688. 68. Borgia A, et al. (2012) Localizing internal friction along the reaction coordinate of protein folding by combining ensemble and single-molecule fluorescence spectroscopy. Nat Commun 3:1195. 69. Johnson ME, Hummer G (2011) Nonspecific binding limits the number of proteins in a cell and shapes their interaction networks. Proc Natl Acad Sci USA 108(2):603–608. 70. Socci ND, Onuchic JN (1995) Kinetic and thermodynamic analysis of proteinlike heteropolymers: Monte Carlo histogram technique. J Chem Phys 4732–4744. 71. Camacho CJ, Thirumalai D (1993) Kinetics and thermodynamics of folding in model proteins. Proc Natl Acad Sci USA 90(13):6369–6372. 72. Gin BC, Garrahan JP, Geissler PL (2009) The limited role of nonnative contacts in the folding pathways of a lattice protein. J Mol Biol 392(5):1303–1314.

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

1. Wolynes PG (2005) Recent successes of the energy landscape theory of protein folding and function. Q Rev Biophys 38(4):405–410.  models for protein 2. Hills RD, Jr., Brooks CL, 3rd (2009) Insights from coarse-grained Go folding and dynamics. Int J Mol Sci 10(3):889–905. 3. Thirumalai D, O’Brien EP, Morrison G, Hyeon C (2010) Theoretical perspectives on protein folding. Annu Rev Biophys 39:159–183. 4. Shakhnovich EI (2006) Protein folding thermodynamics and dynamics: Where physics, chemistry, and biology meet. Chem Rev 106(5):1559–1588. 5. Karplus M, Kuriyan J (2005) Molecular dynamics and protein function. Proc Natl Acad Sci USA 102(19):6679–6685. 6. Dill KA, MacCallum JL (2012) The protein-folding problem, 50 years on. Science 338(6110):1042–1046. 7. Chan HS, Zhang Z, Wallin S, Liu Z (2011) Cooperativity, local-nonlocal coupling, and nonnative interactions: Principles of protein folding from coarse-grained models. Annu Rev Phys Chem 62:301–326. 8. Wolynes PG, Onuchic JN, Thirumalai D (1995) Navigating the folding routes. Science 267(5204):1619–1620.  N (1975) Studies on protein folding, unfolding and fluctu9. Taketomi H, Ueda Y, Go ations by computer simulation. I. The effects of specific amino acid sequence represented by specific inter-unit interactions. Int J Peptide Protein Res 7:445–459. 10. Nymeyer H, García AE, Onuchic JN (1998) Folding funnels and frustration in off-lattice minimalist protein landscapes. Proc Natl Acad Sci USA 95(11):5921–5928. 11. Kubelka J, Henry ER, Cellmer T, Hofrichter J, Eaton WA (2008) Chemical, physical, and theoretical kinetics of an ultrafast folding protein. Proc Natl Acad Sci USA 105(48): 18655–18662. 12. Clementi C, Nymeyer H, Onuchic JN (2000) Topological and energetic factors: What determines the structural details of the transition state ensemble and “en-route” intermediates for protein folding? An investigation for small globular proteins. J Mol Biol 298(5):937–953. 13. Craig PO, et al. (2011) Prediction of native-state hydrogen exchange from perfectly funneled energy landscapes. J Am Chem Soc 133(43):17463–17472. 14. Levy Y, Cho SS, Onuchic JN, Wolynes PG (2005) A survey of flexible protein binding mechanisms and their transition states using native topology based energy landscapes. J Mol Biol 346(4):1121–1145. 15. Turjanski AG, Gutkind JS, Best RB, Hummer G (2008) Binding-induced folding of a natively unstructured transcription factor. PLOS Comput Biol 4(4):e1000060. 16. Yang S, et al. (2004) Domain swapping is a consequence of minimal frustration. Proc Natl Acad Sci USA 101(38):13786–13791. 17. Borgia MB, et al. (2011) Single-molecule fluorescence reveals sequence-specific misfolding in multidomain proteins. Nature 474(7353):662–665. 18. Henry ER, Eaton WA (2004) Combinatorial modeling of protein folding kinetics: Free energy profiles and rates. Chem Phys 307:163–185. 19. Alm E, Morozov AV, Kortemme T, Baker D (2002) Simple physical models connect theory and experiment in protein folding kinetics. J Mol Biol 322(2):463–476. 20. Muñoz V, Eaton WA (1999) A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc Natl Acad Sci USA 96(20):11311–11316. 21. Sabelko J, Ervin J, Gruebele M (1999) Observation of strange kinetics in protein folding. Proc Natl Acad Sci USA 96(11):6031–6036. 22. Best RB, Hummer G (2010) Coordinate-dependent diffusion in protein folding. Proc Natl Acad Sci USA 107(3):1088–1093. 23. Matouschek A, Kellis JT, Jr., Serrano L, Fersht AR (1989) Mapping the transition state and pathway of protein folding by protein engineering. Nature 340(6229):122–126. 24. Chan HS, Dill KA (1994) Transition states and folding dynamics of proteins and heteropolymers. J Chem Phys 100:9238–9257. 25. Socci ND, Onuchic JN, Wolynes PG (1996) Diffusive dynamics of the reaction coordinate for protein folding funnels. J Chem Phys 104(15):5860–5868. 26. Klimov DK, Thirumalai D (2001) Multiple protein folding nuclei and the transition state ensemble in two-state proteins. Proteins 43(4):465–475. 27. Shea J-E, Nochomovitz YD, Guo ZY, Brooks CL (1998) Exploring the space of protein folding Hamiltonians: The balance of forces in a minimalist beta-barrel model. J Chem Phys 109:2895–2903.  model: How a little frus28. Plotkin SS (2001) Speeding protein folding beyond the Go tration sometimes helps. Proteins 45(4):337–345. 29. Clementi C, Plotkin SS (2004) The effects of nonnative interactions on protein folding rates: theory and simulation. Protein Sci 13(7):1750–1766. 30. Camacho CJ, Thirumalai D (1995) Modeling the role of disulfide bonds in protein folding: Entropic barriers and pathways. Proteins 22(1):27–40. 31. Faísca PFN, Nunes A, Travasso RDM, Shakhnovich EI (2010) Non-native interactions play an effective role in protein folding dynamics. Protein Sci 19(11):2196–2209. 32. Schwantes CR, Pande VS (2013) Improvements in Markov state model construction reveal many non-native interactions in the folding of NTL9. J Chem Theory Comput 9(4):2000–2009. 33. Prigozhin MB, et al. (2013) Misplaced helix slows down ultrafast pressure-jump protein folding. Proc Natl Acad Sci USA 110(20):8087–8092. 34. Capaldi AP, Shastry MC, Kleanthous C, Roder H, Radford SE (2001) Ultrarapid mixing experiments reveal that Im7 folds via an on-pathway intermediate. Nat Struct Biol 8(1):68–72. 35. Wallin S, Zeldovich KB, Shakhnovich EI (2007) The folding mechanics of a knotted protein. J Mol Biol 368(3):884–893. 36. a Beccara S, Skrbic T, Covino R, Micheletti C, Faccioli P (2013) Folding pathways of a knotted protein with a realistic atomistic force field. PLOS Comput Biol 9(3): e1003002.

Supporting  Information  For:  Native  Contacts   Determine  Protein  Folding  Mechanisms  in   Atomistic  Simulations      

Robert  B.  Best,  Gerhard  Hummer  and  William  A.  Eaton     Laboratory  of  Chemical  Physics,  National  Institute  of  Diabetes  and   Digestive  and  Kidney  Diseases,  National  Institutes  of  Health,  Bethesda  MD   20892-­‐0520  

                                                                   

1  

Supporting  Text     Coarse-­‐grained  Simulation  Methods     A  coarse-­‐grained  Gō  model  of  protein  G  was  built  from  its  protein  data  bank   structure  2gb1  (1)  using  the  Karanicolas  and  Brooks  Gō  model  (2).  The  protein  is   represented  by  beads  located  at  the  alpha  carbon  positions,  with  fixed  Cα-­‐Cα  “bond”   lengths,  harmonic  angle  (Cα-­‐Cα-­‐Cα)  potentials,  and  a  knowledge-­‐based  Fourier   series  for  the  pseudo-­‐torsional  potential.  Long  Langevin  dynamics  simulations  were   run  at  the  folding  temperature  with  a  friction  coefficient  of  0.1  ps-­‐1  with  Chemistry   at  HARvard  Molecular  Mechanics  (CHARMM)  (3),  with  a  time  step  of  10  fs.     Definition  of  Fraction  of  Native  Contacts     The  fraction  of  native  contacts  is  defined  via  a  list  of  native  contact  pairs  (i,j).  This   list  was  constructed  for  the  Protein  G  Gō  model  as  described  previously  (4).  For  the   all-­‐atom  model,  all  pairs  of  heavy  atoms  i  and  j  belonging  to  residues  θi    and  θj  are  in   contact  provided  that   𝜃! − 𝜃! > 3  and  the  distance  between  i  and  j  is  less  than  4.5   Å.  Then  Q  is  defined  via:   1 1 𝑄(𝑋) =   𝑁 1 + exp  [𝛽(𝑟!" (𝑋) − 𝜆𝑟!"! )] (!,!)

  where  the  sum  runs  over  the  N  pairs  of  native  contacts  (i,j),  rij(X)  is  the  distance   between  i    and  j  in  configuration  X,  𝑟!"!  is  the  distance  between  i    and  j  in  the  native   state,  β  is  a  smoothing  parameter  taken  to  be  5  Å-­‐1  and  the  factor  λ  accounts  for   fluctuations  when  the  contact  is  formed,  taken  to  be  1.2  for  the  Gō  model  and  1.8  for   the  all-­‐atom  model  (the  difference  in  λ  being  due  to  the  Gō  pair  interactions  being   more  short  ranged  relative  to  the  native  contact  distances  than  in  the  all-­‐atom   model).       Definition  of  Residue-­‐Residue  Contacts     For  analysis  of  contact  lifetimes  and  P(TP|qij),  a  residue-­‐residue  contact  𝑞!" ∈ {0,1}   between  a  pair  of  residues  (i,j)  was  defined  to  be  formed  (qij  =  1)  when  the   minimum  distance  between  any  pair  of  heavy  atoms  in  i  and  j  fell  below  3.5  Å,  and   remained  formed  until  the  minimum  distance  between  heavy  atoms  of  i  and  j   exceeded  8.0  Å.  Otherwise  qij=0.                      

2  

      Table  S1.  Summary  of  simulation  statistics  and  folded  and  unfolded  bounds  for   defining  transition  paths.  Proteins  for  which  results  are  presented  in  the  main  text   are  highlighted  in  bold.  Further  simulation  details  and  native  structures  are  listed  in   the  references  given  below.  The  C22*  force  field  refers  to  the  CHARMM22  force   field(5)  with  empirical  backbone  adjustments  (6).  The  ff99SB*-­‐ILDN  force  field   refers  to  the  Amber  ff99SB  force  field  (7)  with  both  empirical  backbone  adjustments   (8)  and  side-­‐chain  adjustments  based  on  energy  surfaces  derived  from  quantum   chemical  calculations  (9).     Protein   Short   Nres   Force   tsim   Qu   Qf   Name   Field   [μs]   From  Lindorff-­‐Larsen  et  al.(10)   BBA   1FME   28   C22*   325   0.05   0.77   Villin  HP35/NLE   2F4K   35   C22*   125   0.28   0.94   Trp-­‐Cage   2JOF   20   C22*   208   0.10   0.95   BBL   2WAV   47   C22*   429   N/A   N/A   α3D     A3D   73   C22*   707   0.19   0.70   Chignolin   CLN025   10   C22*   106   0.03   0.80   WW  domain   GTT   35   C22*   1137   0.05   0.92   λ-­‐repressor   Lambda   80   C22*   643   0.40   0.90   Protein  G   NuG2   56   C22*   1155   0.30   0.92   variant   NTL9   NTL9   39   C22*   2936   0.12   0.96   Homeodomain   UVF   52   C22*   327   N/A   N/A   Protein  B   PRB   47   C22*   104   N/A   N/A   From  Piana  et  al.(11)   Villin  HP35     1YRF   35   ff99SB*-­‐ 398   0.20   0.89   ILDN   From  Piana  et  al.(12)   Ubiquitin   Ubiquitin   76   C22*   7839   0.10   0.95                          

 

3  

          1FME

1YRF

NTL9

8 6 F(Q) 4 2 0 0.6

P(TP|Q)

0.4 0.2 0

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

2JOF

A3D

NuG2

8 6 F(Q) 4 2 0 0.6

P(TP|Q)

0.4 0.2 0

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

Ubiquitin

GTT

lambda

8 6 F(Q) 4 2 0 0.6

P(TP|Q)

0.4 0.2 0

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

Q             Fig.  S1.  Free  energy  surfaces,  F(Q),  and  Bayesian  criterion,  P(TP|Q),  for  the  reaction   coordinate  Q.  For  each  protein,  the  free  energy  surface  is  shown  (units  of  kBT),   above  the  corresponding  P(TP|Q).  The  protein  structure  is  shown  above  the  free   energy  surface.  Broken  red  lines  indicate  the  stable-­‐state  bounds,  which  are  used  to   delimit  transition  paths.  Broken  blue  lines  indicate  the  theoretical  maximum  of   P(TP|Q)  for  diffusive  dynamics.    

 

4  

1FME

1YRF

NTL9

8 6 F(Q) 4 2 0 0.6

P(TP|Q)

0.4 0.2 0

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

2JOF

A3D

NuG2

8 6 F(Q) 4 2 0 0.6

P(TP|Q)

0.4 0.2 0

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

Ubiquitin

GTT

lambda

8 6 F(Q) 4 2 0 0.6

P(TP|Q)

0.4 0.2 0

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

Q

    Fig.  S2.  Robustness  of  the  function  P(TP|Q)  to  different  choices  of  bounds  for  the   folded  and  unfolded  states.  In  each  case,  the  vertical  black  broken  lines  indicate  the   boundaries  used  in  the  main  text,  located  at  the  center  of  the  folded  and  unfolded   free  energy  minima.  The  magenta  broken  lines  correspond  to  a  more  conservative   choice  of  boundaries  in  each  case,  i.e.  even  further  from  the  barrier,  while  the  green   broken  lines  correspond  to  a  less  conservative  choice.  The  P(TP|Q)  resulting  from   each  choice  are  shown  by  solid  curves  with  the  same  color.                  

5  

    Fig.  S3.  Comparison  of  the  fraction  of  native  contacts,  Q(t),  with  the  total  number  of   residue-­‐residue  non-­‐native  contacts  Nnn(t),  for  representative  trajectory  fragments   for  each  protein  considered  (definition  of  non-­‐native  contacts  is  the  same  as  in  all   contact  analyses  in  the  paper).                    

6  

 

P(TP|q ij )nn 0

0.2

0.4

log(t TP/t U) 0.6

0

30

30

20

20

0.5

1

1.5

2

A 10

10 1

1

10

20

30

30

1

1

10

20

30

1

10

20

30

1

10

20

30

1

10

20

30

30

20

20

B 10

Residue j

10 1

1

10

20

30

30

30

20

20

C 10

10 1

1

1

10

20

30

30

1

30

20

20

D 10

10 1

1

10

20

30

1

Residue i       Fig.  S4.  Robustness  of  the  results  for  Villin  HP35  to  changes  in  the  choice  of  contact   definition.  Left  panels  give  the  probability  of  being  on  a  transition  path  given   formation  of  a  contact,  and  right  panels  the  log-­‐ratio  of  contact  lifetimes  on   transition  paths  and  in  the  unfolded  state.  (A)  Definition  used  in  main  text,  rform  =  3.5   Å,  rbreak  =  8  Å;  (B)  rform  =  3.0  Å,  rbreak  =  8  Å;  (C)  rform  =  3.5  Å,  rbreak  =  6  Å;  (D)  rform  =  4.5   Å,  rbreak  =  8  Å.    

7  

      Native Contacts Non-Native Contacts Non-Native Reference Non-Native Contacts (non-proximal to native) 1

Cumulative Distribution

0.8

1

1FME

0.8

1

1YRF

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0 -2 1 0.8

-1

0

1

2JOF

0 2 -2 1

-1

0.8

A3D

0

1

0 2 -2 1 0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0 -2 1 0.8

-1

0

1

Ubiquitin

0 2 -2 1 0.8

-1

0

1

GTT

0.8

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

-1

0

1

0 2 -2

-1

0

1

0 2 -2

log10[tTP/tU]

-1

0

1

2

1

2

1

2

NuG2

0 2 -2 1

0.6

0 -2

NTL9

0.8

-1

0

lambda

-1

0

    Fig.  S5.  Cumulative  distributions  of  the  log-­‐ratio  of  contact  lifetimes  on  transition   paths  relative  to  the  unfolded  state.  Distributions  of  log10[tTP/tU]  for  native  (black   curve)  and  non-­‐native  (red  curve)  contacts.  The  width  of  the  distribution  for  non-­‐ native  contacts  is  almost  entirely  due  statistical  error,  as  evidenced  by  the  width  of   the  distribution  when  segments  of  the  unfolded  state  are  substituted  for  the   transition  paths  and  log[tU*/tU]  calculated,  i.e.  “control”  contacts  (brown  “reference”     curve).  The  reference  curve  has  been  shaded  so  that  all  distributions  skewed  to   longer  log10[tTP/tU]  will  appear  in  the  unshaded  region.  The  distributions  for  non-­‐ native  contacts  that  are  not  close  to  native  contacts  (see  main  text)  are  given  by  the   broken  blue  lines.    

 

8  

     

Native Contacts Non-Native Contacts Non-Native Reference Non-Native Contacts (non-proximal to native) 80

50 40

1FME

60

30

40

20

20

Tail Distribution

10 0 -2 40 30

1YRF

-1

0

1

2JOF

0 2 -2 160

-1

0

1

A3D

120

120 100 NTL9 80 60 40 20 0 2 -2 -1 0 160 120

20

80

80

10

40

40

0 -2 200 150

-1

0

1

Ubiquitin

0 2 -2 80 60

-1

0

1

GTT

150

40

100

50

20

50

0 -2

-1

0

1

0 2 -2

-1

0

1

0 2 -2

log10[tTP/tU]

2

1

2

1

2

NuG2

0 2 -2 200

100

1

-1

0

lambda

-1

0

    Fig.  S6.  Unnormalized  tail  distributions  corresponding  to  Fig.  S5,  i.e.  the  absolute   number  of  contacts  of  each  type  for  which  log10(tTP/tU)  is  greater  than  the  specified   value.  Legend  for  curves  as  in  Fig.  S5.  

 

9  

Cumulative Distribution

Native Contacts Non-native Contacts Non-native Contacts (not proximal to Native) 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 NTL9 1FME 1YRF 0.2 0.2 0.2 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 NuG2 2JOF A3D 0.2 0.2 0.2 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 Ubiquitin 0.4 GTT lambda 0.2 0.2 0.2 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5

p(TP|qij)nn

      Fig.  S7.  Distributions  of  p(TP|qij)nn  for  native  (black)  and  non-­‐native  (red)    contacts   from  MD  simulations  of  Lindorff-­‐Larsen  et  al.(10)  The  broken  blue  curve  is  for  non-­‐ native  contacts  that  are  not  proximal  to  native  contacts  (see  main  text).                                      

10  

Tail Distribution

50 40 30 20 10 0 0 40

Native Contacts Non-native Contacts Non-native Contacts (not proximal to Native) 80 120 100 60 80 40 60 40 NTL9 1FME 1YRF 20 20 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 160 160

30

120

20 10

120

80

2JOF

40

80

A3D

40

NuG2

0 0 0.1 0.2 0.3 0.4 0.5 200

0 0 0.1 0.2 0.3 0.4 0.5 80

0 0 0.1 0.2 0.3 0.4 0.5 200

150

60

150

100 50 0

Ubiquitin 0 0.1 0.2 0.3 0.4 0.5

40 20 0

100

GTT 0 0.1 0.2 0.3 0.4 0.5

p(TP|qij)nn

50 0

lambda 0 0.1 0.2 0.3 0.4 0.5

      Fig.  S8.  Unnormalized  tail  distributions  corresponding  to  Fig.  S7,  i.e.  the  absolute   number  of  contacts  of  each  type  for  which  p(TP|qij)nn  is  greater  than  the  specified   value.  Legend  for  curves  as  in  Fig.  S7.            

                   

11  

A

3

40

2

30 1

20

P(TP|q )

ij nn

1.0

50 40 Residue j

50

Residue j

B

log[t TP/t U]

30

0.5

20 10

10 0

0 0

10

20 30 40 Residue i

0

50

0.0 0

10

20 30 40 Residue i

50

0.8 0.6 0.4 0.2 0 -1 100

C -0.5

0

0.5

1

1.5

2

1 0.8 0.6 0.4 0.2 0

D 0

0.2

0.4

0.6

0.8

1

100

E

80 60 40 20 0 -1

Cumulative Distribution

1

Tail Distribution

Tail Distribution

Cumulative Distribution

Non-Native Reference Native Contacts Non-Native Contacts Non-native Contacts (non-proximal to native)

-0.5

0

0.5

1

1.5

2

F

80 60 40 20 0

0

0.2

0.4

0.6

p(TP|qij)nn

log[tTP/tU]

0.8

1

 

  Fig.  S9.  Tests  for  relevance  of  native  and  non-­‐native  contacts,  applied  to  the  folding   of  the  Gō  model  of  protein  G  described  in  the  text.  (A)  Lifetime  test:  log-­‐ratio  of   contact  lifetimes  on  transition  paths  to  the  corresponding  lifetimes  in  the  unfolded   state  are  shown  for  j>i,  while  the  native  contact  map  (here:  all  residue  pairs  with   favorable  interactions  in  the  model)  is  shown  for  ji,  and  the  contact  map  for  j

Native contacts determine protein folding mechanisms in atomistic simulations.

The recent availability of long equilibrium simulations of protein folding in atomistic detail for more than 10 proteins allows us to identify the key...
5MB Sizes 0 Downloads 0 Views