FOR THE RECORD A comprehensive database of verified experimental data on protein folding kinetics

Amy S. Wagaman,1 Aaron Coburn,2 Itai Brand-Thomas,3 Barnali Dash,4 and Sheila S. Jaswal5* 1

Department of Mathematics and Statistics, Amherst College, Amherst, Massachusetts


Department of Information Technology, Amherst College, Amherst, Massachusetts


Department of Biology, Amherst College, Amherst, Massachusetts Department of Mathematics and Statistics, Mt. Holyoke College, South Hadley, Massachusetts

4 5

Department of Chemistry, Amherst College, Amherst, Massachusetts

Received 24 June 2014; Accepted 15 September 2014 DOI: 10.1002/pro.2551 Published online 17 September 2014

Abstract: Insights into protein folding rely increasingly on the synergy between experimental and theoretical approaches. Developing successful computational models requires access to experimental data of sufficient quantity and high quality. We compiled folding rate constants for what initially appeared to be 184 proteins from 15 published collections/web databases. To generate the highest confidence in the dataset, we verified the reported lnkf value and exact experimental construct and conditions from the original experimental report(s). The resulting comprehensive database of 126 verified entries, ACPro, will serve as a freely accessible resource (https://www.ats. for the protein folding community to enable confident testing of predictive models. In addition, we provide a streamlined submission form for researchers to add new folding kinetics results, requiring specification of all the relevant experimental information according to the standards proposed in 2005 by the protein folding consortium organized by Plaxco. As the number and diversity of proteins whose folding kinetics are studied expands, our curated database will enable efficient and confident incorporation of new experimental results into a standardized collection. This database will support a more robust symbiosis between experiment and theory, leading ultimately to more rapid and accurate insights into protein folding, stability, and dynamics. Keywords: protein folding; folding kinetics; database; reporting standards

Introduction Additional Supporting Information may be found in the online version of this article. Grant sponsor: NSF (UBM-Institutional-Collaborative: The Four-College Biomath Consortium); Grant number: 1129152. *Correspondence to: Sheila Jaswal, Department of Chemistry, Amherst College, P.O. Box 5000, Amherst, MA 01002. E-mail: [email protected]


PROTEIN SCIENCE 2014 VOL 23:1808—1812

Predicting protein folding kinetics is important for understanding the basis of protein structure and energetics, and has applications to diseases linked to protein misfolding, such as cystic fibrosis, Type II Diabetes, Alzheimer’s, and others.1 High quality experimental data is critical for building accurate models to predict and rationalize the behavior of any system,

C 2014 The Protein Society Published by Wiley-Blackwell. V

including protein folding. Acquiring access to a comprehensive, verified dataset is one of the biggest hurdles faced by every researcher modeling a process. In the protein folding community, the experimental reporting methods and formats vary greatly based on the research questions under investigation and the research group. This creates a challenge for collating data from published research. Recognizing this, in 2005, a folding consortium organized by Plaxco proposed consensus experimental conditions and a standardized reporting format.2 They initiated a pioneering collection of folding and unfolding kinetics data for a set of 30 two-state folding proteins (hereafter referred to as two-state proteins) by compiling existing data and re-measuring the kinetics under their consensus conditions for many of the proteins. While the above dataset from Maxwell et al.2 is a gold standard, it is limited to those 30 two-state proteins. In an effort to include multi-state folding proteins (multi-state proteins) and take advantage of folding kinetics studies on additional two-state proteins, many research groups have created their own datasets by aggregating additional folding kinetics data from reports that may not conform to the consensus conditions or reporting standards.3–16 In many cases, extracting the complete information relevant to the experiment (an extrapolated lnkf value, experimental conditions, associated protein databank (PDB) code, and the details of the actual construct used for the folding studies) is challenging, if not impossible. Subsequent expansion of collections through the addition of newly published experimental results has propagated existing errors and has often introduced new inaccuracies due to the same data extraction challenges. The consequences of the unstandardized reporting in the protein folding community are apparent from numerous discrepancies among the different databases, in the reported rate constants, and associated PDB files. Recognizing these issues, we compiled folding rate constants from 15 published collections/ online databases that have varying degrees of overlap in included proteins.2–16 We resolved discrepancies by verifying the experimental conditions, rate constants, and construct information from the original experimental sources. We constructed the Amherst College Protein Folding Kinetics (ACPro) Database containing this verified experimental information along with additional associated structural information, which is freely available to view and download at https://www. To facilitate sharing of high quality and standardized folding kinetics data across the protein folding community, we also incorporated a standard form for submission of new folding kinetics results that conform to the folding consortium’s consensus conditions and reporting standards.2 We welcome use of the database by other researchers, reports of inconsistencies or bugs, and submissions of new protein folding kinetics results.

Wagaman et al.

Table I. Current Database Composition Structural class Folding class







Two-state Multi-state Total

22 7 29

29 10 39

1 9 10

23 10 33

8 7 15

83 43 126

The Database Includes Comprehensive and Verified Kinetics Data To create our database on protein folding kinetics, we combined all entries from the 15 datasets mentioned above, which yielded a total of 184 possible entries.2–16 For each possible entry, we verified the following information using the original experimental source(s): the experimental construct used and its associated PDB code (with fragments specified by start and end residues), whether the protein is a two- or multi-state folding protein, the temperature, pH, buffer, denaturant, folding/detection method used, and the experimental kinetics results (folding rate constant lnkf and associated m-value, if reported). The challenges we encountered have compelled us to renew the folding consortium’s 2005 call for reporting standards,2 and are described below. Through this verification process we corrected inaccuracies and eliminated duplicates, obtaining a current total of 126 proteins to include in the present database. The current composition of the database in terms of structural and folding class is summarized in Table I.

ACPro Database Combines Kinetics with Sequence and Structural Information ACPro is a modern web application that is built upon a MySQL relational database. Numerous scripts written in Perl, PHP, Python, and JavaScript are used for retrieving, analyzing, and displaying data to users. When a user provides a PDB code via Search, the database reports the protein folding kinetics results and experimental conditions for those proteins with verified experimental results contained in ACPro, and displays the molecular model, protein sequence, and structural information compiled from a variety of sources for any protein in the RCSB database.17 The RCSB PDB Web Service interface (http:// provides the molecule name, description, list of chain IDs, taxonomy, and length. The FASTA sequence for each protein is retrieved from the RCSB PDB site and used to extract additional amino acid sequence-dependent protein characteristics from the ProtParam website.18 These characteristics include the molecular weight, the composition and count of amino acids, theoretical iso-electric point (pI), aliphatic index, grand average of hydropathy values (GRAVY), total number of negatively charged residues, and total number of positively charged residues. Protein



Figure 1. ACPro homepage.

structural characteristics, including the number of alpha-helices and beta-strands, and the SCOP structural class annotation19 are extracted directly from the RCSB website using a simple HTML web parser. For protein fragments, at this time, the structural class is from the full protein PDB. To calculate structural parameters with known relationships to protein folding rate constants, the database downloads the PDB file from the RCSB site for local processing. A set of contact order values, a widely used topological parameter,4 are computed using cutoff distances of 6, 8, 10, and 12 angstroms, using a multiple contact all-heavy atom method for counting contacts.20 This computation is handled by a Perl program modified to correct a minor undercount of heavy atoms from the original script which was developed and hosted online ( contact_order/) by the Baker group.21 The difference in accessible surface area (ASA) between the folded and unfolded protein conformations is computed using the University of California at San Francisco’s Chimera software,22 for the folded ASA and the transition midpoint method23 for the unfolded ASA.

ACPro User Tools and Functionality Our database has several tools that enable a user to engage the collection. The ACPro homepage (Fig. 1) offers choices of searching, browsing, accessing resources, contributing, and contacting the database managers via the main menu. Users engage the collection without the need to log in, which is part of our design for accessibility. The Search function enables a user to bring up information about any protein by name or PDB code as previously described. The structural information is displayed



even for proteins that do not yet have any available folding kinetics information, which increases the utility of the database. The Browse function allows a user to view all proteins with verified folding kinetics entries organized by PDB code. The initial browsing view displays preset information about each protein including but not limited to its name, folding type, structural class, and lnkf (Fig. 2). Navigation within Browse to see all the entered proteins online is accomplished with the pagination toolbar. Above that toolbar, the user has the option to turn on additional parameters in the display by clicking the associated button. These are divided into four sets of additional parameters: structural properties, experimental conditions, (detailed) contact order data, and additional notes. Users can download the complete dataset with their choice of parameters selected. The Resources page contains links to other folding databases, the list of experimental sources for our database entries, the SCOP and ProtParam structural sites, code used to calculate contact order, and other protein-folding related online resources. Finally, the Contribute page allows users to submit new entries for consideration in the database. To ensure that this collection maintains the rigorous reporting standards advocated by the folding consortium,2 our template for submission of new entries has required fields for construct information and experimental conditions. The list of required fields is as follows: contributor’s name and email, protein name, PDB code, chain and fragment information (start and end), folding type (two- or multi-state), source for folding kinetics data, temperature, pH, buffer, denaturant, experimental method, the kinetic parameters (including associated errors), and the

Verified Protein Folding Kinetics Database

Figure 2. ACPro browse page.

full sequence information of the actual experimental protein construct. In addition, for multi-state proteins, amplitudes and rate constants for each observed phase are requested, as well as details about the models fit and any resulting intrinsic rate constants. Users can provide feedback and suggestions, particularly for additional resources, using the Contact page.

Renewed Call for Reporting Standards As previously mentioned, in 2005, a consortium of protein folding researchers organized by Plaxco called for a standard set of experimental conditions for kinetics experiments and reported results under those conditions for a collection of two-state proteins.2 Specifically, they encourage performing folding experiments (1) at 25 C (2) with urea as the denaturant (3) using pH 7.0 buffers. Regardless of the conditions used, the temperature, pH, denaturant and buffer should be clearly reported. In addition, the experimental construct and the relevant PDB code should be specified.2 In terms of reporting the results, they suggest preferred methods of obtaining estimates from the data, and they encourage reporting the natural log of the folding rate constant in the absence of water (lnkf in units of sec21), and m-values in the IUPAC-approved units of (kJ/mol)/M. Our work building this new database has found that many publications of experimental results since 2005 do not adhere well to these critical reporting standards. Unfortunately, this inconsistency leads to ambiguity when other researchers attempt to extract the folding parameters and associated sequence and structural information. As a consequence, the contri-

Wagaman et al.

butions of the original experimental results are limited—or negative if erroneously extracted—as are the impact and accuracy of any ensuing theoretical model based on these data. In terms of structural information, specific construct information and associated PDB codes are often not provided. This omission causes the majority of the discrepancies across collections. Various researchers have extracted the same lnkf from an original source, but have associated it with different PDB codes, leading subsequent collections to include multiple entries for the same original lnkf data. Other lnkf values have been incorrectly associated with a full-length PDB code when specific details of the actual smaller experimental construct are not provided. Both the experimental conditions and results can be surprisingly difficult to extract from original sources. In some cases, pH and temperature are missing completely, or are available only in figure captions, Supporting Information, or previous references. When lnkf in the absence of denaturant is not explicitly reported, in some cases this value may be back calculated from other reported thermodynamic parameters. In other cases, extracting lnkf requires estimating the y-intercept value from a chevron plot. For multi-state proteins, we verified the value of the rate constant selectively reported in previous sources. For most proteins, this appeared to be the slowest observed non-proline isomerization rate constant. Without a standardized reported method, confirming this and assigning the reported rate constant to a particular phase or step in a model pathway is practically impossible for anyone not involved in the work.



In compiling our database, we have addressed these issues and numeric data entry errors through a time-intensive verification and data entry process. Each entry was confirmed with the original published source cited, and with additional references as needed. Users may confidently use this comprehensive dataset to develop and test theoretical models for protein folding kinetics, or any other relevant analysis. To maintain this high quality of data as the pool of available protein folding kinetics results expands, we would like to renew the call for the community to adopt the clear and standard reporting of results as proposed in Maxwell et al.,2 adhering to the consensus experimental conditions when possible, and highly encourage submitting their data to ACPro.

Conclusion Our new protein folding kinetics database, ACPro, ( is a freely accessible resource. Multiple tools to engage with the database are provided: information on individual proteins can be retrieved, the collection can be downloaded, and new entries may be submitted. The issues we encountered working from experimental sources in our verification process highlight the urgency for the protein folding community to recommit to following the existing reporting, if not experimental, standards. Since the current 126 entries constitute the most comprehensive and verified collection of associated protein folding kinetics and structural information, we propose the ACPro dataset as the standard for development and testing of future predictive theoretical models.

Acknowledgments The authors thank the Jaswal Lab members and Summer Biomath students for assistance with data acquisition and entry: Catherine Amaya, Paul Yao, Roy Jung, Shennon Lu, Nevon Song, and Jenny Xu. The Amherst College Dean of the Faculty is acknowledged for student support. Thanks to Nese Kurt Yilmaz and two anonymous reviewers for careful reading of the manuscript.

References 1. Knowles TPJ, Vendruscolo M, Dobson CM (2014) The amyloid state and its association with protein misfolding diseases. Nat Rev Mol Cell Biol 15:384–396. 2. Maxwell KL, Wildes D, Zarrine-Afsar A, De Los Rios MA, Brown AG, Frield CT, Hedberg L, Horng J-C, Bona D, Miller EJ, Vall ee-B elisle A, Main ERG, Bemporad F, Qui L, Teilum K, Vu N-D, Edwards AM, Ruczinski I, Poulson FM, Kragelund BB, Michnick SW, Chiti F, Bai Y, Hagan SJ, Serrano L, Oliveberg M, Raleigh DP, Wittung-Stafshed P, Radford SE, Jackcon SE, Sosnick TR, Marqusee S, Davidson AR, Plaxco KW (2005) Protein folding: defining a “standard” set of experimental conditions and a preliminary kinetic data set of two-state proteins. Protein Sci 14:602–616.



3. Zhou H, Zhou Y (2002) Folding rate prediction using total contact distance. Biophys J 82:458–463. 4. Ivankov DN, Garbuzynskiy SO, Alm E, Plaxco KW, Baker D, Finkelstein AV (2003) Contact order revisited: influence of protein size on the folding rate. Protein Sci 12:2057–2062. 5. Ivankov DN, Finkelstein AV (2004) Prediction of protein folding rates from the amino acid sequencepredicted secondary structure. Proc Natl Acad Sci USA 101:8942–8944. 6. Jung J, Lee J, Moon H-T(2005) Topological determinants of protein unfolding rates. Proteins 58:389–395. 7. Huang J-T, Cheng J-P, Chen H (2007) Secondary structure length as a determinant of folding rate of proteins with two- and three-state kinetics. Proteins 67:12–17. 8. Istomin AY, Jacobs DJ, Livesay DR (2007) On the role of structural class of a protein with two-state folding kinetics in determining correlations between its size, topology, and folding rate. Protein Sci 16:2564–2569. 9. Ouyang Z, Liang J (2008) Predicting protein folding rates from geometric contact and amino acid sequence. Protein Sci 17:1256–1263. 10. Bogatyreva NS, Osypov AA, Ivankov DN (2009) KineticDB: a database of protein folding kinetics. Nucleic Acids Res 37:D342–D346. 11. Jung J, Buglass AJ, Lee E-K (2010) Topological quantities determining the folding/unfolding rate of two-state folding proteins. J Solution Chem 39:943–958. 12. Guo J, Rao N (2011) Predicting protein folding rate from amino acid sequence. J Bioinform Comput Biol 9:1–13. 13. De Sancho D, Mu~ noz V (2011) Integrated prediction of protein folding and unfolding rates from only size and structural class. Phys Chem Chem Phys 13:17030–17043. 14. Zou T, Ozkan SB (2011) Local and non-local native topologies reveal the underlying folding landscape of proteins. Phys Biol 8:066011. 15. Huang JT, Xing DJ, Huang W (2011) Relationship between protein folding kinetics and amino acid properties. Amino Acids 43:567–572. 16. Garbuzynskiy SO, Ivankov DN, Bogatyreva NS, Finkelstein AV (2013) Golden triangle for folding rates of globular proteins. Proc Natl Acad Sci USA 110:147–150. 17. Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell DS, Prlic A, Quesada M, Quinn GB, Ramos AG, Westbrook JD, Young J, Zardecki C, Berman HM, Bourne PE (2013) The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Res 41:D475–D482. 18. Gasteiger E, Hoogland C, Gattiker A. Protein identification and analysis tools on the ExPASy server. In: Walker JM, Ed. (2005) The proteomics protocols handbook. New York: Humana Press, pp 571–607. 19. Hubbard TJ, Murzin AG, Brenner SE, Chothia C (1997) SCOP: a structural classification of proteins database. Nucleic Acids Res 25:236–239. 20. Wagaman AS, Jaswal SS (2014) Capturing protein folding-relevant topology via absolute contact order variants. J Theor Comput Chem 13:1450005. 21. Plaxco KW, Simons KT, Baker D (1998) Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 277:985–994. 22. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE (2004) UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem 25:1605–1612. 23. Gong H, Rose GD (2008) Assessing the solvent-dependent surface area of unfolded proteins using an ensemble model. Proc Natl Acad Sci USA 105:3321–3326.

Verified Protein Folding Kinetics Database

A comprehensive database of verified experimental data on protein folding kinetics.

Insights into protein folding rely increasingly on the synergy between experimental and theoretical approaches. Developing successful computational mo...
275KB Sizes 4 Downloads 6 Views