Progress in Nuclear Magnetic Resonance Spectroscopy 82 (2014) 27–38

Contents lists available at ScienceDirect

Progress in Nuclear Magnetic Resonance Spectroscopy journal homepage: www.elsevier.com/locate/pnmrs

NMR structure validation in relation to dynamics and structure determination Wim F. Vranken ⇑ Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium Department of Structural Biology, VIB, 1050 Brussels, Belgium Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, BC Building, 6th Floor, CP 263, 1050 Brussels, Belgium

Edited by David Neuhaus and Gareth Morris

a r t i c l e

i n f o

Article history: Received 1 July 2014 Accepted 14 August 2014 Available online 26 August 2014 Keywords: Nuclear magnetic resonance Protein structure validation Protein dynamics and conformation Protein structure calculation

a b s t r a c t NMR spectroscopy is a key technique for understanding the behaviour of proteins, especially highly dynamic proteins that adopt multiple conformations in solution. Overall, protein structures determined from NMR spectroscopy data constitute just over 10% of the Protein Data Bank archive. This review covers the validation of these NMR protein structures, but rather than describing currently available methodology, it focuses on concepts that are important for understanding where and how validation is most relevant. First, the inherent characteristics of the protein under study have an influence on quality and quantity of the distinct types of data that can be acquired from NMR experiments. Second, these NMR data are necessarily transformed into a model for use in a structure calculation protocol, and the protein structures that result from this reflect the types of NMR data used as well as the protein characteristics. The validation of NMR protein structures should therefore take account, wherever possible, of the inherent behavioural characteristics of the protein, the types of available NMR data, and the calculation protocol. These concepts are discussed in the context of ‘knowledge based’ and ‘model versus data’ validation, with suggestions for questions to ask and different validation categories to consider. The principal aim of this review is to stimulate discussion and to help the reader understand the relationships between the above elements in order to make informed decisions on which validation approaches are the most relevant in particular cases. Ó 2014 Elsevier B.V. All rights reserved.

Contents 1. 2.

3.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Protein characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3. Structure calculation protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4. Ensembles, models, residues and atoms: level of validation assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5. Software, data formats and validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Different situation, different validation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Well-folded, mostly unique conformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1. Many informative data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2. Few informative data (sparse data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Some mobility, distinct conformations in fast exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. High mobility, many conformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

⇑ Address: Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium. E-mail address: [email protected] http://dx.doi.org/10.1016/j.pnmrs.2014.08.001 0079-6565/Ó 2014 Elsevier B.V. All rights reserved.

28 29 29 30 31 31 33 33 33 33 33 35 35

28

4.

W.F. Vranken / Progress in Nuclear Magnetic Resonance Spectroscopy 82 (2014) 27–38

Conclusion . . . . . . . Funding . . . . . . . . . . . Acknowledgements References . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1. Introduction Structural biology is essential for understanding the processes that underlie biology at an atomic level. The key experimental techniques in this field are X-ray diffraction of crystals, Nuclear Magnetic Resonance (NMR) in both solution and solid state, and Electron Microscopy (EM); they have provided and continue to provide extremely relevant information in unravelling biological processes. The desired outcome from the information produced by these methods is usually a full in silico three-dimensional description of the molecule(s) under investigation, traditionally in the Protein Data Bank (PDB) [1,2] or a related format, where Cartesian coordinates are provided for some or all of the molecular atoms at very high precision (to 10 13 m). The great advantage of providing molecular information in this way is that the physical relationships between atoms, such as the length of covalent bonds, can be correctly described even if the overall positions of the atoms are not well determined experimentally. The process of translating data from any experimental technique into this precise atomic in silico representation requires a non-trivial transformation of information: experimental data specific to the biomolecule(s) under investigation necessarily has to be provided in an electronic form that can be used by software (e.g. [3]). This software, in essence, then introduces varying degrees of generic modelling based on theory and/or knowledge-based statistics, and attempts to find the best arrangement of atomic coordinates that matches the experimentally obtained results (the structure calculation process). The relationship between specific experimental data and generic modelling is highly complex and method- as well as software-dependent. In a conceptual oversimplification one can state that the more non-redundant and high precision experimental data are available, the less generic modelling is required, and vice versa, to obtain a meaningful atomic representation of a protein (Fig. 1) [4]. Depending on the typical nonredundancy and/or precision of experimental data, it is possible to place the different experimental techniques on the x-axis of Fig. 1, with the warning that this representation is only indicative, as

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

35 36 36 36

there is a wide variation in data precision within each technique. Important in this context is that over the course of the last decade hybrid methods, where data from multiple techniques are combined, are on the rise in structural biology [5–7]. Such approaches have the distinct advantage that they combine the strengths of different experimental techniques, and the non-redundancy between the data they produce, to generate the final atomic coordinates. It is important to maintain the distinction between experimental data specific to the biomolecule(s) under investigation (e.g. peaks in a NOESY NMR spectrum) and generic modelling information (e.g. statistics on observed bond lengths, known protein structures), as each contributes differently to the structure calculation and validation process (Fig. 2). Independent validation of the calculated structure(s) is the next essential step to ensure that the atomic coordinates are realistic and represent the experimental data. This validation can again be divided into ‘knowledge-based’ validation [8,9], where the atomic coordinates are compared to generic expected characteristics (e.g. Ramachandran plot for the protein backbone), or ‘model versus data’ validation [8,9] to specific experimental data (e.g. NMR chemical shift values). Both ‘knowledge-based’ and ‘model versus data’ validation are highly relevant, but it is generally more difficult to perform the latter as specific experimental data are often limited (Fig. 1); the more generic modelling is required, the less independent protein-specific validation against experimental data is possible. Validation based on the experimental data that were used to generate the structures informs about whether the data are internally consistent and whether the structure calculation process incorporated it correctly; only independent experimental data provide a real cross-validation measure (Fig. 2). In the context of structure validation and interpretation it is important to remain aware of the complexity of the full structure calculation process, where the amount of experimental data, their transformation to model and the choice of calculation protocol can strongly influence the resulting atomic structure representation [10,11]. The very high precision with which atoms are described in silico can therefore be misleading, as the uncertainties of this whole process are not

Fig. 1. Conceptual representation of the relative proportion of specific experimental data and generic modelling required to obtain a model with precise atomic coordinates when using data from different experimental techniques.

W.F. Vranken / Progress in Nuclear Magnetic Resonance Spectroscopy 82 (2014) 27–38

29

Fig. 2. Overview of protein structure calculation and validation. The in silico modelling information necessary for structure calculation can be divided into a generic model based on general experimental data (top half), and a specific model based on protein specific experimental data (bottom half), with theoretical approaches central in interpreting and transforming the data. The modelled data itself can be divided into ‘incorporated’ information used to generate the protein structure model via various structure calculation protocols, and into ‘independent’ information, which is not included and can be used for cross-validation. The calculated protein structure model can then be validated in two ways: ‘knowledge based’ [8,9]/‘geometric validation’ [14] against generic expected characteristics (such as the Ramachandran plot) or ‘model versus data’ [8,9], the agreement of experimental data with the resulting structure [14] (such as NOE-derived distance restraints). Cross-validation against independent data not used in the structure calculation is generally more diagnostic; validation against incorporated data, on the other hand, mainly shows whether the data are internally consistent and if the structure calculation protocol was able to handle them correctly to produce a protein structure model.

immediately apparent. In X-ray crystallography, B-factors describe displacements of the atoms from their mean positions, with dynamic disorder the major contributor [12]. In NMR, ensembles of structural models are calculated, with each model typically assigned a score or energy that relates how well they correspond to the generic modelling and specific experimental data. The structural variation between the models should indicate the uncertainty in the experimental data, but this is not always the case [13]. This review covers the validation of protein structures derived from NMR spectroscopy data, which offers its own particular opportunities and challenges. NMR data is highly complex, not only because distinct types of data can be acquired from NMR experiments, but also because each distinct type of data has a different relation to the conformation and dynamics of the protein. For proteins studied by NMR, this results in a wide variation in the overall quality of the acquired data, in which types of data are available, and in how many data of each type were recorded. The size of the protein and its conformational behaviour also strongly influence these factors. Atomic coordinate structures calculated from NMR data reflect this diversity in information and protein characteristics, and NMR structure validation should account for such differences and make use of the available data in the best possible way. Several excellent reviews on the validation of NMR structures [8,14] as well as validation recommendations [9] have recently been published. This review therefore focuses on a more conceptual description of the relationship between the protein characteristics, the available specific experimental data, the calculation protocol, and the desired or required validation of the results, including a short overview of the most recent developments. Understanding the relationship between these elements, and the problems and issues with each, should help the reader make informed decisions on which validation approaches are the most appropriate in each context. 2. General concepts When validating an NMR protein structure, the modelled high precision atomic coordinates are related back to generic and/or

protein-specific information (Fig. 2). This section describes the general concepts that are important to consider in protein structure validation: the characteristics of the protein being validated, the available experimental data, and the structure calculation protocol. All of these factors determine to a great extent the ‘modelversus-data’ validation [8,9], or how the calculated protein structure model relates to the modelled experimental information. The ‘knowledge-based’ validation [8,9], on the other hand, assesses how physically realistic the high precision atomic coordinates are, and is mostly dependent on the structure calculation approach. 2.1. Protein characteristics Proteins, whether they have very short or extremely long sequences, are very versatile and exhibit a wide range of characteristics. They can assemble into large complexes (or not), adopt anything from a unique and well-defined conformation to a wide variety of dissimilar conformations [15], and display fluctuating degrees of dynamics to change between these conformations. Moreover, these characteristics are context-dependent (e.g. pH, membrane localization, etc.). In solution NMR a reductionist approach is normally adopted where the protein is typically isolated from its native environment in a sample, sometimes with other interaction partners such as ligands. The experimental observations therefore do not necessarily describe how the protein behaves in its native environment(s). Nevertheless, a protein structure model derived from experimental observations should ideally try to capture the conformation(s) the protein adopts in the sample as accurately as possible. The characteristics of the protein will have a direct influence on how this should best be done, and also subsequently on the protein structure validation. For example, for a protein with a very welldefined and stable fold, the experimental data will be generally consistent even if small fluctuations in its conformations are present; it adopts a single dominant conformation and is amenable for traditional structure calculation approaches. At the other extreme, some proteins adopt a wide variety of different conformations, meaning that the experimental data will be averaged and

30

W.F. Vranken / Progress in Nuclear Magnetic Resonance Spectroscopy 82 (2014) 27–38

Fig. 3. Overview figure describing the typically observed relationship for

NMR structure validation in relation to dynamics and structure determination.

NMR spectroscopy is a key technique for understanding the behaviour of proteins, especially highly dynamic proteins that adopt multiple conformations ...
2MB Sizes 4 Downloads 4 Views