HHS Public Access Author manuscript Author Manuscript

Structure. Author manuscript; available in PMC 2017 February 02. Published in final edited form as: Structure. 2016 February 2; 24(2): 216–220. doi:10.1016/j.str.2015.12.010.

Safeguarding structural data repositories against bad apples Wladek Minor1,*, Zbigniew Dauter2, John R. Helliwell3, Mariusz Jaskolski4, and Alexander Wlodawer5 1Department

Author Manuscript

of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA 22903, USA 2Synchrotron Radiation Research Section, Macromolecular Crystallography Laboratory, National Cancer Institute, Argonne National Laboratory, Argonne, IL 60439, USA 3School of Chemistry, University of Manchester, Manchester M13 9PL, UK 4Center for Biocrystallographic Research, Institute of Bioorganic Chemistry, Polish Academy of Sciences and Department of Crystallography, Faculty of Chemistry, A. Mickiewicz University, 60-780 Poznan, Poland 5Protein Structure Section, Macromolecular Crystallography Laboratory, National Cancer Institute, Frederick, MD 21702, USA

Abstract

Author Manuscript

Structural biology research generates large amounts of data, some deposited in public databases/ repositories, but a substantial remainder never becoming available to the scientific community. Additionally, some of the deposited data contain less or more serious errors that may bias the results of data mining. Thorough analysis and discussion of these problems is needed in order to ameliorate this situation. This note is an attempt to propose some solutions and encourage both further discussion and action on the part of the relevant organizations, in particular the Protein Data Bank and various bodies of the International Union of Crystallography.

Introduction

Author Manuscript

During the last three years, a number of publications and editorials have raised the issue of the reproducibility of scientific results in a variety of disciplines. Although the ability to reproduce published results is generally recognized as a defining characteristic of science, in recent years the number of published biomedical research findings that cannot be repeated outside of the original laboratory has been rising at an alarming rate. For example, some pharmaceutical companies have revealed that they can independently verify less than 25% of preclinical cancer studies published in major journals (Begley and Ellis, 2012). Other sobering reports indicate that up to 75% of experiments performed on drug targets in a given laboratory cannot be repeated elsewhere (Prinz et al., 2011). It is tempting to assume that the problem is limited to academic researchers cutting corners because of grant applications, tenure decisions, and other forms of “publish or perish” pressure, but some papers published

*

Correspondence: [email protected]. Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Minor et al.

Page 2

Author Manuscript

by industrial researchers describe results that cannot be replicated elsewhere as well (Choi et al., 2008; Gillmor et al., 1997). The price tag of the irreproducibility of published results was recently estimated to be in the vicinity of 28 billion dollars per annum (Freedman et al., 2015), almost equivalent to the entire annual budget of the National Institutes of Health (NIH).

Author Manuscript Author Manuscript

The NIH (Collins and Tabak, 2014) and leading journals (McNutt, 2014) have been working on guidelines to improve the reproducibility of published research. The peer-review process sometimes requires the availability of the experimental data. For example, recognizing ‘bad’ macromolecular crystal structures is surely an important part of peer review of the submission of a structural biology research article. Since most of the biostructural research rests upon the atomic coordinates derived from X-ray diffraction data, peer review cannot divorce the experimental data from the words that are written. The chemical crystallography community recognized this many years ago and, largely via the International Union of Crystallography (IUCr) Journals Commission, reached a consensus in the early 1990s on community agreed standards for crystal structure articles and data. It was also agreed that all article submissions have to be accompanied by atomic coordinates and structure factor amplitudes. To make matters as manageable as possible, a checkcif report (Spek, 2009) was made available to the submitting authors, referees, and the co-editor. The report presented alerts of various level of seriousness, indicated via an A, B, C system of tags that needed to be considered by the authors. ‘A’ alerts were essential to be dealt with or responded to. This became then an exemplary approach, agreed to by the chemical crystallography community and driven forward by IUCr Journals (namely Acta Crystallographica Section C). JRH, at that time the Editor-in-Chief of IUCr Journals, proposed at the IUCr Congress in Geneva the adoption of a similar system by the macromolecular crystallography community. Unfortunately, the proposal was rejected on two grounds. It was stated, first, that confidentiality of the peer review of a biological crystal structure article submission’s data could not be guaranteed and thus conflicts with competition would be possible, and, second, that the commercial potential of a given research result could be compromised. The relatively recent adoption by most journals of a policy requiring the submission of a Protein Data Bank (PDB) validation report (Quesada et al., 2010) as an obligatory companion of a biological crystallography article submission represents a step forward. Whereas the validation report might be a guide to the authors, referees, editor, and finally to the readers and users, it is an inadequate substitute for the coordinates and diffraction data themselves made available at the time of submission of a structural article.

Need for community-agreed standards Author Manuscript

What are the crucial parameters that could be used to assess the correctness of macromolecular structures and help the reviewers in finding potential problems with structural data? A key parameter that has been consistently difficult to define is the resolution limit of useful diffraction data. Over time, a variety of criteria have been proposed, and subsequently abandoned, by the protein crystallography community. Examples include the resolution limit where Rmerge is lower than 20%, or where is less than 2.0, or, most recently, where CC½ becomes less than 0.5. Other situations where

Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 3

Author Manuscript

the liberty of ‘interpretation’ caused hiatus at different times in the field, include unrestricted addition of buckets of water molecules, or (over)interpretation of ever diminishing electron density features (not even on an absolute scale), and the challenge of electron density blobs for which a mix of community approaches of attempted interpretation or ignoring them altogether, ebb and flow. Definitive standards on these and other points need urgent debate and then consistent enforcement of the guidelines by scientific journals, and certainly at least by IUCr journals.

Journal publications versus deposits to repositories/databases

Author Manuscript

However, results (questionable or not) published in peer-review journals are only a fraction of the data that are never published but are deposited at increasing rates in various biomedical repositories and databases (Figure 1 shows an example taken from the PDB). The presence of inaccurate, incomplete, and/or self-contradictory information in these databases may have a disastrous effect on data mining (Dauter et al., 2014; Zheng et al., 2008). In addition, the underestimated ‘ripple effect’ of dissemination of questionable results may lead to ‘creeping entropy,’ i.e. to accumulation of inconsistent or plainly wrong data, undermining users’ confidence in the usefulness of scientific results. This problem is not unique to biology, but the far-reaching consequences of inaccurate or inconsistent data in biomedical research could be particularly painful and devastating.

Data repositories

Author Manuscript

Our experience with experimental data repositories for the biomedical sciences shows that the data are often incomplete, inconsistent, or even self-contradictory. Inconsistent and inaccurate data are not limited to science but are a general plague pervading modern life. For example, inaccurate and inconsistent information may be seen at many airports, in the case when two different monitors show two different departure times for the same airplane. In the end, the true departure time is verified by the actual departure of the plane. At this point the misleading information affects only the people who missed the plane and does not have any further consequences, except that human beings associate the error with a computer and assume that such errors are just an unavoidable part of modern life. This modern disease has propagated into science, leading some scientists to blame software for erroneous results (Chang, 2007) and accept, as inevitable, the inaccuracies in the experimental procedures that may subsequently affect publications and databases.

Author Manuscript

The reporting of negative results could significantly improve detection of ‘bad apples’. However, the reporting of negative results in structural biology databases is also far from satisfactory. For example, the PSI Structural Biology Knowledgebase (http://sbkb.org/) provides a database collecting experimental data for all protein targets studied by all consortia in the Protein Structure Initiative: Biology (PSI:Biology) (Gabanyi et al., 2011) and several other large-scale structure determination initiatives. An analysis of this database shows that fewer than 15% of all targets contain any information about unsuccessful trials. Protein crystallography, as a major foundation of our molecular understanding of life, is generally regarded as a source of unquestionable information. After all, we ‘see atoms‘, as Sir Lawrence Bragg so vividly described it, and the results provided by protein Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 4

Author Manuscript

crystallographers have thus established the ‘gold standard’ for structural information in modern biology. For example, many consumers of the structural information collated in the PDB (Berman et al., 2000) treat crystal structures as ‘set in stone.’ Because of this confidence, good practices and standards for the maintenance of the unscathed integrity of structural repositories and databases have to be preserved and continuously improved.

Author Manuscript

Recently, a number of problems with structures of proteins, their ligands, nucleic acids, carbohydrates, bound metals, etc. have been identified and discussed in several publications. One of the authors of this paper (JRH) received a valuable critique of a group of crystal structures from the other authors of this paper (Shabalin et al., 2015), to which a response was provided (Tanley et al., 2015). This back-and-forth process is leading to improvements in the associated PDB files and showcases the advantages of the availability of all the data (including raw diffraction images, processed structure factor amplitudes, and the final derived coordinates) underpinning the publications (Tanley et al., 2012) and the raw data (doi: 10.15127/1.215887 https://www.escholar.manchester.ac.uk/uk-ac-man-scw:215887). More widely, the fraction of questionable deposits, especially of protein-ligand complexes that are most important for drug discovery studies, is not decreasing with time. This is most probably a consequence in part of the flood of new structures determined each year, and in part of the availability of various sophisticated software tools that allow relatively untrained persons to determine a macromolecular crystal structure in a partially or fully automated manner.

Author Manuscript

The PDB has adopted several policies that have made it one of the most comprehensive and widely used repositories in the biomedical sciences. The most important of these policies is the requirement to deposit experimental structure factors, which allow for independent reevaluation or even re-refinement of structures if problems are detected and/or deeper analyses are warranted. The PDB has also adopted extensive tools for structure validation and error detection during the deposition process, and which, by their own statements, are still under active development. However, even the best policies will not improve papers that have already been published and results that have already been deposited. Prompt redress of data banks such as the PDB is essential, since even a single incorrect structure may greatly hinder data mining research. One prominent example is the case of PDB deposit 3FJ0, which contained hundreds of ordered water oxygen atoms mislabeled as Na+ ions. This structure almost singlehandedly biased all analyses of bond geometries in the PDB involving Na+ ions (Chruszcz et al., 2010). Deposit 3FJ0 was finally corrected four years after its original deposition, and two years after the problem had been identified. In the meantime, the incorrect structure was downloaded several thousand times and potentially polluted many subsequent analyses and databases.

Author Manuscript

Policies regarding corrections of existing deposits Although, in principle, the conversion of the PDB from a static repository towards a dynamic one (Dauter et al., 2014; Terwilliger and Bricogne, 2014) with continually revised deposits should be possible, in current practice this would require the involvement of the original depositors and is time consuming. However, such an approach would not be as time consuming as the original work, which may have involved a multi-disciplinary team that is

Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 5

Author Manuscript Author Manuscript

likely to have dispersed. The deposited data in all their forms are therefore incredibly valuable. Recognizing bad apples is not always easy. One person’s bad apple is also perhaps another’s triumph over a poorly diffracting crystal or effort to combine many biophysical techniques in order to interpret a difficult portion of a structure. Nevertheless, community agreement is needed that consistent macromolecular crystal structure standards are necessary and possible. Accompanying an article submission with actual coordinates, structure factors, and ideally raw diffraction images would properly re-ignite the discussion and quest for a much improved peer review. This point was already raised at the IUCr Congress in Geneva, but the question was not resolved. We believe that it is essential that database integrity trump any individual concerns regarding intellectual property or competitive disadvantage during peer review. If the work is to be published, it is therefore going to be in the public record and the community has a right to evaluate its quality from the primary data. Supporting the dynamic and updatable database should be high on the agenda of all biomedical oriented research.

Author Manuscript

There are, naturally, various possible forms of the evolution of the PDB and other databases into a dynamic resource with the capability to systematically improve molecular models. This is based on the very sound idea, indeed a fundamental principle, that the originally measured data are eternal but any interpretation is subject to evolution (Chruszcz et al., 2010; McNutt, 2015). For example, any new crystallographic model would be made available together with all precursor models even though the authors of the original data and atomic model may or may not agree with the new reinterpretations of their data (Shabalin et al., 2015). Regardless of the circumstances, the exact contributions of each interpreter of Nature’s Truth should be made very clear to the users of this repository. The dynamic form of the repository would indirectly enforce the requirement that experimental data include all measurements that are needed to reproduce the reported results; for example, anomalous scattering data, when they were utilized in structure determination, should be formally required. The ideal solution would be to link each PDB deposit to the experimental data that could be used for de novo structure determination, for example when the interpretation of e.g. ligands, or indeed any structural feature, is contested. There are also new opportunities to be expected when the usual Bragg diffraction intensities are supplemented by the diffuse scattering outside the Bragg spots. This promising new method of analysis, as well as other currently unforeseen types of analyses, will only be possible when the raw diffraction images are in open archives. Thus ‘full diffraction pattern’ analyses based on an ensemble of molecules, e.g. as initially derived from molecular dynamics simulations, could become much more frequent. For a recent example of such work, see (Wall et al., 2014).

Author Manuscript

We note, and welcome, the new initiatives to create pilot resources for archiving the original experimental data. Examples of such efforts are the NIH Big Data to Knowledge (BD2K) initiative and a number of other resources, such as http://www.proteindiffraction.org/ (USA), http://zenodo.org (Europe), https://store.synchrotron.org.au/public_data/ (Meyer et al., 2014) (Australia), and Structural Biology Data Grid https://data.sbgrid.org/ (USA). Two pioneering resources, the Joint Center for Structural Genomics (JCSG) depository and Joint Center for Structural Genomics of Infectious Diseases and Seattle Structural Genomics Center for Infectious Diseases resource, are now the part of the BD2K initiative. The BD2K

Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 6

Author Manuscript

initiative is by far the largest one and currently comprises almost 2900 indexable and searchable diffraction experiments. The raw diffraction images may be useful for instructional purposes, as training sets for methods developers, as well as challenges of recalcitrant, difficult structures. All these resources sustain a policy of generating a permanent digital object identifier (doi) as part of the process. Extensive detailed discussion can be found in the set of articles in the October 2014 issue of Acta Crystallographica D. The IUCrJ, J Appl Cryst, Acta Crystallographica D and F journals have recently started linking their publications with primary data sets.

Author Manuscript

We note here that the need for and the practical problems connected with archiving of the primary experimental images, is not only limited to crystallography. The newly emerging high-resolution cryo-EM methodology (Doerr, 2015; Egelman, 2015) also produces voluminous amounts of digitally-stored raw image information https://www.ebi.ac.uk/pdbe/ emdb/empiar. It can be hoped that the solutions worked out in one of these disciplines will stimulate in a mutual way similar advancements in the other.

Conclusions and recommendations Having defined the problems, we would like to propose as solutions:

Author Manuscript Author Manuscript



The Community at large is encouraged to instigate repeated validations of archived structural and other biomedical data and promote corrective actions when needed. Overall, this ‘crowd sourcing’ approach will further strengthen the correctness and thereby the usefulness of structural databases, prevent any diminishing trust in the deposited data and ameliorate the serious effects of irreproducibility. Such actions should be spearheaded by the data centers themselves (most notably the PDB), so that it is a coordinated process, but also by leading international organizations (e.g. IUCr), funding agencies (e.g. NIH), and by the leading journals (e.g. Nature, Science, Structure, Protein Science, etc.).



Deposition of experimental raw data by research groups should become a standard and should be performed via established repositories capable of assigning digital object identifiers (doi) (Guss and McMahon, 2014). These repositories may either be established as part of the PDB itself or via the journal initiatives described above. We note that links between the PDB and some of the experimental raw data repositories are currently being established.



The journals publishing biological structures should start to formally require that diffraction data (or chemical shifts in the case of NMR) and atomic coordinates should be made available to the reviewers and editors at the time of article submission. These data may be accessible via the PDB through a secure, passwordprotected channel with all accesses recorded to ensure the maximal safety of the potentially restricted data. The PDB submission should contain not only the diffraction data used for refinement, but also the data (e.g. anomalous data) used for structure determination.

Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 7



Author Manuscript

If appropriate, the journals publishing methodological advancements should require deposition of experimental data sets that were used to illustrate the advantage of new methodology.

Whereas it is true that the Protein Data Bank is rightly seen as an open access exemplar, the importance of primary data archiving is increasingly recognized as described above. As a minimum course of action, we therefore recommend that the IUCr Commission on Biological Macromolecules discusses our proposals at the upcoming international crystallographic conferences in different regions of the world, with a view to their formal adoption at the next IUCr Congress in Hyderabad, India, in August 2017.

References Author Manuscript Author Manuscript Author Manuscript

Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012; 483:531–533. [PubMed: 22460880] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000; 28:235–242. [PubMed: 10592235] Chang G. Structure of MsbA from vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation (Retraction of vol 330, pg 419, 2003). Journal of Molecular Biology. 2007; 369:596–596. [PubMed: 17580380] Choi J, Chon JK, Kim S, Shin W. Conformational flexibility in mammalian 15S-lipoxygenase: Reinterpretation of the crystallographic data. Proteins. 2008; 70:1023–1032. [PubMed: 17847087] Chruszcz M, Domagalski M, Osinski T, Wlodawer A, Minor W. Unmet challenges of structural genomics. Curr Opin Struct Biol. 2010; 20:587–597. [PubMed: 20810277] Collins FS, Tabak LA. Policy: NIH plans to enhance reproducibility. Nature. 2014; 505:612–613. [PubMed: 24482835] Dauter Z, Wlodawer A, Minor W, Jaskolski M, Rupp B. Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining. IUCrJ. 2014; 1:179–193. Doerr A. Cryo-Em Goes High-Resolution. Nat Methods. 2015; 12:598–599. [PubMed: 26339705] Egelman EH. The Current Revolution in Cryo--EM. Biophys. J. 2016; 110(5) In press. Freedman LP, Cockburn IM, Simcoe TS. The Economics of Reproducibility in Preclinical Research. PLoS Biol. 2015; 13:e1002165. [PubMed: 26057340] Gabanyi MJ, Adams PD, Arnold K, Bordoli L, Carter LG, Flippen-Andersen J, Gifford L, Haas J, Kouranov A, McLaughlin WA, et al. The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genomics. 2011; 12:45–54. [PubMed: 21472436] Gillmor SA, Villasenor A, Fletterick R, Sigal E, Browner MF. The structure of mammalian 15lipoxygenase reveals similarity to the lipases and the determinants of substrate specificity. Nat Struct Biol. 1997; 4:1003–1009. [PubMed: 9406550] Guss JM, McMahon B. How to make deposition of images a reality. Acta crystallographica. Section D, Biological crystallography. 2014; 70:2520–2532. [PubMed: 25286838] McNutt M. Journals unite for reproducibility. Science. 2014; 346:679. [PubMed: 25383411] McNutt M. Data, eternal. Science. 2015; 347:7. [PubMed: 25554763] Meyer GR, Aragao D, Mudie NJ, Caradoc-Davies TT, McGowan S, Bertling PJ, Groenewegen D, Quenette SM, Bond CS, Buckle AM, et al. Operation of the Australian Store. Synchrotron for macromolecular crystallography. Acta Crystallogr D. 2014; 70:2510–2519. [PubMed: 25286837] Prinz F, Schlange T, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov. 2011; 10:712. [PubMed: 21892149] Quesada M, Westbrook J, Oldfield T, Young J, Swaminathan J, Feng ZK, Velankar S, Zhuravleva M, Sahni G, Shao CH, et al. wwPDB common annotation and deposition tool project. Abstr Pap Am Chem S. 2010; 240

Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 8

Author Manuscript Author Manuscript

Shabalin I, Dauter Z, Jaskolski M, Minor W, Wlodawer A. Crystallography and chemistry should always go together: a cautionary tale of protein complexes with cisplatin and carboplatin. Acta Crystallographica Section D-Biological Crystallography. 2015; 71:1965–1979. Spek AL. Structure validation in chemical crystallography. Acta Crystallogr D Biol Crystallogr. 2009; 65:148–155. [PubMed: 19171970] Tanley SW, Diederichs K, Kroon-Batenburg LM, Levy C, Schreurs AM, Helliwell JR. Response from Tanley et al. to Crystallography and chemistry should always go together: a cautionary tale of protein complexes with cisplatin and carboplatin. Acta Crystallogr D Biol Crystallogr. 2015; 71:1982–1983. [PubMed: 26327388] Tanley SW, Schreurs AM, Kroon-Batenburg LM, Helliwell JR. Room-temperature X-ray diffraction studies of cisplatin and carboplatin binding to His15 of HEWL after prolonged chemical exposure. Acta Crystallogr Sect F Struct Biol Cryst Commun. 2012; 68:1300–1306. Terwilliger TC, Bricogne G. Continuous mutual improvement of macromolecular structure models in the PDB and of X-ray crystallographic software: the dual role of deposited experimental data. Acta Crystallogr D Biol Crystallogr. 2014; 70:2533–2543. [PubMed: 25286839] Wall ME, Van Benschoten AH, Sauter NK, Adams PD, Fraser JS, Terwilliger TC. Conformational dynamics of a crystalline protein from microsecond-scale molecular dynamics simulations and diffuse X-ray scattering. Proceedings of the National Academy of Sciences of the United States of America. 2014; 111:17887–17892. [PubMed: 25453071] Zheng H, Chruszcz M, Lasota P, Lebioda L, Minor W. Data mining of metal ion environments present in protein structures. J Inorg Biochem. 2008; 102:1765–1776. [PubMed: 18614239]

Author Manuscript Author Manuscript Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 9

Author Manuscript

Highlights Large amounts of data are generated by research in structural biology Inconsistencies in the quality of deposited data lead to errors in data mining and interpretation Clear rules and procedures for correcting errors in content of databases are needed Procedures for maintaining high reliability of the Protein Data Bank are being proposed

Author Manuscript Author Manuscript Author Manuscript Structure. Author manuscript; available in PMC 2017 February 02.

Minor et al.

Page 10

Author Manuscript Author Manuscript Author Manuscript

Figure 1.

The cumulative number of structures in the PDB with a primary citation “to be published”. Only structures that have been marked “to be published” for more than two years since deposition are taken into consideration. Structures deposited by structural genomics (SG) consortia (red) represent roughly 50% of the “to be published” category.

Author Manuscript Structure. Author manuscript; available in PMC 2017 February 02.

Safeguarding Structural Data Repositories against Bad Apples.

Structural biology research generates large amounts of data, some deposited in public databases or repositories, but a substantial remainder never bec...
345KB Sizes 2 Downloads 7 Views