commentary

Improved reproducibility by assuring confidence in measurements in biomedical research

npg

© 2014 Nature America, Inc. All rights reserved.

Anne L Plant, Laurie E Locascio, Willie E May & Patrick D Gallagher

‘Irreproducibility’ is symptomatic of a broader challenge in measurement in biomedical research. From the US National Institute of Standards and Technology (NIST) perspective of rigorous metrology, reproducibility is only one aspect of establishing confidence in measurements. Appropriate controls, reference materials, statistics and informatics are required for a robust measurement process. Research is required to establish these tools for biological measurements, which will lead to greater confidence in research results. A number of recent reports in the peerreviewed literature1,2 and in the popular press3,4 have discussed irreproducibility of results in biomedical research. Some of these articles suggest that the inability of independent research laboratories to replicate published results has a negative impact on the development of, and confidence in, the biomedical research enterprise. One outcome of these concerns is the effort to test the reproducibility of academic research using independent laboratories (http://reproducibilityinitiative. org/#/). But irreproducibility can arise from a number of factors in the measurement process, and it is not the only indicator of measurement quality. One reason for irreproducibility in attempts to replicate an experiment is that different experiments are inadvertently being performed (i.e., there are critical experimental differences that are not accounted for). In this case, irreproducibility can actually add a great deal of knowledge about the biological process under investigation, the analytical methods or the experimental system. In a rush to ensure reproducibility, we may overlook Anne L. Plant is chief of the Biosystems and Biomaterials Division, Laurie E. Locascio is director of the Material Measurement Laboratory, Willie E. May is the acting director and Patrick D. Gallagher is the former director of the US National Institute of Standards and Technology, Gaithersburg, Maryland, USA; Patrick D. Gallagher is currently chancellor at the University of Pittsburgh, Pittsburgh, Pennsylvania, USA. e-mail: [email protected]

opportunities to recognize the underlying sources of irreproducibility, which may be inherently important for understanding the biological process and how to measure it. It is also important to appreciate that reproducibility does not guarantee accuracy. A good measurement should meet a number of criteria: accuracy, precision, robustness, specificity and a well-defined detection limit and response function (Table 1). Although attention to all these criteria is required to fully establish confidence in any analytical measurement, biomedical research measurements often present additional unique challenges. These include the large numbers of parameters involved in many experiments, the difficulty in developing good reference materials, and the challenge of distinguishing biological variability from measurement uncertainty in the absence of knowledge of ‘ground truth’. These issues are discussed below. Characterizing the experimental system Biomedical research often uses complex starting materials and many manual experimental steps, leading to the reality that most experiments are poorly defined. In the absence of reported specifications, the value of the reagents, instrumentation and computational tools used are often taken for granted when they shouldn’t be. There can be many possible variables in an experiment, and these may not be noticed in the absence of specific characterization experiments, robustness testing or interlaboratory comparison studies.

Replicate measurements quantify the dispersion in analytical values, but other aspects of measurement (Table 1) are also critical for establishing confidence in the data. The accuracy of an assay, for example, can be tested by comparing different protocols and by employing or t hogona l me asurement met ho ds because different techniques are likely to be biased in different ways. A relevant study examined different fixation techniques, types of assays (imaging cytometry and western blotting) and reagents (different antibodies) to provide confidence in the accuracy of the measurement of myosin light chain diphosphorylation and its relationship to cell spreading in individual cells5. Confidence in experiments involving cell lines should include authentication that the cell line is indeed what it is assumed to be. A documentary standard now exists that describes testing human cell lines for single tandem repeat (STR) markers to unambiguously assign the identity of the line6. NIST has recently reported STR markers that will allow the authentication of monkey7 and mouse lines as well8. After decades of research and thousands of papers published on the basis of incorrect assumptions9,10 authentication of human lines is now required by some journals. Testing and reporting details about affinity reagents is also being recommended11. To encourage the reporting of critical information for particular experiments, communities have come together to develop consensus requirements. These efforts

nature methods | VOL.11 NO.9 | SEPTEMBER 2014 | 895

commentary

npg

© 2014 Nature America, Inc. All rights reserved.

Table 1 | Key elements of a good measurement Measurement element

Description

Best practice

Accuracy

The measurement delivers the true value of the intended analyte (i.e., the measurand)

Test your experimental observation using orthogonal analytical methods. Use well-defined reference materials to check instrument response and method validity.

Precision

Repeatability (replicates in series) and reproducibility on different days and in different labs

Replicate the measurement in your own lab, perhaps with different personnel. Have another lab perform the experiment. Participate in an interlaboratory comparison study.

Robustness

Lack of sensitivity to unintended changes in experimental reagents and protocols

Test different sources of reagents, fixation conditions, incubation times, cell densities and analysis software.

Limit of detection

Given the noise in the measurement, the level below which the response is not meaningful

Use appropriate positive and negative controls to determine background signal, and use dispersion in replicate measurements to determine measurement uncertainty.

Response function

Dependence of signal on systematic change in experimental condition

Systematically test concentration or activity with reference samples; determine the range in which the assay is sensitive.

Specificity

The analytical result is not confounded by sample composition or physical characteristics

When testing samples from different sources, ensure that apparent response differences are not due to sample matrix differences by using spike-in controls.

include Minimum Information About a Microarray Experiment (MIAME)12 and similar groups for genome sequencing (MIGS), metagenome sequences (MIMS), molecular interactions (MIMix), cell assays (MICA), proteomics (MIAPE), flow cytometry (MIFlowCyt) and T-cell assays (MIATA); an effort to harmonize these projects is Minimum Information about a Biomedical or Biological Investigation (MIBBI)13. Each of these efforts has helped a discrete community with a discrete kind of experiment. Unfortunately, because of the lack of incentive to create more globally useful, natural-language approaches to develop metadata vocabulary, these efforts are not easily adaptable to technological and scientific advances, easily combined with data efforts in allied fields, or easily usable by most practitioners. Approaches have been proposed by NIST and others for alternatives that involve natural-language and semantic approaches for collecting and searching experimental metadata14,15. A tool to make it easier for researchers to record and share protocol steps has been under development for some time (http://protocolnavigator.org/index. html). Yet there is still an absence of useful tools for the collection, organization and mining of experimental details and protocol metadata so that they can be reported, compared and searched. The support of research for developing better tools of this type could help us achieve deeper biological insight by making it easier to identify why experiments are not reproducible. Although performing and reporting t he necessar y control exp eriments are easily recognized as part of good

experimental practice, it can be argued that it is impractical to carry this out to the most rigorous level for every experiment. But data that are not demonstrated to be accurate, within limits of detection and sensitivity, and robust with respect to trivial experimental conditions have limited value. The strength of the conclusions drawn from them should reflect that lack of certainty. Reference materials and reference data We can think of measurement as a controlled way of comparing an unknown quantity to a known quantity. Although reproducibility in an individual laboratory is an imp or t ant indic ator of go o d measurement, interlaboratory comparisons frequently uncover sources of uncertainty that were not otherwise apparent. Often, interlaboratory studies show that competent laboratories can fail to achieve the same results even when the experimental systems are relatively simple. A key to establishing interlaboratory concordance is the use of reference materials. Consider the relatively simple measurement of the concentration of lead in water. A reference sample of an accurately known amount of lead can be used to calibrate the measurement in any laboratory. Such a reference material can help identify uncertainty associated with dilution steps or instrumentation bias. The use of reference materials for biological measurements can be more challenging because biological materials are often less stable than inorganic materials and have more complex characteristics. The design of an appopriate biological reference material can be best addressed

896 | VOL.11 NO.9 | SEPTEMBER 2014 | nature methods

by a community of experts. The creation of reference materials for microarrays and whole-genome sequencing (WGS) is an excellent example of the successful involvement of such a community, including NIST, in the thoughtful design of appropriate materials16,17. As a result of the External RNA Controls Consortium (ERCC), NIST supplies a standard reference DNA material (SRM 2374), which includes synthetic sequences designed to have no significant homology with known genomes and which can be transcribed into RNA in the user’s laboratory for benchmarking microarray results and for use as spike-in controls. Considerations that went into the design of this standard included the stability of DNA compared to that of RNA and the need to provide control sequences whose quantification would not be convoluted with that of sample sequences. A followon activity, organized by the Genome in a Bottle Consortium, will lead to a highly defined sequence for an entire human genome18. Reference materials such as lead in water or RNA sequences for spike-in controls provide benchmarks for the technical performance of a measurement. Complex biological materials pose additional challenges in the design of reference materials, however. A complex measurand (the entity being measured), such as a cell-surface receptor, cannot be easily synthesized, so alternative approaches must be found. For instance, a fixed, stabilized cell line expressing CD4 has been developed to serve as a highly characterized reference material for calibrating CD4 cell-surface receptor expression by flow cytometry19. Many studies were necessary to

npg

© 2014 Nature America, Inc. All rights reserved.

commentary establish this material as reliably quantitative, including orthogonal measurements with mass spectrometry19 and establishing traceability to fluorescent beads through solutions with known fluorophore concentrations20. In addition, this material was determined to be biologically relevant by comparison to fresh lymphocytes19. The design of reference materials for biological activity such as pluripotency or T-cell activation is perhaps still more challenging, and more emphasis on critical thinking of how to assure comparability of complex biological function is needed. For benchmarking a biological activity, it is tempting to compare a test response to the response from a ‘standard’ cell line. However, a living reference material, such as a cell line, is not immutable. It is likely to change over time and may provide different responses in different laboratories owing to subtle handling differences. Assuring the reliability of the response requires, at least in part, the ability to compare the responses of the instrumentation used for analyses. Even the lyophilized cell line expressing a determined amount of CD4 (ref. 19) as a benchmark for a flow cytometry experiment still requires evaluation of the linearity and dynamic range of instrumentation used for the analysis, with a fluorescent reference material20. Similarly, confidence in micro­ scopy measurements can be assured with reference materials and procedures that provide benchmarks of linear response and dynamic range of the fluorescence microscope (http://www.nist.gov/mml/bbd/cell_ systems/fluorescence-microscope-benchmarking.cfm). Other approaches to ensure comparability include the development of protocols that provide an internal reference, such as a ratiometric method under development at NIST to relate intensity of antibody stain to total protein staining for the interlaboratory comparison of pluripotency markers. In addition to benchmarking physical measurements, researchers should evaluate the software used in analysis before reporting results. Test data of known characteristics can serve as ground truth and be used to verify accuracy. In biological imaging, for example, data sets that are accompanied by manual segmentation are publicly available for electron microscopy (http://www.ini. uzh.ch/~acardona/data.html) and for light microscopy (http://www.broadinstitute.org/ bbbc/index.html). A NIST reference data set (http://sbd.nist.gov/image/cell_image.html)

of manually segmented cell images collected as part of a study to compare the accuracy of common image-segmentation algorithms is also available and provides replicate fields of view collected under various imaging conditions to assess the sensitivity of analysis to signal-to-noise ratio, contrast, resolution and other features21; this study showed large differences in the performance of different algorithms as compared to the differences in manual segmentation by two experts. Designing relevant reference materials and reference data that allow comparability of data in different laboratories through traceability of instrument function, and that allow benchmarking and validation of the accuracy of both the biological response and algorithms for data analysis, will be an ongoing research challenge for the next decade. Valid interpretation of the data Not having access to a known or groundtruth value adds a particular challenge to biomedical measurements, especially as data sets become very large and difficult to handle manually. It has thus become necessary to develop nontraditional ways of establishing measurement confidence. In the ongoing effort to establish criteria for confidence in variant calling from WGS data, for example, comparison of results with orthogonal platforms may prove insufficient for establishing accuracy because orthogonal platforms may be equally biased for some sequences; recent analysis of WGS has examined the use of many replicate sequence runs as well as data from different platforms to achieve a more reliable measurement of uncertainty18. A u n i qu ely i mp or t ant s ou rc e of uncertainty in biomedical measurements is biological variability that results from real differences in response due to stochastic fluctuations in molecular events or to genetic differences. Measurement uncertainty and biological variability are convolved, and without a determination of the uncertainty in the measurement technology itself, biological variability cannot be determined. A nondestructive method, such as microscopic imaging of individual cells, allows multiple sampling of the same field of view to provide measurement of instrument noise. In the development of destructive methods such as single-cell transcript measurements, a number of approaches for determining

measurement uncertainty are being considered, including the use of the ERCC spike-in material22–25, Bayesian statistics of drop-out events22, and unique molecular identifiers in the primers for counting individual transcripts25. The potential of single-cell transcriptomics for studying biological heterogeneity and molecular pathways will no doubt drive further research in establishing confidence in these measurements. In conclusion, establishing a process that allows us to evaluate the confidence we have in our data is critical for drawing appropriate conclusions, generating knowledge, achieving predictive models and reusing the data for further analysis. This is all the more crucial in the face of new technologies enabling the collection of large amounts of data at the molecular and phenotypic scales. The resulting large ‘omics’ data sets offer unprecedented opportunities for data-driven hypothesis building and to advance our understanding of mechanisms in biological processes. But unless those data are described thoroughly and the process by which their meaning is derived is clearly articulated, they may not be very valuable, and their reuse may waste vast amounts of analytical time and lead to incorrect conclusions. Irreproducibility may be the latest symptom of the challenges associated with translating basic research into products. At least part of the challenge of translation is being able to estimate the risk of investing in the commercial development of a research observation. This calculus is easier if observations are based on sufficient careful measurements that enable the determination of whether the observations can be developed into a robust, wellunderstood and manufacturable product that can receive regulatory approval. An experimental process that provides the supporting evidence for the meaning of a measurement will undoubtedly enhance the benefit of our investments in basic research as more of our findings can be translated into products, services and knowledge. It is important to recognize that there is no process that can entirely eliminate measurement uncertainty, but a good measurement process can lead to an appropriate interpretation of the data. Collecting and reporting the control experiments and the details of the experimental and computational process will add confidence

nature methods | VOL.11 NO.9 | SEPTEMBER 2014 | 897

commentary to experimental results, improve the efficiency of follow-up studies and establish a more reasonable basis for conclusions. Such practices can also add to our knowledge in ways that simply reproducing an experiment cannot. Research that reexamines published findings with improved characterization of the experimental details or that investigates the validity of experimental methods should be recognized as a critical contribution to the scientific enterprise even if it does not directly result in a new discovery. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

npg

© 2014 Nature America, Inc. All rights reserved.

1. Begley, C.G. & Ellis, L.M. Nature 483, 531–533 (2012).

2. Mobley, A., Linder, S.K., Braeuer, R., Ellis, L.M. & Zwelling, L. PLoS ONE 8, e63221 (2013). 3. Lehrer, J. The truth wears off. New Yorker (13 December 2010). 4. Naik, G. Scientists’ elusive goal: reproducing study results. Wall Street Journal (2 December 2011). 5. Bhadriraju, K., Elliott, J.T., Nguyen, M. & Plant, A.L. BMC Cell Biol. 8, 43 (2007). 6. ANSI/ATCC. Authentication of human cell lines: standardization of STR profiling (ASN-0002-2011) (ANSI/ATCC, 2011). 7. Almeida, J.L., Hill, C.R. & Cole, K.D. BMC Biotechnol. 11, 102 (2011). 8. Almeida, J.L., Hill, C.R. & Cole, K.D. Cytotechnology 66, 133–147 (2014). 9. Hughes, P., Marshall, D., Reid, Y., Parkes, H. & Gelber, C. The costs of using unauthenticated, over-passaged cell lines: how much more data do we need? Biotechniques 43, 575–586 (2007). 10. Nardone, R.M. Biotechniques 45, 221–227 (2008). 11. Anonymous. Nat. Methods 10, 367 (2013). 12. Brazma, A. et al. Nat. Genet. 29, 365–371 (2001). 13. Taylor, C.F. et al. Nat. Biotechnol. 26, 889–896 (2008).

898 | VOL.11 NO.9 | SEPTEMBER 2014 | nature methods

14. Plant, A.L., Elliott, J.T. & Bhat, T.N. BMC Bioinformatics 12, 487 (2011). 15. Liu, K., Hogan, W.R. & Crowley, R.S. J. Biomed. Inform. 44, 163–179 (2011). 16. The External RNA Controls Consortium. Nat. Methods 2, 731–734 (2005). 17. Zook, J.M., Samarov, D., McDaniel, J., Sen, S.K. & Salit, M. PLoS ONE 7, e41356 (2012). 18. Zook, J.M. et al. Nat. Biotechnol. 32, 246–251 (2014). 19. Wang, L. et al. Cytometry A 81, 567–575 (2012). 20. Wang, L. et al. Cytometry B Clin. Cytom. 72, 442–449 (2007). 21. Dima, A.A. et al. Cytometry A 79, 545–559 (2011). 22. Kharchenko, P.V., Silberstein, L. & Scadden, D.T. Nat. Methods 11, 740–742 (2014). 23. Brennecke, P. et al. Nat. Methods 10, 1093– 1095 (2013). 24. Wu, A.R. et al. Nat. Methods 11, 41–46 (2014). 25. Grün, D., Kester, L. & van Oudenaarden, A. Nat. Methods 11, 637–640 (2014).

Improved reproducibility by assuring confidence in measurements in biomedical research.

‘Irreproducibility’ is symptomatic of a broader challenge in measurement in biomedical research. From the US National Institute of Standards and Techn...
147KB Sizes 2 Downloads 3 Views