Nicotine & Tobacco Research, 2015, 1295–1296 doi:10.1093/ntr/ntv131 Editorial Advance Access publication June 15, 2015

Editorial Guidelines on Statistical Reporting at Nicotine & Tobacco Research by a number of scientific organizations, including the American Psychological Association, and the International Committee of Medical Journal Editors. Another is to adopt a more Bayesian perspective in the interpretation of analyses (without necessarily fully adopting Bayesian methods of statistical inference). The conventional Frequentist classification of results from statistical tests into “significant” and “nonsignificant” is based strictly on the null hypothesis, and is made without reference to the distribution of the alternative hypothesis. Under the null hypothesis, the P value is distributed as a continuous uniform random variable. Bayesians can become nearly apoplectic trying to explain that the P value only has meaning when interpreted using the prior distribution of the alternative hypothesis. Ioannidis4 has suggested using the positive predictive value as a measure of the importance of a positive finding. This is measure of the fraction of positive results that are truly positive, derived from epidemiological concepts of sensitivity and specificity, where we substitute 1−β (ie, power) for sensitivity and 1−α for specificity. Both a low prior probability for the alternative hypothesis (eg, exploratory research or unplanned sub-group analyses) and low statistical power (eg, small studies) reduce the positive predictive value, and therefore the likelihood that a statistically significant (ie, P < .05) result is actually true.5 We encourage our authors to take a more considered approach to statistical inference, and in particular to consider the use of measures of effect size and precision (eg, confidence) when reporting results of statistical tests, and to avoid reference to statistical “significance”. Another advantage of this approach is that it obviates the need for some of the tortured language we see used to describe P values in the .05 to .10 range (eg, https://mchankins.wordpress.com/2013/04/21/ still-not-significant-2/). Instead, results can be reported with reference to effect size and precision, and interpreted in the context of not only the strength of evidence against the null hypothesis indicated by the P value obtained, but also the prior probability for the alternative hypothesis and the likely statistical power of the study. In our view, this approach will provide our readers with a more complete and nuanced interpretation of the results reported in our journal. Marcus R. Munafò PhD1, E. Paul Wileyto PhD2 School of Experimental Psychology, University of Bristol, Bristol, United Kingdom; 2 Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, PA Corresponding Author: Marcus Munafò, PhD, School of Experimental Psychology, University of Bristol, Bristol BS8 1TU, United Kingdom. Telephone: 44-117-9546841; Fax: 44-1179288588; E-mail: [email protected]

1

© The Author 2015. Published by Oxford University Press on behalf of the Society for Research on Nicotine and Tobacco. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

1295

Downloaded from http://ntr.oxfordjournals.org/ by guest on November 14, 2015

Despite the broad array of tools available, we in the biomedical sciences have become limited in the way we do statistical inference, with null hypothesis significance testing (NHST) the dominant approach and small P values the measure of our worth. However, there are a number of important limitations to NHST, at least as it is often implemented in practice, and these limitations are beginning to provoke a response across the literature in our field. One journal has even banned the use of NHST entirely.1 Here we describe some of these limitations, and offer recommendations for better reporting of statistical analyses that we would encourage our authors to adopt. Tressoldi and colleagues2 describe three main limitations to NHST. First, NHST focuses on rejection of the null hypothesis at a prespecified level of probability (typically 5%, or .05). The implicit assumption, therefore, is that we are only interested answering “Yes!” to questions of the form “Is there a difference from zero?”. What if we are interested in cases where the answer is “No!”? Since the null hypothesis is hypothetical and unobserved, NHST doesn’t allow us to conclude that the null hypothesis is true. Second, P values can vary widely when the same experiment is repeated (eg, because the participants you sample will be different each time)—in other words, it gives very unreliable information about whether a finding is likely to be reproducible. This is important in the context of recent concerns about the poor reproducibility of many scientific findings.3 Third, with a large enough sample size we will always be able to find something to write about, although it may be a clinically meaningless or theoretically uninteresting relationship. No observed distribution is ever exactly consistent with the null hypothesis, and as sample size increases the likelihood of being able to reject the null increases. This means that trivial differences (eg, a difference in age of a few days) can lead to a P value less than .05 in a large enough sample, despite the difference having no practical or theoretical importance. The last point is particularly important, and relates to two other limitations. Namely, the P value doesn’t tell us anything about how large an effect is (ie, the effect size), or about how precise our estimate of the effect size is. Any measurement will include a degree of error, and it’s important to know how large this is likely to be. Fortunately, these limitations can be addressed, at least in part, by simple changes to the way we report and interpret the results of our statistical analyses. One is the routine reporting of effect size and confidence intervals. The confidence interval is essentially a measure of the reliability of our estimate of the effect size, and can be calculated for different ranges. A 95% confidence interval, for example, represents the range of values that contains the true effect size in the underlying population 95% of the time. Reporting the effect size and associated confidence interval therefore tells us both the magnitude of the observed effect, and the degree of precision associated with that estimate. The reporting of effect sizes and confidence intervals is recommended

1296

Declaration of Interests None declared.

References 1. Trafimow D, Marks M. Editorial. Basic Appl Soc Psych. 2015;37(1):1–2. doi:10.1080/01973533.2015.1012991. 2. Tressoldi PE, Giofre D, Sella F, Cumming G. High impact = high statistical standards? Not necessarily so. PloS ONE. 2013;8(2):e56180. doi:10.1371/ journal.pone.0056180.

Nicotine & Tobacco Research, 2015, Vol. 17, No. 11 3. Munafò M, Noble S, Browne WJ, et al. Scientific rigor and the art of motorcycle maintenance. Nat Biotechnol. 2014:32(9):871–873. doi:10.1038/ nbt.3004. 4. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi:10.1371/journal.pmed.0020124. 5. Button KS, Ioannidis JPA, Mokrysz C, et  al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neuro. 2013;14(5):365–376. doi:10.1038/nrn3475.

Downloaded from http://ntr.oxfordjournals.org/ by guest on November 14, 2015

Guidelines on Statistical Reporting at Nicotine & Tobacco Research.

Guidelines on Statistical Reporting at Nicotine & Tobacco Research. - PDF Download Free
485KB Sizes 2 Downloads 8 Views