Journal of Sport & Exercise Psychology, 2015, 37, 449  -461 http://dx.doi.org/10.1123/jsep.2015-0015 © 2015 Human Kinetics, Inc.

ORIGINAL RESEARCH

Things We Still Haven’t Learned (So Far) Andreas Ivarsson,1,2 Mark B. Andersen,1 Andreas Stenling,3 Urban Johnson,1 and Magnus Lindwall4 1Halmstad

University; 2Linnaeus University; 3Umeå University; 4Gothenburg University

Null hypothesis significance testing (NHST) is like an immortal horse that some researchers have been trying to beat to death for over 50 years but without any success. In this article we discuss the flaws in NHST, the historical background in relation to both Fisher’s and Neyman and Pearson’s statistical ideas, the common misunderstandings of what p < .05 actually means, and the 2010 APA publication manual’s clear, but most often ignored, instructions to report effect sizes and to interpret what they all mean in the real world. In addition, we discuss how Bayesian statistics can be used to overcome some of the problems with NHST. We then analyze quantitative articles published over the past three years (2012–2014) in two top-rated sport and exercise psychology journals to determine whether we have learned what we should have learned decades ago about our use and meaningful interpretations of statistics. Keywords: effect sizes, interpretation of statistics, null hypothesis significance testing, real-world meaning The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP. —Trafimow and Marks (2015, p. 1) The editorial policy change in the above quote represents a radical move in one journal’s approach to the use (or in this case, nonuse) of a long-standing family of statistical procedures employed to determine the import of the results of quantitative studies in psychology. Such policy shifts may bring about, for some, images of defenestrating both the baby and the bath water. Calls for reassessments, and even discarding, of null hypothesis Andreas Ivarsson is with the Center of Research on Welfare, Health and Sport (Centrum för Forskning om Välfärd, Hälsa och Idrott), Halmstad University, Halmstad, Sweden, and with the Department of Psychology, Linnaeus University, Växjö, Sweden. Mark B. Andersen is with the Center of Research on Welfare, Health and Sport (Centrum för Forskning om Välfärd, Hälsa och Idrott), Halmstad University, Halmstad, Sweden. Andreas Stenling is with the Department of Psychology, Umeå University, Umeå, Sweden. Urban Johnson is with the Center of Research on Welfare, Health and Sport (Centrum för Forskning om Välfärd, Hälsa och Idrott), Halmstad University, Halmstad, Sweden. Magnus Lindwall is with the Department of Food and Nutrition, and Sport Science & Department of Psychology, University of Gothenburg, Gothenburg, Sweden. Address author correspondence to Andreas Ivarsson at [email protected].

significance testing (NHST) have been around for decades (e.g., Meehl, 1967), but the practice remains ubiquitous. Critiques of scientific shibboleths, such as NHST, over the last half-century have not been successful in changing research communities’ statistical practices. To effect substantial change, it will probably take initiatives by the power brokers and gatekeepers of research (e.g., journal editors, editorial boards, research funding agencies) to institute policies, such as the BASP example above, which nudge (or maybe shove) investigators in new directions in analyzing and interpreting the results of their studies. In this article, we supply some arguments as to why such nudging and change are needed in sport and exercise psychology quantitative research. We have shamelessly appropriated and modified the title of Cohen’s (1990) classic article, and we have done so with great admiration and respect. The content of this article owes much to Cohen and his 50+ years of admonishing researchers in psychology to pay attention to the results of interest in quantitative research (e.g., how large is the difference?, how strong is the relationship?, what do the results mean?). We are not saying anything that has not been discussed many times in debates about statistical inference in psychology research going back to the time of William Gosset’s arguments with Sir Ronald Fisher 449

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

450  Ivarsson et al.

in the early part of the 19th century (for a summary, see Boland, 1984). We are repeating age-old debates because it seems that sport and exercise psychology researchers, in many cases, use statistics without acknowledging the philosophical as well as the practical assumptions that are related to various methods. Over the years, different statistical perspectives have been mixed together into the procedure we call NHST. The current form of NHST has many philosophical as well as practical problems, and in using this procedure a lot of potentially useful research has been misinterpreted as useless, or even cast aside completely, whereas substantial research gets published with little or trivial relevance to the real world of sport and exercise. As one example, it is quite common that if a test statistic reaches the desired level of p < .05 then the conclusion is that there is an effect. If it doesn’t, then there is no effect. This decision making about the existence or nonexistence of an effect has been heavily criticized (e.g., Cohen, 1994; Gigerenzer, 2004; Goodman, 1999; Wilkinson, 2014) as missing the main questions in research, and those are the determination of how much? or how big? or how strong? (i.e., effect sizes) differences or associations between groups or variables are. But how common is this problem? For example, Fanelli (2010) showed that psychology is the scientific discipline that publishes more articles reporting support for rejecting null hypotheses (approximately 90% of published quantitative articles) compared with other scientific disciplines, and that there is an increasing trend toward reporting support for hypotheses, based on NHST decisions, in psychology (Fanelli, 2012).

NHST versus Effect Sizes We are the first to confess that we have also used the NHST approach in our own research. To provide an overview of what researchers in sport and exercise psychology have based their conclusions about results in their studies, Andersen, McCullagh, and Wilson (2007) performed a critical survey of three of the major journals in the field. They found that of the 54 quantitative articles included in their study, 44 reported effect sizes (i.e., how much?), but reporting effect sizes is no more meaningful than providing p values if effect-size metrics are not interpreted as to what they may mean in terms of the real world of sport and exercise settings. Only seven (13%) of the 54 quantitative studies discussed other metrics (i.e., effect sizes) than p values when interpreting and discussing the meaningfulness or practical significance of their results. The Publication Manual of the American Psychological Association (APA) in its last three iterations (APA, 1994, 2001, 2010) has moved from suggesting that authors report effect sizes, to demanding they do so, and in the latest version the manual states that the general principle to be followed “is to provide the reader with enough information to assess the magnitude of the observed effect.” (p. 34, italics added). We interpret “assess the magnitude” to mean to be “to interpret the

size of the effect and what it might mean in the real world.” Also, the manual states that “Your interpretation of the results should take to account . . . (d) the effect size observed” (p. 35). Most of the researchers in the Andersen et al. (2007) study complied with the reporting of effect sizes (44 of 54 studies), but such reporting seemed like lip service to APA demands, and only seven of the studies interpreted the effect sizes found in terms of real-world meaning. Such incomplete analyses and interpretations of research findings leave readers and other consumers of research without enough information to determine whether the results of a study have any real import. In line with this argument, Seddon and Scheepers (2012) stated that authors need to fully interpret their results, and it is their obligation to do so. A couple years after Andersen et al. (2007), Hagger and Chatzisarantis (2009) reached similar conclusions, stating that, “Research reports still do not include effect size statistics as standard and confine the discussion of findings to statistical significance alone rather than commenting on ‘practical significance’” (p. 511). This existence/nonexistence of an effect decision based on p values and the incomplete reporting of effect sizes (often reported but with no interpretations of what the magnitudes of the effects might mean) have been the standard in sport and exercise quantitative investigations. This narrow-focused and incomplete reporting practice needs to be changed. Given the Andersen et al. (2007) and Hagger and Chatzisarantis (2009) findings, together with other issues concerning NHST and its applications, there seems to be lot of room for improvement. The aims of this article are as follows: (a) to review and discuss the limitations and problems associated with the use of NHST in evaluating the results in experimental and correlational studies in sport and exercise psychology research: (b) to determine the extent articles published in high-impact sport and exercise psychology journals use NHST (existence/ nonexistence of an effect versus interpreting effect-size metrics) to draw conclusions about results, and (c) to critically discuss how researchers can work to come to reasonable decisions when discussing the size, meaningfulness, and practical significance of their results. We first begin with some background on the convoluted path NHST has taken in the history of science.

The Rise of Significant Flaws in Statistical Reasoning Even though the NHST procedure is taught in most university statistics courses, as well as presented in statistics textbooks, many researchers use it without acknowledging or questioning the philosophical assumptions upon which the test procedure rests. For example, the way that the p value is used in the NHST framework was not the way that it was originally meant to be used (Nuzzo, 2014). Because NHST is a hybrid of two fundamentally different statistical testing models, a number of different logical as well as practical problems are associated with its use (Hager, 2013; Perezgonzalez, 2015; Schneider,

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Statistical Errors in Sport Psychology Research   451

2015). This limited understanding of NHST has had substantial negative influence on the interpretations and conclusions that are drawn from this test procedure (Hubbard & Bayarri, 2003; Ziliak & McCloskey, 2008). The essence of the modern NHST can be summarized in a few steps (Gigerenzer, 2004). First, the researcher sets up a statistical null hypothesis (e.g., there is no difference between the means for the experimental and control groups on the dependent variable of interest). Second, the researcher chooses a conventional value for rejecting the null hypothesis (e.g., p < .05). Third, the results are usually reported as p < .05, p < .01, or p < .001, although the APA manual (2010, 6th ed.) encourages researchers to report the exact p value. Finally, the underlying assumption is that this procedure should always be performed in statistical testing to determine whether an effect exists or not. As we implied earlier, what many researchers, as well as sport and exercise psychology students, do not know is that the modern NHST procedure is a hybrid of two different schools of statistics (Gigerenzer et al., 1989; Hager, 2013; Hubbard & Bayarri, 2003; Lehmann, 2011; Schneider, 2015). More specifically, the modern NHST is based on concepts from both Fisher’s null hypothesis testing and the Neyman–Pearson decision theory (for extended descriptions of the different schools of statistics see Fisher, 1922, 1925, 1935a, 1959; Neyman & Pearson, 1928a, 1928b, 1933). This mix of two different statistical procedures leaves the modern NHST procedure burdened with philosophical paradoxes that, in many cases, have influenced (often detrimentally) researchers’ interpretations of the results obtained from this hybrid testing procedure (Gigerenzer, 2004; Hubbard & Bayarri, 2003). When comparing Fisher and Neyman–Pearson perspectives on statistical testing, several substantial differences between the methods emerge. Perhaps the most striking is that Neyman–Pearson decision theory emphasizes deductive reasoning whereas Fisher null hypothesis testing is based on inductive inference (Fisher, 1955; Neyman, 1957). The more deductive reasoning of Neyman–Pearson refers to Neyman’s (1935) idea that it should be possible to construct a theory of mathematical statistics based solely upon a theory of probability. The fundamental act, in the case of hypothesis testing, consists of either rejecting a hypothesis or (provisionally) accepting it (Lehmann, 1993). Fisher (1935b), on the other hand, promoted the concept of inductive inference, and argued that the theory of probability Neyman (1935) promoted, in which the population is fully known, does not apply in many cases. Fisher (1935b) argued that researchers are often attempting to draw inferences from the particular (the sample) to the general (the population), and that such inferences are uncertain. The second major difference between the two test procedures is that Fisher, in comparison with Neyman and Pearson, never discussed rejecting or accepting a hypothesis. Instead, Fisher stressed that the p value indicates the amount of evidence against the null hypothesis (H0). From the Neyman–Pearson perspective, the researcher should

set up two statistical hypotheses (H1, H2). In defining or determining the rejection region for each hypothesis, α, β, and sample size should be used. If data fall into the rejection region of H1, the researcher should accept H2. The third major difference is that the Neyman– Pearson decision theory was designed only for a long sequence of experiment repetitions under constant conditions (Neyman, 1957). If one applies decision theory testing to one single study, the assumptions for Type I and Type II errors will not hold (Berger, 2003). For Fisher, on the other hand, null hypothesis testing was developed for use with single studies because the p value is not relying on multiple tests (Seidenfeld, 1979). Finally, whereas Fisher disagreed with using both preselected significance levels, as well as a deductive procedure, Neyman and Pearson considered data-derived p values to be subjective and biased (Neyman & Pearson, 1933). When reviewing these two statistical procedures that have been mixed into today’s NHST, it is clear that this modern test procedure, frequently applied in sport and exercise psychology, is based on several philosophical paradoxes. This hybrid formally follows Neyman–Pearson ideas, but philosophically rests on Fisher concepts (Johnstone, 1986). For example, many scholars teach Fisher’s idea of evidence against the null hypothesis side by side with Neyman and Pearson’s concepts of an alternative hypothesis, Type I and Type II errors, and power. To sum up, the creation of this hybrid has led to both researchers and students in social sciences misusing statistical procedures and drawing conclusions that are flawed because of misunderstandings about the central aspects of the test procedures they are using (Shaver, 1993).

What Does the p Value Really Tell Us? Statistical significance at any level does not prove medical, scientific, or commercial importance. We all claim to know this but then we go and do the opposite: we base life decisions on a level of statistical significance. —Ziliak (2010, p. 324) One concern is the belief that p is the probability that H0 is true and that 1 – p is the probability that the alternative hypothesis is true (Stang, Poole, & Kuss, 2010). The p value is not the probability that H0 is true. To reject H0 at a significance level of .05 is not the same as to say that, given H0 is true, the likelihood of obtaining the observed data is low, and that is exactly all we can say. We have no idea about the likelihood of H0 being true. This confusion highlights the misconceptions that have occurred due to the mixing of the two theories. Hubbard and Bayarri (2003) addressed the problem of researchers who discuss the evidence (ps) and errors (αs) as similar. More specifically, the p value will not indicate the risk for performing a Type I error. Because the α and p values are developed in two different theoretical frameworks, they should not be mixed together.

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

452  Ivarsson et al.

Nickerson (2000) stated that, “the p value is the probability of obtaining a value of test statistics, say, D, as large as the one obtained conditionally on H0 being true: p(D | H0)” (p. 247). This statement clearly points out that the p value is not based on observed (empirical) results, but instead on unrealistic assumptions. More specifically, “the p value depends on data that were never observed” (Wagenmakers, 2007, p. 783). The unobserved data are the distributions of specific test statistics (e.g., t, F, z) based on replicated data sets, which are generated under the null hypothesis. Wagenmakers also stated, “The p value is the sum . . . over values of the test statistic that are at least as extreme as the one that is actually observed.” (p. 782). Because the p value is partly a function of data that were never observed, several researchers have argued that the p value is violating the conditionality principle, which states that conclusions drawn from statistics should be based on observed data (see Berger & Berry, 1988). Another problem related to the p value is that several researchers have shown that p values can have large variances under several different conditions (Boos & Stefanski, 2011; Gelman & Stern, 2006). Halsey, CurranEverett, Vowler, and Drummond (2015) showed that p values have problematically high variability in conditions where β is below .90. The high variability associated with the p value is related to substantial problems replicating findings in multiple studies. Concerning the practical problems with the p value, Goodman (2008) showed that p is often overstating the evidence against the null hypothesis. Furthermore, as Sellke, Bayarri, and Berger (2001) stated, “knowing that the data are ‘rare’ under H0 is of little use unless one determines whether or not they are also ‘rare’ under H1” (pp. 65–66). All these arguments could be summarized in a quote from Rodgers (2010): “rejecting the null [hypothesis] does not provide logical or strong support for the alternative,” and “failing to reject the null [hypothesis] does not provide logical or strong support for the null [hypothesis].” (p. 3). Fisher, who integrated the p value into his test method, never discussed rejecting or accepting the null hypothesis. Rather, he emphasized that the p value should be used to discuss evidence against the null hypothesis (Fisher, 1925). A common misconception is that the .05 level is a dichotomous breaking point: the point upon which a yes–no decision is made (Cohen, 1990). Because the p value does not inform us about the magnitude or the meaningfulness of differences or associations, it is not a solid foundation to use when deciding if the results are significant outside of a purely statistical realm. Unfortunately, there are many research articles and conference presentations that have decisions, based only on p values, whether effects were obtained without critically discussing both effect sizes and the contexts in which the studies were conducted. Masicampo and Lalande (2012), who reviewed results presented in three of the most well-respected psychology journals, supported this argument and found that the most common range for the p values reported was between .045 and .050.

The result from this critical review might indicate that the institutionalized cutoff (p < .05) is a benchmark in the decision of what to publish (Masicampo & Lalande, 2012). The publication bias that is related to the problem when the probability of publication of a study is dependent on the p value is often referred to as “the file drawer effect” (i.e., “filing away” or tossing out studies with nonsignificant findings; Scargle, 2000). Because the p value gives information only about “the probability of seeing something as weird as or weirder than you actually saw” (Christensen, 2005, p. 121), a fair question, raised by Rosnow and Rosenthal (1989), concerns why .05 was set to be the number that separates the existence versus nonexistence of an effect. The main point of their conclusion was that it is impossible to decide the meaning of the difference between two close p values (e.g., .05 and .06). This discussion turns out to be even more interesting if we consider the two statistical procedures from which the contemporary NHST has borrowed its components. Neither Fisher nor Neyman and Pearson believed in a fixed cutoff value that could be applied as a general rule. In his early work, Fisher emphasized 5% (i.e., p = .05) to be a conventional value for statistically significant evidence against the null hypothesis. Later in his life, however, he wrote that no scientific worker has a fixed level of significance at which he rejects hypotheses (Lehmann, 2011). Although Fisher argued that the exact p value should be used to guide the discussion about the meaning of the result, he also emphasized the importance of taking the context into consideration. Considering the context of where the study took place and the specific variables measured was also important in Neyman–Pearson decision theory. In that framework, the context and specific variables should be taken into consideration when deciding the rejection regions, based on a cost–benefit calculation done for the two hypotheses, which should be stated before conducting the study. In short, the illusion of a foundation for having fixed cutoff values, which could be generalized to most all contexts, seems to not be grounded in any theoretical framework at all. Another potential limitation with using the p value is the assumptions that are related to the generalizations of the results obtained from inferential statistics (e.g., t value, p value). Because inferential statistics are based on probability theory aimed to draw inferences from the specific sample to the population, the assumption of representativeness of the data are fundamental (Seddon & Scheepers, 2012). To gain a representative sample from a population the selection of participants in a study should be random (Schneider, 2015). Even Fisher (1947) acknowledged the importance of random samples, “All our conclusions . . . rest on a process of random sampling, without it our tests of significance would be worthless” (pp. 435–436). It is therefore important that the researcher either shows the sample’s representativeness in relation to the population or informs the readers that there is limited basis for the claim of representativeness (Seddon & Scheepers, 2012).

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Statistical Errors in Sport Psychology Research   453

Finally, another issue that has been raised concerning the null hypothesis and p values in the NHST framework is related to an implausible logic that is present in the procedure (Cohen, 1994). The logic of NHST rests on a concept called modus tollens. Modus tollens is the rule of logic illustrated in the following example: “(a) If A then B, (b) Not B observed, (c) Therefore not A” (Gill, 1999, p. 653). Translating this rule of logic into hypothesis testing, the statements would be “(a) If H0 is true then the data will follow an expected pattern, (b) The data do not follow the expected pattern, (c) Therefore H0 is false” (Gill, 1999, p. 653). The problem with using this logic in NHST is that the statements, which are certain in the example above, are replaced with probabilistic statements: “(a) If the null hypothesis is correct, then these data are highly unlikely, (b) These data have occurred, (c) Therefore the null hypothesis is highly unlikely” (Cohen, 1994, p. 998). The problem with this violation of the rule of logic is that “it accommodates both positive and negative outcomes, so that it loses its power for enabling a researcher to evaluate any hypothesis” (Schneider, 2015, p. 422).

The Paradox of Power Analysis There has been an ongoing debate between editors and editorial boards of peer reviewed journals whether to accept articles with low statistical power. There have been calls by some journals to refuse articles that do not contain power analyses to determine the sample sizes required to show statistically significant differences. —Hudson (2003, p. 105) The above quotation illustrates the increasing discussion about sample sizes and power calculations within research. But is this discussion relevant? It depends on what question the researcher wants to investigate. The discussion of power (and Type I and Type II errors) is leading us back to Neyman–Pearson decision theory. In this particular theory, the question of interest was to determine acceptable cutoffs for Type I and Type II errors over a long series of tests. If the researcher is interested in that type of question, the Neyman–Pearson procedure fits the bill. Looking into the area of sport and exercise psychology, researchers, in most cases, perform single studies. Therefore, the assumption of the Neyman–Pearson decision theory is not fulfilled, and a power calculation seems somewhat superfluous. Going back to Neyman and Pearson’s framework, it was developed for quality control with long-run random sampling procedures from the same population (Gigerenzer, 2004). This procedure is probably not congruent with most studies in sport and exercise psychology. The paradox of calling for power analyses is that they are used to help researchers estimate sample and effect sizes to ensure that results reach p < .05. But, in line with the NHST procedure, we then use p values (in inferential statistics the least informative and, arguably, the results most easy to manipulate), and they tell us little about the major questions of research (e.g., How much?). So why go to the trouble of a power analysis if

it helps ensure a result that offers limited information for answering our central questions? One solution for moving away from problematic power calculations is to use likelihood ratios or Bayes factors (Wagenmakers et al., 2014). By the use of Bayes factors, it is possible to “assess the extent to which a particular data set provides evidence for or against the null hypothesis” (Wagenmakers et al., 2014, Conclusion, para. 3). For an extended description of the Bayes factors see the section later in this article under the heading “So Where Do We Go From Here.” Another solution was presented by Gelman and Carlin (2014, p. 641), who emphasized that researchers should “design calculations in which: (a) the probability of an estimate being in the wrong direction (Type S [sign] error) and (b) the factor by which the magnitude of an effect might be overestimated (Type M [magnitude] error or exaggeration ratio) are estimated.” For elaboration about this procedure, see Gelman and Carlin (2014).

The Rise of Empty Effect Sizes In the sixth edition of the APA manual (APA, 2010) it states, “it is almost always necessary to include some measure of effect size in the Results section.” (p. 34). Effect sizes could be used as a first step to discuss the real-world meaning of the results (Grissom & Kim, 2012) because they indicate the magnitude of the effects that we are always interested in when investigating relationships between variables or differences between groups. But effect sizes are usually in metrics (e.g., d, r) that are not immediately interpretable. Some effect sizes, such as mean difference between groups on a behavioral variable (e.g., how much faster, on average, an intervention group runs 800 m vs. a control group), are usually immediately understandable. For example, an average difference of .20 s between intervention and control groups of 800-m runners is something that would most likely fall in the trivial range, but a mean difference of 5 s would probably get coaches interested in the intervention. Reporting effect sizes has now become a practice of the majority of researchers within the field of sport and exercise psychology. For example, Andersen et al. (2007) found that 81% of experimental and correlational/descriptive studies published in three of the major sport and exercise psychology journals, during 2005, reported effect sizes. But as we mentioned earlier, only 13% of the studies interpreted effect sizes in terms of real-world meaning of the magnitude of those effects. To report just an effect size without interpretation adds, in principle, little to the results (Grissom & Kim, 2012). Cohen (1988) emphasized that researchers have to critically interpret their effect sizes, taking the specific context where the study was performed into consideration to be able to discuss the meaningfulness of the results. It is not enough to just report the effect sizes (and maybe add that “we used the small, medium, and large conventions suggested by Cohen”). The conclusions one can draw from Andersen et al. (2007) is that a lot of effect size reporting was going

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

454  Ivarsson et al.

on 10 years ago, but much of it was not interpreted so readers could make judgments about the meaning of the magnitudes of the effects. When discussing effect sizes and how to interpret them, there are some practical issues that are important for researchers to consider. For example, some design issues have been addressed that influence the potential sampling variance (Thompson, 2002). To have an effect size that is not an overestimate, it is important to adjust for the potential sampling variance (Thompson, 2006). There are a few design issues that generally increase the sampling variance, which in turn will usually generate positively biased effect estimates. For example, small sample sizes, a large number of measured variables, as well as small population effect sizes will all be related to increased sampling variance and more positively biased effect sizes (Thompson, 2002). In line with this state of affairs, Ivarsson, Andersen, Johnson, and Lindwall (2013) showed that there could be large differences between adjusted and nonadjusted effect sizes (up to 68%) depending on sample sizes and other variables. To adjust effect sizes, several formulas have been proposed (for a practical guide of how to adjust an effect size, see Wang & Thompson, 2007). This issue is of particular interest for the field of sport and exercise psychology because many experimental studies have small samples. Another important issue stated in the APA manual (APA, 2010) is that researchers should report effect sizes with 1 degree of freedom (e.g., Cohen’s d) because they are generally more interpretable than omnibus effect sizes such as η2 and Cohen’s f. Nevertheless, the effect size that is most common for studies that include more than two means is η2 or η2p (Henson, 2006). In Andersen et al.’s (2007) review, they found that of the 44 studies reporting effect sizes, 12 (27%) used only omnibus effect indicators (e.g., η2). The problem with omnibus effect sizes (with more than 1 df ) is that they do not answer the most important question: What is the magnitude of the effect between any combination of two groups. Researchers should complement their omnibus effect sizes with effect sizes with 1 df  (e.g., Cohen’s d) and then interpret them. An additional issue concerning the interpretation of effect sizes in sport and exercise psychology is related to the issue that most of the metrics we use in psychology research are arbitrary, meaning that we do not know how a given score (or unit of change) is related to the underlying psychological construct being measured (Blanton & Jaccard, 2006), such as motivation or self-confidence. Not being able to measure how, for example, the mean difference in scores between two groups is related to the actual construct of interest makes it challenging for researchers to draw conclusions based on their statistics (e.g., effect size). This particular problem is important for researchers to acknowledge when working with arbitrary metrics because it could have substantial impact on their conclusions. One potential solution for this problem is to link test scores (from arbitrary metrics) to meaningful real-world events, such as behaviors (Blanton & Jaccard, 2006).

Can We Have Confidence in Confidence Intervals? Even though the APA manual emphasizes the use of confidence intervals around various statistics, there has been some confusion about the meaning, as well as the interpretation, of these interval estimates (Morey, Hoekstra, Rouder, Lee, & Wagenmakers, in press). This issue is problematic because it is essential that researchers interpret statistics appropriately when drawing conclusions from their results (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). To understand the potential problems with the interpretation of the use of confidence intervals it is important to look closely into their definition. The confidence interval was introduced in 1937 by Neyman (Morey et al., in press). The explanation of the confidence interval is, “A X% confidence interval for a parameter Ɵ is an interval (L[ower], U[pper]) generated by an algorithm that in repeated sampling has an X% probability of containing the true value of Ɵ.” (Morey et al., in press, para. 5). Given that the confidence interval is related to the Neyman–Pearson framework, where repeated sampling is essential, the same assumptions that holds for the interpretation for Type I and Type II errors also holds for confidence intervals. As stated earlier, there are mistakes in the use of confidence intervals that tend to be present in current research (Wagenmakers, Lee, Rouder, & Morey, 2015). One of the common mistakes is to believe that the “width of a confidence interval indicates the precision of our knowledge about the parameter.” (Morey et al., in press, para. 13). Going back to the definition of the confidence interval, it is clear that this assumption is invalid. More specifically, the confidence interval has no information about the distribution of parameter values, and therefore it is not adequate to conclude that a parameter value placed in the center of the interval is more credible than a parameter value placed near the end of the interval (Kruschke, 2013). One common misinterpretation related to confidence intervals (e.g., 95% CI), among both students as well as researchers, is that there is a 95% chance that the true parameter value is placed somewhere between the lower and upper limit of that interval (Hoekstra et al., 2014). In addition, in this case it is important to consider that the confidence interval was developed within the Neyman– Pearson perspective, and its definition and use is therefore valid only for a long sequence of replicated studies.

Interpretation Quagmires: The Map is Not the Territory Data are imperfect indicators or representations of what the world is like. Just as a map is not the territory it describes, the statistical tables describing a program are not the program. That’s why they have to be interpreted. —(Patton, 2012, p. 350)

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Statistical Errors in Sport Psychology Research   455

In the previous paragraphs we emphasized the use of potentially informative statistics (such as effect sizes) to help us evaluate the meaningfulness of our results, but there is still more work to be done. More specifically, our own understanding and knowledge about the specific variables and the contexts where the study was performed, as well as the methodological underpinnings in our designs, are central factors guiding us toward reasonable answers to our research questions. The statistics we use are simplifications of the real world and may help us discuss the meanings of our results, but they cannot do the most important work for us. Let us go back to the hypothetical performanceenhancement intervention experiment with 800-m runners mentioned before. It is a randomized controlled trial. It has one main dependent variable: running time for 800 m in real-world competition. After the intervention, the control group, on average, has not changed much on the dependent variable (i.e., pre- to postmean difference = –.20 s), but the intervention group decreased, on average, running times by 5 s. We have a real-world variable here and an effect size (a mean difference score) that is easily interpretable. Most performance-enhancement sport psychologists (and coaches) would say, “This intervention looks quite effective; reject the null hypothesis, and accept the alternative hypothesis” (i.e., the performanceenhancement intervention was effective). Sounds good. But the alternative hypothesis is only one of many possible hypotheses for behavioral changes stemming from taking part in a performance intervention. For example, one of the most robust findings in psychotherapy outcome research is that it is the quality of the therapeutic relationships between clients and practitioners that accounts for a substantial amount of variance in outcomes (Norcross, 2011). It may be that that the helping relationship actually accounts for more variance in outcome than does the intervention. Athletes may have performed better because they felt cared for and appreciated by the person delivering the interventions, and those feelings led to down-regulating stress responses that then resulted in their being able to perform better. But we didn’t look at relationships. We often forget that the maps we use in research (e.g., psychological skills training, various dependent variables) are really types of palimpsests with other maps of different territories hidden beneath. In the world of human behavior and relationships, there are many lost, found, and undiscovered countries. Interpretation is almost always a best guess given the data.

Exploring Current NHST Use The abovementioned discussions are not new, and they have been around for many decades. For example, two of the giants in the field of psychology (Paul Meehl and William Rozeboom) back in the 1960s addressed many of the concerns and critiques that we have presented in this article (see Meehl, 1967; Rozeboom, 1960). But still most researchers use the NHST procedure without considering its shortcomings. For example, both Andersen

et al. (2007) and Hagger and Chatzisarantis (2009) found that the majority of researchers within the field of sport and exercise psychology did not use statistics in the ways they should when interpreting and discussing their findings. Since these two studies were conducted even more researchers have addressed the problems with NHST (see, for example, Ivarsson et al., 2013; Wilkinson, 2014). For this study, we reviewed newly published articles in two top-rated journals in sport and exercise psychology to investigate whether researchers have gotten better at following the APA manual’s recommendations and applying proper procedures when interpreting their results.

Method We reviewed all quantitative articles published in the 2012–2014 volumes of the two sport and exercise journals with the highest impact factors (IF) in 2014: the Journal of Sport & Exercise Psychology (JSEP; IF = 2.59) and Psychology of Sport and Exercise (PSE; IF = 1.77). Articles with traditional statistics, such as ANOVAs, t tests, and various correlation/regression analyses were included in the review. Following the procedure applied by Andersen et al. (2007) to only include studies that apply classical hypothesis testing, we decided to not include papers using structural equation modeling (SEM) or multilevel modeling (MLM). Data collected from each article were as follows: (a) reported power; (b) reported effect sizes; (c) reported omnibus effect sizes, or 1-df effect sizes, or both; (d) whether reported effect sizes were interpreted in terms of real-world meaning concerning the magnitudes of the effects or relationships, or authors instead relied on only NHST decisions (i.e., there is an effect, there is no effect) when discussing results; and (e) statistical perspective (i.e., frequentist or Bayesian). The frequentist paradigm associates probability with long-term frequency, does not allow for prior knowledge to be included in the analyses, and views parameters as fixed (i.e., there is one true population parameter) but unknown. The Bayesian paradigm associates probability with the subjective experience of uncertainty, allows for prior knowledge to be included in the analyses, and parameters are viewed as unknown and random, and should therefore be described by a probability distribution (van de Schoot et al., 2014). In line with Andersen et al. (2007), we classified studies with only omnibus effect sizes as having reported effect sizes “even though the authors did not follow the general rules stated in the APA Manual” (p. 668).

Results The articles included are the experimental and correlational studies from the two journals presented in the method section. A total of 203 articles were found (JSEP, n = 81; PSE, n = 122). Of the studies included, 13% provided at least one power calculation, and 98% of the studies reported at least one type of effect size.

JSEP Vol. 37, No. 4, 2015

456  Ivarsson et al.

Concerning the type of effect sizes reported, 37% of the total number of studies included only omnibus effect sizes; 33% reported only 1-df effect sizes, and 30% reported both. Of the studies included in the review, 91% based the conclusions about the study results (and the effects) only on NHST procedures (yes/no, existence of an effect/nonexistence of an effect) and did not discuss the real-world meanings of their effect sizes. Furthermore, all studies except one (99.5%) applied a frequentist perspective in their statistical procedures. For specific details for each journal and year, see Table 1.

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Discussion We had no hypotheses, but we had hopes that researchers within the field of sport and exercise psychology would have become better at reporting results since the time of the Andersen et al. (2007) and Hagger and Chatzisarantis (2009) studies. What we found was that there has been little change. Even if the current use of NHST is philosophically confused (Gigerenzer, 2004) and does not supply answers to the questions most researchers are interested in (i.e., what is the real-world meaning of the results?; Goodman, 1999), a majority of the quantitative studies (91%) published in these 2012–2014 issues used only this procedure to draw conclusions about their results. This finding shows that even if we know that it is problematic to draw conclusions based on the NHST procedures, we still use them when it comes to interpreting and drawing conclusions about how our findings are relevant or meaningful. There is so much we all have potentially been taught, starting at least with Cohen (1962), but we still haven’t

learned what we should have learned and applied decades ago. It is almost as if it’s too much to ask researchers, who are the experts within their specific areas, to help the readers of their works understand their results based on sound assumptions and interpretations of the central question of “how much?” So what do we have to do to increase the probability that researchers will discuss the size and meaningfulness of their results? One of the most important tasks is to have gatekeepers of research (e.g., journal editors, editorial boards, research funding agencies) emphasize the importance of interpreting the results, in order to change long-standing practices in research reporting. One example of such a move is the radical policy shift mentioned at the start of this article in the journal Basic and Applied Social Psychology (Trafimow & Marks, 2015).

So Where Do We Go From Here? We suggest the following guidelines that, we hope, will give us a push in the right direction toward breaking down the almost institutionalized wall of NHST use that has kept us from the real (and useful) world of statistical inference. To answer the “how much?” question, researchers could, for example, use effect sizes. In working with effect sizes, it is important to state that on their own they cannot say much without the researchers’ interpretations of the effects in relation to contexts and real-world implications (e.g., Cohen, 1988). Lately, there has been a discussion related to the potential difficulties that researchers can have in understanding and interpreting traditional effect sizes, such as Cohen’s d, and Hedges’ g (Brooks, Dalal, & Nolan, 2014). To solve

Table 1  An Analysis of the Statistical Reporting, Decision Making, and Perspectives for JSEP and PSE by Years, 2012 Through 2014 JSEP 2012 (n = 18)

JSEP 2013 (n = 38)

JSEP 2014 (n = 25)

PSE 2012 (n = 44)

PSE 2013 (n = 42)

PSE 2014 (n = 36)

Total (N = 203)

6

8

32

5

14

19

13

Effect sizes (%)

95

100

100

95

95

100

98

One degreea (%)

22

26

32

39

45

30

33

Omnibusa (%)

34

50

28

34

19

42

37

44

24

40

27

36

28

30

94

92

88

93

79

94

91

  Frequentist (%)

100

100

100

98

100

100

99.5

  Bayesian (%)b

0

0

0

2

0

0

0.5

Power (%)

Combineda

(%)

Decision based on NHST p value (%) Perspectives:

aPercentages

were calculated based on the articles that reported effect sizes. included in the table above are three articles published in JSEP (i.e., Doron & Gaudreau, 2014; Jackson, Gucciardi, & Dimmock, 2014; Mahoney, Gucciardi, Ntoumanis, & Mallett, 2014) and one article published in PSE (i.e., Constantinou, Fenton, & Pollock, 2014) that used Bayesian statistics. These articles, however, used SEM or Bayesian networks analysis, and studies with SEM and MLM (or other nontraditional statistical analyses) were excluded from our analysis (see Method section). For more information about the selection process, contact the first author. bNot

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Statistical Errors in Sport Psychology Research   457

potential problems, nontraditional effect indicators, such as common language effect sizes (CL; McGraw & Wong, 1992) have been developed. In the calculation of the CL, the effect is translated into the probability that “a score sampled at random from one distribution will be greater than a score sampled from some other distribution” (McGraw & Wong, 1992, p. 361). Because the nontraditional effect estimates have been found to be easier for researchers to interpret (Brooks et al., 2014; McGraw & Wong, 1992), we suggest that more researchers’ could use those statistics to increase understandability when interpreting the effect of their results. Second, statistics courses at universities have to go beyond calculations of statistical tests and the use of programs such as SPSS to determine whether results are meaningful or not. Statistical inference is only a small part within the larger area of scientific inference, and it is important for researchers to be able to discuss statistics in relation to, for example, methodological and design issues. More specifically, statistics should be presented as a toolbox with different tools to work with for different problems (Gigerenzer, 2004). But to be able to select the right tool and draw reasonable conclusions, it is critical that we understand the underlying assumptions of the tests and statistical perspectives we choose. When a person understands what a tool can (and cannot) do then he or she will be able to work with this tool in a proper, productive, and meaningful way. Third, perhaps we should start to use Bayesian reasoning (and statistics) because it tells us one thing that we want know, that is, the probability of the null (or alternative) hypothesis being accurate (Miles & Banyard, 2007). Even William Gosset (the inventor of the Student’s t test) discussed his results in the light of a “strong Bayesian undertone” (Hanley, Julien, & Moodie, 2008, p. 65). The differences between frequentist approaches (where NHST has been one of the most common statistical procedures) and Bayesian perspectives are grounded in both epistemological and practical differences (Berger, 2003). In Bayesian analyses it is—in comparison with frequentist analyses—possible to specify the prior distribution on the parameter of interest. The prior distribution represents knowledge (i.e., degree of belief based on previous empirical studies) about the parameter of interest before the new data are included into the analysis. In the next step of the analysis, the prior distribution is combined with the parameter value from the current data to produce the posterior distribution (e.g., Kruschke, 2013). The differences were clearly pointed out by Fisher (1925), who stated that, “the theory of inversed probability [today discussed as Bayesian statistics] is founded upon an error and must be wholly rejected” (p. 10). More specifically, by using Bayesian statistics (e.g., Bayesian factors, credibility intervals), it is possible, in comparison with NHST, to compute the probability of a hypothesis conditionally on observed data (Rouder, Speckman, Sun, & Morey, 2009). In Bayesian inference, probability illustrates the degree of belief in, for example, a hypothesis (Andraszewicz et al., 2015). One of the advantages of conducting hypothesis

testing within the Bayesian perspective, in comparison with the frequentist framework, is that it is possible to compare the degrees of belief for two competing hypotheses (Wagenmakers, 2007). In Bayesian hypothesis testing, one statistical model or hypothesis could be more or less plausible before the data are collected (i.e., a priori). The prior plausibility could be translated into the prior model odds through the ratio, p(H0)/p(H1). In the ratio, the p in p(H1) is the degree of belief that we have in H1 based on our prior knowledge (Andraszewicz et al., 2015). When researchers have collected and analyzed the data (D), the posterior odds in favor for the hypotheses (H0 vs. H1) are given by p(H0 | D) / p(H1 | D). By using the ratio, it is possible to calculate the change in odds from prior to posterior once the data have been collected (Wagenmakers, 2007). This change in odds can be illustrated by a quantity called the Bayes factor (Andraszewicz et al., 2015). The Bayes factor is used to compare an alternative hypothesis with a null hypothesis. More specifically, the Bayes factor can “indicate the relative strength of evidence for two theories” (Dienes, 2014, p. 4). The estimated Bayes factor could take on any number between 0 and ∞ (Dienes, 2014). A Bayes factor below 1.0 indicates greater evidence for H1, whereas Bayes factors moving above 1.0 represent increased evidence for H0. If the Bayes factor is 1.0, then none of the hypotheses are favored (Wagenmakers, 2007). There are several ways to calculate the Bayes factor. One of the most established calculation procedures is to use the Bayesian information criterion (BIC; Bollen, Ray, Zavisca, & Harden, 2012). BIC can be defined as “a measure of model fit, based on the likelihood of the observed data given an optimal set of parameter values” (Nathoo & Masson, in press, Practically useful Bayesian methods, para. 6). For more information about how BIC can be both determined and used to calculate the Bayes factor, see Bollen et al. (2012) and Wagenmakers (2007). The Bayes factor can also be computed through the use of Markov chain Monte Carlo (MCMC) methods (for more information about these procedures see Lodewyckx et al., 2011; Morey, Rouder, Pratte, & Speckman, 2011). The MCMC method can be used to offer an alternative procedure to Bayesian hypothesis testing (Kruschke, Aguinis, & Joo, 2012). With the MCMC method, an approximation of the posterior distribution is performed through resampling of parameter values from the posterior distribution (Kruschke, 2013; Kruschke et al., 2012). In the MCMC procedure, a representative sample of credible parameter values is also generated from the posterior distribution (Kruschke, 2013). To evaluate the results from MCMC estimations, posterior predictive p values (PPP) can be calculated (Zyphur & Oswald, 2015). The PPP “reflect the proportion of times that the observed data are more probable than the generated data” (Zyphur & Oswald, 2015, p. 402). As a general rule, a PPP close to .50 indicates good model fit because the generated data are just as probable as the observed data (Asparouhov & Muthén, 2010). The parameter result from a MCMC procedure is a distribution mean accompanied by a 95% credibility interval. “The credibility

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

458  Ivarsson et al.

interval shows the most probable range of values for the effect, allowing statements that are impossible with frequentist methods” (Zyphur & Oswald, 2015, p. 394). Bayesian statistics, however, will not do all, or even the most important part, of the work for researchers. When applying a Bayesian perspective, it is, for the researchers, still important to discuss the real-world meanings of the results (for an extended introduction and description of Bayesian statistics see Kruschke, 2011). When discussing Bayesian perspectives’ potential advantages over frequentist approaches, another issue addressed is related to decisions about the success of a replication attempt based on a previous study (Verhagen & Wagenmakers, 2014). In the frequentist paradigm, the decision is often based on comparison of p values or effect sizes with corresponding confidence intervals. These procedures have both been criticized for various reasons related to, for example, power issues (for an extended discussion see Verhagen & Wagenmakers, 2014). To overcome these limitations, Verhagen and Wagenmakers (2014) suggested a Bayesian test designed to “quantify the success or failure of a replication attempt” (p. 1458). In this test two competing hypothesis (H0 and H1) are specified and compared. The second hypothesis (H1), in this test procedure, holds the assumption that the effect is consistent with the effect found in the original study (for more information see Verhagen & Wagenmakers, 2014). As for all statistical approaches, Bayesian statistics have been criticized. Gelman (2008) highlighted two objections that have been frequently discussed regarding the Bayesian approach. First, the reproduction of data that are performed in, for example, the MCMC procedure has been questioned. To manufacture data instead of using only observed data has been addressed as a shortcoming because the manufactured data may not reflect the importance of individual differences (Little, 2006). The second, and probably the most critical, objection against Bayesian statistics is related to the idea of using the prior and posterior distributions. The opponents of Bayesian statistics argue that the use of prior distributions makes the approach subjective because researchers could chose to include the prior values that are suitable for their purposes (e.g., Gelman, 2008; Jordan, 2011). It is therefore important that researchers, when including prior values, clearly specify from where the prior estimates came (van de Schoot et al., 2014). When investigators supply details about priors, the research consumer has the opportunity to take this information into consideration when reading the authors’ interpretations of the results from the study. Another procedure that researchers can use to deal with the potential problem with subjective prior values is to apply a variant on Bayesian estimation called empirical Bayes (EB). In comparison with the traditional Bayesian estimation procedure where the prior distribution is specified in advance, the EB procedure emphasizes that the prior distribution is specified based on calculations using the observed data (for more information see Atchadé, 2011). Bayesian-oriented researchers have, however, questioned the EB approach because it is a

hybrid between frequentist and Bayesian methods, and it can therefore be seen as an approximation of Bayesian analysis (for more elaboration about this issue, see Bayarri & Berger, 2004). To better relate to the foundations of the Bayesian perspective, an extension of the EB, named the Bayes empirical Bayes (BEB), has been developed. The main difference between the two methods is that in the BEB, in comparison with the EB, the specification of the prior distribution is based on both subjective information as well as information from the observed data (Samaniego & Neath, 1996). Fourth, researchers should internalize the following: “when used in statistics, significant does not mean important or meaningful, as it does in everyday speech” (Higgs, 2013, p. 6). A statistical significance test should not be used to decide if a result is meaningful or not. Researchers, who are the experts, should use statistics to help explain to the nonexpert reader why or why not the result is significant or meaningful and how the size of the effect translates to the world. To do that properly we need to go outside the NHST paradigm because the tools within this approach will not help us in our task. Fifth, journal editors should be encouraged to be more critical when it comes to evaluating the logic and reasoning about statistics in the manuscripts that are submitted to their journals. Because the editors are some of the most respected people within their various academic disciplines, they have the power and the great opportunity to make a difference in helping researchers use their contextual knowledge, instead of just focusing on whether the analysis showed a statistically significant result or not (see Trafimow & Marks, 2015).

Conclusion Over the last five decades, there have been numerous critiques and concerns about both the philosophy and the use of NHST procedures. Yet most researchers within the field of sport and exercise psychology still use NHST as their primary source for discussing their results. Because our goals—which in most cases reside outside of pure theory or model testing—are to show what the practical meanings of our research are, we have to change the way we draw conclusions. We, therefore, hope that more members of the research community can work together on a mission to increase the quality of our interpretations and discussions about the real significance of our research findings.

References American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author. Andersen, M.B., McCullagh, P., & Wilson, G. (2007). But what do the numbers really tell us? Arbitrary metrics and effect size reporting in sport psychology research. Journal of Sport & Exercise Psychology, 29, 664–672. PubMed

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Statistical Errors in Sport Psychology Research   459

Andraszewicz, S., Scheibehenne, B., Rieskamp, J., Grasman, R., Verhagen, J., & Wagenmakers, E.-J. (2015). An introduction to Bayesian hypothesis testing for management research. Journal of Management, 41, 521–543. doi:10.1177/0149206314560412 Asparouhov, T., & Muthén, B. (2010). Bayesian analysis of latent variable models using Mplus. Mplus Technical Report. Retrieved from http://www.statmodel.com Atchadé, Y.F. (2011). A computational framework for empirical Bayes inference. Statistics and Computing, 21, 463–473. doi:10.1007/s11222-010-9182-3 Bayarri, M.J., & Berger, J.O. (2004). The interplay of Bayesian and Frequentist analysis. Statistical Science, 19, 58–80. doi:10.1214/088342304000000116 Berger, J.O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18, 1–32. doi:10.1214/ss/1056397485 Berger, J.O., & Berry, D.A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76, 159–165. Blanton, H., & Jaccard, J. (2006). Arbitrary metrics in psychology. The American Psychologist, 61, 27–41. PubMed doi:10.1037/0003-066X.61.1.27 Boland, P.J. (1984). A biographical glimpse of William Sealy Gosset. The American Statistician, 38, 179–183. doi:10. 1080/00031305.1984.10483195 Bollen, K.A., Ray, S., Zavisca, J., & Harden, J.J. (2012). A comparison of Bayes factor approximation methods including two new methods. Sociological Methods & Research, 41, 294–324. doi:10.1177/0049124112452393 Boos, D.D., & Stefanski, L.A. (2011). P-value precision and reproducibility. The American Statistician, 65, 213–221. PubMed doi:10.1198/tas.2011.10129 Brooks, M.E., Dalal, D.K., & Nolan, K.P. (2014). Are common language effect sizes easier to understand than traditional effect sizes? The Journal of Applied Psychology, 99, 332–340. PubMed doi:10.1037/a0034745 Christensen, R. (2005). Testing Fisher, Neyman-Pearson, and Bayes. The American Statistician, 59, 121–126. doi:10.1198/000313005X20871 Cohen, J. (1962). The statistical power of abnormal-social psychology research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. PubMed doi:10.1037/h0045186 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. doi:10.1037/0003066X.45.12.1304 Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. doi:10.1037/0003066X.49.12.997 Constantinou, A.C., Fenton, N.E., & Pollock, L.J.H. (2014). Bayesian networks for unbiased assessment of referee bias in association football. Psychology of Sport and Exercise, 15, 538–547. doi:10.1016/j.psychsport.2014.05.009 Dienes, Z. (2014). Using Bayes to get the most out of nonsignificant results. Frontiers in Psychology, 5, 1–17. PubMed doi:10.3389/fpsyg.2014.00781 Doron, J., & Gaudreau, P. (2014). A point-by-point analysis of performance in a fencing match: Psychological pro-

cesses associated with winning and losing streaks. Journal of Sport & Exercise Psychology, 36, 3–13. PubMed doi:10.1123/jsep.2013-0043 Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences. PLoS One, 5(3), e10068. PubMed doi:10.1371/journal.pone.0010068 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891–904. doi:10.1007/s11192-011-0494-7 Fisher, R.A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85, 87–94. doi:10.2307/2340521 Fisher, R.A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver & Boyd. Fisher, R.A. (1935a). The design of experiments. Edinburgh, Scotland: Oliver & Boyd. Fisher, R.A. (1935b). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 39–54. doi:10.2307/2342435 Fisher, R.A. (1947). Development of the theory of experimental design. Proceedings of the International Statistical Conference, 3, 434-439. Retrieved from https://digital.library. adelaide.edu.au/dspace/bitstream/2440/15254/1/212.pdf Fisher, R.A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society. Series B. Methodological, 17, 69–78 Retrieved from http://www. jstor.org/stable/2983785. Fisher, R.A. (1959). Statistical methods and scientific inference (2nd ed.). New York, NY: Hafner. Gelman, A. (2008). Objections to Bayesian statistics. Bayesian Analysis, 3, 445–450. doi:10.1214/08-BA318 Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9, 641–651. PubMed doi:10.1177/1745691614551642 Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60, 328–331. doi:10.1198/000313006X152649 Gigerenzer, G. (2004). Mindless statistics. Journal of SocioEconomics, 33, 587–606. doi:10.1016/j.socec.2004.09.033 Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989). The empire of change: How probability changed science and everyday life. New York, NY: Cambridge University Press. doi:10.1017/ CBO9780511720482 Gill, J. (1999). The insignificance of null hypothesis significance testing. Political Research Quarterly, 52, 647–674. doi:10.1177/106591299905200309 Goodman, S.N. (1999). Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine, 130, 995–1004. PubMed doi:10.7326/0003-4819-130-12199906150-00008 Goodman, S.N. (2008). A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology, 45, 135–140. PubMed doi:10.1053/j.seminhematol.2008.04.003 Grissom, R.J., & Kim, J.J. (2012). Effect sizes for research: Univariate and multivariate applications (2nd ed.). New York, NY: Routledge.

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

460  Ivarsson et al.

Hager, W. (2013). The statistical theories of Fisher and of Neyman and Pearson: A methodological perspective. Theory & Psychology, 23, 251–270. doi:10.1177/0959354312465483 Hagger, M.S., & Chatzisarantis, N.L.D. (2009). Assumptions in research in sport and exercise psychology. Psychology of Sport and Exercise, 10, 511–519. doi:10.1016/j.psychsport.2009.01.004 Halsey, L.G., Curran-Everett, D., Vowler, S.L., & Drummond, G.B. (2015). The fickle P value generates irreproducible results. Nature Methods, 12, 179–185. PubMed doi:10.1038/nmeth.3288 Hanley, J.A., Julien, M., & Moodie, E.E.M. (2008). Student’s z, t, and s. What if Gosset had R? The American Statistician, 62, 64–69. doi:10.1198/000313008X269602 Henson, R.K. (2006). Effect-size measures and meta-analytic thinking in counseling psychology research. The Counseling Psychologist, 34, 601–629. doi:10.1177/0011000005283558 Higgs, M.D. (2013). Do we really need the S-word? American Scientist, 101, 6–9. doi:10.1511/2013.100.6 Hoekstra, R., Morey, R.D., Rouder, J.N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164. PubMed doi:10.3758/s13423-013-0572-3 Hubbard, R., & Bayarri, M.J. (2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. The American Statistician, 57, 171–182. doi:10.1198/0003130031856 Hudson, Z. (2003). The research headache—answers to some questions [Editorial]. Physical Therapy in Sport, 4, 105–106. doi:10.1016/S1466-853X(03)00080-4 Ivarsson, A., Andersen, M.B., Johnson, U., & Lindwall, M. (2013). To adjust or not adjust: Nonparametric effect sizes, confidence intervals, and real-world meaning. Psychology of Sport and Exercise, 14, 97–102. doi:10.1016/j.psych sport.2012.07.007 Jackson, B., Gucciardi, D.F., & Dimmock, J.A. (2014). Toward a multidimensional model of athletes’ commitment to coachathlete relationships and interdependent sport teams: A substantive-methodological synergy. Journal of Sport & Exercise Psychology, 36, 52–68. PubMed doi:10.1123/ jsep.2013-0038 Jordan, M.I. (2011). What are the open problems in Bayesian statistics? The ISBA Bulletin, 18(1), 1–4. Retrieved from http://www4.stat.ncsu.edu/~reich/st740/Bayesian_open_ problems.pdf Johnstone, D.J. (1986). Tests of significance in theory and practice. The Statistician, 35, 491–504 Retrieved from http:// jstor.org/stable/2987965. doi:10.2307/2987965 Kruschke, J.K. (2011). Doing Bayesian data analysis. A tutorial with R and BUGS. Burlington, MA: Academic Press. Kruschke, J.K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142, 573–603. PubMed doi:10.1037/a0029146 Kruschke, J.K., Aguinis, H., & Joo, H. (2012). The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15, 722–752. doi:10.1177/1094428112457829 Lehmann, E.L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the

American Statistical Association, 88, 1242–1249. doi:10 .1080/01621459.1993.10476404 Lehmann, E.L. (2011). Fisher, Neyman, and the creation of classical statistics. New York, NY: Springer. doi:10.1007/9781-4419-9500-1 Little, R.J. (2006). Calibrated Bayes: A Bayes/frequentist roadmap. The American Statistician, 60, 213–223. doi:10.1198/000313006X117837 Lodewyckx, T., Kim, W., Lee, M.D., Tuerlinckx, F., Kuppens, P., & Wagenmakers, E.-J. (2011). A tutorial on Bayes factor estimation with the product space method. Journal of Mathematical Psychology, 55, 331–347. doi:10.1016/j. jmp.2011.06.001 Mahoney, J.W., Gucciardi, D.F., Ntoumanis, N., & Mallett, C.J. (2014). Mental toughness in sport: Motivational antecedents and associations with performance and psychological health. Journal of Sport & Exercise Psychology, 36, 281–292. PubMed doi:10.1123/jsep.2013-0260 Masicampo, E.J., & Lalande, D.R. (2012). A peculiar prevalence of p values just below. 05. Quarterly Journal of Experimental Psychology, 65, 2271–2279. PubMed doi:10.108 0/17470218.2012.711335 McGraw, K.O., & Wong, S.P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361–365. doi:10.1037/0033-2909.111.2.361 Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103–115. doi:10.1086/288135 Miles, J., & Banyard, P. (2007). Understanding and using statistics in psychology: A practical introduction. London, England: Sage. Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (in press). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review. Morey, R.D., Rouder, J.N., Pratte, M.S., & Speckman, P.L. (2011). Using MCMC chain outputs to efficiently estimate Bayes factors. Journal of Mathematical Psychology, 55, 368–378. doi:10.1016/j.jmp.2011.06.004 Nathoo, F.S., & Masson, M.E.J. (in press).Bayesian alternatives to null-hypothesis significance testing for repeatedmeasures designs. Journal of Mathematical Psychology. doi:10.1016/j.jmp.2015.03.003 Neyman, J. (1935). [Peer commentary on journal article “The logic of inductive inference” by R. A. Fisher]. Journal of the Royal Statistical Society, 98, 73–76 Retrieved from http://www.jstor.org/stable/2342435. Neyman, J. (1957). “Inductive behavior” as a basic concept of philosophy of science. Review of the International Statistical Institute, 25, 7–22 Retrieved from http://www.jstor. org/stable/1401671. doi:10.2307/1401671 Neyman, J., & Pearson, E.S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A, 175–240. Neyman, J., & Pearson, E.S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A, 263–294. Neyman, J., & Pearson, E.S. (1933). On the problem of the most efficient test of statistical hypotheses. Philosophi-

JSEP Vol. 37, No. 4, 2015

Downloaded by New York University on 09/16/16, Volume 37, Article Number 4

Statistical Errors in Sport Psychology Research   461

cal Transactions of the Royal Statistical Society, A 231, 289–337. doi:10.1098/rsta.1933.0009 Nickerson, R.S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. PubMed doi:10.1037/1082989X.5.2.241 Norcross, J.C. (Ed.). (2011). Psychotherapy relationships that work: Evidence-based responsiveness (2nd ed.). New York, NY: Oxford University Press. doi:10.1093/acprof: oso/9780199737208.001.0001 Nuzzo, R. (2014). Scientific method: Statistical errors. Nature, 506, 150–152. PubMed doi:10.1038/506150a Patton, M.Q. (2012). Essentials of utilization-focused evaluation. Thousand Oaks, CA: Sage. Perezgonzalez, J.D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6, e223. PubMed doi:10.3389/fpsyg.2015.00223 Rodgers, J.L. (2010). The epistemology of mathematical and statistical modeling: A quiet methodological revolution. The American Psychologist, 65, 1–12. PubMed doi:10.1037/a0018326 Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. The American Psychologist, 44, 1276–1284. doi:10.1037/0003-066X.44.10.1276 Rouder, J.N., Speckman, P.L., Sun, D., & Morey, R.D. (2009). Bayesian t test for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237. PubMed doi:10.3758/PBR.16.2.225 Rozeboom, W.W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57, 416–428. PubMed doi:10.1037/h0042040 Samaniego, F.J., & Neath, A.A. (1996). How to be a better Bayesian. Journal of the American Statistical Association, 91, 733–742 Retrieved from http://www.jstor.org/ stable/2291668. doi:10.1080/01621459.1996.10476941 Scargle, J. D. (2000). Publication bias: The “file-drawer” problem in scientific inference. Journal of Scientific Exploration, 14, 91-106. doi:10.1.1.395.211 Schneider, J.W. (2015). Null hypothesis significance tests. A mix-up of two different theories: The basis for widespread confusion and numerous misinterpretations. Scientometrics, 102, 411–432. doi:10.1007/s11192-014-1251-5 Seddon, P.B., & Scheepers, R. (2012). Towards the improved treatment of generalization of knowledge claims in IS research: drawing general conclusions from samples. European Journal of Information Systems, 21, 6–21. doi:10.1057/ejis.2011.9 Seidenfeld, T. (1979). Philosophical problems of statistical inference. Boston, MA: D. Reidel. Sellke, T., Bayarri, M.J., & Berger, J.O. (2001). Calibration of ρ values for testing precise null hypotheses. The American Statistician, 55, 62–71. doi:10.1198/000313001300339950 Shaver, J. (1993). What statistical significance testing is, and what it is not. Journal of Experimental Education, 61, 293–316. doi:10.1080/00220973.1993.10806592

Stang, A., Poole, C., & Kuss, O. (2010). The ongoing tyranny of statistical significance testing in biomedical research. European Journal of Epidemiology, 25, 225–230. PubMed doi:10.1007/s10654-010-9440-x Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31, 25–32. doi:10.3102/0013189X031003025 Thompson, B. (2006). Foundations of behavioral statistics: An insight-based approach. New York, NY: Guilford Press. Trafimow, D. (2014). Editorial. Basic and Applied Social Psychology, 36, 1–2. doi:10.1080/01973533.2014.865505 Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37, 1–2. doi:10.1080/01973533.2015 .1012991 Van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J.B., Neyer, F.J., & Aken, M.A. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85, 842–860. PubMed doi:10.1111/cdev.12169 Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology. General, 143, 1457–1475. PubMed doi:10.1037/a0036731 Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804. PubMed doi:10.3758/BF03194105 Wagenmakers, E.-J., Lee, M.D., Rouder, J.N., & Morey, R.D. (2015). Another statistical paradox. Manuscript submitted for publication. Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M.D., Matzke, D., . . . Morey, R.D. (2014). A power fallacy. Behavior Research Methods. [Advance online publication]. PubMed doi:10.3758/s13428-014-0517-4 Wang, Z., & Thompson, B. (2007). Is the Pearson r2 biased, and if so, what is the best correction formula? Journal of Experimental Education, 75, 109–125. doi:10.3200/ JEXE.75.2.109-125 Wilkinson, M. (2014). Distinguishing between statistical significance and practical/clinical meaningfulness using statistical inference. Sports Medicine (Auckland, N.Z.), 44, 295–301. PubMed doi:10.1007/s40279-013-0125-y Ziliak, S.T. (2010). The Validus Medicus and a new gold standard. Lancet, 376, 324–325. PubMed doi:10.1016/ S0140-6736(10)61174-9 Ziliak, S.T., & McCloskey, D.N. (2008). The cult of statistical significance: How the standard error cost us jobs, justice, and lives. Ann Arbor, MI: University of Michigan Press. Zyphur, M.J., & Oswald, F.L. (2015). Bayesian estimation and inference: A user’s guide. Journal of Management, 41, 390–420. doi:10.1177/0149206313501200

Manuscript submitted: January 23, 2015 Revision accepted: June 24, 2015

JSEP Vol. 37, No. 4, 2015

Things we still haven't learned (so far).

Null hypothesis significance testing (NHST) is like an immortal horse that some researchers have been trying to beat to death for over 50 years but wi...
1KB Sizes 1 Downloads 8 Views