Multiple Comparisons: Comparisonwise Versus Experimentwise T y p e I Error Rates and Their Relationship to Power I K. E. KEMP Department of Statistics Statistical Laboratory KansasState University Manhattan 66506

ABSTRACT S o m e statisticians c o n t e n d that the e x p e r i m e n t w i s e T y p e I error rate is the most i m p o r t a n t attribute of multiple comparison procedures to be used for making all possible pairwise comparisons among t r e a t m e n t means after an analysis of variance. That c o n t e n t i o n is challenged here. The i m p o r t a n c e of T y p e I errors is discussed as well as the occurrence of T y p e I errors in biological experiments. Also considered is the effect of T y p e I error p r o t e c t i o n on power. The two m e t h o d s of measuring T y p e I error rate, comparisonwise and experimentwise, are explained, and the reader may decide which kind he wishes to control. Several references cited support Fisher's least significant difference and Duncan's new multiple range test despite their higherthan-nominal experimentwise T y p e I error rates. These procedures control the comparisonwise T y p e I error rates and are considerably more powerful in finding differences a m o n g treatments than procedures that control the experimentwise T y p e I error rates.

INTRODUCTION Gill (9) suggests that Duncan's new multiple range test (DMRT) (6) and Fisher's least significant difference test (LSD) (7) should be used no longer. However, evidence in the literature is sufficient to indicate that Gill's advice is unwarranted for each procedure. Let me define terms. A t y p e I error is rejecting H o :/Ji = /.tj when H o is true. There are

Received November 1, 1973. xContribution No. 212, Department of Statistics and the Statistical Laboratory of the Kansas Agricultural Experiment Station, Manhattan 66506.

two c o m m o n l y used ways of measuring the Type I error rate of multiple comparison procedures. One is the comparisonwise error rate defined as the ratio of the n u m b e r of T y p e I errors to the total n u m b e r o f comparisons. For example, if we have four t r e a t m e n t means that we wish to compare, there are six comparisons to be made. If, in those six comparisons, one mean is declared significantly different from another but they are really equal, then one T y p e I error would have occurred and the comparisonwise T y p e I error rate w o u l d be 1/6. The other m e t h o d is the experimentwise error rate. It is defined as the p r o p o r t i o n of experiments in which one or more T y p e I errors occur, or the ratio of experiments in which at least one Type I error occurs to the total number of experiments analyzed. For example, if we have 10 experiments and we separate means in each, we would c o u n t the n u m b e r o f experiments in which at least one T y p e I error occurred and divide by 10 to c o m p u t e the experimentwise T y p e I error rate. Thus, if at least one Type I error occurred in three experiments and n o n e occurred in the other seven, the experimentwise T y p e I error rate would be 3/10. The p o w e r of a multiple comparison test is defined as the p r o p o r t i o n of t i m e H o :/~i =/tj is declared false when it is. It is defined on a comparisonwise basis because that seems to be the only meaningful way to measure it. It is doubtful that an experimentwise definition of p o w e r would be useful for comparing pairwise comparison procedures. That also may be true of experimentwise T y p e I error rate. Gill (9) cites papers by Boardman and Moffitt (3) and Anderson (1) as evidence against LSD and DMRT, but neither of those papers discusses situations typical for agricultural or biological researchers. The m o s t imp o r t a n t p o i n t is that in nearly all applications, in both agriculture and biology, the null hypo-

1374

TECHNICAL

thesis is n o t true and few t h i n k it is. Boardman and Moffitt use LSD to make all possible comparisons a m o n g pairs of means for up to 10 treatments with no prior significant F ratio and with no differences a m o n g treatments. Fisher (7) warns not to use the procedure unless the F ratio is significant, indicating a false null hypothesis. Thus, at c~ = .05, t h e LSD should n o t even have been c o m p u t e d for 95% of the comparisons made by Boardman and Moffitt. That leaves their results relevant only to multiple t tests, n o t to LSD tests. The two are n o t the same. A n o t h e r weakness o f their results is that one almost never could design a meaningful biological e x p e r i m e n t and have 10 treatments all the same. Anderson's investigation of LSD is based on the conditional probability that the null hypothesis has been rejected by the F test when H o : ~ 1 = ~22 = • = /'/p = ~ i s true. C o m p u t i n g T y p e I error rate based on such a conditional probability does not evaluate fairly the usefulness of LSD. Such an evaluation is based on only the 100~% of o u t c o m e s when the F ratio is significant because of T y p e I error. It eliminates f r o m consideration those cases where the F test indicates acceptance of H o, and, hence, LSD would n o t be c o m p u t e d . If these cases were included, the experimentwise T y p e I error rate of LSD w o u l d be considerably lower. - -

I M P O R T A N C E OF T Y P E I E R R O R

Before citing research results indicating that D M R T and LSD are n o t nearly as defective in terms of T y p e I error rate as Gill suggested, I should acknowledge that Gill contends that it is e x t r e m e l y i m p o r t a n t to avoid T y p e I errors. Are T y p e I errors that serious? And w h a t a b o u t the p o w e r of the test? If /l 1 = /12 and the experimenter erroneously concludes that ~1>/12 or that ~t2>#l , is great damage done? If the results were f r o m comparing different rations on milk p r o d u c t i o n , for example, whether y o u r e c o m m e n d ration 1 or ration 2 makes no difference if t h e y are equally g o o d and cost a b o u t the same. If cost of the rations, availability of the rations, etc., differ considerably, of course, the best justified e c o n o m i c a l l y should be used. However, if ~1 = /12 and the e x p e r i m e n t were properly designed, m o s t of the time the difference b e t w e e n X1 and X2 w o u l d be t o o small to justify a m o r e expensive ration

1375

NOTE

to obtain a small increase in milk p r o d u c t i o n whether that difference is statistically significant or not. In addition, before extension personnel or researchers r e c o m m e n d major changes to c o m m e r c i a l operators, the results of a particular t r e a t m e n t must be repeated several times. It is hard to imagine them making r e c o m m e n d a t i o n s for major changes on the basis of one experimental o u t c o m e . The chances of ~tl being declared greater than/J2 on more than one e x p e r i m e n t when /11 = ~2 are e x t r e m e l y small when experiments are properly designed. For these reasons, T y p e I error rate is not the most i m p o r t a n t factor in determining the usefulness of a multiple comparison procedure even though it apparently is the view o f those who r e c o m m e n d e x t r e m e l y conservative procedures like T u k e y ' s test (TSD) (17). To me, the p o w e r of a test is at least as i m p o r t a n t as its T y p e I error rate. POWER V E R S U S T Y P E I ERROR RATE

The basic controversy a b o u t multiple comparison tests is whether we should control for comparisonwise or experimentwise T y p e I error rate. Related to this, however, is the p o w e r of the test. Most readers are p r o b a b l y familiar with the relationship in the usual hypothesis testing situation where we k n o w that if we test Ho:~tl = /-t2 at ~ = .05, we have a lower probability of making a T y p e I error than if we use ff = .10, when H o is true. A t the same time, if H o is false, using ff = .05 will give us less p o w e r t h a n using ~ = .10. We can never avoid this trade-off. As ff decreases, so does the p o w e r of the test. Similarly, t h e choice that we m a k e concerning which t y p e of error rate we control will affect the p o w e r of the test. The LSD procedure controls the comparisonwise T y p e I error rate. On the o t h e r hand, T u k e y ' s test (TSD) controls the e x p e r i m e n t w i s e T y p e I error rate. If we use LSD at c~ = .05, then the theoretical comparisonwise error rate is .05. If we have five t r e a t m e n t means that we wish to compare, there are 10 comparisons to be made, and the probability of making at least one T y p e I error in those 10 comparisons, if H o :/~1 =/12 = • • -#s is true, is 1-(.95) 1° = .40. This then is the theoretical e x p e r i m e n t w i s e T y p e I error rate for LSD for this situation. If we were using TSD at = .05 to compare 10 means, the theoretical Journal of Dairy Science Vol. 58, No. 9

1376

KEMP

experimentwise Type I error rate is .05, but the theoretical comparisonwise rate is .002. These examples show that controlling the comparisonwise error rate increases experimentwise error rates, and controlling experimentwise error rate reduces the comparisonwise rate. Orthogonal contrasts have the same theoretical experimentwise Type I error rate as DMRT. Gill recommended orthogonal contrasts but condemned DMRT for having too high experimentwise Type I error rate. The selection one makes in terms of which type of error rate he wishes to control has direct effect on the power of the test. As indicated in the foregoing discussion, choosing to control the experimentwise Type I error rate in an experiment with 10 treatments is the same as choosing a = .002 on a comparisonwise basis. The effect that this has on the power of the test is essentially the same as testing Ho :/al =/a2 at ~ = .05 compared to using ~ = .002. If H o is indeed false, the power of the test will be considerably greater for e~ = .05 than oe = .002 as the comparisonwise Type I error rate. This situation is shown repeatedly in the Monte Carlo simulation discussed in the next section of this paper. Steel (5) pointed out that if an experimentwise error rate is fixed rigidly in advance, the comparisonwise error rate depends on the number of comparisons. Thus, the knowledge that previously u n k n o w n treatments may provide by including them in an experiment is negated because including them may cause otherwise significant results to become nonsignificant. Restricting our interest to experimentwise error rates, rather than also considering comparisonwise error rates, limits the utility of exploratory factorial designs because as more treatments are included, the larger becomes the difference which will be considered significant. Of course, that reduces the probability of finding real differences among treatments. O'Neill and Wetherwill (11) state, "No one, except perhaps some statisticians, would rigidly assign error rates and then state whether or not certain comparisons represented real effects. Error rates are to be used as guides and are conveniently considered as fixed for the purpose of certain power calculations, etc." This would be the understanding of researchers applying multiple-comparisons procedures if they understood enough about them to know Journal of Dairy Science Vol. 58, No. 9

what their choice between power and Type I error rate really was. As Cox (5) points out, "The essence of the matter seems to me to be this. The fact that a probability can be calculated for the simultaneous correctness of a large number of statements does not usually make that probability relevant for the measurement of the uncertainty of one of the statements." Cox's philosophy of the value of multiple comparisons is given in the statement; "The practical usefulness of the multiple comparison techniques then usually lies in giving a conservative bound for the effect of selection, rather than in giving an 'exact' solution."

SIMULATION RESULTS

The best evidence that LSD and DMRT do not, in fact, have extremely high Type I error rates is provided by several Monte Carlo simulations that measure both Type I error rate and power of most of the presently used multiple comparison procedures. The simulations were designed to test the performance of the procedures under conditions similar to those in actual applications. Balaam (2), comparing LSD, DMRT, and Newman-Keuls (10) procedures, found that even with a significant F ratio, Newman-Keuls procedure did not find differences in some cases. He also concluded that LSD had greatest power and an acceptably low Type I error rate. Fryer (8) simulated experiments with differences among 5 treatments of 3 configurations, 5 configurations of difference among 10 treatments, and 2 configurations among 20 treatments. He tested both LSD and DMRT and concluded that each had acceptably low Type I error rate and adequate power. Peteriuovich and Hardyck (12) tested Tukey's test (TSD), Scheffe's test (SSD) (13), and LSD with both unequal variances and non-normality in the data. Using between 2 and 10 treatments, they concluded that LSD did not give sufficient protection against Type I error. However, they were comparing it to the two most conservative tests. Waller (19), comparing multiple-t, LSD, TSD, and a Bayesian decision rule (BSD) (18) for combinations of 5 and 9 treatments, found that TSD lacked power and that both LSD and BSD gave good power with sufficient protection against Type I error.

TECHNICAL NOTE O'Neill and Wetherill (11) gave a complete review of existing techniques and a very thorough collection of references. They pointed out that the lack of suitable tables and the complexity of methods frequently severely limit the choices of method. Such limitations do not affect LSD. After extensively reviewing the literature, they concluded that LSD is as good as any test available, all things considered. Carmer and Swanson (4) have reported the most complete simulation to date. They compared 10 techniques and measured their power, Type I error rate, and Type III error rate. A Type III error is defined as reversing the rank of the treatment means and calling the difference significant, e.g., if ~l>g2 but we conclude that /a2>#1. They used 22 sets of treatment effects including 5, 10, or 20 treatments with assorted distributions of differences among treatments ranging from none, to equally spaced differences to clustered differences. Each treatment configuration was used with 4 replications (3, 4, 6, and 8) producing 88 different experiments, each replicated 1000 times, for a total of 88,000 simulated experiments. They concluded that Type III error was negligible for all tests in their comparisons. They found LSD, DMRT, and BSD the most powerflll and useful. They concluded that TSD and SSD were too conservative to be practical. Thomas (16) compared 7 pairwise comparison procedures. He concluded that DMRT was the one to use because of its high power and sufficient protection against Type I error. He believes LSD should not be used because of higher Type I conditional error rate. Again, it seems that the conditional error rate is not important for LSD, especially in cases such as Thomas simulated where he had 5, 10, and 20 means all equal. He goes on to say that the other procedures he considered were too conservative to be practical; that included TSD. CONCLUSION

The results cited indicate that we are not to the point where LSD and DMRT should be laid to rest and forgotten. The majority of evidence indicates they are the two most powerful procedures available. All indications are that tests like SSD and TSD, which control the experimentwise error rate, lack sufficient power to be useful in sorting out which treatments are different.

1377

In Duncan's original work (6), he said his test could be used without a prior F test. To do so limits the experimentwise Type I error rate to 1-(1 - a)p-1, where p is the number of treatments. In the simulation work, virtually all who included DMRT used it according to Duncan's recommendation, i.e., on every set of data regardless of the F-test. However, many experimenters use DMRT only when the F ratio indicates differences among treatments, i.e., just as they would use LSD, TSD, etc. Those conditions markedly improve protection against Type I error while reducing its power little. Thus, using DMRT only when F is significant gives the researcher a procedure with a lower Type I error rate than LSD with good power for finding differences among treatment means. Those concerned with Type I error could easily adopt that procedure. Only in cases where it is extremely important to avoid Type I errors should such conservative procedures as TSD and SSD be used. Even then, the experimenter would be better off using BSD so that he can specify the relative importance of Type I to Type II error. This procedure was shown by Carmer and Swanson to be similar to LSD but gives better protection against Type I error if the F ratio is small, and higher power if F is large. It is, however, more difficult to use than LSD.

REFERENCES

1 Anderson, D. A. 1972. Ovecall confidence levels of the least significant difference procedure. Am. Star. 26:30. 2 Balaam, L. N. 1963. Multiple comparisons, a sampling experiment. Aust. J. Stat. 5:62. 3 Boardman, T. J., and D. R. Moffitt. 1971. Graphical Monte Carlo Type I error rates from multiple comparison procedures. Biometrics 27:738. 4 Carmer, S. G., and M. R. Swanson. 1973. Evaluation of ten pairwise multiple comparison procedures by Monte Carlo methods. J. Am. Stat. Ass. 68:66. 5 Cox, D. R. 1965. A remark on multiple comparison methods. Technometrics 7:223. 6 Duncan, D. B. 1955. Multiple range and multiple F tests. Biometrics 11 : 1. 7 Fisher, R. A. 1935. The design of experiments. Oliver and Boyd, London. 8 Fryer, H. C. 1966. Concepts and methods of experimental statistics. Allyn and Bacon, Boston. 9 Gill, J. L. 1973. Current status of multiple comparisons of means in designed experiments. J. Dairy Sci. 56:973. Journal of Dairy Science Vol. 58, No. 9

1378 10

11

12

13

14 15

KEMP

Keuls, M. 1952. The use of the studentised range in connection with an analysis of variance. Euphytica 1:112. O'Neill, R., and G. B. Wetherill. 1971, The present state of multiple comparison methods. Royal Stat. Soc. (Series B.) 33:218. Peterinovich, L. F., and C. D. Hardyck. 1969. Error rates for multiple comparison methods. Psych. Bull. 71:43. Scheffe, H. 1953. A m e t h o d for judging all contrasts in the analysis of variance. Biometrika 40:87. Steel, R. G. D., and J. H. Torrie. 1960. Principles and procedures of statistics. McGraw-Hill. Steel, R. G. D. 1961. Query 163: Error rates in

Journal of Dairy Science Vol. 58, No. 9

multiple comparisons. Biometrics 17 : 326. Thomas, D. A. H. 1973. Error rates in multiple comparisons a m o n g m e a n s results of a simulation exercise. Unpublished Master's Thesis. University of Kent, Canterbury, England. 17 Tukey, J. W. 1953. The problem of multiple comparisons. Unpublished notes. Princeton University, Princeton, NJ. 18 Waller, R. A., and D. B. Duncan. 1969. A Bayes rule for the s y m m e t r i c multiple comparison problem. J. Am. Stat. Ass. 64:1484. 19 Waller, R. A. 1970. On the Bayes rule for the symmetric multiple comparisons problem. Unpublished notes. Kansas State University, Manhattan 66506. 16

Multiple comparisons: comparisonwise versus experimentwise Type I error rates and their relationship to power.

Some statisticians contend that the experimentwise Type I error rate is the most important attribute of multiple comparison procedures to be used for ...
420KB Sizes 0 Downloads 0 Views