Acra Oncologica Vol. 31, No. 1, pp. 723-121, 1992

STATISTICAL TESTS IN MEDICAL RESEARCH

Acta Oncol Downloaded from informahealthcare.com by University of Texas San Antonio on 06/29/13 For personal use only.

JACQUES JAMART

Various propositions have been made to improve the statistical quality of medical journals: using statistical referees, promoting better collaboration between statisticians and researchers, and teaching of basic statistics to clinicians. The most frequent errors found in medical articles are misinterpretation of p-values or non-significant results and confusion between statistical and clinical significance, inappropriate use of tests requiring precise assumptions, inappropriate or not controlled multiple testing and particularly testing of post hoc hypotheses, and overemphasis on pvalues. Many errors arising from the misinterpretation of results of statistical hypothesis tests, the basic principles of this methodology are emphasized, and the usual fallacies found in the medical literature are reviewed.

Since World War 11, statistical methods have overrun biomedical as well as other scientific journals, confirming the title of Neyman’s paper ‘Statistics-servant of all sciences’ ( I), and biostatistics has become a discipline essential in clinical research, ‘a pillar of medicine’ (2). Unfortunately, the statistical quality of medical articles has been seriously questioned since more than twenty years in many papers which have assessed the statistical content of various journals, e.g. the British Medical Journal (3) or the New England Journal of Medicine (4). Their conclusion is that about one half of all published papers reporting numerical data contain statistical errors (5-7). The implications of such inadequate statistical analyses are dramatic, with some treatments erroneously being ‘proved’ to be effective, and with other promising treatments being written off early in their development (8). Some clinicians do not probably realize that publishing incorrect or misleading results is unethical by exposing patients to injustified inconvenience, by wasting resources and time, and by carrying out of unnecessary further works (9). Proposed improvements to cure this state have included a) using statistical

Received 10 February 1992. Accepted 2 July 1992. From Consultation Biostatistique, Cliniques Universitaires de Mont-Godinne, Universite Catholique de Louvain, Yvoir, Belgium. Correspondence to: Dr Jacques Jamart. 18. rue Jules Adant. B-1950 Kraainem, Belgium.

referees for reviewing submitted manuscripts, b) promoting better collaboration between statisticians and researchers, and c) teaching of basic statistics to clinicians ( 6 ) . Some journals have also provided statistical guidelines for their contributors (10- 12). It has, however, been noted recently by Altman ( 13) that ‘despite improvements in some areas, it is clear that the misuse of statistics in medical journals is still far too common’. The purpose of the present paper is to discuss shortly the above propositions but mainly to give some useful basic concepts of statistical hypothesis testing which should allow the reader to avoid the most frequent errors. Proposals to improve the statistical quality of papers

Many journals have recruited statistical referees (8, 14, 15). It must, however, be noted that the works most in need of statistical review are generally not the manuscripts using advanced statistical methods, e.g. multivariate analyses, but rather those with usual simple statistical procedures performed personnally by clinicians not aware of basic statistical concepts (16, 17). It is hoped that participation of statisticians to the reviewing process of manuscripts will improve the statistical quality of many journals (15). A recent study comparing the statistical quality of 45 manuscripts submitted to the Brirish Medical Journul with their final published versions is encouraging about that. since the rate of papers considered as statistically acceptable raised from 11 to 82% by the refereeing process ( 18). 723

Acta Oncol Downloaded from informahealthcare.com by University of Texas San Antonio on 06/29/13 For personal use only.

124

J . JAMART

However, it must be emphasized that a number of manuscripts had to be definitively rejected because poor designs could be no more modified by the authors. It would thus be preferable that statisticians are considered by researchers as allies rather than worrisome censors (2). It is at the design stage of a work that a statistician has the greatest potential to contribute to the success of a research project. But the statistician’s experience is that advice is rarely sought at this stage. It is more common that the first contact is made when the data have been completed or even when the paper has been returned by an editor with various statistical criticisms. The clinician then asks the statistician to perform an impossible salvage job (8). As written by Moses & Louis (19), proverbial among statisticians is the incoming telephone call that starts out ‘I have a simple statistical problem that should take only a minute of your time’. This is of course not the best way to take advantage of a statistician’s expertise. The statistician must collaborate deeply to many aspects of a research work to be able to offer the most relevant solution to the clinician’s problem (8,20). Statistics is an art, and there is rarely a unique correct approach (8). However, statisticians should be mandatory only for large scale collaborative research or relatively complex designs. Clinical investigators should learn enough statistical principles to perform themselves elementary analyses correctly and should be able to recognize situations that require help from a professional statistician (15,21). Most of the errors found in medical journals are related to elementary statistical techniques, a material discussed in introductory courses and textbooks. The most frequent errors can be summarized as follows (3, 5.8.21 -23): a) confusion between standard deviation and standard error of the mean, b) misinterpretation of p-values or non-significant results and confusion between statistical and clinical significance, c) inappropriate use of tests requiring precise assumptions, specially Student’s t-tests, d) inappropriate or not controlled multiple testing and particularly testing of post hoc hypotheses, and e) overemphasis on p-values. It can thus be seen that almost all these usual errors are the consequence of the bad use of hypothesis tests or the misinterpretation of their results. It must also be emphasized that this particular problem of result interpretation does not only concern researchers or authors, but also all the practising physicians. Indeed, several reports have clearly shown that their knowledge of elementary statistical concepts is quite insufficient, less than half of them being able to interpret the meaning of p-value (24-26). an expression they are reading nevertheless everyday in their medical journals! Basic concept of hypothesis testing

A statistical hypothesis test is a decision rule designed to accept or reject a hypothesis which concerns one or several

CONCLUSION

Rejecting % :

TYFel-

comet

difference

with

omclusion

stltistiully significant

pmbobility a

OF THE

m

Figure. Relationship between the unknown truth and the conclusions of a statistical hypothesis test with comparison between two means as example.

unknown population parameters. It requires to define two hypotheses which are mutually exclusive, that is if one is true, the other is false. One of these is called the null hypothesis H,, the other is the alternative hypothesis H I . For instance, for comparing two unknown population means pA and pB, we define ‘ p A = pB’ as H, and ‘pA # pB’ as H I , both hypotheses being mutually exclusive. The statistical test uses sample data to access the probability that H, is true. If this probability is low, we consider H, as unlikely and we reject it; if the probability is not low, we accept H,, that is more precisely we do not reject it. The classical level for declaring that a probability is low has been arbitrarily fixed at 0.05. A hypothesis test is thus a proof ab absurdo. In the comparison between both means we try to show that pA = pB is not plausible in order to conclude that pA # pB. An essential element of this theory is the asymmetrical character of both hypotheses, since in fact only Ho is assessed by the test. Rejecting H, allows thus to accept H I , that is in the example both means differ significantly, but accepting H, does not allow to reject H , . It is said in this case that both means do not differ significantly, that is a difference exists or does not exist. The relationship between the unknown truth and both possible conclusions of a statistical test induces four situations described in the figure. We draw a correct conclusion, either when we reject H, ‘ p A = pB1and that in truth both means differ, or when we accept H, and that both means do not actually differ. What we hope of course is to avoid erroneous conclusions. When we reject H, when it is really true, we commit a so-called type I error. The probability u of this false conclusion of the existence of a difference between both means is expressed by the p-value of the test. When we accept H, when it is false, we commit another error. called type I1 error. It is the probability p of not detecting a difference which really does exist. Unfortunately, this parameter is generally not a single value. Indeed, the probability of type I1 error is based on the situation where H, is true, that is pA # pB. There is of course an infinity of values leading to this inequality and

725

Acta Oncol Downloaded from informahealthcare.com by University of Texas San Antonio on 06/29/13 For personal use only.

STATISTICAL TESTS IN MEDICAL RESEARCH

consequently there is a p probability for each value of the possible real difference between unknown means pA and pB. Some authors have pointed out the analogy between statistical hypothesis tests and diagnostic tests, type I and type I1 errors corresponding to false positive and false negative conclusions respectively (27,28). Three usual misinterpretations of the results of a statistical test must be emphasized. 1) The p-value is not the probability of making a mistake. It is only the probability of a type I error or, in the example of the comparison between two means, the probability of obtaining a difference as large or larger than the one observed in the sample under the null hypothesis, that is if in reality there is no difference between both means. 2) The absence of a significant difference between two means does not allow to conclude to the equality of these parameters as it is frequently stated in medical reports. Such an assertion is only possible for a low fi probability value computed for an acceptable clinical difference. This implies sufficiently large samples rarely found in medical literature. Freiman et al. (29) have shown that among 71 controlled clinical trials concluding to the absence of a significant difference between two treatments, published in various journals, only four had a probability lesser than 0.10 of missing a difference of 25% in the studied variables. Every investigator must remember that ‘the absence of a proof of efficacy is not the proof of an absence of efficacy’ (28). 3) The p-value is not a measure of the magnitude of an effect. A statistically significant difference means that a difference between unknown population parameters is probable, not that it is sufficiently large to be clinically significant. For instance, the table shows the values of creatinemia in two groups of 10 subjects treated by two drugs A and B, with mean creatinemias of 1.14 and 1.17 respectively. The small difference between these values is statistically significant with Student’s t-test ( p = 0.020). It is, however, doubtful whether such a difference is clinically relevant. Practical use of statistical tests From a more technical viewpoint, the choice of statistical tests is often inappropriate in medical literature, usually by lack of validity of basic test assumptions or by multiple use of tests performed on the same data. Indeed, many statistical procedures require some precise assumptions to be valid. For instance, unpaired Student’s t-test for comparing two means, which is probably the most used procedure in medical articles, has been derived by assuming that both sets of data are randomly sampled from studied populations, and that, in these populations, the variable follows a normal distribution and has equal variances. The performance of the test on paired samples or, in independent but small samples (say, less than 30 observations per group), for studying variables which do

Table Fictitious dara of subjects treated by drugs A and B Creatinemia

Score

A

B

A

1.14 1.13 1.11 1.19 1.15 1.15 1.09 1.19 1.12 1.14

1.13 1.19 1.16 1.16 1.15 I .20 1.19 1.16 1.19 1.21

1 1 2 1 3

1

B 1

3 I 8 3 2

I

2

1 2 1

21 3 2

not respect both cited conditions at least approximately, is therefore misleading and any conclusion drawn from its result is completely spurious. This bad use of t-test can be found in many papers (3,22), when a correct alternative of Wilcoxon’s rank sum test could be easily performed in most cases of independent samples, as shown by the following example. Suppose that we have to compare the values of arbitrary scores observed in both groups of 10 patients described before (see Table). This parameter is not continuous and does not follow a normal distribution. Student’s t-test is therefore a bad choice to test the null hypothesis of equality of both unknown population means after drugs A and B. Its result would give a p-value of 0.1 16 leading to the acceptance of the null hypothesis of equality, when this assumption would be rejected by a correct Wilcoxon’s rank sum test, with p = 0.026. Another and probably still more common fallacy is the multiple use of statistical tests. Indeed, as discussed above, the p-value measures the probability of a false conclusion of rejecting the null hypothesis. When we perform a test for comparing two means for instance, the classical definition of statistical significance at p < 0.05 involves that the probability of a false positive conclusion will be smaller than 5% for this test. This probability will considerably inflate with the number of tests. If an investigator applies 100 statistical tests in a given study, on expected average (s)he will obtain 5 so-called significant differences even if no differences exist in the studied populations. Ottenbacher (30) has shown that, in 5 randomly issues of the New England Journal of Medicine, the proportion of results labelled as statistically significant that are likely to be chance results was near 20%).The incidence of type I errors, that is the number of so-called statistically significant erroneous conclusions, is thus much higher than most researchers and clinicians realize (30). This problem of multiple testing becomes even more acute when the data used in various tests are not independent, due to the complex relationship between error rates

Acta Oncol Downloaded from informahealthcare.com by University of Texas San Antonio on 06/29/13 For personal use only.

726

1. JAMART

in this situation. Usual methods like x 2 or t-tests must then be avoided. Three particular forms of this problem may be individualized. 1) Comparisons between several group means must be made by procedures such as analysis of variance followed by appropriate tests for individual comparisons. The erroneous use of multiple t-tests has nevertheless been found in 44, 54 and .61% of papers with several group means comparisons published in Circulation Research (21), the New England Journal of Medicine (4) and Circulation (21) respectively. 2 ) Multiple looks on data to see if a difference has already emerged, for instance in clinical trials, must also be forbidden. Adequate methods for using such a strategy are available. But the investigator who performs a repetitive t-test in this way has the same logic as the bettor who declares that the race is over when his( her) horse is ahead! (31). 3) The last form of the problem of multiplicity is the statistical testing of post hoc hypotheses, that is hypotheses derived from the data. Usual procedures must always be performed to test a precise a priori hypothesis (1 1,23), not to screen among a lot of results to discover something new, which is in fact equivalent to test all possible hypotheses simultaneously. Clinicians must recognize that defining a hypothesis by scrutinizing all observations and then trying to test its plausibility on the same set of data is conceptually absurd. More generally, there is in medical literature an excessive use of hypothesis tests with an overemphasis on p-values. Authors often give insufficient attention to estimate the magnitude of their results, proportions, means, indices, . . . or the differences between them and to study the variability and the risk of bias (32, 33). Only few hypotheses specified in advance should be tested (33-35). As noted by Gardner & Altman (32), ‘The implication of hypothesis testing-that there can always be a simple ‘yes’ or ‘no’ answer as the fundamental result from a medical study-is clearly false and used in this way is of limited value’. Moreover, the convention of the 0.05 level of significance, generally attributed to R. A. Fisher, is arbitrary. Considering this level as a cutoff barrier is nonsense and any suggestion that a p-value of 0.04 denotes a real difference between two means for instance whereas a value of 0.06 can be ignored is absurd. Philosophy of statistical analysis

Unfortunately, for many clinicians, statistics is not the science of analyzing data in the most objective way, but a demonstration exercise, intended merely to illustrate preconceived truth and to convince others of its validity (36). Statistics becomes so an obligatory and ritual process, closer to magic than to science, but with an ‘aura of mathematical proof‘ (37). Fisher himself came to deplore

how often his methods were applied thoughtlessly as cookbook solutions (38). It is the ‘hunting of p-values’ (39) considered as ‘passports to publication’ (40). Where is the origin of this distortion? I think the responsibility must be shared between clinicians, journal editors and statisticians. Medical researchers are of course the main persons to be involved because of their ignorance, at least for many of them, of the basic statistical concepts and their stubbornness to privilege, in the words of Mainland (41), statistical arithmetic rather than statistical thinking. This fallacy is aggravated by editors of medical journals who insist that no paper is complete without a collection of asterisks denoting significance (42). But this distortion is also the consequence of some inconsistencies of the theory of hypothesis tests, the responsibility of statisticians. The problem of multiple testing discussed above is a good example of such an inconsistency. Suppose a researcher X conducts a clinical trial on treatments A, B and C while another scientist Y is comparing only A and B. Why should investigator X be punished with an adjustment to the p-value which is not imposed to his (her) colleague? Why, with the same data, should X conclude that A and B do not differ significantly while Y could reach the 0.05 level of significance (35,43)? In fact, the theory of hypothesis tests developed mainly by Neyman and Pearson between both World Wars is not the only approach to statistical inference. Some alternative theories exist, the most studied of these being the so-called Bayesian inference, which consists of a calculation of a posterior probability given the observed data and prior information. This approach does not have the same limitations than classical theory but its use entails other difficult problems. Despite some weaknesses, hypothesis testing remains today a useful approach for analyzing the results of a medical study and allowing a more objective look at the data. But it must be remembered that this purpose can only be reached by a proper use of the tests, which means, on the one hand, the choice of a valid method for the given problem, and on the other, a correct interpretation of the test result.

REFERENCES I . Neyman J. Statistics-servant

of all sciences. Science 1955;

122: 4 0 - 6 .

2. Editorial. A pillar of medicine. JAMA 1966; 195: 1145. 3. Gore SM. Jones IG, Rytter EC. Misuse of statistical methods: critical assessment of articles in BMJ from January to March 1976. Br Med J 1977; 1: 85-7. 4. Godfrey K. Comparing the means of several groups. N Engl J Med. 1985; 313: 1450-6. 5. Altman DG. Statistics in medical journals. Stat Med 1982; 1: 59-71. 6. Johnson T. Statistical guidelines for medical journals. Stat Med 1984; 3 97-9. 7. Murray GD. Statistical aspects of research methodology. Br J Surg 1991; 78: 777-81.

Acta Oncol Downloaded from informahealthcare.com by University of Texas San Antonio on 06/29/13 For personal use only.

STATISTICAL TESTS IN MEDICAL RESEARCH

8. Murray GD. The task of a statistical referee. Br J Surg. 1988; 75: 664-7. 9. Altman DG. Statistics and ethics in medical research. Misuse of statistics is unethical. Br Med J 1980; 281: 1182-4. 10. Altman DG, Gore SM. Gdrdner MJ, Pocock SJ. Statistical guidelines for contributors to medical journals. Br Med J 1983; 286: 1489-93. 11. Bailar JC 111, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Amplifications and explanations. Ann Int Med 1988; 108: 266-73. 12. Murray GD. Statistical guidelines for the British Journal of Surgery. Br J Surg 1991; 78: 782-4. 13. Altman DG. Statistics in medical journals: developments in the 1980s. Stat Med 1991; 10: 1897-913. 14. Gardner MJ, Altman DG, Jones DR, Machin D. Is the statistical assessment of papers submitted to the ‘British Medical Journal’ effective? Br Med J 1983; 286: 1485-8. 15. Morris RW. A statistical study of papers in the Journal of Bone and Joint Surgery [Br] 1984. J Bone Joint Surg [Br]. 1988; 70-B: 242-6. 16. Schor S , Karten 1. Statistical evaluation of medical journal manuscripts. JAMA 1966; 195: 145-50. 17. Marks RG, Dawson-Saunders ER, Bailar JC, Dan BB, Verran JA. Interactions between statisticians and biomedical journal editors. Stat Med 1988; 7: 1003-11. 18. Gardner MJ. Bond J. An exploratory study of statistical assessment of papers published in the British Medical Journal. JAMA 1990; 263: 1355-7. 19. Moses L, Louis TA. Statistical consulting in clinical research: the two-way street. Stat Med 1984; 3: 1-5. 20. Finney DJ. The questioning statistician. Stat Med 1982; 1: 5-13. 21. Glantz SA. Biostatistics: how to detect, correct and prevent errors in the medical literature. Circulation 1980; 61: 1-7. 22. White SJ. Statistical errors in papers in the British Journal of Psychiatry. Br J Psychiat 1979; 135: 336-42. 23. Bailar JC 111. Science, statistics and deception. Ann Int Med 1986; 104: 259-60. 24. Weiss ST, Samet JM. An assessment of physician knowledge of epidemiology and biostatistics. Am J Med Educ 1980; 55: 692-703. 25. Berwick DM, Fineberg HV, Weinstein MC. When doctors meet numbers. Am J Med 1981; 71: 991-8.

121

26. Wulff HR, Andersen B, Brandenhoff P, Guttler F. What do doctors know about statistics? Stat Med 1987; 6: 3-10. 27. Diamond GA, Forrester JS. Clinical trials and statistical verdicts: probable grounds for appeal. Ann Int Med 1983; 98: 385-94. 28. Detsky AS, Sackett DL. When was a ‘negative’ clinical trial big enough? How many patients you needed depends on what you found. Arch Int Med 1985; 145: 709-12. 29. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR.The importance of beta, the type I1 error and sample size in the design and interpretation of the randomized control trial. Survey of 71 ‘negative’ trials. N Engl J Med 1978; 299: 690-4. 30. Ottenbacher KJ. Scientific vs statistical inference: the problem of multiple contrasts in clinical research. Am J Med Sci 1988; 295: 172-7. 31. Silverman WA. Human experimentation: a guided step into the unknown. New York: Oxford University Press, 1985 (cited in 33). 32. Gardner MJ, Altman DG. Confidence intervals rather than p-values: estimation rather hypothesis testing. Br Med J 1986; 292: 746-50. 33. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med 1987; 317: 426632. 34. Roebruck P. Explorative statistical analysis and the valuation of hypotheses. Rev Epidem Sante Pub1 1984; 32: 181-4. 35. Brown GW. P-values. Am J Dis Child 1990; 144: 493-5. 36. Silverman WA. The seeds of dualism in medical research. Controlled Clin Trials 1990; I I : 4-6. 37. Mainland D. Statistical ritual in clinical journals: is there a cure? 11. Br Med J 1984; 288: 920-2. 38. Box JF. R. A. Fisher: the life of a scientist. New York: Wiley 1978 (cited in 40). 39. Salsburg DS. The religion of statistics as practiced in medical journals. Am Stat 1985; 39: 220-3. 40. Mainland D. Statistical ritual in clinical journals: is there a cure? 1. Br Med J 1984; 288: 841-3. 41. Mainland D. Medical statistics-Thinking vs arithmetic. J Chron Dis 1982; 35: 413-7. 42. Finney DJ. Is the statistician still necessary? Biom Praxim 1989; 29: 135-46. 43. OBrien PC. The appropriateness of analysis of variance and multiple-comparison procedures. Biornetrics 1983; 39: 787-8.

Statistical tests in medical research.

Various propositions have been made to improve the statistical quality of medical journals: using statistical referees, promoting better collaboration...
488KB Sizes 0 Downloads 0 Views