JOURNAL OF BONE AND MINERAL RESEARCH Volume 6, Number 2, 1991 Mary Ann Liebert, Inc., Publishers

Editorial How Do We Know What We Know? The Randomized Controlled Trial Revisited ROBERT P. HEANEY

in our clinical experience are temporally linked, there is a tendency to conclude that one is the cause of the other. Sometimes it is, but often it is not. If we mistakenly conclude that one of two associated events is the cause of the other, we have fallen victim to the post hoc fallacy. Conclusion drawing in certain forms of clinical investigation, specifically the case control and cohort designs, is always prey to this fallacy. The doubleblind, placebo-controlled, randomized trial has been developed precisely to circumvent this trap. It is Ihe only design that permits strong causal inferences, and for that reason it is commonly referred to as a “strong design.” It is not a new tool, but it has assumed new prominence in recent years. Both the Canadian Task Force on the Periodic Health Examination and the United States Preventive Services Task Force, 1988, have formally ranked it ahead of all other investigative designs in terms of the weight it carries.‘” FDA practice now requires that claims of efficacy be based upon two distinct randomized controlled trials.‘’] Journals and monographs are increasingly filled with reports of such studies and with calls for increased use of this d e ~ i g n . ‘ ~ - ~ ~ Although any move toward scientific rigor must be considered positive overall, this emphasis on the randomized controlled trial carries some dangers that are apparently inadequately appreciated. The first is that we may be bullied into thinking that the randomized trial is the only way to know anything, when clearly this is not so. The second is a failure to appreciate that the potential inferential power of a randomized controlled trial can easily be dissipated, either by design mistakes or by untoward factors beyond investigative control. When this happens the results of a trial are no longer persuasive-may, in fact, be less persuasive than results from an inherently weaker design. Students of study design recognize this,l6 ’I but it is an uncomfortable truth and many clinical investigators seem unaware of it. Unfortunately controlled trials can be so expensive and so draining that neither the sponsoring agency

W

HEN TWO FACTORS

nor the investigative team may readily accept an inconclusive outcome from such a study for what it is. Donald Mainland many years agot8) provided a set of four criteria to help us both to design good trials and to decide whether we find their results persuasive. Two of these, the use of contrast groups and randomization throughout, are straightforward and present no great difficulty, even though they may not always be scrupulously employed. The other two present special problems, however. They are what Mainland called “avoidance of interference’’ and “complete follow-through.” They deal with problems that, although not peculiar to randomized trials, may be most dangerous with this design. Experienced clinical investigators recognize that investigation itself has a powerful effect both on human performance and on the course of clinical illness. Sometimes (often, actually) the interference is positive and usually goes by the name of the “placebo effect.” This is most apt to be the case when the subject is sick and the trial involves potential relief. Sometimes the interference produces a negative change, however. (An analog is the malign effect of curses in cultures that we might consider primitive or superstitious.) The point is that the very process of looking-of making observations during a trial, with all their attendant procedures and precautions-affects what we are trying to measure. It is a kind of clinical investigative analog of Heisenberg’s uncertainty principle. What Mainland meant by “avoidance” of interference in investigation was not so much that we could totally eliminafe such effects as that we must design the investigation to avoid adding them inadvertently, and most importantly the design must equalize the interference across the contrast groups. This is what the double-blind accomplishes. The same kind of investigation-related interference can occur in a concurrent cohort study as well, but it looms largest in the randomized controlled trials. This is a little appreciated facet of what has become the gold standard of clinical investigation. The consequence, however, seems

Creighton University, Omaha, Nebaska. 103

104

hard to escape: by introducing random variation due to the investigation itself, the randomized controlled trial actually can make it harder to see real effects or discern real differences. In other words there can be a trade-off: for a potential gain in inferential power we may have to pay with a potential loss of observational power. Clinical investigators exhibit surprising resistance to accepting the reality of this trade-off, even to the point of flatly asserting that any loss of observational power must mean that a study has been badly conducted. This resistance may be a reflection of the widespread ignorance in the health professions about the real power of the placebo The second problem is loss of sampling units from a randomized, controlled trial, what Mainland called “incomplete follow-through.” Current and timely examples are found in the recently published results of the trials of fluoride and etidronate in the treatment of osteoporosis.(”+1z) The fluoride trial was a 4 year, prospective protocol, impeccably designed to equalize investigative interference. Unfortunately, by the end of the study better than 30% of the sampling units had been lost from both the treatment and the placebo groups. The etidronate study of Storm et al., lasting 3 years, had a loss of sampling units of nearly 40%. Losses from the study of Watts et al. were less severe; still they amounted to 14% of all patients starting the study, approximately evenly divided across all treatment groups. The smaller loss rate in this study is probably related to the shorter study length ( 2 years). Such losses do more than reduce the power of a study to see real effects: they open the door for a variety of biases. For example, it would be entirely plausible to propose, from what is known both about fluoride and about osteoporosis, that the losses from the control group of the Mayo study were due to disease progression (thus with only the less severe cases staying in the study) and that the losses from the treatment group were due to toxic effects associated with therapeutic response (thus with only the less responsive cases staying in the study). This is speculation of course, not very farfetched, but still speculation. One of the benefits of a properly executed randomized controlled trial is that it is supposed to remove grounds for such speculation. At very least, therefore, the loss of sampling units destroys that power. Mainland insisted that missing sampling units had to be counted both ways, as successes and as failures. If doing this affected the conclusions, then the study would have to be judged inconclusive. In the case of the Mayo study, depending upon how one counted the missing sampling units the results could indicate either that fluoride was decidedly helpful or that it was harmful. Applying the same treatment t o the data of Watts et al. shows that the 14% losses from the study were not severe enough to obliterate the apparent benefit on spine mineral density. However, if even a few of the patients lost from the etidronate treatment groups had developed further (but, of course, uncounted) fractures, the apparent fracture benefit of the agent could disappear. The point of these comments is not to disparage any of these studies, each of which was well designed and conducted by competent investigators. At least two of the studies had, as well, ongoing design interactions with the

EDITORIAL regulatory agency (FDA) that would ultimately have to evaluate the evidence the studies produced. Rather, the point is to emphasize that even under these generally favorable conditions this most favored of designs may produce results that are less persuasive, less strong, than our naive confidence in the design had perhaps led us to expect. More to the point: a substantial loss of sampling units is an entirely predictable occurrence in studies that must extend over several years and that involve disabled, older persons, no matter what one does. Given this fact, then, we must be certain in planning investigations that we have designed ways to handle this problem. At very least we ought to question whether a design whose strength is so dependent upon minimal losses (among other things) is actually the best choice for our research situation, particularly when it can also be predicted to add investigative noise. What, if anything, are the available remedies? For one thing, within the randomized controlled trial the loss of sampling units can be minimized by such devices as maintaining dose personal contact with the subjects, even to visiting them in their homes (or home towns) if necessary, and then, failing that, to attempt total ascertainment of the study outcomes in the units who dropped out. Both steps are necessary, and unfortunately both are costly. None of the three osteoporosis reports mentions whether these steps were taken in their respective studies. More commonly, investigators analyze lost sampling units to see if they differ from those who stayed in the study on any of several grounds ascertained at entry, such as age, sex, socioeconomic status, comorbidity, and severity of disease on entry. On finding no differences they conclude that the losses did not bias the study. This approach, alhough well-intentioned, is seriously flawed on at least three counts. First, the numbers concerned are almost never large enough to permit finding a difference that might well be important. Second, the very selection of a no-difference value for the null hypothesis relating to these lost units is itself a bias. Finally, the features compared are not usually germane to the original hypothesis being tested. (One may as well have tested whether birth months were distributed similarly in both groups.) On one crucial point the sampling units lost from a controlled trial are incontrovertibly different: they dropped out, and the others stayed in. To assume that those decisions were random and unrelated to the study is nothing more than wishful thinking. No investigative design, the conclusions of which end up being based on wishful thinking, can possibly be considered “strong.” Nonconcurrent cohort designs, like randomized trials, sort the sampling units into contrast groups by exposure or treatment. Such studies have an advantage over randomized trials in that there is much reduced opportunity for investigative interference. Unlike experiments, however, the exposure is not randomly assigned - a weakness. The reason it is a weakness is that treatment decisions are often conditioned by patient characteristics that may be importantly related to study outcomes- a problem that randomization normally handles. But randomization is not a magical incantation, not a

105

EDITORIAL black box. It serves two well defined purposes which, to a substantial extent, can be dealt with in other ways. It is important to recognize that. First, randomization limits the expression of the various forms of bias that might otherwise shift more of those subjects who will have better outcomes into one or another of the treatment groups. We can never know all those biases, but many of them have names, have been studied, and are reasonably well underStoOd.(7.l31 Selection bias,” “protopathic bias,” “susceptibility bias,” “admission rate bias” are some of them. We can use knowledge of these effects to control admission of patients into the study samples, that is, selection of those members of the treated or untreated groups whose outcomes we shall study. In this way we can anticipate and thus thwart the biases concerned. The second function of randomization is more problematic. Conclusion drawing from samples, or from differences between samples (for example, confidence limits), is based upon knowledge of random processes and the laws that describe them. Nonrandom assignments provide no such basis for inference. However, even truly random samples that have lost sampling units suffer precisely the same disability, so there is little basis for choosing between the two designs on this account. When all is said and done there is no all-purpose, perfect design. All have weaknesses, and all involve compromises and risks. A design that seems weaker, apriori, may be the best that can realistically be employed in some situations. It would be well to recognize this, and to exercise some modest skepticism in regard to the reputed power of the randomized, controlled trial. Standards of evidence need to be tailored to the situation in which eflicacy is being evaluated. There is a final, ironic twist to the study o f the fluoride trial. Based on the evidence available when the trial was planned, the RFA called for a dose of sodium fluoride of 75 mg/day. In the more than 10 years since those specifications were written, a scientific consensus has developed to the effect that the optimal dosage is actually between 25 and 50 mg/day.“‘’ This amounts to just about half the dosage mandated in the NIH-sponsored trial. The 75 mg is now widely considered excessive, even toxic. How is it that we know this? There has been no randomized, controlled trial dealing with this issue. The answer is partly the shared collective experience of the community of clinical investigators working with fluoride for the past 20 years, as well as data from many other sources, such as pharmacokinetic studies. Apparently we are willing to accept other sorts of evidence after all. (6

REFERENCES I . Battista RN, Fletcher SW 1988 Making recommendations on preventive practices: Methodological issues. Am J Prev Med 453-76. 2. 21 Code of Federal Regulations Ch. 1 314.126 (4/1/89 edi-

tion). 3. Robin E D 1984 Matters of Life & Death: Risks vs. Benefits

of Medical Care, W.H. Freeman and Company, New York. 4. Detsky AS 1989 Are clinical trials a cost-effective investment? JAMA 262:1795-1800. 5. Sackett DL 1986 Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest 89:2S-3S. 6. Fletcher RH 1989 The costs of clinical trials (editorial). JAMA 262:1842. 7. Feinstein AR 1989 Epidemiologic analyses of causation: The

8. 9.

10.

11.

unlearned scientific lessons of randomized trials. J Clin Epidemiol 42:48 1-489. Mainland D 1963 Elementary Medical Statistics, 2nd ed. W.B. Saunders, Philadelphia. Goodwin JS, Goodwin JM, Vogel AV 1979 Knowledge and use of placebos by house officers and nurses. Ann Intern Med 91:106-110. Riggs BL. Hodgson SF, O’Fallon WM, Chao EYS, Wahner HW, Muhs Jhl, Cede1 SL, Melton LJ 111 1990 The effect of fluoride treatment on vertebral fracture rate in osteoporotic women. N Engl J Med 3222302-809. Watts NB, Harris ST, Genant HK, Wasnich RD, Miller PD, Jackson RD, Licata AA, Ross P , Woodson GC, Yanover MJ, Mysiw J, Kohse L, Rao MB, Steiger P, Richmond B, Chesnut C H I I I 1990 Intermittent cyclical etidronate treatment of postmenopausal osteoporosis. N Engl J Med 323:7379.

2. Storm T, Thainsborg G, Steiniche T, Genant HK, SBrensen OH 1990 Effect of intermittent cyclical etidronate therapy on bone mass and fracture rate in women with postmenopausal osteoporosis. N Engl J Med 322:1265-1271. 3. Feinstein AR 1985 Clinical Epidemiology. The Architecture of Clinical Research. W.B. Saunders. Philadelphia. 1. Heaney RP, Baylink DJ, Johnston C C Jr, Melton LJ 111, Meunier P , Murray TM, Nagant de Deuxchaisnes C 1989 Fluoride therapy for vertebral crush fracture syndrome: Status report 1988. Ann Intern Med 111:678-680.

Address reprint requests to: Robert P. Heaney, M.D. Creighton University 2500 California Street Omaha. N E 68178

Received for publication April 30, 1990; in revised form August 15, 1990; accepted September 12, 1990.

How do we know what we know? The randomized controlled trial revisited.

JOURNAL OF BONE AND MINERAL RESEARCH Volume 6, Number 2, 1991 Mary Ann Liebert, Inc., Publishers Editorial How Do We Know What We Know? The Randomize...
310KB Sizes 0 Downloads 0 Views