COMMENTARY

Statistics Commentary Series Commentary #2VNoninferiority Trials David L. Streiner, PhD, CPsych

W

hen we think of a randomized controlled trial, what usually comes to mind is giving 1 group of patients some new wonder drug that promises to cure the world of cancer, warts, and the heartbreak of psoriasis, whereas the other group gets a placebo or treatment as usual. At the end of the study, we make sacrifices to the gods so that the statistics will show how much better the new agent is in terms of fewer deaths, less serious adverse effects, greater remission of symptoms, improved quality of life, and so on (not to mention a much greater cost). That is, most studies, including those assessing pharmaceuticals, are superiority trials aimed at showing that one treatment is better than the other or no treatment at all. There are times, though, when we would be content to demonstrate that the new agent does not differ too much from an already existing one. The primary arena in which this applies is developing a generic or ‘‘me too’’ version of a drug, although it has also been used for other treatments. For example, the aim of 1 study was to show that cognitive behavior therapy delivered via the Internet was not any worse for treating bulimia nervosa than face-to-face therapy, which is much more expensive.1 But before we get into the details, let us differentiate between 2 types of trials that do not try to show superiorityVequivalence studies and noninferiority studies. What is surprising about these terms is that they perfectly describe what they are trying to show (unlike other terms in research such as ‘‘missing at random,’’ which does not mean missing at randoma). An equivalence study is like Goldilocks’ criterion for the perfect porridgeVneither too hot nor too cold.b The aim is to demonstrate that the results from the new agent are not any higher or lower than those from the standard agent. This is most commonly performed within the context of bioavailability studies, where we do not want serum levels of some new drug, such as a generic version of a serotonin-norepinephrine reuptake inhibitor, to be either too much higher or too much lower than the existing formVtoo high and the formulation can be toxic, too low and it would be ineffective. The purpose of a noninferiority study, as the name implies, is to show that the new form of the treatment is not any worse than the existing form. We do not care if it is superiorVif it is, then that is marvelous but it is not the aim of the study (which is all to the good because that outcome occurs about as often as politicians keep campaign promises). However, trying to show noninferiority (or equivalence, for that matter) raises a difficulty that is partly statistical and partly philosophicalVhow do we prove the nonexistence of something, in this case, the nonexistence of a difference? We are in essence trying to prove the null hypothesis (H0) and, as we learned in our introductory statistics class, we can never do that. In a traditional study, if the P level is greater than 0.05, we do not say that we proved H0, only that we failed to reject it. This bit of circumlocution is not just semantic hair splitting but goes to the heart of the philosophy of science. The Scottish philosopher, David Hume (1711Y1776), asserted that it is impossible to prove the nonexistence of something. No one has ever seen a unicorn, but that does not prove they do not exist2; one can walk out of the forest tomorrow, just as the olinguito was recently discovered in the forests of Ecuador and Colombia. The lack of evidence of a difference is not the same as evidence of a lack of a difference. For example, there were many negative results of trials of oral A-blockers to prevent death after a myocardial infarction. Those did not prove that the drug is not useful, only that the studies were flawed in some way; in this case, they were underpowered, as later meta-analyses showed it to be a highly effective medication.3 The resolution of this dilemma rests on 2 facts. First, not all differences are clinically important. A person can score 30 on a scale of depression one day and a week later, have a score of 29. We would be hard-pressed to say that the person has improved and would be more inclined to attribute this to measurement error of the instrument itself or to trivial fluctuations in the person’s mood. Only when the difference exceeds some value, usually approximately one half of the SD,4,5 do we sit up and acknowledge it as important. The second fact is that the H0 does not necessarily need to refer to

From the Department of Psychiatry and Behavioral Neurosciences, McMaster University, Hamilton, Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada. Reprints: David L. Streiner, PhD, CPsych, St. Joseph’s Healthcare, West 5th Campus 100 West 5th Street, Box 585 Room B-366 Hamilton, Ontario, Canada L8N 3K7 (e level at 0.05 and A at 0.10, how many participants would we need for a noninferiority trial? The answer depends on the value we select for C, and the results are shown in Figure 2. As you can see, as the interval gets smaller, the required sample size gets exponentially larger. This is the first place that some mischief can creep into the design of such trials. Because those who pay for trials naturally want to keep the costs as low as possible, there is a great temptation to make C large to keep the sample size small. For example, 1 trial of warfarin used a ‘‘relatively generous’’ noninferiority margin of a 2% absolute risk difference and needed a sample size of more than 3000 patients. A narrower, more realistic margin would have resulted in a sample size of more than 8000.8 This means that the consumers of these studies must read them carefully and decide whether the size of the equivalence interval is clinically b

From the internationally known children’s tale BGoldilocks and the Three Bears[ (http://www.youtube.com/watch?v=atipwymJk5I).

www.psychopharmacology.com

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

185

Journal of Clinical Psychopharmacology

Commentary

FIGURE 2. Sample size requirements in noninferiority trials as a function of C.

meaningful or whether they feel it exceeds a true region of indifference. Another point to keep in mind is the interpretation of negative findings in both superiority and noninferiority trials. The issue is that in noninferiority trials, we have to sort of stand on our heads and look into a mirror to decipher what we mean by a type 1 error, a type 2 error, and power. To quickly review what they mean in traditional studies, a type 1 error is the probability of erroneously concluding that a difference exists between the groups, when in fact, there is none; a type 2 error is the probability of concluding there is no difference when there is one; and power is the probability of detecting a difference if it is there (ie, 1-type 2 error). But remember that in noninferiority trials, we have reversed the meanings of H0 and H1. Consequently, we need to reverse the meanings of these 3 terms. Now, a type 1 error occurs when we conclude that the 2 treatments are equivalent, when in fact, the standard treatment is better; a type 2 error is concluding that the standard therapy is better, when in truth, the 2 are equivalent; and power is the probability of concluding that the treatments are equivalent when they actually are.7 Sufficiently confused yet? This has 2 implications for interpreting the results of trials with negative results. First, in a superiority trial, a lack of a significant difference can mean either that the new drug is not better or that it is better but we did not have sufficient power to detect a difference. This is turned around in a noninferiority trial, where a negative finding may mean either that the new drug is actually worse or that the two are really equivalent but we were again underpowered to detect the fact. The only problem is that, because noninferiority trials require larger sample sizes for small values of C, this can be a more prevalent issue. The second implication is also the second area where mischief can creep inVthe (mis)interpretation of negative results in superiority trials. A study that begins as a superiority trial and finds no difference is not the same as a noninferiority study that has found no difference, although the vast majority of trials claiming noninferiority are just that.9 Again, the issue is one of sample size and power. One primary reason that superiority trials fail to find a difference is that they are underpowered.5 However, as we just discussed, noninferiority trials require even larger sample sizes than superiority trials to demonstrate a lack of difference, so that not finding a difference is not equivalent to demonstrating no difference. This is yet another place where readers must be alert. If a study claims to have shown noninferiority, was it designed and powered to show this or is it merely an underpowered superiority trial in disguise?

186

www.psychopharmacology.com

&

Volume 34, Number 2, April 2014

More generally, there are a host of other reasons that superiority studies may yield insignificant results, such as the following: ) The patient population may be poorly selected. We found, for example, that effect sizes were smaller (ie, the intervention was less effective) when patients were selected based on the clinician’s opinion about the presence of depression rather than the objective criteria from a structured interview.10 ) The outcome measure could be insensitive. Emslie et al11 found that children and adolescents with depression improved significantly on fluoxetine when the outcome measure was the Children’s Depression Rating Scale-Revised, but this was less evident when self-report scales were used. ) The follow-up period is too short. Many psychiatric disorders need months or even years to fully resolve, so that a lack of improvement shortly after the termination of therapy is often not a good predictor of longer-term recovery.12 And we can go onVa high dropout rate, inappropriate medication or an inadequate dosage, poor data analysis, and so forth are all reasons a study may not find a difference between groups. In other words, a badly conducted study is less likely to find true differences between groups. This should not be taken as an evidence of noninferiority but of incompetence and is another area to which the reader must be alert. In summary, demonstrating noninferiority is possible but a good noninferiority trial is one that begins and ends that way; it is not a superiority trial gone bad. AUTHOR DISCLOSURE INFORMATION The author declares no conflicts of interest. REFERENCES 1. Bulik CM, Marcus MD, Zerwas S, et al. CBT4BN versus CBTF2F: comparison of online versus face-to-face treatment for bulimia nervosa. Contemp Clin Trials. 2012;33:1056Y1064. 2. Streiner DL. Unicorns do exist: a tutorial on ‘‘proving’’ the null hypothesis. Can J Psychiatry. 2003;48:756Y761. 3. Antman EM, Lau J, Kupelnick B, et al. A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Treatments for myocardial infarction. JAMA. 1992;268:240Y248. 4. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care. 2003;41:582Y592. 5. Cohen J. A power primer. Psychol Bull. 1992;112:155Y159. 6. Rogers JL, Howard KI, Vessey JT. Using significance tests to evaluate equivalence between two experimental groups. Psychol Bull. 1993;113:553Y565. 7. Hatch JP. Using statistical equivalence testing in clinical biofeedback research. Biofeedback Self Regul. 1996;21:105Y119. 8. Head SJ, Kaul S, Bogers AJ, et al. Non-inferiority study designs: lessons to be learned from cardiovascular trials. Eur Heart J. 2012;33:1318Y1324. 9. Greene WL, Concato J, Feinstein AR. Claims of equivalence in medical research: are they supported by the evidence? Ann Intern Med. 2000;132:715Y722. 10. Joffe R, Sokolov S, Streiner D. Antidepressant treatment of depression: a metaanalysis. Can J Psychiatry. 1996;41:613Y616. 11. Emslie GJ, Rush AJ, Weinberg WA, et al. A double-blind, randomized, placebo-controlled trial of fluoxetine in children and adolescents with depression. Arch Gen Psychiatry. 1997;54:1031Y1037. 12. Simon GE. Long-term prognosis of depression in primary care. Bull World Health Organ. 2000;78:439Y445.

* 2014 Lippincott Williams & Wilkins

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Statistics commentary series: commentary #2--noninferiority trials.

Statistics commentary series: commentary #2--noninferiority trials. - PDF Download Free
652KB Sizes 2 Downloads 0 Views