J Ch Epidemiol Vol. 44, No. 8, pp. 839-849, Printed in Great Britain. All rights reserved

1991 copyright

0

0895-4356/91 $3.00 + 0.00 1991 Pergamon Press plc

METHODOiOGICAL STANDARDS FOR ASSESSING THERAPEUTIC EQUIVALENCE BRAMKIRSHNER Peel Health Department, 199 County Court Boulevard, Brampton, Ontario, Canada L6W 4P3 (Received

in

revisedform 4 February 1991)

Abstract-This paper reviews issues related to defining and demonstrating therapeutic equivalence. A set of guidelines are proposed to critically review clinical trials to determine whether there is sufficient evidence to conclude that an experimental therapy is therapeutically equivalent to a standard one. These guidelines include criteria for assessing whether imprecision pertaining to the measurement of outcomes impinges on the validity of an equivalence test. Equivalence Critical appraisal Validity Type I and II error

Clinical trial

INTRODUCTION

The purpose of some clinical trials is to determine whether differences in outcomes associated with two or more interventions are small enough to be considered equivalent. The most common application of this type of trial is to test whether an experimental therapy with an administrative and/or clinical benefit is comparable to a standard therapy in terms of outcome. The experimental therapy may be less expensive, less invasive and/or have some other benefit in comparison to the standard which would make its use preferable if it performed as well as, but not necessarily better than, the one currently in use. In the past, the most frequent application of this type of study has been to determine whether there is bioequivalence between alternative drug regimens. In the usual case, a drug company is interested in demonstrating that the effect of a less toxic formulation or generic product is comparable to that of the standard. As governments place increasing emphasis on reducing health care costs, controlled trials are being carried out more frequently to assess whether a standard health care intervention can be replaced by a less costly alternative. Under these circumstances, the 839

Effectiveness

Precision

study must demonstrate that the therapies are equivalent in terms of their effectiveness (the degree to which they do more good than harm) when administered to the same population. When reviewing the results of published studies in health care research, it may often be difficult to distinguish therapeutic equivalence trials from traditional ones which are set up to demonstrate that an experimental treatment is more effective than an alternate one. (I will refer to the latter studies as effectiveness trials.) The reasons for this are as follows: first, it is not always clear whether the intentions of the investigator were to demonstrate that the study groups are equivalent or different. When it is the former, they often test the wrong null hypothesis, one which specifies that the treatments are equal, rather than one which specifies that the standard treatment performs better than the experimental one(s). Second, the meaning of therapeutic equivalence may not be defined in terms of the outcome measures considered in the study, thus making the interpretation of statistical tests uncertain. Third, confusion may result when multiple grouping factors or outcomes are considered in the context of the same study, such that some of the statistical tests should relate to

BUMKIRSHNER

840

establishing a difference, while others to equivalence. Finally, the effectivenss of the standard therapy may be in doubt, or there may not be a consensus as to which therapies are standard. Thus, what might be viewed as an equivalence trial by one clinician, might not be by another. EFFJXTIVENESS VS EQUIVALENCE The question of whether a trial should deal with the issues of effectiveness or equivalence should be based upon how the results of the study will impact on current knowledge related to therapeutic practice. Many trials which test the null hypothesis of no difference between the study groups may be viewed as equivalence trials when one considers the treatment consequences associated with a conclusion that the study groups are the same. In order to transcend ambiguities pertaining to the investigator’s intentions and/or hypothesis testing, I will use the terms validity and efficiency to distinguish between the implications of a study’s findings in a post hoc sense. When addressing the issue of validity, the question we are interested in is whether a study was sufficiently rigorous to be useful in helping clinicians to decide upon changing the way in which they treat patients. In practical terms, such decisions are likely to be based upon more than one study, but in the case of large expensive multicenter trials, one study may comprise the sole basis of making these decisions. For the purpose of this discussion, I will define an “invalid result” as acceptance of a false conclusion which supports a change in current knowledge related to therapeutic practice. The question of efficiency in a post hoc sense pertains to whether a null finding-one which has no therapeutic consequences and thus is consistent with the status quo-should be accepted or the study repeated, subject to the availability of resources. The term “inefficient result” will refer to a null result which is false. These broad definitions of error related to validity and efficiency will be used to distinguish between equivalence trials on the one hand, and effectiveness trials on the other. It follows that in equivalence trials:

-The probability of an invalid result increases when error decreases the estimate of the true difference between the study groups; and -The probability of an inefficient result increases when error increases the estimate of the true difference between the study groups.

An effectiveness trial is one in which the above assumptions pertaining to validity and efficiency are reversed. (Note that terminology such as CIand fi error or type I and type II error may confuse the issue since these terms are used to describe error related to rejection or acceptance of a given null hypothesis, regardless of the implications of these results in practice. Furthermore, while some methodologists discuss the issue of equivalence solely in the context of type II (fi) error [l-3], others suggest that an analysis of equivalence should be carried out such that type I (a) error pertains to falsely rejecting the null hypothesis that the study groups are different [4,5].) In the following discussion, I will argue that a unique set of issues is involved in the design and analysis of therapeutic equivalence trials which are distinct from effectiveness trials. As a result, a trial may be valid for answering the question: Are the study groups different?; but invalid for answering whether they are the same. As a consequence of this, I will suggest that guidelines for critical appraisal of an equivalence study should be different in some respects from those which have been proposed for effectiveness trials by Sackett and colleagues [5]. Before describing these guidelines, I will give an example which illustrates how equivalence and effectiveness trials differ in terms of issues related to the validity of the findings. Consider a trial by Nyren et al. [6] comparing the therapeutic benefit from antacids, cimetidine and a placebo in non-ulcer dyspepsia. The investigators point out in their preamble that standard practice for many clinicians in treating this condition involves prescribing an antacid or an Hz receptor blocker. They feel that evidence on the effectiveness of these approaches is largely based on experiences in clinical practice rather than controlled clinical trials. The primary outcome measures in this study were a pain intensity score and a pain index score. The results of the study were as follows: “The pain index was reduced by 31 per cent with placebo; antacid and cimetidine treatment caused a 5 and 3 per cent greater reduction, respectively. The 95 per cent confidence limits for the difference between antacid and placebo were - 13 and +23 per cent, and the corresponding limit for the difference between cimetidine and placebo were -14 and +21 per cent.”

Results of the pain intensity score were almost identical to those for the pain index.

Assessing Therapeutic Equivalence

Let us compare how two clinicians with different starting assumptions about the effectiveness of the therapies would view this study. Suppose that Clinician A thinks that antacids and/or H, receptor blockers are effective in reducing pain associated with non-ulcer dyspepsia based on previous observations in clinical practice. Clinician B, on the other hand, does not prescribe these drugs routinely and is reviewing this study as a means of assessing whether he should. According to our previous definition, an invalid result for Clinician A would mean concluding that the study groups are the same when they are not. Conversely, in the case of Clinician B, an invalid result would mean concluding that the study groups are not the same when they are. Now let us examine how the different starting assumptions of the two clinicians might influence they way in which they would critically review the results of this study. Clinician A would focus on whether there was sufficient evidence to conclude that the study groups are the same. The scientific basis for such a conclusion would depend upon rejecting the (null) hypothesis that the study groups were qualitatively different. The first point to consider is that differences exceeding 20% in the pain index scores cannot be ruled out with 95% certainty. The investigators do not discuss what the clinical implications are of a 21 or 23% difference in these scores. Furthermore, they do not tell us anything about the distribution of scores within each group. When using confidence intervals in place of an inferential test in equivalence trials, the investigators must discuss the clinical importance of the upper bound or else the results will not be interpretable by those unfamiliar with the measure. Thus, there may be limitations in the extent to which Clinician A can interpret the findings. Clinician B’s interest would focus on the lower bound of the confidence interval which could not rule out a difference of 13% in favour of the placebo group. Thus for Clinician B, this study has no therapeutic implications. Furthermore, the study can rule out a difference of 25% in the pain index between the active treatments and the placebo which suggests that it is efficient according to guidelines which have been proposed for assessing null results for effectiveness trials [l, 2,5]. Over and above the actual results, let us consider in what other ways Clinician A and B

841

might differ in terms of how they might assess issues related to the validity of a positive finding from this study. Even if Clinician A accepted that the upper bound of the confidence intervals excludes all clinically important treatment effects, he may still question whether the similarity of results between the study groups was an artifact of measurement rather than the treatments. This involves two issues. First, how responsive was the outcome measure? The pain intensity index had a maximum score of 6, indicating maximum pain, yet mean values for the total sample at the start of the study were less than 2, even though pain was an important motivator for many in the sample to seek treatment. The narrow range in the distribution of pain scores is an indication that the index may have lacked the necessary precision to detect small but clinically important treatment effects. A second concern is whether precision was further compromised in this study by a training effect associated with repeated applications of the index over a short period of time. During the course of the 3-week followup, the index was administered four times to each patient. This raises the question of whether repeated applications are associated with a systematic effect whereby subjects tend to rate their pain as being less severe. Clinician B may recognize that many of Clinician A’s concerns about the measurement of pain may mean that the current study is not necessarily definitive. However, if the study had shown that the active treatments were more effective in reducing pain than the placebo, then the question of imprecision would not pose a threat to the validity of the findings-it would not increase the probability of rejecting the null hypothesis that the treatment groups were the same. In terms of external validity, Clinician A may question what factors were involved in producing a 25% reduction in pain scores for the placebo group. Was it due to the natural history of non-ulcer dyspepsia or factors involved in being part of an experiment? To the extent that this decrease in pain was due to the latter, then it is unlikely that it could be replicated in practice on an ongoing basis. Prescribing a placebo routinely outside of an experiment is unlikely to be adopted as standard practice. In addition, the so-called “placebo effect” may encompass other factors related uniquely to the experimental setting (e.g. wanting to respond in a manner which would please the investigators

BRAMK~~~NER

842

Table 1. Overview of critical appraisal guidelines for effectiveness trials* Internal validity

Applicability

(Are the results likely to be true?) .



.

(Are the results likely to be useful?) 1. Was the assignment of patients to treatments really randomized? 2. Were all clinically relevant outcomes reported? 3. Were the study patients recognizably similar to your own? 4. Were both clinical and statistical significance considered? 5. Is the therapeutic manoeuvre feasible in your practice? 6. Were all the patients who entered study accounted for at its conclusion?

,

b . *

*These guidelines were proposed by Sackett et al. [5] and are currently in wide use for assessing whether a trial is both valid and useful for demonstrating that a therapy does more good than harm.

and/or perceptions on the part of the patients that they were receiving better care). Therefore, from Clinician A’s viewpoint, the validity of the study as an equivalence trial would have been strengthened if the design had included a control group who received no medication-placebo or otherwise. After all, this is what would occur in practice if the investigators’ conclusions were accepted. In fact, had the study been clearly identified as an equivalence trial in the design stage, then the no treatment group should have replaced the placebo group if a second control group was beyond the investigators’ resources. This would have provided a more meaningful answer for Clinician A about what impact stopping routine prescribing would have on patient outcomes. Clinician B would have had strong reservations about the study if the no treatment group had been substituted for the placebo group. Even in a management trail, his concern is not whether factors related to the intervention in the control group are reproducible in practice, but rather whether this group is an appropriate control on factors related to experimental error [7]. Under these circumstances, it would not be apparent from a study using the no treatment group as a control whether group differences were due to the active constituents of the drugs or patients’ perceptions about the drugs’ efficacy in an experimental trial. CRITICAL APPRAISAL GUIDELINES FOR THERAPEUTIC EQUIVALENCE TRIALS

As in the case of guidelines for reviewing effectiveness trials, those for therapeutic equivalence trials should deal with the issues of usefulness and validity by endeavouring to answer:

(1) How useful or applicable are the results for deciding between alternative therapies? (2) Are results which show that the therapies are equivalent likely to be true? Sackett and colleagues [5] propose only 6 guidelines for critical appraisal of effectiveness trials: 3 which apply to the usefulness of the study; 2 which apply to its validity; and 1 which applies to both (see Table 1). In the following discussion, I will propose 9 guidelines for equivalence trials: 5 which apply to the usefulness of the study; and 4 which apply to its validity (see Table 2). The increase in the number of guidelines in Table 2 over Table 1 is due to the fact that all of the guidelines proposed for effectiveness trials apply in equal measure to therapeutic equivalence trials. There are, however, additional points to consider which are unique to the latter. The discussion of these guidelines will be disproportionate. For the sake of completeness, I will only mention in passing those which are identical in terms of their application to effectiveness and equivalence trials. The rationales for these guidelines have been elucidated elsewhere and are generally well known. I will, however, discuss in detail those guidelines which are unique to equivalence trials or differ in some important aspect in terms of their application to the two types of studies. 1. Prior assumptions about the therapies In most effectiveness trials, our starting assumption under the null hypothesis is that the study groups are the same in terms of outcomes. By comparison, the null hypothesis in equivalence trials usually states that the standard

843

Assessing Therapeutic Equivalence Table 2. Overview of critical appraisal guidelines for equivalence trials Annlicabihtv

Internal validitv

(Are the results likely to be useful?)

(Are the results likely to be true) 1.* Is the standard therapy effective? and does the experimental therapy have an advantage which is general&able? 2.* Can an acceptable minimum effect be defined in terms of the outcome measures? 3: Can this effect be measured precisely? 4.t Was the assignment of patients to treatments really randomized? 5.’ Was the timeframe for followup suthcient to conclude that the therapies are equivalent? 6.1 Were factors which determined participation in the study identical to those for receiving the standard therapy 7.1 Is it probable that the groups differ by less than a minimum effect? 8.t Were all patients who entered the study accounted for at its conclusion? 9.$ Were the treatments administered as they would be. in practice? *Applicable solely to equivalence trials. tIdentica1 to the one for effectiveness trials. #Similar but slightly altered from its counterpart for effectiveness trials.

therapy is more effective than the experimental one(s) on at least one of the outcomes of interest. The purpose of the trial is to reject this null hypothesis and accept the alternative that the former does not perform better than the latter. Thus, a test of this null hypothesis only makes sense if we accept prior to the study that the standard therapy is effective (or that its use is sufficiently widespread so that it can be considered standard practice). If, however, the status of the standard therapy as a treatment of choice is in doubt, and there is no evidence from previous trials to indicate that it is effective, then the results of an equivalence trial are likely to be of dubious value. The alternative hypothesis in an equivalence trial specises that the standard treatment is no more effective than the experimental one. In most situations, this hypothesis will only be of interest if the experimental therapy has an important benefit over the standard which is generalizable to our target population. The first point to consider when assessing this supposed benefit is: Who accures it-society, patients, drug companies or clinicians? The questions of whether these benefits are important, and if so,

whether they are generalizable, can then be addressed. For example, one of the first applications of a therapeutic equivalence trial in health care research compared whether using nurse practitioners in place of physicians as a first contact in a primary care setting impacted negatively on quality of care and/or patients’ health status [8,9]. The only practical reason for replacing physicians with nurse practitioners in a primary care setting was the assumption that they could provide the same care on a less costly basis. The question of who would accrue this benefit is anything but straightforward. In a system where there is an oversupply of primary care physicians, the benefits to society would have been of questionable value. This probably explains why the Ministry of Health in Ontario, Canada refused to fund the practice under the Ontario Health Insurance Plan except in areas of Northern Ontario where there is a shortage of physicians. When the benefit accrues to the patient, then it may only be generalizable for certain groups of patients. For example: the benefit used to promote many new drugs is that they improve compliance by reducing the number of times

844

BRAMKIRSMWR

patients must medicate over a 24-hour period. If compliance with the drug is not a problem in a given group of patients, or the problem is unlikely to be resolved by the need to medicate less frequently (i.e. refusal to take the drug altogether rather than forgetfulness), then there will not be any benefit associated with the new drug for these patients. Consequently, the results of the study should not have therapeutic consequences for them. 2. Assessing minimum effects The next guideline is whether an acceptable minimum effect is/or can be defined in terms of the outcome(s) considered in the study. This guideline also pertains to how useful the study is likely to be for treatment decisions rather than validity and involves two issues. First, were all relevant outcomes related to the meaning therapeutic equivalence considered in the study? Secondly, was the size of a minimum effect a useful benchmark for determining therapeutic equivalence? With respect to the first question, the definition of a minimum effect should be based on those outcomes where there is a basis for hypothesizing that the standard therapy may be of greater benefit than the experimental one(s). It would be inappropriate to restrict the definition of a minimum effect to outcomes which are likely to remain unaffected or may even improve when the experimental therapy is used in place of the standard. Consider a study by Top01 et al. [lo] comparing an experimental program of early release from hospital after 3 days vs conventional stay (7-10 days), for patients with a diagnosis of uncomplicated myocardial infarction (MI). The primary outcome measure referred to in the study was the average rate at which patients returned to work or their previous level of function. To the extent, however, that there is a difference in favour of conventional stay, then it is more likely to concern differences in the timeframe for diagnosing and managing unexpected complications and consequently the possibility of less favourable outcomes when these complications occur. If the definition of a minimum effect was based on the management of unexpected complications, then it would have been apparent that a sample size of 40 patients in each of the two study groups was much too small to determine whether the two regimens of care were therapeutically equivalent.

Given that the appropriate outcome(s) are considered, the next point to assess is whether the definition of a minimum effect is appropriate in terms of effect size. In order to be able to interpret the results of an equivalence trial, we have to be able to operationally define the size of a minimum effect (A) for use in the null hypothesis. Under these circumstances, A refers to the smallest difference on the outcome measure which is considered to be of clinical or practical importance. The value of A must be acceptable as a point of demarcation between the null state that there is a meaningful difference between the groups, vs the alternative, that the groups can be considered the same. If A is set too high, then the probability of an invalid result will increase. If it is set too low, then the probability of an inefficient result will increase. Perhaps one of the most contentious issues in therapeutic equivalence trials is how to select the size of a minimum effect in the absence of clinically relevant benchmarks. This problem is particularly inherent, though not restricted to, the application of quality of life measures as outcomes in health care research such as wellbeing, functional status, emotional function, etc. These outcomes are usually measured with indices which produce quantitative measures of qualitative health states. Consequently, these measures are usually not relevant in and of themselves, but only as they relate to one another. One approach for defining minimum effects is to use the standard deviation of the observed measures to determine an appropriate value for A. Cohen [l l] presents formulae for calculating the standardized values of small, medium and large effect sizes for various inferential tests. While Cohen’s discussion is intended to be used for power analysis, the definitions of small effect sizes can also be used in equivalence tests in place of a minimum effect. Use of this method in the absence of clinically relevant benchmarks may provide a number of advantages. First, if we can establish that two groups do not differ by more than a conventional definition of a small effect size, then this generally implies that there is a very little of the distribution in one group which is not overlapped by that of the other. Secondly, this method makes it less likely that A will be defined in a post hoc fashion after eyeballing the data. Third, the use of less precise or unresponsive measures will not necessarily increase the probability that A will be rejected.

Assessing Therapeutic Equivalence

Finally, use of this approach may make it easier to compare results between studies. Therefore, Cohen’s or some other benchmark of a small effect size should be kept in mind when minimum effects derive all or part of their meaning from artifacts related to the measurement of a sample rather than generally accepted clinical criteria. Examples of the former would include: -a

percentage difference in the proportions having a given outcome, -a numerical difference which may or may not be clinically relevant (e.g. differences in IQ scores, emotional function index). Examples of the latter would include: -a

meaningful definition of a change in morbidity, -a numerical difference in a clinical measure generally accepted as being clinically insignificant in terms of morbidity (e.g. a change of 3 points in diastolic blood pressure, a difference in weight of 3 lb), Threshold effects can be used in place of minimum effects. Threshold effects refer to the point at which a difference in outcome in favour of the standard therapy is considered to be of equal value to the relative benefits of replacing it with the experimental therapy. Use of threshold effects in the null hypothesis gives an equivalence trial maximum efficiency. However, there are often a number of operational problems associated with using them. First, it may be difficult to preform an economic analysis which can establish parity between administrative benefits and outcomes. Secondly, we often do not have sufficient information prior to a study to quantify the relative benefits of using the experimental therapy in place of the standard to base sample size calculations and a priori hypotheses on threshold effects rather than minimum effects. Finally, the comparative benefits of administering the experimental therapy over the standard may be subject to greater variation from sample to sample than treatment effects. Thus a threshold between these benefits and a difference in outcome in favour of the standard therapy may not be generalizable. For example, the size of the benefit in many American studies involving less costly care will not necessarily be generalizable to Canada where per capita expenditures on health care are roughly 66% of that in the U.S. Therefore, to the extent that the size of this benefit affects the interpretation of

845

results, then the conclusion from an American study will not necessarily be generalizable to Canada. 3. Assessing precision Having defined a minimum or threshold effect which is useful, the next issue to consider is whether it can be measured precisely. This guideline deals with validity. Measurement error refers to the difference between the true values and the observed values of the sampling units. A distinction is generally made between measurement error which is random vs systematic. Although imprecision is often defined as being synonymous with random error [12, 131, the most frequent use of this term is to describe a structural or functional limitation of a measuring instrument in which: a range of possible values are collapsed into a single category on the scale of measurement (structural), and/or the observed distribution does not adequatly reflect variation between subjects on the variable being measured (functional). Under these circumstances, imprecision represents a special form of systematic error which always decreases the estimate of group differences. Loss of discriminatory power when measuring cross-sectional differences, lack of responsiveness when measuring change or random misclassification error in a 2 x 2 table, are all forms of imprecision. It is important to point out that a measure can be both valid and reliable without being precise. Measures on a dichotomous classification of health (e.g. good health or poor health, improved or not improved), may be more reproducible but less precise than those involving a set of three or more ordinal classifications denoting a continuum of health. The importance of precision for inferential tests of equivalence is that as measures become more imprecise, the estimate of the true difference between the study groups will always decrease. In an effectiveness trial this will lead to inefficiency, in an equivalence trial it will increase the probability of an invalid result. There are two requirements for assessing precision, The first pertains to whether the scale of measurement is adequate for detecting treatment differences. A common practice when carrying out effectiveness trials in health care research is to measure group differences in terms of a relatively small number of qualitative categories to facilitate the measurement process

846

BRAMKWHNEX

as well as the interpretation of results [lo]. Collapsing over categories is very shaky practice, however, when the null hypothesis states that the groups are different rather than the same. If variation within the measurement categories obscures qualitatively important treatment effects, then the validity of the findings will be undermined. In the Burlington Randomized Control Trial, a dichotomous outcome denoting adequate and inadequate care was used to measure whether nurse practitioners provided comparable health care as physicians [7]. Clearly, there must have been a range in the quality of care experienced by patients within each of these two categories. An assessment of precision must also consider whether the observed distribution adequately reflects expected differences between the subjects on the variable being measured. For example, in the study by Nyren et al. [16] on non-ulcer dyspepsia, it was pointed out that the distribution of pain scores was much more limited than what we would expect to find for the patient population in question, raising the possibility that the index may not have been responsive to all clinically important treatment differences. When the variation within treatment groups decreases, then the effect size associated with rejecting a fixed value of A will increase. Therefore, the validity of an equivalence trial may be in doubt when the within group variation is less than expected, and/or the value of A seems inappropriately large in relation to the standard deviation of the observations. 4. Randomization The assessment of validity must also consider whether the assignment of patients to the treatments was really randomized. The application of this guideline to equivalence trials is identical to that for effectiveness trials. 5. Timeframe for followup The usefulness of an equivalence trial will also be dependent upon whether the timeframe for patient followup was sufficient to conclude that the therapies are therapeutically equivalent. In an effectiveness trial, when a new intervention has a therapeutic benefit which is subsequently shown to be of a limited duration, then this does not necessarily impinge on the original finding that the therapy does more good than harmalthough it may not turn out to be cost effective. The same does not hold true for an equivalence

trial. If the conclusion that the study groups are therapeutically equivalent is only true for a limited period of time, then the original finding is likely to be flawed. A minimum requirement for equivalence trials is that the time frame for following patients should be equal to (or exceed) that used to establish the efficacy of the standard therapy in previous studies. For example, balloon aortic valvuloplasty for treatment of aortic stenosis has been advocated as an alternative treatment to cardiac surgery for elderly patients. During the initial post intervention phase, outcomes for the former treatment compare favourably to the latter due to the risk of surgical mortality in this patient group. Recent reports from trials which have followed up patients for periods of time comparable to those used to establish the efficacy of surgery have cast considerable doubt on the initial findings since long-term survival rates in elderly patients undergoing valvuloplasty do not appear to be as good as those for elderly surgical patients [ 151. 6. Selection criteria The next guideline is whether the sample is representative of the population for whom replacement of the standard with the experimental therapy is being considered. In theory, this issue is the same for effectiveness and equivalence trials in that we carry out statistical tests based on the assumption that the sample prior to randomization is representative of a given population. In practice, however, we rarely have random samples from the population of all eligible patients. As a result, the issue of generalizability for both types of studies often focuses on whether the sample is similar to a group of patients in terms of clinical profiles and/or how and from where they were selected. When a study demonstrates that a therapy does more good than harm, then application of the results to a group of patients is usually based on whether they are similar to the sample in terms of clinical profiles especially prognosis (i.e. probability that they will develop the outcome). The critical appraisal guideline which deals with this issue asks, “Were the study patients recognizably similar to your own?” [5j. It is much more difficult to use this approach in an equivalence trial. The reason for this is that we are usually interested in very small treatment differences between two active therapies. When generalizing these results to a similar

AssessingTherapeutic Eqkhnce group of patients, we must make the assumption that they are nearly identical to the sample in terms of how they would respond to the standard therapy. The corresponding issue in a placebo controlled trial is whether a group of patients is comparable in terms of their risk of developing the outcome. The former is much more difficult to determine than the latter since we rarely have precise knowledge on why the standard therapy works and exactly whom it will work for. The most important factors for assessing whether the results of an equivalence trial are generalizable have to do with the way in which patients were selected to take part in the study. There are already likely to be criteria in place for selecting patients to receive the standard therapy. If this is the population of interest, then it is essential that factors which determine participation in the study conform to these criteria. This discussion is not meant to imply that in assessing the generalizability of a study, we should concentrate on one of the above factors to the exclusion of the other. Rather, it is meant to suggest that one of these factors may be more important than the other depending upon whether we are dealing with the issue of equivalence or a difference in treatment effects. 7. Statistical significance The guideline dealing with statistical significance asks: “Is it probable that the groups differ by less than a minimum effect?’ This question deals with validity and is answered by testing the following null hypothesis H,: 6 3 A (where 6 = the true difference and A = a minimum effect). A number of inferential tests of equivalence have been proposed for different sampling distributions [4,9, 16, 171. In an effectiveness trial it is important to assess issues of clinical as well as statistical significance. However, this is not necessary for tests of equivalence. Given that an acceptable definition of A is specified in the null hypothesis, then there will never be a divergence between clinical and statistical significance. Over the past decade, many commentaries have recommended the use of confidence intervals over inferential tests when summarizing the results from clinical trials [l&20]. With respect to tests of equivalence in particular, use of confidence intervals has been advocated as the preferred, and according to some authors, the

847

only method of assessing equivalence in controlled trials [21]. In view of this, it may be worthwhile to examine under what circumstances, if any, inferential tests should take precedence over confidence intervals when assessing equivalence. In many circumstances, this debate may be somewhat academic since investigators will frequently perform both types of analyses for the same study. The major advantage of an inferential test is that it removes any ambiguity pertaining to the findings by forcing the investigator to define the meaning of equivalence and to formally test for it. When it is possible to derive an acceptable definition of A, then this approach provides a precise probability statement regarding the likelihood of the study’s results for testing a null hypothesis which is clinically relevant. It also enables the reviewer to assess whether the study had the necessary precision to detect a given value of A under the null hypothesis by considering whether A is inappropriately large in relation to the standard deviation of the observations. The disadvantge of an inferential test is that a minimum effect size cannot always be defined according to rigorous criteria. At the same time, readers may not be in a position to judge the appropriateness of these criteria. The definition of a threshold effect may also change as more becomes known about the administrative benefits of the experimental therapy in different populations. Under these circumstances, it is important for readers to know the range of values for estimating the true difference between the study groups rather than whether this range excludes an arbitrary value of A. The major advantage of confidence intervals is that an investigator does not necessarily have to interpret the results of the study in terms of a given value for A. Although A must be specified in the design stage of a trial to determine sample size, by using confidence intervals in the analysis stage, an investigator can let readers decide for themselves on whether the groups are equivalent. Presumably this would mean assessing whether the upper bound of a confidence interval was less than an important clinical benchmark for a minimum effect size. Confidence intervals will not be useful when a scale of measurement associated with an outcome measure is not well understood nor generalizable. For example, the proliferation of indices to measure concepts such as quality of life, emotional function, depression, etc. clearly

848

BRAMICMHNER

puts the onus on the investigator to establish and subsequently test relevant bench marks. In a hypothesis testing situation, the reader will at least be able to determine whether a difference between the study groups is statistically significant although this may be of limited use if the clinical relevance of this difference remains undefined. Use of confidence intervals solely will have the added disadvantage that readers will not be able to draw any conclusion from the study if they cannot interpret the clinical significance of the upper bound. (This point was illustrated previously by the problems that Clinician A encountered in trying to determine whether the upper bound for the difference in pain scores excluded a minimum effect in the trial on non-ulcer dyspepsia). By comparison, demonstrating that the lower bound of a confidence interval in an effectiveness trial is removed from 0 at least permits readers to infer that the groups are different, even though the relevance of this difference may also elude them if the investigator does not discuss the clinical significance of the results.

experimental manoeuvre and co-intervention. It is also essential, however, that the results are not unduly affected by factors related uniquely to the experimental situation which may apply in equal measure to both groups, and in the process, obscure small but important treatment differences. These factors may include any of the following:

8. Accounting for all patients

When outcomes may be influenced by these factors, then they should be validated by a post hoc analysis. One approach for carrying out this type of analysis is to assess whether the results for the standard therapy are consistent with those reported in previous effectiveness trials. Recently, Makuch and Johnson [18] provided an excellent example on the application of this guideline to a therapeutic equivalence trial. This type of post hoc analysis, however, is not always possible since in many situations standard practice has evolved on the basis of clinical judgment which has not been tested through clinical trials. When clinical trials have been carried out, it may be difficult to ascertain the degree of comparability between samples. Furthermore, as mentioned previously, the issue of equivalence in health care research may focus on different outcome measures than the question of whether a therapy does more good than harm. An alternative approach for carrying out a post hoc analysis is through construct validation procedures. A finding that the study groups are equivalent will be more credible when it can also be shown that the outcome measure appropriately differentiates between subgroups of patients with different prognoses.

One of the guidelines for assessing the validity of an effectiveness trial asks: “Were all patients who entered the study accounted for at its conclusion?’ The application of this guideline to equivalence trials is identical to that in effectiveness trials. 9. Applicability of the interventions to practice A study will only be useful if the treatments were carried out as they would be in practice. This involves similar issues to those covered by the guideline for effectiveness trials which asks: “Is the therapeutic manoeuvre feasible in your practice?” When carrying out trials in health care research, we stipulate that protocols for administering the interventions should conform to those which are likely to be implemented in practice. The demands, however, of the experimental situation usually require that changes be made to the setting and/or conditions under which subjects receive these interventions. In many cases, unbiased treatment effects may only be observable under artificially clean conditions. As in the case of effectiveness trials, equivalence trials must be free of contamination of the

(1) Differing selection procedures than would usually be carried out to determine who should receive a given intervention. (2) Differences in patient management, ancillary services and/or treatment than is generally provided in practice. (3) Changes in compliance and/or cooperation on the part of patients related to the experimental situation. (4) Placebo effects due to being part of a study-especially where outcomes are based in whole or in part on patient perceptions. (5) Training effects associated with administering the outcome measure on multiple occasions over the course of the study.

Assessing Therapeutic Equivalence REFERENCES 1. Freiman JA, Chalmers TC et al. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. N Engl J Med 1978; 299: 690-694. 2. Detsky AS, Sackett DL. When was a “negative” clinical trial big enough? Arch Intern Med 1985; 146: 861-862. 3. Ciampi A, Till JE. Null results in clinical trials: the need for a decision-theory approach. Br J Cancer 1980; 41: 618629. 4. Blackwelder WC. “Proving the null hypothesis” in clinical trials. Contr CIht Trials 1982; 5: 97-105. 5. Sackett DL et al. How to read clinical journals: V: To distinguish useful from useless or even harmful therapy. CM.4 J 1981; 124: 11561161. 6. Nyren 0 et al. Absence of therapeutic benefit from antacids or cimetidine in non-ulcer dyspepsia. N Engl J Med 1986; 314: 339-343. I. Wickramaratne PJ, Holford TR. Confounding in epidemiologic studies: the adequacy of the control group as a measure of confounding. Biometrics 1987; 43: 751-765. 8. Spitzer WO, Sackett DL et al. The Burlington Randomized Trial of the nurse practitioner. N Engl J Med 1974; 290: 251-256. 9. Dunnett CW, Gent M. Significance testing to establish equivalence between treatments, with special reference to data in the form of 2 x 2 tables. Biometrics 1977; 33: 593602. 10. Top01 EJ et al. A randomized controlled trial of hospital discharge three days after myocardial infarc-

11. 12. 13. 14.

15. 16. 17. 18. 19. 20.

21. 22.

849

tion in the era of reperfusion. N Engl J Med 1988; 318: 1083-1088. Cohen J. StatistIcal Power Amdysis for the Behavioral S&nce~. New York: Academic Press. 1977. Kleinbaum DG, Kupper LL, Morgenstem H. Enidemiologic in Reaeucb Principles -and Quanta&e Methods. London: Wadsworth. 1982: Section 10.2. Fletcher RI-I, Fletcher SW, Wagner EH. In clinical Epidemiology-the Ementials. Baltimore, MD: Williams and Williams; 1982: 21. Kimball AW. Measures of and tests for net improvement in clinical trials. Contr Clin Trials 1988; 9: 6-10. Block PC. Aortic valvuloplasty-a valid alternative? N Enal J Med 1988: 319: 169-171. Rodda BD, Davis RL. Determining the probability of an important difference in bioavailablity. CIin Pharmaco1 Ther 1980; 28: 247-252. Pate1 HI, Gupta GD. A problem of equivalence in clinical trials. Biometry J 1984; 26: 471474. Rothman KJ. A show of confidence. N Engl J Med 1978; 299: 1362-1363. Wonnacott T. Statistically significant (Letter). Can Med Amoc J 1985; 133: 843. Gardner MJ, Altman DG. Confidence intervals rather than P values; estimation rather than hypothesis testing. Br Med J 1985; 292: 746750. Makuch RW, Johnson M. Some issues in the design and interpretation of “negative” clinical studies. Arch Intern Med 1986; 146: 986989. Makuch R, Johnson M. Issues in planning active control equivalent studies. J Clln Epidemiol 1989; 42: 503-5 11.

Methodological standards for assessing therapeutic equivalence.

This paper reviews issues related to defining and demonstrating therapeutic equivalence. A set of guidelines are proposed to critically review clinica...
1MB Sizes 0 Downloads 0 Views