This article was downloaded by: [University Library Utrecht] On: 13 March 2015, At: 19:34 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biopharmaceutical Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lbps20

A Regulatory Perspective on Essential Considerations in Design and Analysis of Subgroups When Correctly Classified a

Sue-Jane Wang & H. M. James Hung

b

a

Office of Biostatistics, OTS/CDER , Food and Drug Administration , Silver Spring , Maryland , USA b

Division of Biometrics I, OB/OTS/CDER , Food and Drug Administration , Silver Spring , Maryland , USA Published online: 06 Jan 2014.

Click for updates To cite this article: Sue-Jane Wang & H. M. James Hung (2014) A Regulatory Perspective on Essential Considerations in Design and Analysis of Subgroups When Correctly Classified, Journal of Biopharmaceutical Statistics, 24:1, 19-41, DOI: 10.1080/10543406.2013.856022 To link to this article: http://dx.doi.org/10.1080/10543406.2013.856022

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Journal of Biopharmaceutical Statistics, 24: 19–41, 2014 Copyright © Taylor & Francis Group, LLC ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543406.2013.856022

A REGULATORY PERSPECTIVE ON ESSENTIAL CONSIDERATIONS IN DESIGN AND ANALYSIS OF SUBGROUPS WHEN CORRECTLY CLASSIFIED Sue-Jane Wang1 and H. M. James Hung2

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

1

Office of Biostatistics, OTS/CDER, Food and Drug Administration, Silver Spring, Maryland, USA 2 Division of Biometrics I, OB/OTS/CDER, Food and Drug Administration, Silver Spring, Maryland, USA This regulatory research provides possible approaches for improvement to conventional subgroup analysis in a fixed design setting. The interaction-to-overall effects ratio is recommended in the planning stage for potential predictors whose prevalence is at most 50% and its observed ratio is recommended in the analysis stage for proper subgroup interpretation if sample size is only planned to target the overall effect size. We illustrate using regulatory examples and underscore the importance of striving for balance between safety and efficacy when considering a regulatory recommendation of a label restricted to a subgroup. A set of decision rules gives guidance for rigorous subgroup-specific conclusions. Key Words: Biomarker; Formation of subgroup; Interaction-to-overall effects ratio; Predictive of treatment effect; Subgroup-specific treatment effect hypothesis.

1. INTRODUCTION On July 9, 2012, President Obama signed into law the bipartisan Food and Drug Administration Safety and Innovation Act (FDASIA), including a provision (Section 907) that will require the Food and Drug Adminsitration (FDA) to report annually the extent to which applications submitted to the Agency include data on demographic subgroups, including sex, age, race, and ethnicity (“President signs,” 2012). In fact, following the release of the National Institutes of Health (NIH) Revitalization Act (effective March 9, 1994) and extending from NIH-supported clinical research, the guidelines for implementation further state that Phase 3 clinical trials must be designed to allow separate planning, conducting and reporting of analyses of these groups when prior research has indicated that it may be important and that preliminary trials must provide enough information to inform the design Received August 9, 2013; Accepted August 28, 2013 This article not subject to US copyright law. Address correspondence to Sue-Jane Wang, PhD, Office of Biostatistics, OTS/CDER, FDA, 10903 New Hampshire Ave, Bldg 21, HFD-710, Mail Stop #3526, Silver Spring, MD 20993-0002, USA; E-mail: [email protected] Color versions of one or more of the figures in the article can be found online at www.tandfonline. com/lbps. 19

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

20

WANG AND HUNG

of subsequent Phase 3 trials (National Institutes of Health, 1994). The reporting of findings presented by sex, age, and race/ethnicity from randomized clinical trials has become a routine practice. These reports submitted through new drug applications (NDAs) or biological license applications (BLAs) aiming for drug licensures are also routinely reviewed by regulatory statistical scientists. More recently, routine subgroup analysis has been extended informally to geographical regions in multiregional clinical trials that may be predefined or defined via a post hoc grouping of regions (e.g., Wedel et al., 2001; Brilinta label, 2013). In this regulatory research, we consider traditional subgroup analysis and the strategy of planning the trial on subgroup-specific hypothesis and formally testing this hypothesis in a fixed-design all-comer controlled trial. Specifically, the subgroups are assumed to be correctly specified and the controlled trial includes those trials planned for Phase 2 and Phase 3 in a drug development program. In section 2, we show the range of probability of observing directionally or statistically consistent versus inconsistent treatment effects between two mutually exclusive subgroups. Assuming no missing data, an example is a subgroup of patients that are classified as high risk versus all others (or versus those classified as non-high risk). In section 3, we derive the interaction-to-overall effects ratio to articulate the likelihood of a baseline covariate that may be predictive of treatment effect in the positive subgroup without formal hypothesis testing performed in the negative subgroup. In sections 4 and 5, we give insights into the formation of a subgroup and prespecification of the subgroup-specific hypothesis in study designs for investigation and lay down a set of decision rules for labeling recommendation applicable to patient population of all comers or a subgroup only. In section 6, we highlight a few regulatory experiences labelling for subgroup only. Discussion follows in section 7. 2. CONVENTIONAL SUBGROUP ANALYSIS Traditionally, confirmatory studies target an intent-to-treat (ITT) population of patients in which the effect of a study treatment is expected to have sufficient homogeneity in order to permit precise estimation of the treatment effect and for the effect to be interpretable (International Conference on Harmonization [ICH], 1998a). 2.1. Characteristics of Baseline Factors Relevant to Formation of Meaningful Patient Subgroups Under the conventional framework, subgroup may be prespecified, for example, by age, gender, race/ethnicity, and geographical region (Hung et al., 2010; Wang and Hung, 2012). The subgroup analysis is conducted primarily to investigate whether the treatment effect is consistent or inconsistent between the mutually exclusive subgroups defined by the subgroup factor or among multiple subgroups. Thus, during the planning of a confirmatory trial, it is recommended to prespecify the baseline covariates and intrinsic/extrinsic factors that are suspected to have an important influence on the primary efficacy endpoint because these factors may be used to define subgroups. In addition, there is a need to consider how to account

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

21

for these covariates in the primary analysis in order to improve precision and to mitigate any possible adverse impact that may be attributed to random imbalance between treatment groups. Depending on the treatment indication and the study population, some of the baseline characteristics may be prognostic of disease state or predictive (or prognostic-predictive) of treatment effects in a randomized controlled trial (Wang et al., 2007; Wang, 2007). Prognostic factors are generally used to identify patients at relatively higher versus lower likelihood of having a disease-related event or having a substantial versus less worsening in outcome condition independent of treatment intervention. When a linear regression model includes the prognostic factors for statistical analysis, these factors have little influence on the accuracy of the treatment effect estimate but can improve the precision of the estimate. In contrast, predictive factors are baseline characteristics that can be used to identify a subset of patients who are more likely to respond to treatment than other patients with the condition being treated (Food and Drug Administration, 2012). In general, predictive factors include patient’s physiology, disease pathology or characteristics that are related in some manner to the mechanism of the study drug; usually, they are intrinsic factors described in ICH E-5 (1998b). Treatment effect may vary with subgroups defined by predictive or prognostic-predictive factors (Wang et al., 2007; Wang, 2007). In the conventional paradigm, investigation of a factor that may be predictive of treatment effect or a subgroup that is of particular interest a priori is generally pursued via either a prespecified subgroup analysis or a statistical model including interactions as part of the planned confirmatory analysis (ICH, 1998a). For the purpose of this article, we refer to the two mutually exclusive subgroups as the positive subgroup versus the negative subgroup or its complementary subgroup. Simply adding the interaction term in the main model without adequate sample size planning for the interaction test, we show in section 2.3 that the probability that neither the positive subgroup nor its complimentary subgroup achieves statistical significance is not small. 2.2. Probability of Observing Directionally Consistent Versus Inconsistent Treatment Effects Between Mutually Exclusive Subgroups Under the conventional subgroup analysis paradigm, treatment effects are expected to be homogeneous across subgroups. The hypotheses of interest are H0   = 0 vs. H1   = 0

(1)

where  is the size of the overall treatment effect. Consequently, subgroup analyses are mainly performed to explore the extent of inconsistency in treatment effects across subgroups. Denote by + the effect size in one subgroup (call it the positive subgroup) defined by a baseline covariate, and by − the effect size in the complementary subgroup (call it the negative subgroup). Let  be the interaction effect between the two subgroups, that is,  = + − − . The ICH E9 (ICH, 1998a) advocates that a treatment-by-subgroup interaction test for K0   = 0 vs. K1   = 0

(2)

22

WANG AND HUNG

should be performed first, before conducting a subgroup-specific test of treatment effect, for example,

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

H0+  + = 0 vs. H1+  + = 0

(3)

with the intent of declaring a significant treatment effect in only one of the mutually exclusive subgroups. In the following, we introduce the probability of observing directionally consistent versus inconsistent treatment effects and statistically consistent versus inconsistent treatment effects between mutually exclusive subgroups. Then we discuss the implication of testing the treatment by subgroup interaction. To simplify the argument, suppose that sample sizes are equal in the treated arm and the control arm within each subgroup. At the planning stage when homogeneous treatment effect () is anticipated, the study can be designed to detect the common treatment effect  =  with (1 − ) power at a one-sided significance level /2. Suppose ˆ + and ˆ − are the observed standardized effect size estimates with n+ per arm in the positive subgroup and with n− per arm in the negative subgroup, respectively. Let n be the total sample size per arm. Then n = 2z/2 + z / 2 , where zv is the 1 − th percentile of the standard normal distribution. Let + = n+ /n be the percentage of patients in the positive subgroup. Similarly, − = n− /n in the negative subgroup. Since the two subgroups are mutually exclusive, we can calculate the probability that subgroup-specific treatment effects are directionally consistent or inconsistent and the probability that subgroup-specific treatment effects are statistically significantly consistent or inconsistent. Consistent means that both subgroups show beneficial effects either directionally or statistically, or both subgroups show statistically nonsignificant treatment effects, whereas “inconsistent” means that one subgroup yields beneficial treatment effect but the other subgroup does not, again either directionally or statistically. Technical details and insights on statistically significantly consistent versus inconsistent probability are provided in section 2.3. Note that   √  (4) Prˆ + > 0   =  = PrZ+ > − n + /2 = + z/2 + z   √  Prˆ − > 0   =  = PrZ− > − n − /2 = − z/2 + z (5) Denote by prpp the probability that numerically beneficial treatment effects in both subgroups are observed. Then the four probabilities of observing directionally consistent versus inconsistent subgroup effects are prpp = Prˆ + > 0 and ˆ − > 0   =   √   √  = + z/2 + z ∗ − z/2 + z

(6)

prnn = Prˆ + ≤ 0 and ˆ − ≤ 0   =   √   √   = 1 − + z/2 + z ∗ 1 − − z/2 + z

(7)

prpn = Prˆ + > 0 and ˆ − ≤ 0   =   √   √  = + z/2 + z ∗ 1 − − z/2 + z

(8)

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

23

Figure 1 Probability of observing directionally consistent vs. inconsistent subgroup effects given common effect size.

prnp = Prˆ + ≤ 0 and ˆ − > 0   =   √   √   = 1 − + z/2 + z ∗ − z/2 + z

(9)

At the planning stage, when a homogeneous treatment effect ( is anticipated and the study is designed to achieve 80% power at a one-sided 2.5% significance level, it is expected that the probability of observing directionally consistent beneficial subgroup effects is at least 80% when at least 10% to at most 90% of the intentto-treat patients belong to the positive subgroup. This probability is around 85% and higher when the study is designed to achieve 90% power and is the largest (at least 95%, if planned at 80% power, and at least 97.5%, if planned at 90% power) when the proportion of subjects is equal in the “positive” subgroup and the “negative” subgroup; see the top two curves (prpp80 and prpp90) in Fig. 1. Although the study expects a homogeneous treatment effect in both the positive and the negative subgroups, the probability of observing a numerically harmful treatment effect in the positive subgroup increases as the proportion of subjects in the positive subgroup decreases. This probability is a little over 25% for a study with 80% power and about 23% for a study with 90% power, when there are only 5% or fewer subjects in the positive subgroup (see the curves labeled as prnp80 and prnp90, reading the x-axis from right to left). The phenomenon is symmetrically reversed for the negative subgroup (see the curves labeled as prpn80 and prpn90). In all cases evaluated, the probability of observing a numerically harmful treatment effect in both subgroups is essentially zero across all prevalence considered (prnn, results not shown in Fig. 1). 2.3. Probability of Observing Statistically Significantly Consistent Versus Inconsistent Treatment Effects Between Mutually Exclusive Subgroups Denote by p+ the nominal one-tailed p-value of main interest from testing the hypothesis of treatment effect in the positive subgroup and, similarly, p− in the

24

WANG AND HUNG

negative subgroup. The probability of observing statistically significantly consistent versus inconsistent treatment effects can be further derived. Corresponding to equations (4) and (5), let √    + z/2 + z − z/2   √  Prp− < /2   =  = − z/2 + z − z/2

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

Prp+ < /2   =  =

(10) (11)

The four probabilities similar to equations (6)–(9) of prpp, prnn, prpn, and prnp can be derived using equations (10) and (11); call these psg-nsg, pns-nns, psg-nns, and pns-nsg, respectively. In these three-letter acronyms, the first letter is either “p” representing “positive” subgroup or “n” representing “negative” subgroup. The second and the third letters are either “sg” indicating statistically significant or “ns” indicating not statistically significant. We illustrate the probabilities of observing statistically significantly consistent versus inconsistent treatment effects, assuming a study is powered at 80% to detect a common effect size . From Fig. 2, we see that the probability of observing a statistically significantly beneficial treatment effect in both groups is low, ranging from as small as 7% at extreme proportions (5% vs. 95% positive subgroup size) to approximately 25% at equal subgroup sample sizes; see the psg–nsg curve with filled circles. Similarly, the probability that both subgroup-specific treatment effects are not shown to be statistically significant ranges from 20% for extreme subgroup sample size proportions to approximately 25% for equal subgroup sizes; see the pns– nns curve with open circles in Fig. 2. Thus, the probability of observing statistically significantly consistent treatment effects and the probability of observing statistically nonsignificantly consistent treatment effects are at most approximately 50%. We can also see from Fig. 2 that the probability of observing at least a statistically not significant subgroup-specific treatment effect (abbreviated as prat1ns) is high, at least 75% and higher (see the prat1ns curve), which is the rear mirror image of psg–nsg, that is, 1-psg-nsg. In fact, this probability consists of three probabilities: the probability that both subgroups yield statistically not significant effect (pns-nns) described in the previous paragraph, the probability

Figure 2 Probability of observing statistically significantly consistent/inconsistent subgroup effects given common effect size.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

25

that treatment effect in the positive subgroup is not statistically significant, but is in the negative subgroup (pns-nsg), and the probability that treatment effect in the positive subgroup is statistically significant and is not so in the negative subgroup (psg-nns). The latter two curves showing inconsistent effects depend on the proportion of patients in the positive subgroup. The smaller the proportion of the positive subgroup size is, the higher is the probability that inconsistent effects will be observed between the two subgroups (pns-nsg). The reverse is true for the negative subgroup (psg-nns). That is, testing for a subgroup-specific treatment effect is generally underpowered for a low-prevalence subgroup. The prevalence factor is generally not considered in the design stage, especially when there is no materially suspected heterogeneity. The evaluation just given assumes there is only one baseline covariate used to define the two mutually exclusive subgroups and that classification of subgroup membership is accurate. In practice, there is generally no way of knowing how many post hoc subgroup analyses were performed, including use of different cutoff values for the same baseline covariate to form subgroups and use of various baseline characteristics to group subjects into a small or a large number of subgroups. Therefore, under the premise that an overall treatment effect can reasonably be applicable to all subgroups, subgroup analyses ought to be considered as exploratory analyses. When a subgroup-specific treatment effect appears to be highly statistically significant, the probability of observing such a phenomenon can be as high as 70% for a 5%:95% or 95%:5% subgroup size ratio shown in Fig. 2; such an observed subgroup-specific treatment effect should be interpreted with extreme caution, given that there are no a priori suspected differential treatment effects between the positive and the negative subgroups. 2.4. Observations About the Interaction Test When Homogeneous Effect Is the Truth Several observations relating to the interaction test from sections 2.2 and 2.3 can be summarized. From Fig. 1, one should expect more than 70% to close to 100% probability of observing directional consistency between the positive and negative subgroups in the prevalence range evaluated (5% to 95%) if the treatment effect is homogeneous across the subgroups. As shown in Fig. 2, in the scenario of homogeneous treatment effect, the probability of observing nonsignificant treatment effects in both subgroups ranges from 20% to 25% as the prevalence ranges from extremes (e.g., 5% and 95%) to 50%. The probability of observing that at least one subgroup shows a nonsignificant result is high: at least 75% when prevalence is 50%, and more than 90% if the prevalence is either 5% and smaller or 95% and higher. Given the study is not powered to detect a treatment-by-subgroup interaction effect, adding the interaction term to the main statistical analysis model to evaluate the interaction between positive and negative subgroups is almost certain to be misleading and can at best be exploratory. The risk of using a underpowered interaction test is that the probability of observing a significant treatment effect in the negative subgroup when the positive subgroup consists of only, say, 5% of the intent-to-treat patients can be as high as 70%, which can result in a perceived “one-subgroup-only” benefit when the truth is a homogeneous treatment effect. So is the probability of observing a significant treatment effect in the positive subgroup

26

WANG AND HUNG

when the negative subgroup consists of 5% of the intent-to-treat patients. The current practice of the treatment-by-subgroup interaction test is haphazard because neither the prevalence of the primary subgroup nor the necessary sample size for the interaction test is considered at the design stage.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

3. PRESPECIFICATION OF A SUBGROUP-SPECIFIC HYPOTHESIS A philosophy that is different from the conventional subgroup analysis is to formally design the study allowing for a statistical conclusion on a subgroup-specific treatment effect. The underlying argument is that the study is normally not powered sufficiently to detect a treatment by subgroup interaction when the study is mainly designed to detect a common effect . Depending on the clinical question of primary interest, a study may be designed to detect a treatment effect in the intent-to-treat patient population or in the positive patient subgroup. In such cases, the primary null and alternative hypotheses are L0   = 0 and + = 0

vs L1   = 0 or + = 0

(12)

The framework for testing hypothesis (12) with family-wise type I error rate control has been proposed in the literature (e.g., Simon and Wang, 2006; Freidlin and Simon, 2005). A summary of methodological approaches with strong control of the familywise type I error rate can be found in, for example, Millen et al. (2012). It consists of hypothesis (1) and hypothesis (3). Note hypothesis (12) does not formally address the clinical question of whether treatment is effective in the negative patient subgroup. When a statistically significant treatment effect in the positive patient subgroup is shown, demonstration of a statistically significant overall treatment effect cannot guarantee the conclusion that there is also a statistically significant treatment effect in the negative patient subgroup. However, prespecification of a futility rule to terminate early or to screen out an ineffective subgroup in an adaptive enrichment design may not fall into this category (e.g., Wang et al., 2007, 2009; Brannath et al., 2009; Mehta et al., 2009), as the intent via adaptation is generally different from subgroup analyses in a fixed design trial. In other incidences, a study may be designed to formally address the clinical question “Does treatment effect differ between the positive and negative patient subgroups?” This question leads essentially to hypothesis (2). It is worth noting that hypothesis (2) if pursued is generally a secondary hypothesis of interest. That is, hypothesis (2) may be tested only if a study concludes H1 in hypothesis (1) that the treatment is effective in all comers (ICH, 1998a). However, it appears to be still ambiguous about what a significant test result of an interaction test can achieve. The introduction of interaction-to-overall effects ratio in section 3.1 and likelihood of a baseline covariate predictive of treatment effect in section 3.2 is under the usual sample size planning for the intent-to-treat patients, which satisfies sample size required for the interaction test when the interested subgroup, say, the positive subgroup, is at most 50% of the intent-to-treat patients. This ratio aims for efficacy assessment.

A REGULATORY PERSPECTIVE

27

3.1. Interaction-to-Overall Effects Ratio Following the notation in section 2.2, denote by

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

=

 + − − =  

(13)

the interaction-to-overall effects ratio. This ratio,  in equation (13), allows for assessing sample size increase (or the total sample size, nINT needed to detect the interaction effect from the original planning of nITT to detect the overall effect, . Note that this ratio is independent of the original intent-to-treat (ITT) sample size (nITT ), the nominal power 1 −  , the magnitude of the interaction effect, , and the magnitude of the overall treatment effect, . See the technical derivation in section 3.2. Smith and Day (1984) showed, and Brookes et al. (2004) confirmed, that a study has to be at least four times larger if it is aimed to detect interactions than if it is confined to detect the overall effect of the same magnitude assuming equal subgroup sample sizes. In other words, for + = 05, nINT ≈ 4nITT when  = 1. Certainly, nINT ≈ nITT when  = 2 and nINT ≈ 16nITT when  = 05. Thus, it should be straightforward that the commonly used “rule of four” may not always be sufficient to detect the interaction effects. It may be argued that when a sample size calculation is based on targeting an overall effect of  without distinguishing between the scenario of consistent subgroup-specific treatment effects and the scenario of differential subgroup-specific effects, prespecification of a subgroup-specific treatment effect hypothesis (3) of H0+ vs H1+ seems haphazard. Namely, prespecification of H0   = 0 and H0+  + = 0 with family-wise error rate control (FWER) for L0 (in equation (12)) alone may be insufficient to confidently conclude if  > 0 (H1 that treatment is also effective in the negative patient subgroup) (e.g., Rothman et al., 2012), when H0 in L0 is rejected, unless the prevalence in the positive subgroup of main interest is at most 50%. 3.2. Likelihood of a Baseline Covariate Predictive of Treatment Effect Often, the motivation of L0 in equation (12) hinges on some belief in “− = 0” but with insufficient confidence during the planning stage. That is, the subgroup indicator as a baseline covariate has a reasonable likelihood to be predictive of treatment effects such that the treatment is effective only in one subgroup and not in the other. A viable alternative hypothesis to null hypothesis L0 could be L∗1   > 0 + > 0 and − > 0 or + > 0 and − = 0

(14)

In other words, existence of an overall treatment effect implies homogeneous treatment effects in both subgroups, and the beneficial treatment effect in the positive subgroup indicates null treatment effect in the negative subgroup. However, there is generally no interest in formally testing the hypothesis of a negative subgroup-specific treatment effect, namely, H0−  − = 0 vs. H1−  − = 0

(15)

28

WANG AND HUNG

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

for the alternative hypothesis L∗1 of equation (14). Therefore, the negative subgroupspecific treatment effect hypothesis is only implicit in L∗1 when concluding  > 0 (implying beneficial effects in both subgroups) versus concluding + > 0 (implying no beneficial effect in the negative subgroup). In general, if the hypothesis in equation (15) is also of interest, this treatment effect is either studied through hypothesis (1) with the conjecture that the treatment is effective in both subgroups or is investigated in two separate studies anticipating the two subgroups are at differing risk levels or essentially two subpopulations (e.g., Martinez et al., 2005). To investigate the alternative hypothesis of being predictive in the second scenario of hypothesis given in equation (14), one can derive the interaction-tooverall effects ratio from equation (13). When  = + + + 1 − + − and − = 0, we have =

+ 1 = + + +

(16)

That is, if the baseline covariate is predictive of treatment effect (the true state of treatment effects), then irrespective of how large or how small the effect size in the positive patient subgroup + is, the interaction-to-overall effects ratio for a predictive baseline characteristic is a function of the proportion of patients in the positive subgroup, or shown in equation (16). Given that the interaction-to-overall effects ratio is not affected by the effect size in the positive subgroup, we illustrate the implication of (16) by presenting one scenario (+ = 03, − = 0) summarized in Table 1, which is applicable to any + > 0 when the true state of the negative subgroup-specific treatment effect is zero, that is, − = 0. When the sample size per arm, n, is planned based on an overall treatment effect, , one can plan the sample size for the positive patient subgroup, n+ , based on the postulated effect size, + > , at the same nominal significance level, for this scenario, n+ = 175 per arm. Accounting for the usual necessary multiplicity adjustments, for example, at 0.0125 by Bonferroni adjustment, for the two hypotheses, n+ is 212 per arm. Table 1 Interaction-to-overall effects ratio as a function of prevalence of positive subgroup + 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95



n

n+ from n

n− from n



0015 003 006 009 012 015 018 021 024 027 0285

69789 17448 4362 1939 1091 698 485 357 273 216 194

3490 1745 873 582 437 350 292 250 219 195 185

66299 15703 3489 1357 654 348 193 107 54 21 9

2000 1000 500 333 250 200 167 143 125 111 105

Note. + = 03, − = 0, n+ = 175 per arm based on + only or 212 per arm with Bonferroni adjustment.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

29

When multiplicity adjustment is accounted for, n+ may be a little short if more than 80% of subjects belong to the positive subgroup. This is generally not a concern, because when the positive subgroup is the majority, the interest to consider a subgroup-specific treatment effect would be lowered. Thus, we may focus on at most 50% of subjects in one subgroup, say, the positive subgroup, for evaluation of the likelihood that the baseline covariate is predictive of treatment effect (e.g., Wang et al., 2007; Wang, 2007). From Table 1, the interaction-to-overall effects ratio is at least 2 or higher for 50% or less subjects in the positive subgroup, that is,  ≥ 2 for + ≤ 05. From section 3.1, nINT ≈ nITT when  = 2 for equal subgroup sample size. For unequal subgroup sample sizes, Smith and Day (1984) show that up to 1.15-fold and 2.5-fold sample size of nITT would be needed for nINT to detect  = 2 if the percentage of positive subgroup is 30% and 10%, respectively. From Table 1,  is larger than 2 when the percentage of positive subgroup is less than 50%, and  = 333 for 30% and  = 10 for 10% positive subgroup, respectively. The larger the ratio  (> 2) is, the smaller the sample size is needed, nINT , to test the interaction effect. As a result, the sample size, nITT , planned for  would be sufficiently powered to also formally test the interaction effect in hypothesis (2) for  ≥ 2 aside from the fact that n+ is also sufficiently powered for detecting + . The usual argument that there is insufficient power to formally test the interaction effect would not be applicable for  ≥ 2 where the positive subgroup is at most 50%. In contrast, the nITT would be insufficient for testing the interaction when more than 50% of subjects are in the positive subgroup, making the interaction test underpowered. Although it is equivalent to a statistically powered interaction test, subgroup-specific conclusion for drug licensure may need to also rely on appropriate benefit/risk assessment within each subgroup and their trade-offs between the mutually exclusive subgroups. 4. DESIGN CONSIDERATION Subgroup analysis is a common practice in randomized controlled trials, irrespective of prespecification of a subgroup-specific treatment effect hypothesis. From design consideration, the level of evidence needed for concluding a subgroupspecific treatment effect will likely depend on many criteria, such as those listed in Table 2. Table 2 Criteria for consideration that may affect interpretability of a subgroup-specific finding (i) Is subgroup indicator pathophysiologically plausible, prognostic of disease, and/or predictive of clinical outcome? (ii) What are the study objective(s), primary efficacy endpoint, primary hypothesis or hypotheses? (iii) When is the subgroup-specific hypothesis formed? (iv) Is the subgroup unambiguously defined and when? (v) Is study designed with a prespecified subgroup-specific treatment effect size, e.g., + and n+ ? (vi) Is subgroup defined or newly developed via, e.g., prediction algorithm needing validation? (vii) Is subgroup defined by phenotypic (clinical) or genomic (molecular, intrinsic) characteristics? (viii) When and how are genomic samples collected if characteristics of subgroup are genomic? (ix) Is diagnostic assay analytically validated? (x) Are genomic samples essentially fully ascertained and when? (xi) Are multiplicity adjustments for the dual hypotheses prespecified?

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

30

WANG AND HUNG

The criteria listed in Table 2 will at least guide us as to whether a randomized controlled trial is designed (I) prospectively with  and + , (II) prospectively with  alone, (III) prospectively–retrospectively (Wang et al., 2006), or (IV) retrospectively with a subgroup-specific hypothesis may also be of eventual interest. Figure 3 depicts the schema of the approaches that may help to determine if a study can be rendered adequate and well-controlled (FDA, 2013) for definitively concluding a positive treatment effect in a predefined subgroup or in all comers. Design consideration does not have to always rely on whether the finding of an overall treatment effect must be shown first. Instead, from the design standpoint, a welldefined subgroup that is biologically plausible or molecularly targeted can be worthy of credible and meaningful statistical inference when the subgroup factor is prespecified, though subgroup-specific hypothesis may not always be prespecified. In this sense, a priori criteria are likely to be required for consideration in concluding a “subgroup-only” effect. Stratification by such a subgroup factor is generally preferable for randomization. Although a subgroup-specific hypothesis may not be clearly stated, we note that the commonly performed subgroup analyses, by age, gender, race/ethnicity, and geographical regions, might be viewed as a prespecified subgroup factor when there are plausible rationales a priori on pathophysiology, mechanism of treatment targeting, biological reasoning, or clinical supports for a disease-specific indication under investigation. There is no doubt that level-1 evidence would be based on the credible findings on  and + from prospective study designs (FDA, 1998). The next level of evidence for a subgroup-specific treatment effect relies on the study’s primary objective or

Figure 3 Flow chart of design and analysis of subgroup-specific treatment effect vs. traditional subgroup analysis.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

31

objective(s). For pharmacogenomics clinical trials, fully specified prospective design would mean that the genomic samples are collected at study baseline, in addition to the overall study design characteristics (Wang, 2007). The interests in pursuing a prospective/retrospective design and analysis by “prespecifying” a genomic subgroup hypothesis prior to knowing the genomic status and “retrospectively” performing the genomic subgroup analysis was popular earlier, and may not fully ascertain genomic status from all comers (e.g., Wang et al., 2006, 2010). In fact, prospective design with complete ascertainment of genomic samples has been proven possible, for example, with PREDICT-1 (Hughes et al., 2008). Prospective design with  and analysis of subgroup without incorporating + at the design stage is a common prospective design scenario. We note that prospective design including + and analysis has also been considered in the medical literature. For instance, a pharmacogenomics trial can be designed with sufficient sample sizes, n and n+ , aiming at a specific study power to detect an overall effect size  in the intent-totreat patients and a subgroup-specific effect size + , in subjects who are classified as (genomic) biomarker positive (e.g., Johnston et al., 2009).

5. DECISION RULES To make a regulatory review recommendation based on statistical conclusion with respect to the dual hypotheses of equation (12) in L0 from an adequate and well-controlled (UFDA, 2013) confirmatory trial, the study design and analysis plan require prespecification of a multiplicity adjustment method that strongly controls the studywise type I error rate. Except for the last row of Table 3, when L0 is rejected, there are three possible outcomes summarized in column 1 (is treatment effective in the ITT population?) and column 2 (is treatment effective in the positive subgroup?). Aside from the study-wise type I error rate control, testing for treatment by subgroup interactions is often viewed as a secondary analysis or exploratory analysis that is not included in the chain or hierarchy of the studywise type I error rate control. The role of the interaction test is to facilitate assessment for the likelihood of having a negligible or nonnegligible treatment effect in the negative subgroup before concluding that the corresponding subgroup indicator or classifier is predictive of treatment effects. We have provided probabilistic arguments on the caveats of the interaction test when a homogeneous treatment effect is the truth in section 2.4. As noted in the alternative hypothesis in equation (14) of L∗1 , when the treatment is shown effective statistically, that is,  > 0 in the intent-to-treat patients, the implicit hypothesis − > 0 in the negative subgroup is always possible, though it may or may not imply + = − =  > 0. Given that  > 0 is concluded, rejection of H0+  + = 0 should not be immediately viewed as an indicator of no treatment effect in the negative subgroup. Together, the test of interaction (2) in column 3, the subgroup-specific prevalence, and the observed interaction-to-overall effects ratio ˆ in this case may shed valuable insights into the likelihood of a quantitative or a qualitative interaction on whether it is feasible to rule out a beneficial treatment effect in the negative subgroup. Regulatory recommendation inclusive of efficacy and safety in a subgroup versus in all comers and the disease indication investigated

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

32

WANG AND HUNG

can be built on top of the statistical conclusion attained in one of the first four rows where H0 in the hypothesis in equation (1) is rejected. In the other scenario where only H0+  + = 0 of the hypothesis in equation (3) is rejected, sample size planned for nITT is also sufficiently powered to test the interaction effect (2) for  ≥2 where prevalence of the subgroup of main interest is at most 50%. Rejection of H0+  + = 0 in the alternative hypothesis in equation (14) of L∗1 might imply that there is no effect in the negative subgroup (− = 0), but it is not a direct proof and can be either way, that is, − = 0 or − > 0. In the event K0   = 0 is rejected, given + > 0 and the observed ratio ˆ ≥ 2, the likelihood that the corresponding prespecified subgroup classifier is predictive of treatment effect can be high if the prevalence of positive subgroup is at most 50%. The predictiveness of a baseline factor or classifier should ideally be confirmed in a separate study before one can confidently conclude the predictive claim. However, for drug approval, it may suffice to replicate the observed significance, either on the subgroup-specific treatment effect or on the overall treatment effect, in a separate study. After + > 0 is concluded, a statistically insignificant interaction does not clearly suggest that there is no treatment effect in the negative subgroup, especially when ˆ < 2. It is possible that the treatment effect is smaller in the negative subgroup compared to that in the positive subgroup and therefore the study is insufficiently powered for detecting a smaller effect or a treatment by subgroup interaction.

Table 3 Regulatory review recommendation on statistical conclusion of treatment effects H0 =0

H0 + = 0

K0 =0

− > 0

Regulatory statistics recommendation

Reject

Reject

Reject

Possible

Not reject

Possible

Not reject

Reject

Possible

Reject

Not reject Reject

Possible Unlikely

Not reject

Possible

n/a

n/a

Positive subgroup claim Negative subgroup claim—review issue, especially ˆ < 2 ITT claim possible Positive subgroup claim possible and is disease dependent; Insufficient evidence to exclude negative subgroup if ˆ < 2 ITT claim if quantitative interaction, e.g., ˆ < 2; or negative subgroup claim possible implicit + > 0, − > 0∗ ITT claim May be predictive claim if ˆ ≥ 2 and the prevalence for specific subgroup is at most 50%; when replicated in a separate study increases certainty as a predictive characteristic Positive subgroup claim; the study may potentially be underpowered for the treatment effect in the negative subgroup if ˆ < 2 No claims

Reject

Not reject

Not reject

Not reject

*Does not require consistent treatment effects across subgroups, + > − > 0 or + = − > 0.

A REGULATORY PERSPECTIVE

33

6. SOME REGULATORY REVIEW EXPERIENCES IN LABELING RECOMMENDATIONS In regulatory review practices of NDAs and BLAs thus far, it is hardly an exaggeration to state that there is a subgroup problem behind every disease indication, so that “subgroup analysis” may have been prespecified but subgroupspecific treatment effect hypothesis is rarely prespecified.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

6.1. Challenges Due to Full Reliance of a Single Controlled Trial In many applications, the basis for regulatory approval for a test drug may have to rely primarily on a single adequate and well-controlled trial (FDA, 2013). This is already difficult and presents even more challenges to rule out a potentially false positive finding on subgroup-specific treatment effect. In the following, we share two approval experiences of subgroup-specific labeling based mainly on one controlled trial of which the subgroup results may be mentioned but prespecification of subgroup-specific treatment effect hypothesis is not in place. It could be argued that the regulatory recommendation seems post hoc in nature after a statistically significant treatment effect in all comers is demonstrated. In one case, the approval of short-term survival benefit was limited to the high risk subgroup, noting that “efficacy has not been established in patients with low risk of death.” These results correspond to row 1 of Table 3 showing overall statistical significance and a significant treatment effect in a high-risk subgroup with a nominal significant p-value (p < 0018) for the interaction test. The subgroup indicator was thought to be prognostic of disease that is defined by clinical criteria (see (i) and (vii) of Table 2); there were debates on the cutoff threshold used to define the high-risk subgroup (see (iv) of Table 2); the subgroup-specific hypothesis was not prespecified (no to (v) and (xi) of Table 2), and was considered a retrospective analysis falling under traditional subgroup analysis in Fig. 3. A few years later, the label was updated to include supportive information from another trial that studied mostly the low-risk patients and was early terminated for futility because little efficacy was observed (Xigris label, 2007). Interestingly, a few more years later, a follow-up study showed no short-term survival benefit in the highrisk subgroup. The authors concluded that there was no benefit in the high-risk subgroup for which the treatment was approved and noted that they cannot explain the inconsistency suggested in their finding. All three trials are large mortality trials (Rothman et al., 2012). This subgroup-specific labeling has been challenged. In the hind sight, the observed interaction-to-overall effects ratio was only ˆ  15 in the original study used as the basis for approval only to the high-risk subgroup, which is clearly not large enough to be certain about the predictive utility of the treatment effect. The observed prevalence of the high-risk subgroup is a little more than 50% of the intent-to-treat patients. It turned out that independent replication of treatment effect in the high-risk subgroup did not hold. Eventually, a worldwide voluntary market withdrawal was announced (http://www.fda.gov/Safety/MedWatch/ SafetyInformation/SafetyAlertsforHuman MedicalProducts/ucm277143.htm). In the other case, the approval of reduced morbidity is limited to approximately 7% of the intent-to-treat heart failure patients who are intolerant

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

34

WANG AND HUNG

of angiotensin-converting enzyme (ACE) inhibitors based on the ValHEFT trial (Diovan label, 2002). The subgroup is arguably based on the finding from the only controlled trial without prespecification for evaluation of a subgroup-specific (only those who did not receive an ACE inhibitor—a clinical characteristic, see (vii) of Table 2) treatment effect. In the drug label, it was stated, “It is not known if this is a reproducible effect or a chance occurrence. The use of a beta-blocker did not appear to influence the effect of valsartan in patients not receiving an ACE inhibitor.” The empirical observation with some inhibitor arguments (i.e., mechanism of action of drug predictive of outcome with subgroup defined; see (i) and (iv) of Table 2) was thought to be a plausible explanation for concluding this subgroup-specific benefitrisk assessment, not prospectively specified (no to (v) and (xi) of Table 2). In this latter case, the observed interaction-to-overall effects ratio was larger than 2 (ranging from ˆ > 3 to ˆ > 5) for the combined endpoint of heart failure morbidity and all-cause mortality. This ratio was more than 40 for all-cause mortality alone, which is one component of heart failure morbidity and is the coprimary endpoint. Here, the observed hazard ratios (and nominal p-values) were 0.87 (0.009) for the combined endpoint and 1.02 (0.80) for all-cause mortality in the intent-to-treat patients. That is, there was no significant risk reduction on all-cause mortality and there was a significant risk reduction on combined endpoint in allcomers. The treatment-by-receiving-ACE-inhibitor interaction tests were significant for the combined endpoint (nominal p < 0003) and for all-cause mortality (nominal p < 0015). Thus, the results on all-cause mortality fits row 5 of Table 3 and the results on the combined endpoint fits row 1 of Table 3. The analysis of the subgroup not receiving an ACE inhibitor is a traditional retrospective analysis (see Fig. 3). Note that all-cause mortality was also viewed as a safety endpoint, and the interpretation of differential mortality effects between those receiving and those not receiving ACE-inhibitor should be cautiously made given that the overall mortality benefit was not shown statistically and the subgroup analysis might not be prespecified, and especially if conducting another study to replicate the mortality finding is not feasible. Since FDA approval was limited to the heart failure subgroup not receiving an angiotensin-converting enzyme (ACE) inhibitor, medical literature continued to assess the benefit–risk profiles of the subgroups mostly comparing among subgroups receiving ACE inhibitor and/or beta-blocker (e.g., Louis et al., 2001). 6.2. Challenges Due to Belief in Subgroup-Specific Treatment Effect Rarely is a prospectively planned subgroup-specific hypothesis in place. Instead, data of the same subgroup across multiple studies may be the only available information for health authorities to make decision for drug licensure, for example, approval of cetuximab for treatment of colorectal cancer in kRAS mutationnegative patients with observed prevalence of approximately 55% to 65% (Erbitux label, 2012) of the colorectal cancer patients. The cetuximab development consists of a few prospective/retrospective pharmacogenomics study designs and analyses with incomplete ascertainment of kRAS mutation status; see prospective/retrospective design and analysis in Fig. 3 and see (i) and (vii)–(x) of Table 2. Where feasible, a separate study that is prospectively designed should be conducted to allow for

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

35

assessing whether the subgroup-specific interpretation in a post hoc fashion can be credibly replicated. Although desirable, this was not possible for the cetuximab development because ongoing trials could no longer accrue kRAS mutation-positive colorectal cancer patients once the clinical community required routine kRAS testing prior to use of anti-epidermal growth factor receptor (anti-EGFR) antibody therapy (McNeil, 2008). In other words, although it has not been proven effective yet, the clinical community may choose to be pragmatic on the ethical grounds and not to put EGFR mutation-positive colorectal cancer patients into experimentation. Another all-comer colorectal cancer trial was not an option. It may be possible under other settings where a targeted therapy has not been factored into clinical practices (e.g., Gregson et al., 2012). In the cetuximab case, it was difficult to properly assess the interaction-to-overall effect ratio due to further enrollment being limited to the EGFR mutation-negative subgroup. Regarding the cetuximab development, regulatory reviews and labeling recommendation account for the available kRAS subgroup-specific trial data and another drug (panitumimab) under the same drug class, although there were some concerns on genomic convenience samples, which were later alleviated due to noteworthy efforts made to improve ascertainment rates from less than 50% to around 90%. It turned out that lack of cetuximab benefit measured by progression free survival and overall survival was shown in several studies in the kRAS mutation-positive subgroup, including the first-, second-, and third-line treatments. In addition, there was a suggestion of harm (decrease in progression-free survival and decrease in overall survival) in the kRAS mutation-positive subgroup receiving cetuximab in combination with other agents. Balancing the safety and efficacy of cetuximab in the kRAS mutation-positive subgroup observed, cetuximab was eventually approved in metastatic colorectal patients who are classified as kRAS mutation negative (Erbitux label, 2012). It appears that the current practice for labeling mostly considers favorable benefit/risk profile shown either in a subgroup or in an intent-to-treat patient set. Use Tarceva as another example, the intended use section of the label states, “Tarceva is indicated for first-line treatment of patients with metastatic non-small cell lung cancer (NSCLC) whose tumors have epidermal growth factor receptor (EGFR) exon 19 deletions or exon 21 (L858R) substitutions as detected by an FDAapproved test.” The limitation of use stated that “Safety and efficacy of Tarceva have not been evaluated as first-line treatment in patients with metastatic NSCLC whose tumors have EGFR mutations other than exon 19 deletions or exon 21 (L858R) substitution” (Tarceva label, 2013). 7. DISCUSSION There is a considerable literature on subgroup analysis. Many authors have suggested guidelines for determining whether the apparent differences in subgroupspecific treatment effects are real. This includes, for example, Yusuf et al. (1991), Oxman et al. (2002), Cui et al. (2002), Rothwell (2005), and Sun et al. (2012), and many references cited therein. In the conventional framework, irrespective of prespecification of subgroup-specific hypothesis or not, subgroup analysis is generally performed as an exploratory analysis after the overall treatment effect is

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

36

WANG AND HUNG

demonstrated. It is arguable whether the prespecified subgroup-specific treatment effect hypothesis is only exploratory, especially when the multiplicity adjustments due to dual hypotheses (ITT and specific-subgroup) are made and the subgroup as defined is consistent with formation of a meaningful subgroup. In contrast, subgroup analysis if performed when the overall treatment effect cannot be shown conclusively is considered as hypothesis generation. Prespecification of dual hypotheses with multiplicity adjustment may arguably not be considered “purely” hypothesis generation if there are sound clinical or genomic rationales for the subgroup-specific hypothesis. In fact, subgroup testing can produce false-negative results in a trial with a clear overall treatment effect, due to chance finding and lack of statistical power (Schultz and Grimes, 2005). These authors argue that post hoc observations of a subgroup-specific treatment effect should be treated with scepticism irrespective of their statistical significance (e.g., Yusuf et al., 1991; Rothwell, 2005). These points seem to be in line with the probability of statistically significant consistency versus inconsistency shown in Fig. 2 when a homogeneous treatment effect across subgroups is the true state of nature. Dwelling on the potential of studywise type I error rate inflation due to the unknown number of subgroup hypotheses tested or simply requiring prespecification of a subgroup-specific hypothesis may not directly address the design and analysis issues about testing a subgroup-specific treatment effect hypothesis in an all-comer study. Our thought process begins with formation of an appealing subgroup based on pathophysiologic plausibility, the intrinsic and/or extrinsic characteristics that may set patients apart due to their prognoses on degree of underlying illness and/or their predictiveness of differential treatment effects. These developments follow the logical sequences of formulating and evaluating a subgroup-specific treatment effect hypothesis. The flow from these developments also ties in closely to the process of distinguishing various designs for a randomized controlled trial presented in Table 2 and Figure 3. For prospective planning, prespecification of a subgroup analysis in the context of the hypothesis in equation (12) involves formal hypothesis testing and evaluation of a subgroup-specific treatment effect. In general, randomization should be stratified by the subgroup indicator, especially in small controlled trials aiming at statistical inference of a subgroup-specific treatment effect. In addition, we advocate consideration of the interaction-to-overall effects ratio in equation (13) and the prevalence of baseline factor as additional measures for study planning in investigating a predictive effect. As summarized in Table 1 and derived in equation (16), this ratio is independent of the subgroup-specific effect size, which embeds a priori conjecture of a potential likelihood of little to no effect in the complimentary subgroup at the planning stage. This is consistent with the recommendation that if important subgroup-specific treatment effects are anticipated, trials should be powered to detect them reliably (Cui et al., 2002). As much as one prefers that the subgroup hypothesis be prespecified, post hoc subgroup analyses continue. Thus, we give a set of decision rules in the analysis stage, which should provide guidance for rigorous subgroup-specific conclusions. We recommend joint consideration of Table 3, with the prevalence of interested ˆ as statistical subgroup and the observed interaction-to-overall effects ratio, , criteria for regulatory decision making on labeling if only for a specific subgroup.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

37

The potential utility of this observed ratio is illustrated using regulatory NDA/BLA case studies in section 6 under the setting where one single controlled trial is the main basis for regulatory decision making. Use of the observed ratio ˆ may also be applicable to the setting where multiple trials all investigate the treatment effects in both the positive and the negative subgroups. An increasing evidence of a subgroup-specific treatment effect would likely be supported when the majority of the trials yield ˆ ≥ 2 where prevalence of the specific subgroup is at most 50%. Future research on the theoretical property of interaction-to-overall effects ratio will continue and may account for the uncertainty in the observed prevalence. It is certainly not wise to plan a trial to include patients for whom no treatment effect is anticipated. As such, enrichment design may be a natural choice. The many criteria such as those listed in Table 2 for concluding a subgroupspecific treatment effect and the logical flow about formation of a subgroup in categorizing the types of study designs involving subgroup analysis in Fig. 3 are irrelevant to enrichment design. When a subgroup and its complementary subgroup are included in a controlled trial, it is the general belief that either a well-controlled trial should have proper representation of patient subgroups, such as, age, gender, and race, or there may be suspicion of potential differential treatment effects between subgroups. But when a study is not powered to account for significant interaction, the deficiencies of interaction testing such that it is underpowered and misses statistically significantly inconsistent subgroup effects or leads to a higher probability of committing a false-positive error in concluding inconsistent subgroup effects due to extreme prevalence, say, 5% and lower versus 95% and higher, make the interaction test results potentially very misleading. On the other hand, it is difficult to project whether the subgroup heterogeneity, if any, in treatment effect is either quantitative or qualitative. In other words, the clinical utility of the subgroup indicator being prognostic of disease and/or predictive of treatment outcome cannot be well assumed with certainty at the design stage. In controlled all-comer pharmacogenomics trials, treatment effects are generally evaluated in both biomarker-positive and biomarker-negative patients based on the presence or absence of a molecular biomarker or genomic biomarker. If there is some preliminary rationale suggesting that the biomarker-positive patients are likely to drive the most benefit of the test treatment, but potential benefit for the biomarker-negative patients cannot be ruled out at the design stage, the interactionto-overall effects ratio could be a useful measure for assessing the predictive nature of the biomarker (a baseline subgroup factor) when the percentage of patients in the interested biomarker-positive subgroup is at most 50% of the intent-to-treat patients, as articulated in section 3. From the design and analysis perspectives of subgroups, we recommend consideration of the following two questions: • Is the study prospectively planned with + and n+ prespecified? • Is the magnitude of the subgroup-specific treatment effect clinically relevant? The importance of the first question in terms of its logical place in trial planning is shown in Fig. 3. As for the second criterion, it is worth noting that in the usual intent-to-treat study design and analysis, a statistically significant overall treatment effect may or may not be clinically meaningful. In the same vein, we should also

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

38

WANG AND HUNG

demand that the subgroup-specific treatment effect meets clinical relevance, which ˆ in addition to the integrated can be informally assessed using the observed ratio, , clinical benefit–risk assessment. In other words, it is important to also account for the subgroup-specific treatment effect and its corresponding sample size during trial planning so as to avoid studying a clinically irrelevant treatment effect in the specific subgroup of ultimate interest. The nonapproval of pleconaril is an example, where the observed treatment effect on treating the common cold was not large. It was argued from public health perspectives that there was no urgent need of this treatment, but it might cause unwanted side effects (“Cold comfort” 2002). In hindsight, the observed ratios were ˆ = 14 in one study and ˆ < 10 in the other study, implying unclear predictive benefit in the positive reverse-transcription polymerase chain reaction (RT-PCR) subgroup, whose observed prevalence ranged from 62% to 68%. ACKNOWLEDGMENTS We thank Dr. Lisa LaVange for her helpful comments. We also thank two anonymous referees for their insightful comments that led to a much improved article. FUNDING The regulatory research work presented here was supported by the RSR funds 05-02 and 05-14, provided by the Center for Drug Evaluation and Research of the U.S. Food and Drug Administration. This article reflects the views of the authors and should not be construed to represent the views or policies of the U.S. Food and Drug Administration. REFERENCES Brannath, W., Zuber, E., Branson, M., Bretz, F., Gallo, P., Posch, M., Racine-Poon, A. (2009). Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Statistics in Medicine 28:1445–1463. Brilinta label. (2013). Available at: http://www.accessdata.fda.gov/drugsatfda_docs/label/ 2013/022433s006lbl.pdf Brookes, S. T., Whitely, E., Egger, M., Smith, G. D., Mulheran, P. A., Peters, T. J. (2004). Subgroup analyses in randomized trials: Risks of subgroup-specific analyses; power and sample size for the interaction test. Journal of Clinical Epidemiology 57:229–236. Cold comfort. (2002). Lancet Infectious Diseases 2:385. Cui, L., Hung, H. M., Wang, S. J., Tsong, Y. (2002). Issues related to subgroup analysis in clinical trials. Journal of Biopharmaceutical Statistics 12:347–358. Diovan label. (2002). Available at: http://www.accessdata.fda.gov/drugsatfda_docs/label/ 2002/20665s16lbl.pdf Erbitux label. (2012). Available at: http://www.accessdata.fda.gov/drugsatfda_docs/label/ 2012/125084s225lbl.pdf Food and Drug Administration. (2013). Health Human Services, Code of Federal Regulation, 21CFR314.126. Available at: http://www.accessdata.fda.gov/scripts/ cdrh/ cfdocs/cfcfr/CFRSearch.cfm?fr=314.126

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

A REGULATORY PERSPECTIVE

39

Food and Drug Administration. (1998). U.S. Food and Drug Administration Guidance for Industry: Providing clinical evidence of effectiveness for human drug and biological products. Available at: http://www.fda.gov/downloads/Drugs/GuidanceCompliance RegulatoryInformation/Guidances/ucm078749.pdf Food and Drug Administration. (2012). Draft guidance for industry: Enrichment strategies for clinical trials to support approval of human drugs and biological products. Available at: http://www.fda.gov/downloads/drugs/guidancecomplianceregulatory information/guidances/ucm332181.pdf (released December 14, 2012, for public comments). Freidlin, B., Simon, R. (2005). Adaptive signature design: An adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clinical Cancer Research 11:7872–7878. Gregson, B. A., Broderick, J. P., Auer, L. M., Batjer, H., Chen, X. C., Juvela, S., Morgenstern, L. B., Pantazis, G. C., Teernstra, O. P. M., Wang, W. Z., Zuccarello, M., Mendelow, A. D. (2012). Individual patient data subgroup meta-analysis of surgery for spontaneous supratentorial intracerebral hemorrhage. Stroke 43:1496–1504. Hughes, S., Hughes, A., Brothers, C., Spreen, W., Thorborn, D., CNA106030 Study team. (2008). PREDICT-1: The first powered prospective trial of pharmacogenetic screening to reduce drug adverse events. Pharmaceutical Statistics 7:121–129. Hung, H. M. J., Wang, S. J., O’Neill, R. (2010). Consideration of regional difference in design and analysis of multi-regional trials. Pharmaceutical Statistics 9:173–178. International Conference on Harmonization. (1998a). International Conference on Harmonization (ICH) guidance, E9 Statistical Principles for Clinical Trials (ICH E9 guidance). February. Available at: http://www.ich.org/fileadmin/Public_Web_Site/ ICH_Products/Guidelines/Efficacy/E9/Step4/E9_Guideline.pdf International Conference on Harmonization. (1998b). International Conference on Harmonization (ICH) guidance, E5 (R1) Ethnic factors in the acceptability of foreign clinical data. February. Available at: http://www.ich.org/fileadmin/Public_Web_Site/ ICH_Products/Guidelines/Efficacy/E5_R1/Step4/E5_R1_Guideline.pdf Johnston, S., Pippen, J., Pivot, X., Lichinitser, M., Sadeghi, S., Dieras, V., Gomez, H. L., Romieu, G., Manikhas, A., Kennedy, M. J., Press, M. F., Malzman, J., Florance, A., O’Rourke, L., Oliva, C., Stein, S., Pegram, M. (2009). Lapatinib combined with letrozole versus letrozole and placebo as first-line therapy for postmenopausal hormone receptor-positive metastatic breast cancer. Journal of Clinical Oncology 27(33):5538–5546. Louis, A., Cleland, J. G. F., Crabbe, S., Ford, S., Thackray, S., Houghton, T., Clark, A. (2001). Clinical trials update: CAPRICORN, COPERNICUS, MIRACLE, STAF, RITZ-2, RECOVER and RENAISSANCE and cachexia and cholesterol in heart failure. Highlights of the scientific sessions of the American College of Cardiology, 2001. European Journal of Heart Failure 3:381–387. Martinez, F. J., Grossman, R. F., Zadeikis, N., Fisher, A. C., Walker, K., Ambruzs, M. E., Tennenberg, A. M. (2005). Patient stratification in the management of acute bacterial exacerbation of chronic bronchitis: The role of levofloxacin 750 mg. European Respiratory Journal 25:1001–1009. Mehta, C., Gao, P., Bhatt, D. L., Harrington, R. A., Skerjanec, S., Ware, J. H. (2009). Optimizing trial design: Sequential, adaptive, and enrichment strategies. Circulation 119:597–605. Millen, B. A., Dmitrienko, A., Ruberg, S., Shen, L. (2012). A statistical framework for decision making in confirmatory multipopulation tailoring clinical trials. Drug Information Journal. doi:10.1177/0092861512454116. McNeil, C. (2008). K-Ras mutations are changing practice in advanced colorectal cancer. Journal of National Cancer Institute 100(23):1667–1669.

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

40

WANG AND HUNG

National Institutes of Health. (1994). NIH guidelines on the inclusion of women and minorities as subjects in clinical research. Federal Register 59:14508–14513. Oxman, A., Guyatt, G., Green, L., Craig, J., Walter, S., Cook, D. (2002). When to believe a subgroup analysis. In: Guyatt, G., Rennie, D., eds. Users’ guides to the medical literature. A manual for evidence-based clinical practice. Chicago, IL: AMA Press, pp. 553–565. President signs FDA user fee bill as industry begins watching for implementation issues. (2012). Sidley.com. Available at: http://www.sidley.com/President-Signs-FDA-UserFee-Bill-as-Industry-Begins-Watching-for-Implementation-Issues-07-09-2012 Ranieri, V. M., Thompson, B. T., Barie, P. S., Dhainaut, J. F., Douglas, I. S., Finfer, S., Gardlund, B., Marshall, J. C., Rhodes, A., Artigas, A., Payen, D., Tenhunen, J., AlKhalidi, H. R., Thompson, V., Janes, J., Macias, W. L., Vangerow, B., Williams, M. D., for the PROWLESS-SHOCK Study Group. (2012). Drotrecogin Alfa (Activated) in adult with septic shock. New England Journal of Medicine 366:2055–2064. Rothman, M. D., Zhang, J. J., Lu, L., Fleming, T. R. (2012). Testing a prespecified subgroup and the intent-to-treat population. Drug Information Journal 46:175–179. Rothwell, P. (2005). Subgroup analysis in randomized controlled trials: Importance, indications, and interpretation. Lancet 365:176–186. Simon, R., Wang, S. J. (2006). Use of genomic signatures in therapeutics development in oncology and other diseases. Advance online publication, January 17, 2006. Pharmacogenomics Journal 6:166–173. Smith, P. G., Day, N. E. (1984). The design of case-control studies: The influence of confounding and interaction effects. International Journal of Epidemiology 13:356–365. Sun, X., Briel, M., Busse, J. W., You, J. J., Akl, E. A., Mejza, F., Bala, M. M., Bassler, D., Mertz, D., Diaz-Granados, N., Vandvik, P. O., Malaga, G., Srinathan, S. K., Dahm, P., Johnston, B. C., Alonso-Coello, P., Hassouneh, B., Walter, S.D., HeelsAnsdell, D., Bhatnagar, N., Altman, D. G., Guyatt, G. H. (2012). Credibility of claims of subgroup effects in randomized controlled trials: systematic review. British Medical Journal 15:344. doi: 10.1136/bmj.e1553. Schultz, K. F., Grimes, D. A. (2005). Multiplicity in randomized controlled trials II: Subgroup and interim analyses. Lancet 365:1657–1661. Tarceva label. (2013). Available at: http://www.accessdata.fda.gov/drugsatfda_docs/label/ 2013/021743s018lbl.pdf Xigris label. (2007). Available at: http://www.accessdata.fda.gov/drugsatfda_docs/label/ 2007/125029s080LBL.pdf Wang, S. J. (2007). Biomarker as a classifier in pharmacogenomics clinical trials: A tribute to 30th anniversary of PSI. Pharmaceutical Statistics 6:283–296. Wang, S. J., Cohen, N., Katz, D. A., Ruano, G., Shaw, P., Spear, B. (2006). Retrospective validation of genomic biomarkers—What are the questions, challenges and strategies for developing useful relationships to clinical outcomes—Workshop summary. Pharmacogenomics Journal 6:82–88. Wang, S. J., Hung, H. M. J. (2012). Ethnic sensitive or molecular sensitive beyond all regions being equal in multiregional clinical trials. Journal of Biopharmaceutical Statistics 22:879–893. Wang, S. J., Hung, H. M. J., O’Neill, R. T. (2009). Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal 51(2):358–374. Wang, S. J., O’Neill, R. T., Hung, H. M. J. (2007). Approaches to evaluation of treatment effect in randomized clinical trials with genomic subset. Pharmaceutical Statistics 6:227–244. Wang, S. J., O’Neill, R. T., Hung, H. M. J. (2010). Some statistical considerations in evaluating pharmacogenomics confirmatory clinical trials. Clinical Trials 7:525–536.

A REGULATORY PERSPECTIVE

41

Downloaded by [University Library Utrecht] at 19:34 13 March 2015

Wedel, H., DeMets, D., Deedwania, P., Fagerberg, B., Goldstein, S., Gottlieb, S., Hjalmarson, A., Kjekshus, J., Waagstein, F., Wikstrand, J., on belahf of the MERITHF Study Group. (2001). Challenges of subgroup analyses in multinational clinical trials: Experiences from the MERIT-HF trial. American Heart Journal 142:502–511. Yusuf, S., Wittes, J., Probstfield, J., Tyroler, H. A. (1991). Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. Special communication. Journal of the American Medical Association 266(1):93–98.

A regulatory perspective on essential considerations in design and analysis of subgroups when correctly classified.

This regulatory research provides possible approaches for improvement to conventional subgroup analysis in a fixed design setting. The interaction-to-...
477KB Sizes 0 Downloads 0 Views