Journal of Biopharmaceutical Statistics
ISSN: 1054-3406 (Print) 1520-5711 (Online) Journal homepage: http://www.tandfonline.com/loi/lbps20
A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample Yunzhi Lin, Kefei Zhou & Jitendra Ganju To cite this article: Yunzhi Lin, Kefei Zhou & Jitendra Ganju (2016): A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample, Journal of Biopharmaceutical Statistics, DOI: 10.1080/10543406.2016.1148718 To link to this article: http://dx.doi.org/10.1080/10543406.2016.1148718
Accepted author version posted online: 18 Feb 2016.
Submit your article to this journal
Article views: 6
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lbps20 Download by: [Gazi University]
Date: 23 February 2016, At: 07:30
A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample Authors: Yunzhi Lin1, Kefei Zhou2, Jitendra Ganju*3
t
[email protected],
[email protected]. *Corresponding author
[email protected] cr ip
1
Abstract
us
subgroups. For example, the effect size, or informally the benefit with treatment, is often greater
M an
in patients with a moderate condition of a disease is than in those with a mild condition. A limitation of the usual method of analysis is that it does not incorporate this ordering of effect size by patient subgroup. We propose a test statistic which supplements the conventional test by including this information and simultaneously tests the null hypothesis in pre-specified
ed
subgroups and in the overall sample. It results in more power than the conventional test when the differences in effect sizes across subgroups are at least moderately large; otherwise it loses
pt
power. The method involves combining p-values from models fit to pre-specified subgroups and
ce
the overall sample in a manner that assigns greater weight to subgroups in which a larger effect size is expected. Results are presented for randomized trials with two and three subgroups. Keywords: subgroups; minimum p-value; power
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
In clinical trials, some patient subgroups are likely to demonstrate larger effect sizes than other
1
INTRODUCTION Patients enrolled in randomized and blinded clinical trials comprise a heterogeneous group with
cr ip
t
respect to disease severity, medical history, demographics, and the like. Those that share similar characteristics at baseline (i.e., pre-treatment) form a subgroup who may to respond to treatment
differently compared to patients belonging to another subgroup. Examples of subgroups include
us
medications prior to study entry (1 / 2). It is usually anticipated that the effect size (the treatment
M an
effect divided by the standard deviation) may vary systematically by subgroup and in some cases the ordering can be identified in advance. This article proposes a test statistic that incorporates this knowledge for testing the null hypothesis of no treatment effect in two-group trials.
ed
Three examples of varying treatment effects by subgroup are shown in Table 1. Examples 1, 2 and 3 are from trials in type 2 diabetes mellitus (Canaglifozin Briefing Document 2013), heart
pt
failure (SOLVD Investigators 1992), and anemia (Corwin et al. 1997), respectively. In each case the treatment effect varies substantially by subgroup. For instance, in Example 1, the treatment
ce
effect in the subgroup with baseline HbA1c < 8% (the subgroup with less disease severity) is less than half of that in the > 9% subgroup (the subgroup with greater disease severity). The distance
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
disease severity (mild / moderate / severe), age (elderly / middle-aged / young), or number of
between the lower limit of the 95% confidence interval of the < 8% (-0.52%) and the upper limit of the β₯ 9% (-0.90%) subgroups is also very large providing clear evidence of a much greater benefit with treatment in the subgroup with the greatest disease severity. This systematic change in treatment effect by subgroup is also expected in many other disease areas.
2
[Table 1 here] The conventional approach to data analysis is to model the response variable as a function of treatment, the baseline characteristic (e.g. disease severity) and perhaps its interaction with
cr ip
t
treatment. The model, however, does not use the additional information, namely, that the effect
size varies systematically by subgroup. This article presents a method that uses this knowledge,
us
and in the overall sample. In outline, the method proposed works as follows: (a) Pre-specify the ordering of effect size by subgroup. (b) Obtain p-values from models fit to the overall sample
M an
and to the pre-specified subgroups. (c) For inference, derive a single p-value which is a function of the p-values from these analyses. Obtain the p-value of the single p-value using the permutation approach to control the type I error rate at its designated value. The choice of a p-
ed
value combining function to get a single p-value is discussed later. When one subgroup has more power than another subgroup, the test based on combining p-values can result in greater power
pt
than the usual test; otherwise there is loss in power.
ce
The rest of the article is organized as follows. Section 2 describes the proposed method, Section 3 provides simulation results showing when the test gives more or less power, Section 4 illustrates the method with an example, and remarks are made in Section 5.
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
and proposes a single test statistic for simultaneously testing the null in one or more subgroups
THE SINGLE TEST FOR JOINT TESTING The notation used and the description of the permutation method will be kept similar to Edwards (1999). Random variables and their realizations will be denoted by upper case and lower case
3
letters, respectively; matrices will use bold typeface. Let π denote the total number of patients
to compare two treatments, π (coded 0 for one treatment and 1 for the other) denote the random treatment allocation, π the collection of all possible treatment allocations, π the response vector,
cr ip
t
where the dimension of π and π equal π Γ 1. Each ππ is assumed to have the same variance.
The null hypothesis β of no treatment effect across all subgroups states that the response for
subject ππ will be the same regardless of what randomized treatment is received. See Rosenbaum
us
effect. Thus, if the null is not true in just one subgroup, it is also not true in the overall sample.
M an
The converse, however, that the null is not true in the overall sample implies that the null is not true in a particular subgroup, does not hold.
The method is described for a single variable (e.g. HbA1c) or factor (e.g. disease severity) that
ed
can be split into subgroups. Let π½ denote the number of subgroups associated with a single variable or single factor (e.g. π½ =3 if the subgroups are mild, moderate and severe disease
pt
severity). We set the convention that π1 denotes the subset of data associated with the subgroup with the largest assumed effect size with p-value for treatment denoted π1 , π2 denotes the subset
ce
of data associated with the subgroup with the next largest assumed effect size with p-value π2 ,
and so on, until ππ½ which denotes the subset of data associated with the subgroup with the
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
(2002, pp. 39) for a preference for this null over the null that says that on average there is no
smallest assumed effect size with p-value ππ½ . The subsets π1 , π2 ,β¦ ππ½ form a mutually exclusive and exhaustive partitioning of the data. Let π1 , π2 ,.., ππ½ denote the subgroups
associated with the data subsets π1 , π2 ,β¦ ππ½ , resp.
4
Let ππ,π denote the subset of data which combines subgroups π (= 1,2,3, β¦ , π½) and π (=
1,2,3, β¦ , π½), where π β π, with p-value ππ,π . Similarly define ππ,π,π , where π β π β π, with pvalue ππ,π,π , and so on such that π1,2,3,β¦π½ , with p-value π1,2,3,β¦,π½ , refers to the entire sample. Thus,
cr ip
t
for example, with Example 1 of Table 1, π1 refers to the subset of data defined by subjects with
between 8 β 9%, π1,2 refers to the β₯ 8% subgroup, and π1,2,3 refers to the entire sample, with p-
M an
with the data subsets ππ,π , ππ,π,π ,.., π1,2,3,β¦π½ , resp.
us
values π1 , π2 , π1,2 , and π1,2,3 , resp. Let ππ,π , ππ,π,π ,β¦, π1,2,3,β¦π½ denote subgroups associated
The three-step procedure is as follows. (1) The first is to pre-specify subsets of data for inclusion in analysis based on an ordering of expected effect size. In its most general form the subsets could be any combination of mutually exclusive subsets and π1,2,3,β¦π½ , or a sequence of nested sizes is correct.
ed
subsets. We emphasize the latter because we expect it to be more powerful if the order of effect
Let π οΏ½ denote the set of datasets. For example, one may choose π οΏ½=
pt
{π1 , π1,2,3 }. (2) The second step is to obtain p-values from models fit to the pre-specified
ce
datasets. Corresponding to π οΏ½, let ποΏ½ denote the associated set of p-values. For π οΏ½ = {π1 , π1,2,3 },
we have ποΏ½ = {π1 , π1,2,3 }. (3) The final step is to combine the p-values using a p-value combining
function and obtain its p-value. Let π(π οΏ½, π, π) denote the p-value combining function with
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
baseline HbA1c > 9%, π2 refers to the subset of data defined by subjects with baseline HbA1c
observed value denoted as ποΏ½πππ . Because the p-values contained in ποΏ½ are correlated, the permutation distribution is used to obtain the p-value of ποΏ½πππ which is described next (Edwards
1999; Dudoit et al. 2003; Ganju et al. 2013; Ganju and Ma 2014). The procedure ensures weak control of the type I error rate.
5
οΏ½ , π, π) denote the πΌ π‘β percentile of the distribution of π(π οΏ½, π, π); note that π is random Let ππΌ (π ππ{π(π οΏ½, π, π) β€ ππΌ (π οΏ½, π, π)} β€ πΌ
cr ip
οΏ½ ππ{π = π}πΌ{π(π οΏ½, π, π) β€ ππΌ (π, π, π)} β€ πΌ
us
π§
M an
Where πΌ{ } is the indicator function and the summation is over all π§ in π.
The permutation-based p-value of ποΏ½πππ is
οΏ½, π, π) < ποΏ½πππ }. ππ{π(π οΏ½, π, π) < ποΏ½πππ } = βπ§ ππ{π = π}πΌ{π(π
(1)
ed
β is rejected when ππ{π(π οΏ½, π, π) < ποΏ½πππ } β€ πΌ or equivalently, when ποΏ½πππ β€ ππΌ (π, π, π).
pt
We choose π(π οΏ½, π, π) = ππποΏ½ποΏ½οΏ½ as the p-value combining function. For simplicity in notation, we let ππππ denote ππποΏ½ποΏ½οΏ½, and πππππ denote the p-value of ποΏ½πππ . Under a repeated sampling
ce
framework and assuming independence between p-values, it is known that ππππ follows
π΅ππ‘π(1, πΎ) distribution where πΎ denotes the number of p-values included in ποΏ½ (David 1980).
However, the π΅ππ‘π distribution does not hold when the p-values are correlated; hence the
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
The above probability can be rewritten as
t
under π(π οΏ½, π, π). Under β, π(π οΏ½, π, π) is a level πΌ test if
reliance on the permutation distribution.
With hierarchically organized subsets of data, inference works as follows. If πππππ β€ πΌ, then the null hypothesis associated with the subgroup that yielded ποΏ½πππ can be rejected. Because the
6
subsets of data are incrementally hierarchical, the null hypotheses associated with subgroups containing the subgroup that yielded ποΏ½πππ , which includes the overall sample, can also be
rejected. If πππππ > πΌ, then no subgroup can be declared significant. For example, if the null
cr ip
t
hypothesis associated with π1 , β(π1 ), is not true, it implies that the null associated with π1,2 ,
βοΏ½π1,2 οΏ½ is not true. However, βοΏ½π1,2 οΏ½ not true does not imply β(π1 ) not true. To be
(ii)
M an
rejected, even if π1,2 and π1,2,3 > πΌ.
us
πππππ β€ πΌ and if π1 = ππποΏ½π1 , π1,2 , π1,2,3 οΏ½, β(π1 ), βοΏ½π1,2 οΏ½, and βοΏ½π1,2,3 οΏ½ can be
(i)
πππππ β€ πΌ and if π1,2 = ππποΏ½π1 , π1,2 , π1,2,3 οΏ½, βοΏ½π1,2 οΏ½, and βοΏ½π1,2,3 οΏ½ can be rejected.
However, β(π1 ) cannot be rejected.
πππππ β€ πΌ and if π1,2,3 = ππποΏ½π1 , π1,2 , π1,2,3 οΏ½, βοΏ½π1,2,3 οΏ½ can be rejected. However,
ed
(iii)
πππππ > πΌ, no null can be rejected.
ce
(iv)
pt
β(π1 ) and βοΏ½π1,2 οΏ½ cannot be rejected.
SIMULATION RESULTS In this section we show how the power of the proposed test for the overall sample compares with
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
specific, when π οΏ½ = {π1 , π1,2 , π1,2,3 }, then if:
conventional tests. It is convenient to demonstrate the power for linear models for which the true model for subgroup π is
π¦ = π + ππ π§ + π
(2)
7
where π denotes a constant, ππ denotes the treatment effect (difference in means) for the jth subgroup, π denotes the error term. When a subset of data includes π½ β₯ 2 subgroups, the model
(3)
π½β1
(4)
us
where π denotes the weighted average treatment effect across the π½ subgroups. The term π₯π
M an
identifies the subgroups with π½ levels. It is parameterized so that it has π½ β 1 degrees of freedom
(e.g. the factor βsexβ has two levels, but has one degree of freedom; π₯ may take the value -1 for
females and 1 for males). The coefficients, ππ and ππ , are associated with each of the π½ β 1 degrees of freedom of π₯π and π₯π π§, respectively.
ed
Next we evaluate the power of ππππ compared to the power of the conventional test. The
conventional test refers to the test statistic for treatment from fitting models (3) or (4), as
pt
appropriate. Under the assumption of normally distributed errors, this test statistic follows a t-
ce
distribution. Henceforth, we refer to the conventional test as the t test. When model (4) is fit, we use the test for treatment associated with the Type II method β i.e. the treatment effect is not adjusted for interaction but the error variance is estimated after removing the effect of interaction
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
π¦ = π + ππ§ + βπ=1 ππ π₯π + βπ½β1 π=1 ππ π₯π π§ + π,
cr ip
π½β1
π¦ = π + ππ§ + βπ=1 ππ π₯π + π, or
t
fit to the data may be
(Fleiss 1986). An explicit formula for the conventional t-test (using the Type II method) from fitting models (3) or (4) is in Ganju and Mehrotra (2003) which they denote as π‘ π ). Under
interaction and random subgroup sizes, the size of the t test is inflated (Ganju and Mehrotra
8
2003), but for the cases considered the inflation is slight. We thus treat the power from the interaction model as the correct estimated power. We consider trials with 2 and 3 subgroups for illustration. Data are generated and power is
cr ip
t
calculated as follows: π datasets are generated according to (4) with random treatment
of π, ππ , and ππ . Within each dataset each randomly selected subject has an equal probability of
us
belonging to any of the subgroups, so the subgroups are of the same size on average (a few cases
M an
with unequal probabilities are also included). The errors π follow a unit normal distribution.
For each run of the simulation calculate ποΏ½πππ from π οΏ½. To calculate πππππ as per (1) requires calculation of ππ{π = π} via π which is computationally prohibitive. Similar results can be
obtained while saving much computational time by selecting random samples from π from
ed
which ππ{π = π} is estimated (Edwards 1999; Ganju and Ma 2014). Let πΏ denote the number of
permutation-based datasets generated for each simulation run.
For each πΏ the treatment
pt
assignment π is randomly permuted to generate the permutation distribution, and all else is held
ce
constant (Dudoit et al. 2003, Ganju et al. 2013). For the ith simulation run, let ππππππ denote the permutation-based p-value of ππππ. Expressed as a percentage, power for the procedure is estimated as βππ=1 πΌοΏ½ππππππ β€ πΌοΏ½ Γ 100/π. Results shown in Tables II and III are based on
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
assignment and with sample size per treatment equal to π/2. Results are invariant to the choice
π = 2000, πΏ = 1000, two-sided p-values and πΌ = 0.05 (except for the null case in Table 2 for
which π = 5000 and πΏ = 1000).
9
Two Subgroups: Table 2 shows the power when the total sample size π = 80 (or π = 40 per
treatment group) and π οΏ½ = οΏ½π1 , π1,2 οΏ½, π οΏ½ = οΏ½π2 , π1,2 οΏ½, and π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½, where, as noted
earlier, π1 is the subgroup with the largest effect size, π2 is subgroup with the next largest
cr ip
t
effect size and π1,2 denotes the overall sample. For large values of π1 β π2 , or equivalently,
π οΏ½ = οΏ½π1 , π1,2 οΏ½ is the desired set. However, for comparison purposes, power is also provided for
us
π οΏ½ = οΏ½π2 , π1,2 οΏ½, and π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½. We compare the power of ππππ with the t test the model based on π1 .
M an
derived from models fit (3) or (4) on the entire dataset. Also included in Table 2 is the power of
When π οΏ½ = οΏ½π1 , π1,2 οΏ½, data from individual subjects is included an unequal number of times in the analysis. Subjects who belong to π1 are also included in π1,2 , and are thus counted twice,
ed
whereas subjects belonging to π2 are only counted once, and that is in π1,2 . The procedure thus
weights subject data unequally: those who belong to the subgroup with a larger effect size are
ce
pt
given a larger weight than those who belong to the subgroup with a smaller effect size. [Table 2 here]
Similarly, with π οΏ½ = οΏ½π2 , π1,2 οΏ½, subjects belonging to π2 are counted twice, and those
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
when the power of the test for π1 is much larger than the power of the same test for π2 ,
belonging to π1 are counted once. However, this choice of π οΏ½ is not advisable because it
includes π2 which has lower power than π1 . With π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½, each subject is counted twice.
10
When π οΏ½ = οΏ½π1 , π1,2 οΏ½, the power of ππππ increases as the magnitude of π1 β π2 increases, or
equivalently as the power of the test applied to π1 increases relative to π2 . When the effects
sizes are the same between subgroups or if the difference in the effect size is small, there is loss
cr ip
t
in power. For example, when π1 = .75 and π2 = .65, the t test yields more power (86% vs 83%).
However, for larger differences between π1 and π2 shown in the table, ππππ yields more power.
us
model), and ππππ has 65% power.
M an
When the sample sizes per subgroup are unequal, the results depend to a large extent on whether the larger effect size is associated with the larger or smaller subgroup. For example, as shown in Table 2, when π1 = 0.75, π2 = 0.25 with sample sizes 54 and 26, resp., the power with the conventional test is 75%. With ππππ the power is 79% when π οΏ½ = οΏ½π1 , π1,2 οΏ½. When the sample
ed
sizes are reversed, the power with the conventional test and with ππππ equal 42% and 47%,
resp.
pt
To illustrate the loss in power when an incorrect ordering is pre-specified consider what happens
ce
when when π οΏ½ = οΏ½π2 , π1,2 οΏ½, and π1 = .75 and π2 = .25. The power with ππππ is 49% which compares very unfavorably with the 59% power from the t test. The loss in power with ππππ is
greater for larger values of π1 β π2 . Thus, identification of the right set of subgroups for
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
For example, when π1 = .75 and π2 = .25, the t test has 59% power (under the interaction
inclusion in the analysis is important.
Power is also shown when π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½. Here each subject is counted twice. It is interesting to observe that the method based on counting each subject twice can give greater
11
power than the method based on counting each subject once (i.e. the conventional test). For example, when π1 = 1 and π2 = .2, the proposed approach has 81% power whereas the t test has 74% power. For ππππ to have more power than the t test, the magnitude of π1 β π2 has to be
cr ip
t
large, otherwise it has less power.
Although the performance of different combining functions has been studied under independence
us
conclusion is that no one method is uniformly the best. Next we briefly discuss the performance of one of the different combining functions, Fisherβs combination test (FCT), which is a function
M an
of the product of the p-values contained in ποΏ½. For instance, for π οΏ½ = οΏ½π1 , π1,2 οΏ½, FCT is
2 distribution π(π οΏ½, π¦, π§) = β2πποΏ½π1 Γ π1,2 οΏ½. Under independence of p-values FCT follows a π2πΎ
where πΎ denotes the number of p-values in ποΏ½ (Fisher 1932). In our case, because of correlated p-
ed
values, power for FCT was obtained in the same way as for ππππ using the permutation method.
For π1 = .75 and π2 = .25, and for π οΏ½ = οΏ½π1 , π1,2 οΏ½, π οΏ½ = οΏ½π2 , π1,2 οΏ½, and π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½
pt
power (%) with FCT equals (for comparison, the power of ππππ is shown parenthetically), resp.,
ce
68 (65), 41 (49), 59 (61). As shown for this configuration and for others (results not included), neither statistic is uniformly better than the other. Three Subgroups: Power for the test is shown when π = 120 (or π = 40 per treatment group)
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
(Birnbaum 1954; Loughlin 2004) and dependence (Ganju and Ma 2014) of p-values, the
and π οΏ½ = οΏ½π1 , π1,2 , π1,2,3 οΏ½ for various values of π1 , π2 , and π3 in Table 3. As shown for the
case with two subgroups, there is loss in power with the ππππ procedure compared to the t test if the treatment effect across subgroups is not much different. For example, when π1 = π2 = π3 =
.55 (i.e. no interaction), the t test has 84% power and the ππππ test has 78% power. Simulation 12
results also indicate that if the power of the test associated with either π1 or π1,2 is larger than
the power of the test associated with π1,2,3 , the power of ππππ is also larger. It is instructive to compare the power for configurations π1 = .9, π2 = .4, π3 = .2, and π1 = 1.0, π2 = .3, π3 = .2.
cr ip
t
Because the subgroups are equally sized, both configurations yield the same overall treatment effect of .5, and thus the same power for π1,2,3 (77%). Likewise with π1,2 , the treatment effect is
us
power than the former (86% vs 83%), and that is because power of the test associated π1 is also
M an
higher (86% vs 78%).
[Table 3]
EXAMPLE
ed
Table 4 shows hypothetical data from a randomized clinical trial involving two treatments and two subgroups. The response variable is whether or not a patient has healed. A total of 200
pt
patients are randomized in a 1:1 ratio to treatment or placebo. The data are from Lachin (2000;
ce
pages 90-91). The example has three subgroups; however, because of small sample sizes we combine the first two subgroups into one subgroup. We label the first subgroup βmild disease severityβ and the second βmoderate disease severityβ.
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
.65 and the power is the same (81%). However, with ππππ the latter configuration gives more
stratified by disease severity. [Table 4 here]
13
We assume randomization was not
Let π1 and π2 , resp., represent datasets associated with the moderate (π1 )and mild (π2 ) subgroups. The one-sided p-values are 0.011, 0.358 and 0.033 for π1 , π2 , and π1,2 from a
logistic regression analysis, where for π1,2 the analysis is adjusted for disease severity. The
cr ip
t
model including the interaction between treatment and subgroup, gives a similar p-value as the model without interaction (0.034). All model-based p-values are from the Wald statistic. For
declaring significance, one-sided p-values are compared to πΌ = 0.025. With the conventional
us M an
> 0.025.
The p-values for ππππ based on π οΏ½ = οΏ½π1 , π1,2 οΏ½, π οΏ½ = οΏ½π2 , π1,2 οΏ½, and π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½
equal, resp., 0.020, 0.065, and 0.036. Each p-value is based on 5000 permutation datasets. The conclusion, for example, with the choice of π οΏ½ = οΏ½π1 , π1,2 οΏ½ is that the p-value of 0.02 is small
ed
enough to suggest that treatment is statistically superior to placebo in the subgroup with moderate disease severity and in the overall sample. However, because the benefit with treatment
That discussion is deferred to the next section. If π οΏ½ = οΏ½π2 , π1,2 οΏ½ or
ce
sample.
pt
in the mild subgroup is very weak, it is reasonable to ask if treatment is efficacious in the entire
π οΏ½ = οΏ½π1 , π2 , π1,2 οΏ½, treatment cannot be judged statistically superior to placebo in any represented subgroup or in the overall sample.
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
method, treatment is not statistically superior to placebo because the Wald test p-value of 0.033
REMARKS To avoid bias, it is necessary to emphasize that the subgroups should be pre-specified in the same way that analytical methods are fully described before unblinding the trial. Two limitations
14
to the interpretation of results that apply to the traditional method that would also apply here are now mentioned. The first is when qualitative interaction (Gail and Simon 1985) between treatment and the subgroup factor is present. In such a case, regardless of the method of analysis,
cr ip
t
the results are ambiguous and difficult to interpret. The other concerns the appropriateness of making claims on the overall sample when some subgroups show no effect. In the context of biomarker positive and negative patients, Rothmann et al. (2012) correctly note if only patients
us
then it is not appropriate to claim benefit in the overall sample because of a statistically favorable
M an
result in the overall sample. A strong treatment effect in the biomarker-positive sample coupled with a null effect in the biomarker-negative sample can still give a small enough p-value in the overall sample to declare success. But it would not be correct to conclude that all patients, regardless of biomarker positive or negative status, benefit from treatment. Any claim of benefit
ed
should be limited to just the biomarker positive patients.
pt
With other types of subgroups, similar restrictions in interpretation may or may not apply. Example 1 of Table 1 shows all subgroups benefitting from treatment but that is not the case in
ce
Examples 2 and 3. The estimated treatment effects in the last subgroup in Example 2, and in the last two subgroups in Example 3 are too small to suggest benefit. The implication with the last subgroup in Example 1 is that a trial done only in patients with low baseline HbA1c levels will
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
who are biomarker positive benefit from treatment and those who are biomarker negative do not,
demonstrate a meaningful reduction in HbA1c levels. With Examples 2 and 3, whether or not claims of benefit would generalize to the entire sample is often a subject-matter specific discussion which is out of the scope of the article. A successful result interpreted to include the overall sample rests on the point of view that a large enough sample size in the subgroups with
15
weak effects would have demonstrated meaningful benefit.
This judgment, however, is
independent of the method of analysis. This consideration in interpreting results across all subgroups applies to the conventional test and to the test that we propose. The above discussion
cr ip
t
is when trials are interpreted after data are observed. For prospective evaluation of effect sizes in subgroups and the overall sample, see Millen et al. (2012 and 2014). Their framework requires designing the trial such that valid claims of benefit can be made across the subgroups and the
us
The statistical literature on subgroups is replete with the admonition to not take the observed data
M an
at face value, particularly when the subgroups are not called out in advance (Wittes 2009; Yusuf et al. 1991). Methods for adjusting p-values have been proposed to avoid an excess of false positives (Bristol 1997; Koch 1997; Biesheuvel and Hothorn 2003). The context is usually the
ed
transition to interpreting subgroups after evaluating the overall result. This article presents a statistic for joint testing in subgroups and in the overall sample. It is premised on the observation
pt
that in many therapeutic areas the effect size varies systematically by patient subgroups which the conventional method of analysis does not exploit.
The proposed method explicitly
ce
recognizes the ordering in effect size by subgroup, and is shown to have moderate sized improvements in power for the combined sample when the differences in effect sizes between the subgroups is large. Conversely, and as expected, the method also loses power if the wrong
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
overall sample.
subgroup gets selected. Our recommendation is to use the method as an alternative to the conventional method of analysis when there is joint interest in rejecting the null in the overall sample and in subgroups.
16
One criticism of the method might be the role played by the stronger subgroups in generalizing the result across the overall sample. A significant result in the overall sample might lead to the conclusion that treatment is efficacious in the weaker subgroups. Our response to this is two-
cr ip
t
fold. First note that the concern extends even to the traditional analysis. The method does not preclude the conclusion that treatment is not efficacious in a weaker subgroup even if the result is significant in the overall sample. The points made by Rothmann et al. (2012) for a traditional
us
context for use of the method. When treatment has the potential to provide meaningful benefit in
M an
all subgroups, the proposed statistic has the potential to yield more power than the conventional test statistic to reject the null hypothesis in the overall sample while simultaneously allowing for rejecting the null in pre-specified subgroups.
ed
The proposed testing procedure does not provide an unbiased estimate of the treatment effect although it is bracketed by the estimated treatment effects observed with the individual datasets
pt
contained in π οΏ½. While the article provided results for two or three subgroups for linear models,
extensions to more than three subgroups and to non-linear models follows directly from the
ce
three-step procedure noted in Section 2.
Acknowledgments
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
analysis, as noted earlier, apply to the proposed analysis as well. The second is to understand the
We thank two reviewers for a careful review. One reviewer in particular provided us with excellent suggestions for which we are grateful. REFERENCES
17
Biesheuvel, E.H.E, Hothorn LA. (2003). Protocol designed subgroup analyses in multiarmed clinical trials:
t
multiplicity aspects. Journal of Biopharmaceutical Statistics 13: 663-673.
cr ip
Birnbaum, A. (1954). Combining independence tests of significance. Journal of the American
us
Association 49: 559-575.
M an
Bristol, D,R. (1997). p-value adjustments for subgroup analyses. Journal of Biopharmaceutical Statistics 7: 313-321.
ed
Canagliflozin briefing book for FDA Advisory Committee meeting (2013). http://www.fda.gov/downloads/AdvisoryCommittees/CommitteesMeetingMaterials/Drug
pt
s/EndocrinologicandMetabolicDrugsAdvisoryCommittee/UCM334551.pdf
ce
Corwin, H.L., Gettinger, A, Fabian, T.C., May, A., Pearl, R.G., Heard, S., An, R., Bowers, P.J., Burton, P.,
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
Statistical
Klausner, M.A., Corwin, M.J. (1997). Efficacy and safety of epoetin alpha in critically ill
patients.
New England Journal of Medicine 357: 965-976.
18
David, H.A. (1980). Order Statistics. Wiley. Dudoit, S., Shaffer, J.P., Boldrick, J.C. (2003). Multiple hypothesis testing in microarray
t
experiments.
cr ip
Statistical Science 18: 71-103.
us
Medicine
M an
18: 771-785.
Fisher, R.A. (1932). Statistical Methods for Research Workers, 4th edition. Oliver and Boyd: London.
ed
Fleiss, J.L. (1986). The design and analysis of clinical experiments. Wiley. Gail, M., Simon, R. (1985). Testing for qualitative interaction between treatment effects and
pt
patient
ce
subsets. Biometrics 41: 361β372.
Ganju, J., Ma, J. (2014). The potential for increased power from combining p-values testing the
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
Edwards, D. (1999). On model pre-specification in confirmatory randomized studies. Statistics in
same
hypothesis. Statistical Methods in Medical Research (e-publication ahead of print; June
11, 2014).
19
Ganju, J., Mehrotra, D.V. (2003). Stratified experiments re-examined with emphasis on multicenter trials.
t
Controlled Clinical Trials 24: 167-181. Correction: 2003, page 830.
cr ip
Ganju, J., Yu, X., Ma, J. (2013). Robust inference from multiple statistics via Permutations: A
us
alternative to the single statistic approach. Pharmaceutical Statistics 12: 282-290.
M an
Koch, G.G. (1997). Discussion of βp-value adjustments for subgroup analyses.β Journal of Biopharmaceutical Statistics 7: 323-331
Lachin, J.M. (2000). Biostatistical Methods: The assessment of relative risks. Wiley.
ed
Loughin, T.M. (2004). A systematic comparison of methods for combining p-values from
pt
independent
ce
tests. Computational Statistics and Data Analysis 47: 467-485. Millen, B. A., Dmitrienko, A., Ruberg, S., Shen, L. (2012). A statistical framework for decision making in
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
better
confirmatory multipopulation tailoring clinical trials. Therapeutic Innovation and Regulatory Science 46: 647-656.
20
Millen, B. A., Dmitrienko, A., Song, G. (2014). Bayesian assessment of the influence and interaction condition in
cr ip
Rosenbaum, P.R. (2002). Observational studies. Springer-Verlag.
t
multipopulation tailoring clinical trials. Journal of Biopharmaceutical Statistics 24: 94-109.
us
and the
M an
intent-to-treat population. Drug Information Journal 46: 175-179.
SOLVD Investigators. (1992). Effect of enalapril on mortality and the development of heart failure in
ed
asymptomatic patients with reduced left ventricular ejection fractions. New England Journal of
pt
Medicine 327: 685-691.
ce
Wittes, J. (2009). On looking at subgroups. Circulation 119: 912-915. Yusuf, S., Wittes, J., Probstfield, J., Tyroler, H.A. (1991). Analysis and interpretation of
Ac
Downloaded by [Gazi University] at 07:30 23 February 2016
Rothamnn, M.D., Zhang, J.J., Lu, L., Fleming, T.R. (2102). Testing in a pre-specified subgroup
treatment effects in subgroups of patients in randomized clinical trials. Journal of the American Medical Association 266: 93-98.
21
HbA1c
95% CI
us
Baseline
Estimate
β
estimator is difference in means1
-0.45
-0.78
-0.90, -0.66
pt
8 - < 9%
-0.52, -0.38
ed