A single test for rejecting the null hypothesis in subgroups and in the overall sample.

Journal of Biopharmaceutical Statistics

ISSN: 1054-3406 (Print) 1520-5711 (Online) Journal homepage: http://www.tandfonline.com/loi/lbps20

A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample Yunzhi Lin, Kefei Zhou & Jitendra Ganju To cite this article: Yunzhi Lin, Kefei Zhou & Jitendra Ganju (2016): A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample, Journal of Biopharmaceutical Statistics, DOI: 10.1080/10543406.2016.1148718 To link to this article: http://dx.doi.org/10.1080/10543406.2016.1148718

Accepted author version posted online: 18 Feb 2016.

Submit your article to this journal

Article views: 6

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lbps20 Download by: [Gazi University]

Date: 23 February 2016, At: 07:30

A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample Authors: Yunzhi Lin1, Kefei Zhou2, Jitendra Ganju*3

t

[email protected], [email protected]. *Corresponding author [email protected]

cr ip

1

Abstract

us

subgroups. For example, the effect size, or informally the benefit with treatment, is often greater

M an

in patients with a moderate condition of a disease is than in those with a mild condition. A limitation of the usual method of analysis is that it does not incorporate this ordering of effect size by patient subgroup. We propose a test statistic which supplements the conventional test by including this information and simultaneously tests the null hypothesis in pre-specified

ed

subgroups and in the overall sample. It results in more power than the conventional test when the differences in effect sizes across subgroups are at least moderately large; otherwise it loses

pt

power. The method involves combining p-values from models fit to pre-specified subgroups and

ce

the overall sample in a manner that assigns greater weight to subgroups in which a larger effect size is expected. Results are presented for randomized trials with two and three subgroups. Keywords: subgroups; minimum p-value; power

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

In clinical trials, some patient subgroups are likely to demonstrate larger effect sizes than other

1

INTRODUCTION Patients enrolled in randomized and blinded clinical trials comprise a heterogeneous group with

cr ip

t

respect to disease severity, medical history, demographics, and the like. Those that share similar characteristics at baseline (i.e., pre-treatment) form a subgroup who may to respond to treatment

differently compared to patients belonging to another subgroup. Examples of subgroups include

us

medications prior to study entry (1 / 2). It is usually anticipated that the effect size (the treatment

M an

effect divided by the standard deviation) may vary systematically by subgroup and in some cases the ordering can be identified in advance. This article proposes a test statistic that incorporates this knowledge for testing the null hypothesis of no treatment effect in two-group trials.

ed

Three examples of varying treatment effects by subgroup are shown in Table 1. Examples 1, 2 and 3 are from trials in type 2 diabetes mellitus (Canaglifozin Briefing Document 2013), heart

pt

failure (SOLVD Investigators 1992), and anemia (Corwin et al. 1997), respectively. In each case the treatment effect varies substantially by subgroup. For instance, in Example 1, the treatment

ce

effect in the subgroup with baseline HbA1c < 8% (the subgroup with less disease severity) is less than half of that in the > 9% subgroup (the subgroup with greater disease severity). The distance

Ac


disease severity (mild / moderate / severe), age (elderly / middle-aged / young), or number of

between the lower limit of the 95% confidence interval of the < 8% (-0.52%) and the upper limit of the ≥ 9% (-0.90%) subgroups is also very large providing clear evidence of a much greater benefit with treatment in the subgroup with the greatest disease severity. This systematic change in treatment effect by subgroup is also expected in many other disease areas.

2

[Table 1 here] The conventional approach to data analysis is to model the response variable as a function of treatment, the baseline characteristic (e.g. disease severity) and perhaps its interaction with

cr ip

t

treatment. The model, however, does not use the additional information, namely, that the effect

size varies systematically by subgroup. This article presents a method that uses this knowledge,

us

and in the overall sample. In outline, the method proposed works as follows: (a) Pre-specify the ordering of effect size by subgroup. (b) Obtain p-values from models fit to the overall sample

M an

and to the pre-specified subgroups. (c) For inference, derive a single p-value which is a function of the p-values from these analyses. Obtain the p-value of the single p-value using the permutation approach to control the type I error rate at its designated value. The choice of a p-

ed

value combining function to get a single p-value is discussed later. When one subgroup has more power than another subgroup, the test based on combining p-values can result in greater power

pt

than the usual test; otherwise there is loss in power.

ce

The rest of the article is organized as follows. Section 2 describes the proposed method, Section 3 provides simulation results showing when the test gives more or less power, Section 4 illustrates the method with an example, and remarks are made in Section 5.

Ac


and proposes a single test statistic for simultaneously testing the null in one or more subgroups

THE SINGLE TEST FOR JOINT TESTING The notation used and the description of the permutation method will be kept similar to Edwards (1999). Random variables and their realizations will be denoted by upper case and lower case

3

letters, respectively; matrices will use bold typeface. Let 𝑁 denote the total number of patients

to compare two treatments, 𝒁 (coded 0 for one treatment and 1 for the other) denote the random treatment allocation, 𝛀 the collection of all possible treatment allocations, 𝒀 the response vector,

cr ip

t

where the dimension of 𝒁 and 𝒀 equal 𝑁 × 1. Each 𝑌𝑖 is assumed to have the same variance.

The null hypothesis ℋ of no treatment effect across all subgroups states that the response for

subject 𝑌𝑖 will be the same regardless of what randomized treatment is received. See Rosenbaum

us

effect. Thus, if the null is not true in just one subgroup, it is also not true in the overall sample.

M an

The converse, however, that the null is not true in the overall sample implies that the null is not true in a particular subgroup, does not hold.

The method is described for a single variable (e.g. HbA1c) or factor (e.g. disease severity) that

ed

can be split into subgroups. Let 𝐽 denote the number of subgroups associated with a single variable or single factor (e.g. 𝐽 =3 if the subgroups are mild, moderate and severe disease

pt

severity). We set the convention that 𝑚1 denotes the subset of data associated with the subgroup with the largest assumed effect size with p-value for treatment denoted 𝑃1 , 𝑚2 denotes the subset

ce

of data associated with the subgroup with the next largest assumed effect size with p-value 𝑃2 ,

and so on, until 𝑚𝐽 which denotes the subset of data associated with the subgroup with the

Ac


(2002, pp. 39) for a preference for this null over the null that says that on average there is no

smallest assumed effect size with p-value 𝑃𝐽 . The subsets 𝑚1 , 𝑚2 ,… 𝑚𝐽 form a mutually exclusive and exhaustive partitioning of the data. Let 𝑀1 , 𝑀2 ,.., 𝑀𝐽 denote the subgroups

associated with the data subsets 𝑚1 , 𝑚2 ,… 𝑚𝐽 , resp.

4

Let 𝑚𝑗,𝑘 denote the subset of data which combines subgroups 𝑗 (= 1,2,3, … , 𝐽) and 𝑘 (=

1,2,3, … , 𝐽), where 𝑗 ≠ 𝑘, with p-value 𝑃𝑗,𝑘 . Similarly define 𝑚𝑗,𝑘,𝑙 , where 𝑗 ≠ 𝑘 ≠ 𝑙, with pvalue 𝑃𝑗,𝑘,𝑙 , and so on such that 𝑚1,2,3,…𝐽 , with p-value 𝑃1,2,3,…,𝐽 , refers to the entire sample. Thus,

cr ip

t

for example, with Example 1 of Table 1, 𝑚1 refers to the subset of data defined by subjects with

between 8 – 9%, 𝑚1,2 refers to the ≥ 8% subgroup, and 𝑚1,2,3 refers to the entire sample, with p-

M an

with the data subsets 𝑚𝑗,𝑘 , 𝑚𝑗,𝑘,𝑙 ,.., 𝑚1,2,3,…𝐽 , resp.

us

values 𝑃1 , 𝑃2 , 𝑃1,2 , and 𝑃1,2,3 , resp. Let 𝑀𝑗,𝑘 , 𝑀𝑗,𝑘,𝑙 ,…, 𝑀1,2,3,…𝐽 denote subgroups associated

The three-step procedure is as follows. (1) The first is to pre-specify subsets of data for inclusion in analysis based on an ordering of expected effect size. In its most general form the subsets could be any combination of mutually exclusive subsets and 𝑚1,2,3,…𝐽 , or a sequence of nested sizes is correct.

ed

subsets. We emphasize the latter because we expect it to be more powerful if the order of effect

Let 𝑚 � denote the set of datasets. For example, one may choose 𝑚 �=

pt

{𝑚1 , 𝑚1,2,3 }. (2) The second step is to obtain p-values from models fit to the pre-specified

ce

datasets. Corresponding to 𝑚 �, let 𝑃� denote the associated set of p-values. For 𝑚 � = {𝑚1 , 𝑚1,2,3 },

we have 𝑃� = {𝑃1 , 𝑃1,2,3 }. (3) The final step is to combine the p-values using a p-value combining

function and obtain its p-value. Let 𝑝(𝑚 �, 𝒚, 𝒛) denote the p-value combining function with

Ac


baseline HbA1c > 9%, 𝑚2 refers to the subset of data defined by subjects with baseline HbA1c

observed value denoted as 𝑝�𝑜𝑏𝑠 . Because the p-values contained in 𝑃� are correlated, the permutation distribution is used to obtain the p-value of 𝑝�𝑜𝑏𝑠 which is described next (Edwards

1999; Dudoit et al. 2003; Ganju et al. 2013; Ganju and Ma 2014). The procedure ensures weak control of the type I error rate.

5

� , 𝒚, 𝒛) denote the 𝛼 𝑡ℎ percentile of the distribution of 𝑝(𝑚 �, 𝒚, 𝒁); note that 𝒁 is random Let 𝑝𝛼 (𝑚 𝑃𝑟{𝑝(𝑚 �, 𝒚, 𝒁) ≤ 𝑝𝛼 (𝑚 �, 𝒚, 𝒛)} ≤ 𝛼

cr ip

� 𝑃𝑟{𝒁 = 𝒛}𝐼{𝑝(𝑚 �, 𝒚, 𝒛) ≤ 𝑝𝛼 (𝑚, 𝒚, 𝒛)} ≤ 𝛼

us

𝑧

M an

Where 𝐼{ } is the indicator function and the summation is over all 𝑧 in 𝛀.

The permutation-based p-value of 𝑝�𝑜𝑏𝑠 is

�, 𝒚, 𝒛) < 𝑝�𝑜𝑏𝑠 }. 𝑃𝑟{𝑝(𝑚 �, 𝒚, 𝒁) < 𝑝�𝑜𝑏𝑠 } = ∑𝑧 𝑃𝑟{𝒁 = 𝒛}𝐼{𝑝(𝑚

(1)

ed

ℋ is rejected when 𝑃𝑟{𝑝(𝑚 �, 𝒚, 𝒁) < 𝑝�𝑜𝑏𝑠 } ≤ 𝛼 or equivalently, when 𝑝�𝑜𝑏𝑠 ≤ 𝑝𝛼 (𝑚, 𝒚, 𝒛).

pt

We choose 𝑝(𝑚 �, 𝒚, 𝒛) = 𝑚𝑖𝑛�𝑃�� as the p-value combining function. For simplicity in notation, we let 𝑚𝑖𝑛𝑃 denote 𝑚𝑖𝑛�𝑃��, and 𝑝𝑚𝑖𝑛𝑃 denote the p-value of 𝑝�𝑜𝑏𝑠 . Under a repeated sampling

ce

framework and assuming independence between p-values, it is known that 𝑚𝑖𝑛𝑃 follows

𝐵𝑒𝑡𝑎(1, 𝐾) distribution where 𝐾 denotes the number of p-values included in 𝑃� (David 1980).

However, the 𝐵𝑒𝑡𝑎 distribution does not hold when the p-values are correlated; hence the

Ac


The above probability can be rewritten as

t

under 𝑝(𝑚 �, 𝒚, 𝒁). Under ℋ, 𝑝(𝑚 �, 𝒚, 𝒁) is a level 𝛼 test if

reliance on the permutation distribution.

With hierarchically organized subsets of data, inference works as follows. If 𝑝𝑚𝑖𝑛𝑃 ≤ 𝛼, then the null hypothesis associated with the subgroup that yielded 𝑝�𝑜𝑏𝑠 can be rejected. Because the

6

subsets of data are incrementally hierarchical, the null hypotheses associated with subgroups containing the subgroup that yielded 𝑝�𝑜𝑏𝑠 , which includes the overall sample, can also be

rejected. If 𝑝𝑚𝑖𝑛𝑃 > 𝛼, then no subgroup can be declared significant. For example, if the null

cr ip

t

hypothesis associated with 𝑀1 , ℋ(𝑀1 ), is not true, it implies that the null associated with 𝑀1,2 ,

ℋ�𝑀1,2 � is not true. However, ℋ�𝑀1,2 � not true does not imply ℋ(𝑀1 ) not true. To be

(ii)

M an

rejected, even if 𝑝1,2 and 𝑝1,2,3 > 𝛼.

us

𝑝𝑚𝑖𝑛𝑃 ≤ 𝛼 and if 𝑝1 = 𝑚𝑖𝑛�𝑝1 , 𝑝1,2 , 𝑝1,2,3 �, ℋ(𝑀1 ), ℋ�𝑀1,2 �, and ℋ�𝑀1,2,3 � can be

(i)

𝑝𝑚𝑖𝑛𝑃 ≤ 𝛼 and if 𝑝1,2 = 𝑚𝑖𝑛�𝑝1 , 𝑝1,2 , 𝑝1,2,3 �, ℋ�𝑀1,2 �, and ℋ�𝑀1,2,3 � can be rejected.

However, ℋ(𝑀1 ) cannot be rejected.

𝑝𝑚𝑖𝑛𝑃 ≤ 𝛼 and if 𝑝1,2,3 = 𝑚𝑖𝑛�𝑝1 , 𝑝1,2 , 𝑝1,2,3 �, ℋ�𝑀1,2,3 � can be rejected. However,

ed

(iii)

𝑝𝑚𝑖𝑛𝑃 > 𝛼, no null can be rejected.

ce

(iv)

pt

ℋ(𝑀1 ) and ℋ�𝑀1,2 � cannot be rejected.

SIMULATION RESULTS In this section we show how the power of the proposed test for the overall sample compares with

Ac


specific, when 𝑚 � = {𝑚1 , 𝑚1,2 , 𝑚1,2,3 }, then if:

conventional tests. It is convenient to demonstrate the power for linear models for which the true model for subgroup 𝑗 is

𝑦 = 𝑎 + 𝑏𝑗 𝑧 + 𝜀

(2)

7

where 𝑎 denotes a constant, 𝑏𝑗 denotes the treatment effect (difference in means) for the jth subgroup, 𝜀 denotes the error term. When a subset of data includes 𝐽 ≥ 2 subgroups, the model

(3)

𝐽−1

(4)

us

where 𝑏 denotes the weighted average treatment effect across the 𝐽 subgroups. The term 𝑥𝑖

M an

identifies the subgroups with 𝐽 levels. It is parameterized so that it has 𝐽 − 1 degrees of freedom

(e.g. the factor ‘sex’ has two levels, but has one degree of freedom; 𝑥 may take the value -1 for

females and 1 for males). The coefficients, 𝑐𝑖 and 𝑑𝑖 , are associated with each of the 𝐽 − 1 degrees of freedom of 𝑥𝑖 and 𝑥𝑖 𝑧, respectively.

ed

Next we evaluate the power of 𝑚𝑖𝑛𝑃 compared to the power of the conventional test. The

conventional test refers to the test statistic for treatment from fitting models (3) or (4), as

pt

appropriate. Under the assumption of normally distributed errors, this test statistic follows a t-

ce

distribution. Henceforth, we refer to the conventional test as the t test. When model (4) is fit, we use the test for treatment associated with the Type II method – i.e. the treatment effect is not adjusted for interaction but the error variance is estimated after removing the effect of interaction

Ac


𝑦 = 𝑎 + 𝑏𝑧 + ∑𝑖=1 𝑐𝑖 𝑥𝑖 + ∑𝐽−1 𝑖=1 𝑑𝑖 𝑥𝑖 𝑧 + 𝜀,

cr ip

𝐽−1

𝑦 = 𝑎 + 𝑏𝑧 + ∑𝑖=1 𝑐𝑖 𝑥𝑖 + 𝜀, or

t

fit to the data may be

(Fleiss 1986). An explicit formula for the conventional t-test (using the Type II method) from fitting models (3) or (4) is in Ganju and Mehrotra (2003) which they denote as 𝑡 𝑈 ). Under

interaction and random subgroup sizes, the size of the t test is inflated (Ganju and Mehrotra

8

2003), but for the cases considered the inflation is slight. We thus treat the power from the interaction model as the correct estimated power. We consider trials with 2 and 3 subgroups for illustration. Data are generated and power is

cr ip

t

calculated as follows: 𝑆 datasets are generated according to (4) with random treatment

of 𝑎, 𝑐𝑖 , and 𝑑𝑖 . Within each dataset each randomly selected subject has an equal probability of

us

belonging to any of the subgroups, so the subgroups are of the same size on average (a few cases

M an

with unequal probabilities are also included). The errors 𝜀 follow a unit normal distribution.

For each run of the simulation calculate 𝑝�𝑜𝑏𝑠 from 𝑚 �. To calculate 𝑝𝑚𝑖𝑛𝑃 as per (1) requires calculation of 𝑃𝑟{𝒁 = 𝒛} via 𝛀 which is computationally prohibitive. Similar results can be

obtained while saving much computational time by selecting random samples from 𝛀 from

ed

which 𝑃𝑟{𝒁 = 𝒛} is estimated (Edwards 1999; Ganju and Ma 2014). Let 𝐿 denote the number of

permutation-based datasets generated for each simulation run.

For each 𝐿 the treatment

pt

assignment 𝒁 is randomly permuted to generate the permutation distribution, and all else is held

ce

constant (Dudoit et al. 2003, Ganju et al. 2013). For the ith simulation run, let 𝑝𝑚𝑖𝑛𝑃𝑖 denote the permutation-based p-value of 𝑚𝑖𝑛𝑃. Expressed as a percentage, power for the procedure is estimated as ∑𝑆𝑖=1 𝐼�𝑝𝑚𝑖𝑛𝑃𝑖 ≤ 𝛼� × 100/𝑆. Results shown in Tables II and III are based on

Ac


assignment and with sample size per treatment equal to 𝑁/2. Results are invariant to the choice

𝑆 = 2000, 𝐿 = 1000, two-sided p-values and 𝛼 = 0.05 (except for the null case in Table 2 for

which 𝑆 = 5000 and 𝐿 = 1000).

9

Two Subgroups: Table 2 shows the power when the total sample size 𝑁 = 80 (or 𝑁 = 40 per

treatment group) and 𝑚 � = �𝑚1 , 𝑚1,2 �, 𝑚 � = �𝑚2 , 𝑚1,2 �, and 𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �, where, as noted

earlier, 𝑚1 is the subgroup with the largest effect size, 𝑚2 is subgroup with the next largest

cr ip

t

effect size and 𝑚1,2 denotes the overall sample. For large values of 𝑏1 − 𝑏2 , or equivalently,

𝑚 � = �𝑚1 , 𝑚1,2 � is the desired set. However, for comparison purposes, power is also provided for

us

𝑚 � = �𝑚2 , 𝑚1,2 �, and 𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �. We compare the power of 𝑚𝑖𝑛𝑃 with the t test the model based on 𝑚1 .

M an

derived from models fit (3) or (4) on the entire dataset. Also included in Table 2 is the power of

When 𝑚 � = �𝑚1 , 𝑚1,2 �, data from individual subjects is included an unequal number of times in the analysis. Subjects who belong to 𝑚1 are also included in 𝑚1,2 , and are thus counted twice,

ed

whereas subjects belonging to 𝑚2 are only counted once, and that is in 𝑚1,2 . The procedure thus

weights subject data unequally: those who belong to the subgroup with a larger effect size are

ce

pt

given a larger weight than those who belong to the subgroup with a smaller effect size. [Table 2 here]

Similarly, with 𝑚 � = �𝑚2 , 𝑚1,2 �, subjects belonging to 𝑚2 are counted twice, and those

Ac


when the power of the test for 𝑚1 is much larger than the power of the same test for 𝑚2 ,

belonging to 𝑚1 are counted once. However, this choice of 𝑚 � is not advisable because it

includes 𝑚2 which has lower power than 𝑚1 . With 𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �, each subject is counted twice.

10

When 𝑚 � = �𝑚1 , 𝑚1,2 �, the power of 𝑚𝑖𝑛𝑃 increases as the magnitude of 𝑏1 − 𝑏2 increases, or

equivalently as the power of the test applied to 𝑚1 increases relative to 𝑚2 . When the effects

sizes are the same between subgroups or if the difference in the effect size is small, there is loss

cr ip

t

in power. For example, when 𝑏1 = .75 and 𝑏2 = .65, the t test yields more power (86% vs 83%).

However, for larger differences between 𝑏1 and 𝑏2 shown in the table, 𝑚𝑖𝑛𝑃 yields more power.

us

model), and 𝑚𝑖𝑛𝑃 has 65% power.

M an

When the sample sizes per subgroup are unequal, the results depend to a large extent on whether the larger effect size is associated with the larger or smaller subgroup. For example, as shown in Table 2, when 𝑏1 = 0.75, 𝑏2 = 0.25 with sample sizes 54 and 26, resp., the power with the conventional test is 75%. With 𝑚𝑖𝑛𝑃 the power is 79% when 𝑚 � = �𝑚1 , 𝑚1,2 �. When the sample

ed

sizes are reversed, the power with the conventional test and with 𝑚𝑖𝑛𝑃 equal 42% and 47%,

resp.

pt

To illustrate the loss in power when an incorrect ordering is pre-specified consider what happens

ce

when when 𝑚 � = �𝑚2 , 𝑚1,2 �, and 𝑏1 = .75 and 𝑏2 = .25. The power with 𝑚𝑖𝑛𝑃 is 49% which compares very unfavorably with the 59% power from the t test. The loss in power with 𝑚𝑖𝑛𝑃 is

greater for larger values of 𝑏1 − 𝑏2 . Thus, identification of the right set of subgroups for

Ac


For example, when 𝑏1 = .75 and 𝑏2 = .25, the t test has 59% power (under the interaction

inclusion in the analysis is important.

Power is also shown when 𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �. Here each subject is counted twice. It is interesting to observe that the method based on counting each subject twice can give greater

11

power than the method based on counting each subject once (i.e. the conventional test). For example, when 𝑏1 = 1 and 𝑏2 = .2, the proposed approach has 81% power whereas the t test has 74% power. For 𝑚𝑖𝑛𝑃 to have more power than the t test, the magnitude of 𝑏1 − 𝑏2 has to be

cr ip

t

large, otherwise it has less power.

Although the performance of different combining functions has been studied under independence

us

conclusion is that no one method is uniformly the best. Next we briefly discuss the performance of one of the different combining functions, Fisher’s combination test (FCT), which is a function

M an

of the product of the p-values contained in 𝑃�. For instance, for 𝑚 � = �𝑚1 , 𝑚1,2 �, FCT is

2 distribution 𝑝(𝑚 �, 𝑦, 𝑧) = −2𝑙𝑛�𝑝1 × 𝑝1,2 �. Under independence of p-values FCT follows a 𝜒2𝐾

where 𝐾 denotes the number of p-values in 𝑃� (Fisher 1932). In our case, because of correlated p-

ed

values, power for FCT was obtained in the same way as for 𝑚𝑖𝑛𝑃 using the permutation method.

For 𝑏1 = .75 and 𝑏2 = .25, and for 𝑚 � = �𝑚1 , 𝑚1,2 �, 𝑚 � = �𝑚2 , 𝑚1,2 �, and 𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �

pt

power (%) with FCT equals (for comparison, the power of 𝑚𝑖𝑛𝑃 is shown parenthetically), resp.,

ce

68 (65), 41 (49), 59 (61). As shown for this configuration and for others (results not included), neither statistic is uniformly better than the other. Three Subgroups: Power for the test is shown when 𝑁 = 120 (or 𝑁 = 40 per treatment group)

Ac


(Birnbaum 1954; Loughlin 2004) and dependence (Ganju and Ma 2014) of p-values, the

and 𝑚 � = �𝑚1 , 𝑚1,2 , 𝑚1,2,3 � for various values of 𝑏1 , 𝑏2 , and 𝑏3 in Table 3. As shown for the

case with two subgroups, there is loss in power with the 𝑚𝑖𝑛𝑃 procedure compared to the t test if the treatment effect across subgroups is not much different. For example, when 𝑏1 = 𝑏2 = 𝑏3 =

.55 (i.e. no interaction), the t test has 84% power and the 𝑚𝑖𝑛𝑃 test has 78% power. Simulation 12

results also indicate that if the power of the test associated with either 𝑚1 or 𝑚1,2 is larger than

the power of the test associated with 𝑚1,2,3 , the power of 𝑚𝑖𝑛𝑃 is also larger. It is instructive to compare the power for configurations 𝑏1 = .9, 𝑏2 = .4, 𝑏3 = .2, and 𝑏1 = 1.0, 𝑏2 = .3, 𝑏3 = .2.

cr ip

t

Because the subgroups are equally sized, both configurations yield the same overall treatment effect of .5, and thus the same power for 𝑚1,2,3 (77%). Likewise with 𝑚1,2 , the treatment effect is

us

power than the former (86% vs 83%), and that is because power of the test associated 𝑚1 is also

M an

higher (86% vs 78%).

[Table 3]

EXAMPLE

ed

Table 4 shows hypothetical data from a randomized clinical trial involving two treatments and two subgroups. The response variable is whether or not a patient has healed. A total of 200

pt

patients are randomized in a 1:1 ratio to treatment or placebo. The data are from Lachin (2000;

ce

pages 90-91). The example has three subgroups; however, because of small sample sizes we combine the first two subgroups into one subgroup. We label the first subgroup ‘mild disease severity’ and the second ‘moderate disease severity’.

Ac


.65 and the power is the same (81%). However, with 𝑚𝑖𝑛𝑃 the latter configuration gives more

stratified by disease severity. [Table 4 here]

13

We assume randomization was not

Let 𝑚1 and 𝑚2 , resp., represent datasets associated with the moderate (𝑀1 )and mild (𝑀2 ) subgroups. The one-sided p-values are 0.011, 0.358 and 0.033 for 𝑀1 , 𝑀2 , and 𝑀1,2 from a

logistic regression analysis, where for 𝑀1,2 the analysis is adjusted for disease severity. The

cr ip

t

model including the interaction between treatment and subgroup, gives a similar p-value as the model without interaction (0.034). All model-based p-values are from the Wald statistic. For

declaring significance, one-sided p-values are compared to 𝛼 = 0.025. With the conventional

us M an

> 0.025.

The p-values for 𝑚𝑖𝑛𝑃 based on 𝑚 � = �𝑚1 , 𝑚1,2 �, 𝑚 � = �𝑚2 , 𝑚1,2 �, and 𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �

equal, resp., 0.020, 0.065, and 0.036. Each p-value is based on 5000 permutation datasets. The conclusion, for example, with the choice of 𝑚 � = �𝑚1 , 𝑚1,2 � is that the p-value of 0.02 is small

ed

enough to suggest that treatment is statistically superior to placebo in the subgroup with moderate disease severity and in the overall sample. However, because the benefit with treatment

That discussion is deferred to the next section. If 𝑚 � = �𝑚2 , 𝑚1,2 � or

ce

sample.

pt

in the mild subgroup is very weak, it is reasonable to ask if treatment is efficacious in the entire

𝑚 � = �𝑚1 , 𝑚2 , 𝑚1,2 �, treatment cannot be judged statistically superior to placebo in any represented subgroup or in the overall sample.

Ac


method, treatment is not statistically superior to placebo because the Wald test p-value of 0.033

REMARKS To avoid bias, it is necessary to emphasize that the subgroups should be pre-specified in the same way that analytical methods are fully described before unblinding the trial. Two limitations

14

to the interpretation of results that apply to the traditional method that would also apply here are now mentioned. The first is when qualitative interaction (Gail and Simon 1985) between treatment and the subgroup factor is present. In such a case, regardless of the method of analysis,

cr ip

t

the results are ambiguous and difficult to interpret. The other concerns the appropriateness of making claims on the overall sample when some subgroups show no effect. In the context of biomarker positive and negative patients, Rothmann et al. (2012) correctly note if only patients

us

then it is not appropriate to claim benefit in the overall sample because of a statistically favorable

M an

result in the overall sample. A strong treatment effect in the biomarker-positive sample coupled with a null effect in the biomarker-negative sample can still give a small enough p-value in the overall sample to declare success. But it would not be correct to conclude that all patients, regardless of biomarker positive or negative status, benefit from treatment. Any claim of benefit

ed

should be limited to just the biomarker positive patients.

pt

With other types of subgroups, similar restrictions in interpretation may or may not apply. Example 1 of Table 1 shows all subgroups benefitting from treatment but that is not the case in

ce

Examples 2 and 3. The estimated treatment effects in the last subgroup in Example 2, and in the last two subgroups in Example 3 are too small to suggest benefit. The implication with the last subgroup in Example 1 is that a trial done only in patients with low baseline HbA1c levels will

Ac


who are biomarker positive benefit from treatment and those who are biomarker negative do not,

demonstrate a meaningful reduction in HbA1c levels. With Examples 2 and 3, whether or not claims of benefit would generalize to the entire sample is often a subject-matter specific discussion which is out of the scope of the article. A successful result interpreted to include the overall sample rests on the point of view that a large enough sample size in the subgroups with

15

weak effects would have demonstrated meaningful benefit.

This judgment, however, is

independent of the method of analysis. This consideration in interpreting results across all subgroups applies to the conventional test and to the test that we propose. The above discussion

cr ip

t

is when trials are interpreted after data are observed. For prospective evaluation of effect sizes in subgroups and the overall sample, see Millen et al. (2012 and 2014). Their framework requires designing the trial such that valid claims of benefit can be made across the subgroups and the

us

The statistical literature on subgroups is replete with the admonition to not take the observed data

M an

at face value, particularly when the subgroups are not called out in advance (Wittes 2009; Yusuf et al. 1991). Methods for adjusting p-values have been proposed to avoid an excess of false positives (Bristol 1997; Koch 1997; Biesheuvel and Hothorn 2003). The context is usually the

ed

transition to interpreting subgroups after evaluating the overall result. This article presents a statistic for joint testing in subgroups and in the overall sample. It is premised on the observation

pt

that in many therapeutic areas the effect size varies systematically by patient subgroups which the conventional method of analysis does not exploit.

The proposed method explicitly

ce

recognizes the ordering in effect size by subgroup, and is shown to have moderate sized improvements in power for the combined sample when the differences in effect sizes between the subgroups is large. Conversely, and as expected, the method also loses power if the wrong

Ac


overall sample.

subgroup gets selected. Our recommendation is to use the method as an alternative to the conventional method of analysis when there is joint interest in rejecting the null in the overall sample and in subgroups.

16

One criticism of the method might be the role played by the stronger subgroups in generalizing the result across the overall sample. A significant result in the overall sample might lead to the conclusion that treatment is efficacious in the weaker subgroups. Our response to this is two-

cr ip

t

fold. First note that the concern extends even to the traditional analysis. The method does not preclude the conclusion that treatment is not efficacious in a weaker subgroup even if the result is significant in the overall sample. The points made by Rothmann et al. (2012) for a traditional

us

context for use of the method. When treatment has the potential to provide meaningful benefit in

M an

all subgroups, the proposed statistic has the potential to yield more power than the conventional test statistic to reject the null hypothesis in the overall sample while simultaneously allowing for rejecting the null in pre-specified subgroups.

ed

The proposed testing procedure does not provide an unbiased estimate of the treatment effect although it is bracketed by the estimated treatment effects observed with the individual datasets

pt

contained in 𝑚 �. While the article provided results for two or three subgroups for linear models,

extensions to more than three subgroups and to non-linear models follows directly from the

ce

three-step procedure noted in Section 2.

Acknowledgments

Ac


analysis, as noted earlier, apply to the proposed analysis as well. The second is to understand the

We thank two reviewers for a careful review. One reviewer in particular provided us with excellent suggestions for which we are grateful. REFERENCES

17

Biesheuvel, E.H.E, Hothorn LA. (2003). Protocol designed subgroup analyses in multiarmed clinical trials:

t

multiplicity aspects. Journal of Biopharmaceutical Statistics 13: 663-673.

cr ip

Birnbaum, A. (1954). Combining independence tests of significance. Journal of the American

us

Association 49: 559-575.

M an

Bristol, D,R. (1997). p-value adjustments for subgroup analyses. Journal of Biopharmaceutical Statistics 7: 313-321.

ed

Canagliflozin briefing book for FDA Advisory Committee meeting (2013). http://www.fda.gov/downloads/AdvisoryCommittees/CommitteesMeetingMaterials/Drug

pt

s/EndocrinologicandMetabolicDrugsAdvisoryCommittee/UCM334551.pdf

ce

Corwin, H.L., Gettinger, A, Fabian, T.C., May, A., Pearl, R.G., Heard, S., An, R., Bowers, P.J., Burton, P.,

Ac


Statistical

Klausner, M.A., Corwin, M.J. (1997). Efficacy and safety of epoetin alpha in critically ill

patients.

New England Journal of Medicine 357: 965-976.

18

David, H.A. (1980). Order Statistics. Wiley. Dudoit, S., Shaffer, J.P., Boldrick, J.C. (2003). Multiple hypothesis testing in microarray

t

experiments.

cr ip

Statistical Science 18: 71-103.

us

Medicine

M an

18: 771-785.

Fisher, R.A. (1932). Statistical Methods for Research Workers, 4th edition. Oliver and Boyd: London.

ed

Fleiss, J.L. (1986). The design and analysis of clinical experiments. Wiley. Gail, M., Simon, R. (1985). Testing for qualitative interaction between treatment effects and

pt

patient

ce

subsets. Biometrics 41: 361–372.

Ganju, J., Ma, J. (2014). The potential for increased power from combining p-values testing the

Ac


Edwards, D. (1999). On model pre-specification in confirmatory randomized studies. Statistics in

same

hypothesis. Statistical Methods in Medical Research (e-publication ahead of print; June

11, 2014).

19

Ganju, J., Mehrotra, D.V. (2003). Stratified experiments re-examined with emphasis on multicenter trials.

t

Controlled Clinical Trials 24: 167-181. Correction: 2003, page 830.

cr ip

Ganju, J., Yu, X., Ma, J. (2013). Robust inference from multiple statistics via Permutations: A

us

alternative to the single statistic approach. Pharmaceutical Statistics 12: 282-290.

M an

Koch, G.G. (1997). Discussion of “p-value adjustments for subgroup analyses.” Journal of Biopharmaceutical Statistics 7: 323-331

Lachin, J.M. (2000). Biostatistical Methods: The assessment of relative risks. Wiley.

ed

Loughin, T.M. (2004). A systematic comparison of methods for combining p-values from

pt

independent

ce

tests. Computational Statistics and Data Analysis 47: 467-485. Millen, B. A., Dmitrienko, A., Ruberg, S., Shen, L. (2012). A statistical framework for decision making in

Ac


better

confirmatory multipopulation tailoring clinical trials. Therapeutic Innovation and Regulatory Science 46: 647-656.

20

Millen, B. A., Dmitrienko, A., Song, G. (2014). Bayesian assessment of the influence and interaction condition in

cr ip

Rosenbaum, P.R. (2002). Observational studies. Springer-Verlag.

t

multipopulation tailoring clinical trials. Journal of Biopharmaceutical Statistics 24: 94-109.

us

and the

M an

intent-to-treat population. Drug Information Journal 46: 175-179.

SOLVD Investigators. (1992). Effect of enalapril on mortality and the development of heart failure in

ed

asymptomatic patients with reduced left ventricular ejection fractions. New England Journal of

pt

Medicine 327: 685-691.

ce

Wittes, J. (2009). On looking at subgroups. Circulation 119: 912-915. Yusuf, S., Wittes, J., Probstfield, J., Tyroler, H.A. (1991). Analysis and interpretation of

Ac


Rothamnn, M.D., Zhang, J.J., Lu, L., Fleming, T.R. (2102). Testing in a pre-specified subgroup

treatment effects in subgroups of patients in randomized clinical trials. Journal of the American Medical Association 266: 93-98.

21

HbA1c

95% CI

us

Baseline

Estimate

–

estimator is difference in means1

-0.45

-0.78

-0.90, -0.66

pt

8 - < 9%

-0.52, -0.38

ed

Sample size calculation for a hypothesis test.

The case for rejecting the amyloid cascade hypothesis.

Confidence and the null hypothesis.

Globigerinoides ruber morphotypes in the Gulf of Mexico: a test of null hypothesis.

We need more data before rejecting the saturated fat hypothesis.

Testing the null hypothesis in small area analysis.

Null space in the Hodgkin-Huxley Equations. A critical test.

Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk.

Evidence against vs. in favour of a null hypothesis.

A test of the complementarity hypothesis in A-B research.

A test of the spine resistance hypothesis for LTP expression.

Test of the neutrality hypothesis.

Sample size calculation for the one-sample log-rank test.

Sample size calculation for the one-sample log-rank test.

Atenolol in premenstrual syndrome: a test of the melatonin hypothesis.

A default Bayesian hypothesis test for mediation.

Calibration of a non-null test interferometer for the measurement of aspheres and free-form surfaces.

Matching, the null hypothesis and the aging variable: would the real age effect please stand up.

Test of the incidental-cues hypothesis.

Double-Deficit Hypothesis in a Clinical Sample: Extension Beyond Reading.

The iatrogenic psychology of practitioners' defeatism and other assertions of the null hypothesis.

Beyond the functional matrix hypothesis: a network null model of human skull growth for the formation of bone articulations.

A More Robust Test of the Penrose Hypothesis.

A test of the production line hypothesis of mammalian oogenesis.