Journal of Biopharmaceutical Statistics

ISSN: 1054-3406 (Print) 1520-5711 (Online) Journal homepage: http://www.tandfonline.com/loi/lbps20

A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample Yunzhi Lin, Kefei Zhou & Jitendra Ganju To cite this article: Yunzhi Lin, Kefei Zhou & Jitendra Ganju (2016): A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample, Journal of Biopharmaceutical Statistics, DOI: 10.1080/10543406.2016.1148718 To link to this article: http://dx.doi.org/10.1080/10543406.2016.1148718

Accepted author version posted online: 18 Feb 2016.

Submit your article to this journal

Article views: 6

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lbps20 Download by: [Gazi University]

Date: 23 February 2016, At: 07:30

A Single Test for Rejecting the Null Hypothesis in Subgroups and in the Overall Sample Authors: Yunzhi Lin1, Kefei Zhou2, Jitendra Ganju*3

t

[email protected], [email protected]. *Corresponding author [email protected]

cr ip

1

Abstract

us

subgroups. For example, the effect size, or informally the benefit with treatment, is often greater

M an

in patients with a moderate condition of a disease is than in those with a mild condition. A limitation of the usual method of analysis is that it does not incorporate this ordering of effect size by patient subgroup. We propose a test statistic which supplements the conventional test by including this information and simultaneously tests the null hypothesis in pre-specified

ed

subgroups and in the overall sample. It results in more power than the conventional test when the differences in effect sizes across subgroups are at least moderately large; otherwise it loses

pt

power. The method involves combining p-values from models fit to pre-specified subgroups and

ce

the overall sample in a manner that assigns greater weight to subgroups in which a larger effect size is expected. Results are presented for randomized trials with two and three subgroups. Keywords: subgroups; minimum p-value; power

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

In clinical trials, some patient subgroups are likely to demonstrate larger effect sizes than other

1

INTRODUCTION Patients enrolled in randomized and blinded clinical trials comprise a heterogeneous group with

cr ip

t

respect to disease severity, medical history, demographics, and the like. Those that share similar characteristics at baseline (i.e., pre-treatment) form a subgroup who may to respond to treatment

differently compared to patients belonging to another subgroup. Examples of subgroups include

us

medications prior to study entry (1 / 2). It is usually anticipated that the effect size (the treatment

M an

effect divided by the standard deviation) may vary systematically by subgroup and in some cases the ordering can be identified in advance. This article proposes a test statistic that incorporates this knowledge for testing the null hypothesis of no treatment effect in two-group trials.

ed

Three examples of varying treatment effects by subgroup are shown in Table 1. Examples 1, 2 and 3 are from trials in type 2 diabetes mellitus (Canaglifozin Briefing Document 2013), heart

pt

failure (SOLVD Investigators 1992), and anemia (Corwin et al. 1997), respectively. In each case the treatment effect varies substantially by subgroup. For instance, in Example 1, the treatment

ce

effect in the subgroup with baseline HbA1c < 8% (the subgroup with less disease severity) is less than half of that in the > 9% subgroup (the subgroup with greater disease severity). The distance

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

disease severity (mild / moderate / severe), age (elderly / middle-aged / young), or number of

between the lower limit of the 95% confidence interval of the < 8% (-0.52%) and the upper limit of the β‰₯ 9% (-0.90%) subgroups is also very large providing clear evidence of a much greater benefit with treatment in the subgroup with the greatest disease severity. This systematic change in treatment effect by subgroup is also expected in many other disease areas.

2

[Table 1 here] The conventional approach to data analysis is to model the response variable as a function of treatment, the baseline characteristic (e.g. disease severity) and perhaps its interaction with

cr ip

t

treatment. The model, however, does not use the additional information, namely, that the effect

size varies systematically by subgroup. This article presents a method that uses this knowledge,

us

and in the overall sample. In outline, the method proposed works as follows: (a) Pre-specify the ordering of effect size by subgroup. (b) Obtain p-values from models fit to the overall sample

M an

and to the pre-specified subgroups. (c) For inference, derive a single p-value which is a function of the p-values from these analyses. Obtain the p-value of the single p-value using the permutation approach to control the type I error rate at its designated value. The choice of a p-

ed

value combining function to get a single p-value is discussed later. When one subgroup has more power than another subgroup, the test based on combining p-values can result in greater power

pt

than the usual test; otherwise there is loss in power.

ce

The rest of the article is organized as follows. Section 2 describes the proposed method, Section 3 provides simulation results showing when the test gives more or less power, Section 4 illustrates the method with an example, and remarks are made in Section 5.

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

and proposes a single test statistic for simultaneously testing the null in one or more subgroups

THE SINGLE TEST FOR JOINT TESTING The notation used and the description of the permutation method will be kept similar to Edwards (1999). Random variables and their realizations will be denoted by upper case and lower case

3

letters, respectively; matrices will use bold typeface. Let 𝑁 denote the total number of patients

to compare two treatments, 𝒁 (coded 0 for one treatment and 1 for the other) denote the random treatment allocation, 𝛀 the collection of all possible treatment allocations, 𝒀 the response vector,

cr ip

t

where the dimension of 𝒁 and 𝒀 equal 𝑁 Γ— 1. Each π‘Œπ‘– is assumed to have the same variance.

The null hypothesis β„‹ of no treatment effect across all subgroups states that the response for

subject π‘Œπ‘– will be the same regardless of what randomized treatment is received. See Rosenbaum

us

effect. Thus, if the null is not true in just one subgroup, it is also not true in the overall sample.

M an

The converse, however, that the null is not true in the overall sample implies that the null is not true in a particular subgroup, does not hold.

The method is described for a single variable (e.g. HbA1c) or factor (e.g. disease severity) that

ed

can be split into subgroups. Let 𝐽 denote the number of subgroups associated with a single variable or single factor (e.g. 𝐽 =3 if the subgroups are mild, moderate and severe disease

pt

severity). We set the convention that π‘š1 denotes the subset of data associated with the subgroup with the largest assumed effect size with p-value for treatment denoted 𝑃1 , π‘š2 denotes the subset

ce

of data associated with the subgroup with the next largest assumed effect size with p-value 𝑃2 ,

and so on, until π‘šπ½ which denotes the subset of data associated with the subgroup with the

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

(2002, pp. 39) for a preference for this null over the null that says that on average there is no

smallest assumed effect size with p-value 𝑃𝐽 . The subsets π‘š1 , π‘š2 ,… π‘šπ½ form a mutually exclusive and exhaustive partitioning of the data. Let 𝑀1 , 𝑀2 ,.., 𝑀𝐽 denote the subgroups

associated with the data subsets π‘š1 , π‘š2 ,… π‘šπ½ , resp.

4

Let π‘šπ‘—,π‘˜ denote the subset of data which combines subgroups 𝑗 (= 1,2,3, … , 𝐽) and π‘˜ (=

1,2,3, … , 𝐽), where 𝑗 β‰  π‘˜, with p-value 𝑃𝑗,π‘˜ . Similarly define π‘šπ‘—,π‘˜,𝑙 , where 𝑗 β‰  π‘˜ β‰  𝑙, with pvalue 𝑃𝑗,π‘˜,𝑙 , and so on such that π‘š1,2,3,…𝐽 , with p-value 𝑃1,2,3,…,𝐽 , refers to the entire sample. Thus,

cr ip

t

for example, with Example 1 of Table 1, π‘š1 refers to the subset of data defined by subjects with

between 8 – 9%, π‘š1,2 refers to the β‰₯ 8% subgroup, and π‘š1,2,3 refers to the entire sample, with p-

M an

with the data subsets π‘šπ‘—,π‘˜ , π‘šπ‘—,π‘˜,𝑙 ,.., π‘š1,2,3,…𝐽 , resp.

us

values 𝑃1 , 𝑃2 , 𝑃1,2 , and 𝑃1,2,3 , resp. Let 𝑀𝑗,π‘˜ , 𝑀𝑗,π‘˜,𝑙 ,…, 𝑀1,2,3,…𝐽 denote subgroups associated

The three-step procedure is as follows. (1) The first is to pre-specify subsets of data for inclusion in analysis based on an ordering of expected effect size. In its most general form the subsets could be any combination of mutually exclusive subsets and π‘š1,2,3,…𝐽 , or a sequence of nested sizes is correct.

ed

subsets. We emphasize the latter because we expect it to be more powerful if the order of effect

Let π‘š οΏ½ denote the set of datasets. For example, one may choose π‘š οΏ½=

pt

{π‘š1 , π‘š1,2,3 }. (2) The second step is to obtain p-values from models fit to the pre-specified

ce

datasets. Corresponding to π‘š οΏ½, let 𝑃� denote the associated set of p-values. For π‘š οΏ½ = {π‘š1 , π‘š1,2,3 },

we have 𝑃� = {𝑃1 , 𝑃1,2,3 }. (3) The final step is to combine the p-values using a p-value combining

function and obtain its p-value. Let 𝑝(π‘š οΏ½, π’š, 𝒛) denote the p-value combining function with

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

baseline HbA1c > 9%, π‘š2 refers to the subset of data defined by subjects with baseline HbA1c

observed value denoted as π‘οΏ½π‘œπ‘π‘  . Because the p-values contained in 𝑃� are correlated, the permutation distribution is used to obtain the p-value of π‘οΏ½π‘œπ‘π‘  which is described next (Edwards

1999; Dudoit et al. 2003; Ganju et al. 2013; Ganju and Ma 2014). The procedure ensures weak control of the type I error rate.

5

οΏ½ , π’š, 𝒛) denote the 𝛼 π‘‘β„Ž percentile of the distribution of 𝑝(π‘š οΏ½, π’š, 𝒁); note that 𝒁 is random Let 𝑝𝛼 (π‘š π‘ƒπ‘Ÿ{𝑝(π‘š οΏ½, π’š, 𝒁) ≀ 𝑝𝛼 (π‘š οΏ½, π’š, 𝒛)} ≀ 𝛼

cr ip

οΏ½ π‘ƒπ‘Ÿ{𝒁 = 𝒛}𝐼{𝑝(π‘š οΏ½, π’š, 𝒛) ≀ 𝑝𝛼 (π‘š, π’š, 𝒛)} ≀ 𝛼

us

𝑧

M an

Where 𝐼{ } is the indicator function and the summation is over all 𝑧 in 𝛀.

The permutation-based p-value of π‘οΏ½π‘œπ‘π‘  is

οΏ½, π’š, 𝒛) < π‘οΏ½π‘œπ‘π‘  }. π‘ƒπ‘Ÿ{𝑝(π‘š οΏ½, π’š, 𝒁) < π‘οΏ½π‘œπ‘π‘  } = βˆ‘π‘§ π‘ƒπ‘Ÿ{𝒁 = 𝒛}𝐼{𝑝(π‘š

(1)

ed

β„‹ is rejected when π‘ƒπ‘Ÿ{𝑝(π‘š οΏ½, π’š, 𝒁) < π‘οΏ½π‘œπ‘π‘  } ≀ 𝛼 or equivalently, when π‘οΏ½π‘œπ‘π‘  ≀ 𝑝𝛼 (π‘š, π’š, 𝒛).

pt

We choose 𝑝(π‘š οΏ½, π’š, 𝒛) = π‘šπ‘–π‘›οΏ½π‘ƒοΏ½οΏ½ as the p-value combining function. For simplicity in notation, we let π‘šπ‘–π‘›π‘ƒ denote π‘šπ‘–π‘›οΏ½π‘ƒοΏ½οΏ½, and π‘π‘šπ‘–π‘›π‘ƒ denote the p-value of π‘οΏ½π‘œπ‘π‘  . Under a repeated sampling

ce

framework and assuming independence between p-values, it is known that π‘šπ‘–π‘›π‘ƒ follows

π΅π‘’π‘‘π‘Ž(1, 𝐾) distribution where 𝐾 denotes the number of p-values included in 𝑃� (David 1980).

However, the π΅π‘’π‘‘π‘Ž distribution does not hold when the p-values are correlated; hence the

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

The above probability can be rewritten as

t

under 𝑝(π‘š οΏ½, π’š, 𝒁). Under β„‹, 𝑝(π‘š οΏ½, π’š, 𝒁) is a level 𝛼 test if

reliance on the permutation distribution.

With hierarchically organized subsets of data, inference works as follows. If π‘π‘šπ‘–π‘›π‘ƒ ≀ 𝛼, then the null hypothesis associated with the subgroup that yielded π‘οΏ½π‘œπ‘π‘  can be rejected. Because the

6

subsets of data are incrementally hierarchical, the null hypotheses associated with subgroups containing the subgroup that yielded π‘οΏ½π‘œπ‘π‘  , which includes the overall sample, can also be

rejected. If π‘π‘šπ‘–π‘›π‘ƒ > 𝛼, then no subgroup can be declared significant. For example, if the null

cr ip

t

hypothesis associated with 𝑀1 , β„‹(𝑀1 ), is not true, it implies that the null associated with 𝑀1,2 ,

ℋ�𝑀1,2 οΏ½ is not true. However, ℋ�𝑀1,2 οΏ½ not true does not imply β„‹(𝑀1 ) not true. To be

(ii)

M an

rejected, even if 𝑝1,2 and 𝑝1,2,3 > 𝛼.

us

π‘π‘šπ‘–π‘›π‘ƒ ≀ 𝛼 and if 𝑝1 = π‘šπ‘–π‘›οΏ½π‘1 , 𝑝1,2 , 𝑝1,2,3 οΏ½, β„‹(𝑀1 ), ℋ�𝑀1,2 οΏ½, and ℋ�𝑀1,2,3 οΏ½ can be

(i)

π‘π‘šπ‘–π‘›π‘ƒ ≀ 𝛼 and if 𝑝1,2 = π‘šπ‘–π‘›οΏ½π‘1 , 𝑝1,2 , 𝑝1,2,3 οΏ½, ℋ�𝑀1,2 οΏ½, and ℋ�𝑀1,2,3 οΏ½ can be rejected.

However, β„‹(𝑀1 ) cannot be rejected.

π‘π‘šπ‘–π‘›π‘ƒ ≀ 𝛼 and if 𝑝1,2,3 = π‘šπ‘–π‘›οΏ½π‘1 , 𝑝1,2 , 𝑝1,2,3 οΏ½, ℋ�𝑀1,2,3 οΏ½ can be rejected. However,

ed

(iii)

π‘π‘šπ‘–π‘›π‘ƒ > 𝛼, no null can be rejected.

ce

(iv)

pt

β„‹(𝑀1 ) and ℋ�𝑀1,2 οΏ½ cannot be rejected.

SIMULATION RESULTS In this section we show how the power of the proposed test for the overall sample compares with

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

specific, when π‘š οΏ½ = {π‘š1 , π‘š1,2 , π‘š1,2,3 }, then if:

conventional tests. It is convenient to demonstrate the power for linear models for which the true model for subgroup 𝑗 is

𝑦 = π‘Ž + 𝑏𝑗 𝑧 + πœ€

(2)

7

where π‘Ž denotes a constant, 𝑏𝑗 denotes the treatment effect (difference in means) for the jth subgroup, πœ€ denotes the error term. When a subset of data includes 𝐽 β‰₯ 2 subgroups, the model

(3)

π½βˆ’1

(4)

us

where 𝑏 denotes the weighted average treatment effect across the 𝐽 subgroups. The term π‘₯𝑖

M an

identifies the subgroups with 𝐽 levels. It is parameterized so that it has 𝐽 βˆ’ 1 degrees of freedom

(e.g. the factor β€˜sex’ has two levels, but has one degree of freedom; π‘₯ may take the value -1 for

females and 1 for males). The coefficients, 𝑐𝑖 and 𝑑𝑖 , are associated with each of the 𝐽 βˆ’ 1 degrees of freedom of π‘₯𝑖 and π‘₯𝑖 𝑧, respectively.

ed

Next we evaluate the power of π‘šπ‘–π‘›π‘ƒ compared to the power of the conventional test. The

conventional test refers to the test statistic for treatment from fitting models (3) or (4), as

pt

appropriate. Under the assumption of normally distributed errors, this test statistic follows a t-

ce

distribution. Henceforth, we refer to the conventional test as the t test. When model (4) is fit, we use the test for treatment associated with the Type II method – i.e. the treatment effect is not adjusted for interaction but the error variance is estimated after removing the effect of interaction

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

𝑦 = π‘Ž + 𝑏𝑧 + βˆ‘π‘–=1 𝑐𝑖 π‘₯𝑖 + βˆ‘π½βˆ’1 𝑖=1 𝑑𝑖 π‘₯𝑖 𝑧 + πœ€,

cr ip

π½βˆ’1

𝑦 = π‘Ž + 𝑏𝑧 + βˆ‘π‘–=1 𝑐𝑖 π‘₯𝑖 + πœ€, or

t

fit to the data may be

(Fleiss 1986). An explicit formula for the conventional t-test (using the Type II method) from fitting models (3) or (4) is in Ganju and Mehrotra (2003) which they denote as 𝑑 π‘ˆ ). Under

interaction and random subgroup sizes, the size of the t test is inflated (Ganju and Mehrotra

8

2003), but for the cases considered the inflation is slight. We thus treat the power from the interaction model as the correct estimated power. We consider trials with 2 and 3 subgroups for illustration. Data are generated and power is

cr ip

t

calculated as follows: 𝑆 datasets are generated according to (4) with random treatment

of π‘Ž, 𝑐𝑖 , and 𝑑𝑖 . Within each dataset each randomly selected subject has an equal probability of

us

belonging to any of the subgroups, so the subgroups are of the same size on average (a few cases

M an

with unequal probabilities are also included). The errors πœ€ follow a unit normal distribution.

For each run of the simulation calculate π‘οΏ½π‘œπ‘π‘  from π‘š οΏ½. To calculate π‘π‘šπ‘–π‘›π‘ƒ as per (1) requires calculation of π‘ƒπ‘Ÿ{𝒁 = 𝒛} via 𝛀 which is computationally prohibitive. Similar results can be

obtained while saving much computational time by selecting random samples from 𝛀 from

ed

which π‘ƒπ‘Ÿ{𝒁 = 𝒛} is estimated (Edwards 1999; Ganju and Ma 2014). Let 𝐿 denote the number of

permutation-based datasets generated for each simulation run.

For each 𝐿 the treatment

pt

assignment 𝒁 is randomly permuted to generate the permutation distribution, and all else is held

ce

constant (Dudoit et al. 2003, Ganju et al. 2013). For the ith simulation run, let π‘π‘šπ‘–π‘›π‘ƒπ‘– denote the permutation-based p-value of π‘šπ‘–π‘›π‘ƒ. Expressed as a percentage, power for the procedure is estimated as βˆ‘π‘†π‘–=1 πΌοΏ½π‘π‘šπ‘–π‘›π‘ƒπ‘– ≀ 𝛼� Γ— 100/𝑆. Results shown in Tables II and III are based on

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

assignment and with sample size per treatment equal to 𝑁/2. Results are invariant to the choice

𝑆 = 2000, 𝐿 = 1000, two-sided p-values and 𝛼 = 0.05 (except for the null case in Table 2 for

which 𝑆 = 5000 and 𝐿 = 1000).

9

Two Subgroups: Table 2 shows the power when the total sample size 𝑁 = 80 (or 𝑁 = 40 per

treatment group) and π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½, π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½, and π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½, where, as noted

earlier, π‘š1 is the subgroup with the largest effect size, π‘š2 is subgroup with the next largest

cr ip

t

effect size and π‘š1,2 denotes the overall sample. For large values of 𝑏1 βˆ’ 𝑏2 , or equivalently,

π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½ is the desired set. However, for comparison purposes, power is also provided for

us

π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½, and π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½. We compare the power of π‘šπ‘–π‘›π‘ƒ with the t test the model based on π‘š1 .

M an

derived from models fit (3) or (4) on the entire dataset. Also included in Table 2 is the power of

When π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½, data from individual subjects is included an unequal number of times in the analysis. Subjects who belong to π‘š1 are also included in π‘š1,2 , and are thus counted twice,

ed

whereas subjects belonging to π‘š2 are only counted once, and that is in π‘š1,2 . The procedure thus

weights subject data unequally: those who belong to the subgroup with a larger effect size are

ce

pt

given a larger weight than those who belong to the subgroup with a smaller effect size. [Table 2 here]

Similarly, with π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½, subjects belonging to π‘š2 are counted twice, and those

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

when the power of the test for π‘š1 is much larger than the power of the same test for π‘š2 ,

belonging to π‘š1 are counted once. However, this choice of π‘š οΏ½ is not advisable because it

includes π‘š2 which has lower power than π‘š1 . With π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½, each subject is counted twice.

10

When π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½, the power of π‘šπ‘–π‘›π‘ƒ increases as the magnitude of 𝑏1 βˆ’ 𝑏2 increases, or

equivalently as the power of the test applied to π‘š1 increases relative to π‘š2 . When the effects

sizes are the same between subgroups or if the difference in the effect size is small, there is loss

cr ip

t

in power. For example, when 𝑏1 = .75 and 𝑏2 = .65, the t test yields more power (86% vs 83%).

However, for larger differences between 𝑏1 and 𝑏2 shown in the table, π‘šπ‘–π‘›π‘ƒ yields more power.

us

model), and π‘šπ‘–π‘›π‘ƒ has 65% power.

M an

When the sample sizes per subgroup are unequal, the results depend to a large extent on whether the larger effect size is associated with the larger or smaller subgroup. For example, as shown in Table 2, when 𝑏1 = 0.75, 𝑏2 = 0.25 with sample sizes 54 and 26, resp., the power with the conventional test is 75%. With π‘šπ‘–π‘›π‘ƒ the power is 79% when π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½. When the sample

ed

sizes are reversed, the power with the conventional test and with π‘šπ‘–π‘›π‘ƒ equal 42% and 47%,

resp.

pt

To illustrate the loss in power when an incorrect ordering is pre-specified consider what happens

ce

when when π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½, and 𝑏1 = .75 and 𝑏2 = .25. The power with π‘šπ‘–π‘›π‘ƒ is 49% which compares very unfavorably with the 59% power from the t test. The loss in power with π‘šπ‘–π‘›π‘ƒ is

greater for larger values of 𝑏1 βˆ’ 𝑏2 . Thus, identification of the right set of subgroups for

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

For example, when 𝑏1 = .75 and 𝑏2 = .25, the t test has 59% power (under the interaction

inclusion in the analysis is important.

Power is also shown when π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½. Here each subject is counted twice. It is interesting to observe that the method based on counting each subject twice can give greater

11

power than the method based on counting each subject once (i.e. the conventional test). For example, when 𝑏1 = 1 and 𝑏2 = .2, the proposed approach has 81% power whereas the t test has 74% power. For π‘šπ‘–π‘›π‘ƒ to have more power than the t test, the magnitude of 𝑏1 βˆ’ 𝑏2 has to be

cr ip

t

large, otherwise it has less power.

Although the performance of different combining functions has been studied under independence

us

conclusion is that no one method is uniformly the best. Next we briefly discuss the performance of one of the different combining functions, Fisher’s combination test (FCT), which is a function

M an

of the product of the p-values contained in 𝑃�. For instance, for π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½, FCT is

2 distribution 𝑝(π‘š οΏ½, 𝑦, 𝑧) = βˆ’2𝑙𝑛�𝑝1 Γ— 𝑝1,2 οΏ½. Under independence of p-values FCT follows a πœ’2𝐾

where 𝐾 denotes the number of p-values in 𝑃� (Fisher 1932). In our case, because of correlated p-

ed

values, power for FCT was obtained in the same way as for π‘šπ‘–π‘›π‘ƒ using the permutation method.

For 𝑏1 = .75 and 𝑏2 = .25, and for π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½, π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½, and π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½

pt

power (%) with FCT equals (for comparison, the power of π‘šπ‘–π‘›π‘ƒ is shown parenthetically), resp.,

ce

68 (65), 41 (49), 59 (61). As shown for this configuration and for others (results not included), neither statistic is uniformly better than the other. Three Subgroups: Power for the test is shown when 𝑁 = 120 (or 𝑁 = 40 per treatment group)

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

(Birnbaum 1954; Loughlin 2004) and dependence (Ganju and Ma 2014) of p-values, the

and π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 , π‘š1,2,3 οΏ½ for various values of 𝑏1 , 𝑏2 , and 𝑏3 in Table 3. As shown for the

case with two subgroups, there is loss in power with the π‘šπ‘–π‘›π‘ƒ procedure compared to the t test if the treatment effect across subgroups is not much different. For example, when 𝑏1 = 𝑏2 = 𝑏3 =

.55 (i.e. no interaction), the t test has 84% power and the π‘šπ‘–π‘›π‘ƒ test has 78% power. Simulation 12

results also indicate that if the power of the test associated with either π‘š1 or π‘š1,2 is larger than

the power of the test associated with π‘š1,2,3 , the power of π‘šπ‘–π‘›π‘ƒ is also larger. It is instructive to compare the power for configurations 𝑏1 = .9, 𝑏2 = .4, 𝑏3 = .2, and 𝑏1 = 1.0, 𝑏2 = .3, 𝑏3 = .2.

cr ip

t

Because the subgroups are equally sized, both configurations yield the same overall treatment effect of .5, and thus the same power for π‘š1,2,3 (77%). Likewise with π‘š1,2 , the treatment effect is

us

power than the former (86% vs 83%), and that is because power of the test associated π‘š1 is also

M an

higher (86% vs 78%).

[Table 3]

EXAMPLE

ed

Table 4 shows hypothetical data from a randomized clinical trial involving two treatments and two subgroups. The response variable is whether or not a patient has healed. A total of 200

pt

patients are randomized in a 1:1 ratio to treatment or placebo. The data are from Lachin (2000;

ce

pages 90-91). The example has three subgroups; however, because of small sample sizes we combine the first two subgroups into one subgroup. We label the first subgroup β€˜mild disease severity’ and the second β€˜moderate disease severity’.

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

.65 and the power is the same (81%). However, with π‘šπ‘–π‘›π‘ƒ the latter configuration gives more

stratified by disease severity. [Table 4 here]

13

We assume randomization was not

Let π‘š1 and π‘š2 , resp., represent datasets associated with the moderate (𝑀1 )and mild (𝑀2 ) subgroups. The one-sided p-values are 0.011, 0.358 and 0.033 for 𝑀1 , 𝑀2 , and 𝑀1,2 from a

logistic regression analysis, where for 𝑀1,2 the analysis is adjusted for disease severity. The

cr ip

t

model including the interaction between treatment and subgroup, gives a similar p-value as the model without interaction (0.034). All model-based p-values are from the Wald statistic. For

declaring significance, one-sided p-values are compared to 𝛼 = 0.025. With the conventional

us M an

> 0.025.

The p-values for π‘šπ‘–π‘›π‘ƒ based on π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½, π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½, and π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½

equal, resp., 0.020, 0.065, and 0.036. Each p-value is based on 5000 permutation datasets. The conclusion, for example, with the choice of π‘š οΏ½ = οΏ½π‘š1 , π‘š1,2 οΏ½ is that the p-value of 0.02 is small

ed

enough to suggest that treatment is statistically superior to placebo in the subgroup with moderate disease severity and in the overall sample. However, because the benefit with treatment

That discussion is deferred to the next section. If π‘š οΏ½ = οΏ½π‘š2 , π‘š1,2 οΏ½ or

ce

sample.

pt

in the mild subgroup is very weak, it is reasonable to ask if treatment is efficacious in the entire

π‘š οΏ½ = οΏ½π‘š1 , π‘š2 , π‘š1,2 οΏ½, treatment cannot be judged statistically superior to placebo in any represented subgroup or in the overall sample.

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

method, treatment is not statistically superior to placebo because the Wald test p-value of 0.033

REMARKS To avoid bias, it is necessary to emphasize that the subgroups should be pre-specified in the same way that analytical methods are fully described before unblinding the trial. Two limitations

14

to the interpretation of results that apply to the traditional method that would also apply here are now mentioned. The first is when qualitative interaction (Gail and Simon 1985) between treatment and the subgroup factor is present. In such a case, regardless of the method of analysis,

cr ip

t

the results are ambiguous and difficult to interpret. The other concerns the appropriateness of making claims on the overall sample when some subgroups show no effect. In the context of biomarker positive and negative patients, Rothmann et al. (2012) correctly note if only patients

us

then it is not appropriate to claim benefit in the overall sample because of a statistically favorable

M an

result in the overall sample. A strong treatment effect in the biomarker-positive sample coupled with a null effect in the biomarker-negative sample can still give a small enough p-value in the overall sample to declare success. But it would not be correct to conclude that all patients, regardless of biomarker positive or negative status, benefit from treatment. Any claim of benefit

ed

should be limited to just the biomarker positive patients.

pt

With other types of subgroups, similar restrictions in interpretation may or may not apply. Example 1 of Table 1 shows all subgroups benefitting from treatment but that is not the case in

ce

Examples 2 and 3. The estimated treatment effects in the last subgroup in Example 2, and in the last two subgroups in Example 3 are too small to suggest benefit. The implication with the last subgroup in Example 1 is that a trial done only in patients with low baseline HbA1c levels will

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

who are biomarker positive benefit from treatment and those who are biomarker negative do not,

demonstrate a meaningful reduction in HbA1c levels. With Examples 2 and 3, whether or not claims of benefit would generalize to the entire sample is often a subject-matter specific discussion which is out of the scope of the article. A successful result interpreted to include the overall sample rests on the point of view that a large enough sample size in the subgroups with

15

weak effects would have demonstrated meaningful benefit.

This judgment, however, is

independent of the method of analysis. This consideration in interpreting results across all subgroups applies to the conventional test and to the test that we propose. The above discussion

cr ip

t

is when trials are interpreted after data are observed. For prospective evaluation of effect sizes in subgroups and the overall sample, see Millen et al. (2012 and 2014). Their framework requires designing the trial such that valid claims of benefit can be made across the subgroups and the

us

The statistical literature on subgroups is replete with the admonition to not take the observed data

M an

at face value, particularly when the subgroups are not called out in advance (Wittes 2009; Yusuf et al. 1991). Methods for adjusting p-values have been proposed to avoid an excess of false positives (Bristol 1997; Koch 1997; Biesheuvel and Hothorn 2003). The context is usually the

ed

transition to interpreting subgroups after evaluating the overall result. This article presents a statistic for joint testing in subgroups and in the overall sample. It is premised on the observation

pt

that in many therapeutic areas the effect size varies systematically by patient subgroups which the conventional method of analysis does not exploit.

The proposed method explicitly

ce

recognizes the ordering in effect size by subgroup, and is shown to have moderate sized improvements in power for the combined sample when the differences in effect sizes between the subgroups is large. Conversely, and as expected, the method also loses power if the wrong

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

overall sample.

subgroup gets selected. Our recommendation is to use the method as an alternative to the conventional method of analysis when there is joint interest in rejecting the null in the overall sample and in subgroups.

16

One criticism of the method might be the role played by the stronger subgroups in generalizing the result across the overall sample. A significant result in the overall sample might lead to the conclusion that treatment is efficacious in the weaker subgroups. Our response to this is two-

cr ip

t

fold. First note that the concern extends even to the traditional analysis. The method does not preclude the conclusion that treatment is not efficacious in a weaker subgroup even if the result is significant in the overall sample. The points made by Rothmann et al. (2012) for a traditional

us

context for use of the method. When treatment has the potential to provide meaningful benefit in

M an

all subgroups, the proposed statistic has the potential to yield more power than the conventional test statistic to reject the null hypothesis in the overall sample while simultaneously allowing for rejecting the null in pre-specified subgroups.

ed

The proposed testing procedure does not provide an unbiased estimate of the treatment effect although it is bracketed by the estimated treatment effects observed with the individual datasets

pt

contained in π‘š οΏ½. While the article provided results for two or three subgroups for linear models,

extensions to more than three subgroups and to non-linear models follows directly from the

ce

three-step procedure noted in Section 2.

Acknowledgments

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

analysis, as noted earlier, apply to the proposed analysis as well. The second is to understand the

We thank two reviewers for a careful review. One reviewer in particular provided us with excellent suggestions for which we are grateful. REFERENCES

17

Biesheuvel, E.H.E, Hothorn LA. (2003). Protocol designed subgroup analyses in multiarmed clinical trials:

t

multiplicity aspects. Journal of Biopharmaceutical Statistics 13: 663-673.

cr ip

Birnbaum, A. (1954). Combining independence tests of significance. Journal of the American

us

Association 49: 559-575.

M an

Bristol, D,R. (1997). p-value adjustments for subgroup analyses. Journal of Biopharmaceutical Statistics 7: 313-321.

ed

Canagliflozin briefing book for FDA Advisory Committee meeting (2013). http://www.fda.gov/downloads/AdvisoryCommittees/CommitteesMeetingMaterials/Drug

pt

s/EndocrinologicandMetabolicDrugsAdvisoryCommittee/UCM334551.pdf

ce

Corwin, H.L., Gettinger, A, Fabian, T.C., May, A., Pearl, R.G., Heard, S., An, R., Bowers, P.J., Burton, P.,

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

Statistical

Klausner, M.A., Corwin, M.J. (1997). Efficacy and safety of epoetin alpha in critically ill

patients.

New England Journal of Medicine 357: 965-976.

18

David, H.A. (1980). Order Statistics. Wiley. Dudoit, S., Shaffer, J.P., Boldrick, J.C. (2003). Multiple hypothesis testing in microarray

t

experiments.

cr ip

Statistical Science 18: 71-103.

us

Medicine

M an

18: 771-785.

Fisher, R.A. (1932). Statistical Methods for Research Workers, 4th edition. Oliver and Boyd: London.

ed

Fleiss, J.L. (1986). The design and analysis of clinical experiments. Wiley. Gail, M., Simon, R. (1985). Testing for qualitative interaction between treatment effects and

pt

patient

ce

subsets. Biometrics 41: 361–372.

Ganju, J., Ma, J. (2014). The potential for increased power from combining p-values testing the

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

Edwards, D. (1999). On model pre-specification in confirmatory randomized studies. Statistics in

same

hypothesis. Statistical Methods in Medical Research (e-publication ahead of print; June

11, 2014).

19

Ganju, J., Mehrotra, D.V. (2003). Stratified experiments re-examined with emphasis on multicenter trials.

t

Controlled Clinical Trials 24: 167-181. Correction: 2003, page 830.

cr ip

Ganju, J., Yu, X., Ma, J. (2013). Robust inference from multiple statistics via Permutations: A

us

alternative to the single statistic approach. Pharmaceutical Statistics 12: 282-290.

M an

Koch, G.G. (1997). Discussion of β€œp-value adjustments for subgroup analyses.” Journal of Biopharmaceutical Statistics 7: 323-331

Lachin, J.M. (2000). Biostatistical Methods: The assessment of relative risks. Wiley.

ed

Loughin, T.M. (2004). A systematic comparison of methods for combining p-values from

pt

independent

ce

tests. Computational Statistics and Data Analysis 47: 467-485. Millen, B. A., Dmitrienko, A., Ruberg, S., Shen, L. (2012). A statistical framework for decision making in

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

better

confirmatory multipopulation tailoring clinical trials. Therapeutic Innovation and Regulatory Science 46: 647-656.

20

Millen, B. A., Dmitrienko, A., Song, G. (2014). Bayesian assessment of the influence and interaction condition in

cr ip

Rosenbaum, P.R. (2002). Observational studies. Springer-Verlag.

t

multipopulation tailoring clinical trials. Journal of Biopharmaceutical Statistics 24: 94-109.

us

and the

M an

intent-to-treat population. Drug Information Journal 46: 175-179.

SOLVD Investigators. (1992). Effect of enalapril on mortality and the development of heart failure in

ed

asymptomatic patients with reduced left ventricular ejection fractions. New England Journal of

pt

Medicine 327: 685-691.

ce

Wittes, J. (2009). On looking at subgroups. Circulation 119: 912-915. Yusuf, S., Wittes, J., Probstfield, J., Tyroler, H.A. (1991). Analysis and interpretation of

Ac

Downloaded by [Gazi University] at 07:30 23 February 2016

Rothamnn, M.D., Zhang, J.J., Lu, L., Fleming, T.R. (2102). Testing in a pre-specified subgroup

treatment effects in subgroups of patients in randomized clinical trials. Journal of the American Medical Association 266: 93-98.

21

HbA1c

95% CI

us

Baseline

Estimate

–

estimator is difference in means1

-0.45

-0.78

-0.90, -0.66

pt

8 - < 9%

-0.52, -0.38

ed

A single test for rejecting the null hypothesis in subgroups and in the overall sample.

In clinical trials, some patient subgroups are likely to demonstrate larger effect sizes than other subgroups. For example, the effect size, or inform...
13MB Sizes 0 Downloads 10 Views