MAIN PAPER (wileyonlinelibrary.com) DOI: 10.1002/pst.1609

Published online 5 February 2014 in Wiley Online Library

Sample size determination for the weighted log-rank test with the Fleming–Harrington class of weights in cancer vaccine studies Takahiro Hasegawa* In recent years, immunological science has evolved, and cancer vaccines are available for treating existing cancers. Because cancer vaccines require time to elicit an immune response, a delayed treatment effect is expected. Accordingly, the use of weighted log-rank tests with the Fleming–Harrington class of weights is proposed for evaluation of survival endpoints. We present a method for calculating the sample size under assumption of a piecewise exponential distribution for the cancer vaccine group and an exponential distribution for the placebo group as the survival model. The impact of delayed effect timing on both the choice of the Fleming–Harrington’s weights and the increment in the required number of events is discussed. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: sample size; weighted log-rank test; delayed treatment effect; piecewise exponential distribution; Fleming–Harrington class of weights

1. INTRODUCTION

128

Cancer is usually treated using chemotherapy, radiotherapy, or surgery. In recent years, immunological science has evolved, and cancer vaccines are available for treating existing cancers. For example, sipuleucel-T (Provenger / for metastatic hormone-refractory prostate cancer and ipilimumab (Yervoyr / for late-stage melanoma have been approved following clinical proof of success achieved in controlled randomized phase 3 trials meeting primary survival endpoints [1,2]. The mechanism of action of most cancer vaccines is considered to be mediated by the amplification of cytotoxic T lymphocytes. Because cancer vaccines require time to elicit an immune response, a delayed treatment effect is expected in patients who have received these vaccines according to the Food and Drug Administration draft guideline [3] and the guide of the Japan Society for Biological Therapy [4]. Because of this delayed effect, the overall survival curves of the trial results may show no effect for the initial portion of the study. If the treatment is effective, separation of the curves may occur later in the study after the vaccine’s effect has been established. This delayed treatment effect was in fact observed in phase 3 studies of both sipuleucel-T and ipilimumab [1,2]. This situation for cancer vaccines may violate the assumptions necessary for the Cox proportional hazards model and lead to substantial loss of the statistical power of conventional methods such as the log-rank test and generalized Wilcoxon test [3–5]. For this reason, the use of a weighted log-rank test with the Fleming–Harrington’s G, class of weights [6] has been proposed as an alternative analysis method for survival endpoints [4,7]. When confirmatory studies of a cancer vaccine with primary survival endpoints are designed, the sample size should be determined based on a weighted log-rank test with the Fleming–Harrington class of weights. In this sample size deter-

Pharmaceut. Statist. 2014, 13 128–135

mination, we need to estimate the timing of delayed separation during measurement and construct a delayed treatment effect model. Fine [7] proposed the assumption of a piecewise exponential distribution for the cancer vaccine group and an exponential distribution for the placebo group as the survival model. The appropriate weight required to detect the treatment effect then depends on the delayed onset time relative to the median survival in the placebo group in terms of power. Although the weighted log-rank test with the Fleming–Harrington class of weights can be easily performed by the LIFETEST procedure in SAS/STAT r 9.1 or later [8] or the FHtest package in R [9,10], sample size determination for the G, class of weights is not available as a built-in function, and this calculation method is not clearly defined in any report known to us. In addition, properties of the G, class of weights for sample size determination may not have been sufficiently discussed. The purpose of this paper is to show the sample size calculation method for the weighted log-rank test with the Fleming–Harrington’s G, class of weights extending Cantor’s calculation [11] and implementing the method described by Lakatos [12]. Section 2 describes the delayed treatment effect model proposed by Fine [7] and the weighted log-rank test with the Fleming–Harrington class of weights. Section 3 presents the details of sample size determination. Section 4 examines the average power under the sample size obtained in Section 3 via Monte Carlo simulation and compares it with the sample size based on the average hazard ratio. Section 5 explores the properties of the

Biostatistics Department, Shionogi & Co., Ltd., Osaka, Japan *Correspondence to: Takahiro Hasegawa, Biostatistics Department, Shionogi & Co., Ltd., 12F, Hankyu Terminal Bldg., 1-4, Shibata 1-chome, Kita-ku, Osaka 530-0012, Japan. E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd.

T. Hasegawa sample size under changed parameter values. Following this, we discuss several statistical issues in cancer vaccine studies.

2. MODEL AND TEST 2.1. Delayed treatment effect model We assume a clinical trial consisting of a cancer vaccine group and a placebo group and aim to compare the survival times of the two groups. Taking into account that cancer vaccines require time to elicit an immune response. Fine [7] proposed the assumption of a piecewise exponential distribution for the cancer vaccine group and an exponential distribution for the placebo group. We denote " as the time when the cancer vaccine elicits an immune response,  as the hazard of the placebo group and of the cancer vaccine group before time " and  as the hazard of the cancer vaccine group after time ", where < 1 if the cancer vaccine shows efficacy. The probability density fj .t/, survival Sj .t/, and hazard function hj .t/ in the placebo group .j D 1/ and the cancer vaccine group .j D 2/ are as follows:

In particular, in the delayed treatment effect model, this expression simplifies to   1 2p" C e".1Cp .1p// 2p 1  e p.1C / .   1 1 2p" C e".1Cp .1p// 2p 1  e p.1C / The choice p D 1=2, where the hazard ratio at time t is weighted according to the geometric average of the two survivor functions, is more natural and has some efficiency advantage as discussed by Kalbfleisch and Prentice [13]. 2.2. Weighted log-rank test To use the test statistic for survival endpoints, Fine [7] proposed the use of a weighted log-rank test using the Fleming–Harrington class of weights in cancer vaccine studies. Suppose that deaths are observed at times T1 < T2 < : : : < TD and that the number of deaths at time Ti is di .D d1i C d2i / of a total number at risk at that time of ni .D n1i C n2i /. The weighted log-rank statistic is

t , S .t/ D et , h .t/ D  Placebo group: f1 .t/ D e 1 1 ( (  et et , 0 6 t < ", Cancer vaccine group: f2 .t/ D , S2 .t/ D , h2 .t/ D , " 6 t c e t ce t

R1 where c D e" .1= 1/ so that 0 f2 .t/dt D 1. Median survival time is often employed to summarize survival data because it has in practice a more straightforward interpretation than the popular hazard function. It is thus convenient to express the aforementioned hazard functions by the median survival times. Let mj represent the median survival time in the jth group. The hazard  in the placebo group is ln(2)/m1 . Similarly, the hazard  in the cancer vaccine group after time " is ln.2/.m1  "/ for m2 > ", m1 .m2  "/ which is obtained by solving S2 .m2 / = 1/2. In contrast, when m2 6 ", the hazard for the cancer vaccine group is no longer expressed as a function of the median because m1 D m2 . We introduce a definition of average hazard ratio, which retains this interpretability under non-proportional hazards in the delayed treatment effect model in order to consider the time-dependent hazard ratio averaged over time. The general average hazard ratio proposed by Kalbfleisch and Prentice [13] is defined as R1 h2 .t/ dG.t/  0 h .t/Ch 1 2 .t/ R1 h1 .t/  0 h .t/Ch .t/ dG.t/ 1

2

where h1 .t/ and h2 .t/ denote the hazards of the placebo and cancer vaccine groups at time t, respectively, and G.t/ is some survivor function to reflect the relative importance attached to hazard ratios in different time periods. Kalbfleisch and Prentice [13] considered the following class as the weight function: –dG(t/ G.t/ D

p p S1 .t/S2 .t/

Pharmaceut. Statist. 2014, 13 128–135

iD1

,r W.Ti / .d1i  E .d1i //

XD iD1

W 2 .Ti / Var.d1i /,

where E.d1i / D n1i  .di =ni / and Var.d1i / D

n1i n2i di .ni  di / n2i .ni  1/

under the null hypothesis. This statistic follows the standard normal distribution. Fleming and Harrington [6] proposed the G, class of weighted log-rank tests with a weight function equal to O i // .1  S.T O i // for  > 0,  > 0, W.Ti / D .S.T O i / is the Kaplan–Meier estimate of the survival funcwhere S.T tion in the pooled sample at time Ti Figure 1 displays the range of weight functions used in the Fleming–Harrington’s G, class. When  D 0,  D 0 and  D 1 correspond to the standard log-rank and Prentice statistics, respectively. Here, setting  D 0 and  D 1 would place more weight on late events, emphasizing late differences in the hazard rates and/or the survival curves. It must be noted, however, that we focus instead on the entire survival curves rather than on late differences after time ". Therefore, valid inference requires that appropriate values for  and  be specified prior to data collection, given that this test may be overly sensitive to early differences in the survival curves.

3. DERIVATION OF SAMPLE SIZE Our proposed sample size determination for the weighted log-rank test with the Fleming–Harrington class of weights is obtained by extending Cantor’s calculation. Accordingly, we present it as described in Cantor [11] (pp. 84–85). Suppose we plan to accrue patients during time T at a constant rate to a clinical trial designed to compare survival time between two groups

Copyright © 2014 John Wiley & Sons, Ltd.

129

for p > 0. The average hazard ratio with this weight function can be written as R1 p p h2 .t/S1 .t/S2 .t/dt . R01 p p 0 h1 .t/S1 .t/S2 .t/dt

XD

T. Hasegawa

0.5

1.0

1.0 0.5

Weight W(t)

0.0

0.5

Weight W(t)

0.0

0.5

0.0

0.5

1.0

0.0

0.5

1.0

(d) ρ = 0, γ = 0.5

(e) ρ = 0, γ = 1

(f) ρ = 0, γ = 2

0.5

1.0

0.5

Weight W(t)

0.0

Weight W(t)

0.0

0.5 0.0 0.0

1.0

^ 1 − S(t)

1.0

^ 1 − S(t)

1.0

^ 1 − S(t)

0.5

Weight W(t)

0.0 0.0

Weight W(t)

(c) ρ = 1, γ = 1

1.0

(b) ρ = 1, γ = 0

1.0

(a) ρ = 0, γ = 0

0.0

^ 1 − S(t)

0.5

1.0

0.0

0.5

^ 1 − S(t)

1.0

^ 1 − S(t)

Figure 1. Weight functions in the Fleming–Harrington’s G, class. Weights are uniform in (a) and emphasize early, middle, and late differences, respectively, in (b), (c), and (d)–(f ).

.j D 1, 2/. We will then, after accrual ends, follow the study patients for time . That is, we will perform the final analysis at time T C  after the start of enrollment, and the range of follow-up time for patients will be from  to T C  . The study period Œ0, T C  is partitioned into M subintervals of equal length ft0 D 0, t1 , t2 , : : :, tM D T C  g in calculations, where M D floorŒ.T C /b and b is the number of subintervals per time unit. Floor[x] is defined as the largest integer not greater than x. Let hj .ti / be the hazard function for group j at time ti . We need to calculate the expected number at risk in group j at time point ti .i D 0, : : :, M1/, which will be denoted by Nj .i/. For each subinterval, [ti , tiC1 /, the conditional probability of death for a patient in group j can be represented approximately by hj .ti /=b. Because uniform accrual is assumed, the probability to be censored in each subinterval is approximately 1=fb.T C   ti /g for ti >  and 0 for ti 6  . To allow for unequal sample sizes, let wj be the proportion that we plan to assign to group j. These considerations lead to Nj .i/ as follows: Nj .0/ D nwj ,



  1 1  1fti >g , Nj .i C 1/ D Nj .i/ 1  hj .ti / b b.T C   ti / 

where n is the total sample size, and the indicator function denoted by 1fti > g is equal to 1 if ti >  and 0 otherwise; then the expected number of events for each subinterval Œti , tiC1 / is calculated as follows: 1 Di D Œh1 .ti /N1 .i/ C h2 .ti / N2 .i/ . b

130

Now, let Sj .ti / be the survival function for group j at time ti , i D h2 .ti //h1 .ti / and i D N2 .ti /=N1 .ti /. Note that the hazards within each subinterval are assumed to be proportional. Under a fixed local alternative, Lakatos [12] has shown that the weighted

Copyright © 2014 John Wiley & Sons, Ltd.

log-rank statistic has, in general, a normal distribution with unit variance and an approximate expectation of XM1 ED

h

D i ri

iD0 r XM1 iD0

i i 1Ci i

Di ri2



i 1Ci

i ,

i .1Ci /2

where ri is the weight function at time ti . When the Fleming–Harrington’s G, class of weights is applied, ri D fS.ti /g f1  S.ti /g . Here, we propose that the weighted survival function S.ti / D w1 S1 .ti / C w2 S2 .ti / be used as a substitute for the Kaplan–Meier estimate of the survival function in the pooled sample, originally proposed by Fleming and Harrington [6]. A weighted combination of the survival functions is used because the unequal sample size needs to be taken into account. Note that E can be expressed equivalently as 2 1 1 6 E D n 2 E D n 2 6 4

XM1

Di ri iD0

r

h

XM1 iD0

where Di D h1 .ti /N1 .i/Ch2 .ti /N2 .i/h Nj .0/ D wj ,

Nj .iC1/ D Nj .i/

1 b

1hj .ti /

i i 1Ci i

Di ri2



i 1Ci

i .1Ci /2

i3 7 7, 5

 i. 1  1 b  b.TCt / 1fti >g i

 1 Treating the weighted log-rank test statistic as N n 2 E  , 1 with power 1  ˇ and a one-sided significance level ˛, we have ˇ ˇ ˇ 1 ˇ ˇn 2 E ˇ D z˛ C zˇ , ˇ ˇ

Pharmaceut. Statist. 2014, 13 128–135

T. Hasegawa where z˛ is a standard normal deviate. Then the required total sample size is obtained as  nD

z˛ C zˇ E

2 .

Given a total sample size of n, the corresponding approximate and ˇ the total expected number of events are ˇ power ˇ 1 ˇ P ˆ ˇˇn 2 E ˇˇ  z˛ and M1 iD0 Di , respectively, where ˆ denotes the cumulative distribution function of the standard normal distribution.

4. EXAMPLE As an example, we evaluated the determination of sample size using the result of a phase 3 study of sipuleucel-T [1] where a delayed treatment effect was observed. In the double-blind, randomized, placebo-controlled trial of sipuleucel-T, patients were randomly assigned in a 2:1 ratio to receive either sipuleucel-T or placebo administered intravenously every 2 weeks. The primary endpoint was overall survival. The median survival was 25.8 months in the sipuleucel-T group and 21.7 months in the placebo group. Visual separation of the overall survival curves by the Kaplan–Meier estimates occurred approximately 6 months after the randomization. We determined the sample size for a potential new trial by the proposed formula for detecting the difference in overall survival between the sipuleucel-T group and the placebo group using the weighted log-rank test with the Fleming–Harrington weights for  D 0 and  D 1, which emphasizes late differences. We assumed that patients are accrued to a clinical trial for 48 months at a constant rate and that the final analysis is performed 18 months later; that is 66 months after the start of enrollment, so that the range of follow-up times for patients is from 18 to 66 months. We also assumed that the overall survival times for patients receiving the sipuleucel-T group follow a piecewise exponential distribution with a delayed onset time of 6 months and that those for the placebo group follow an exponential distribution. The corresponding hazard ratio after the delayed onset time is 0.79 as shown in Figure 2. The total sample size required for achieving the desired power of 90% at the one-sided significance level of 2.5% was 1974 patients for the

analysis of 1322 deaths. The details for assumptions about distributions of overall survival, accrual and follow-up times, and the process of sample size calculation are provided in the Appendix. To evaluate whether or not the desired power is attained under the previous sample size, we performed a Monte Carlo simulation with 10,000 samples and calculated the average power. Monte Carlo simulation of the required sample size of 1974 patients yielded an average power 89.8%, a value quite similar to the desired power. This result suggests that the proposed sample size formula is applicable to the weighted log-rank test with the Fleming–Harrington class of weights. The delayed onset time was only approximately 28% of the median survival time in the placebo group in this assumption. Fine [7] showed that the power of weights for  D 1 and  D 1, which emphasizes middle differences, is expected to be higher than that for  D 0 and  D 1 in cases where the delayed onset time is below 35% of the median survival time in the placebo group. A total of 1833 patients was required when recalculated for the use of  D 1 and  D 1. If the standard log-rank test is used for regulatory acceptability, in contrast, the sample size of 2325 patients was needed. The previous calculated sample size will now be compared with the result of the sample size calculations of the standard log-rank test to detect the average hazard ratio. Given that the assumption of the survival curves is the same as the one previously mentioned, an average hazard ratio of 0.83 for p = 1/2 is obtained. When the median survival in the placebo group is assumed to be 21.7 months, 1884 patients are required with 1262 deaths to have at least 90% power to detect the average hazard ratio of 0.83 at the one-sided significance level of 2.5%. However, it is questionable that the standard log-rank test, which does not emphasize late differences, retains the desired power because the delayed onset time actually exists. The power of the standard log-rank test for 1884 patients falls to 83.1% in reality. Therefore, the sample size calculation of the standard log-rank test leads to an erroneous result in this setting. Even if the Cox proportional hazards model is applied for 1884 patients, the geometric mean of hazard ratio and the average power was 0.84 and 83.7%, respectively, obtained by simulation with 10,000 iterations.

5. PROPERTIES OF SAMPLE SIZE

0.4

0.6

0.8

Cancer vaccine group Placebo group

0.2

Overall survival

1.0

5.1. Choice of weights

HR = 0.79

0.0

HR = 1.0

0

10

20

30

40

50

60

Time from randomization (months)

Pharmaceut. Statist. 2014, 13 128–135

Copyright © 2014 John Wiley & Sons, Ltd.

131

Figure 2. Assumed survival curves for the cancer vaccine and placebo group in the example of Section 4. The overall survival times for patients receiving the sipuleucel-T group follow a piecewise exponential distribution with a median time of 25.8 months, and the delayed onset time of 6 months; and those for the placebo group follow an exponential distribution with a median time of 21.7 months. HR denotes the hazard ratio.

When in practice, the sample size for planning the confirmatory studies is to be determined for the weighted log-rank test with the Fleming–Harrington class of weights; the sample size in such a case is sensitive to two parameters. One is the delayed onset time of the cancer vaccine. However, it would be difficult to estimate this time with high accuracy using the results of prior studies with limited sample sizes. Supposing that the delayed onset time is known or has been estimated with satisfactory accuracy, this issue has not been discussed in this paper. The other parameter is the choice of weights in the Fleming–Harrington class. Weighting late survival differences yields more power than equally weighting all survival differences owing to the delayed separation of survival curves with piecewise constant hazard. Figure 1 shows the main choices for the Fleming–Harrington’s G, class of weights. The minimum value of  D 0, and a value of  > 0 are often reasonable to accentuate late survival difference. We evaluate the effect of the Fleming–Harrington’s G, class of weights with  D 0 and a delayed onset time " on the sample size

T. Hasegawa

1.0

1.5

0.0

The required number of patients in a survival study is generally calculated from the number of events required to be observed in a study. Because there is expected to be a separation of the survival curves in cancer vaccine studies, it is considered that the events observed prior to the delayed onset time do not contribute power to detect a difference in survival curves. In addition, the weighted log-rank test focuses on the entire survival curves rather than the late difference after the delayed onset time. For this reason, the required number of events in studies where the treatment effect is delayed will be greater than that in studies with an immediate treatment effect, even though the late survival difference is emphasized by the Fleming–Harrington class of weights. We illustrate an increment size in the required number

2.0

0.0

0.5

1.0

1.5

1000 1500 2000

(c) ε = 6 months

500

Total sample size required

1000 1500 2000

(b) ε = 3 months

500

Total sample size required

5.2. Number of events required

2.0

0.0

0.5

1.0

1.5

(d) ε = 9 months

(e) ε = 12 months

(f) ε = 15 months

0.5

1.0

γ

1.5

2.0

0.0

0.5

1.0

γ

1.5

2.0

2.0

500

1000 1500 2000

γ Total sample size required

γ 1000 1500 2000

γ Total sample size required

1000 1500 2000

0.5

500

Total sample size required

0.0

est total sample size of 912. When " D 6, at least 801 patients are needed at values of  of 0.6 and 0.7. When " D 9, values of  of 0.9 and 1.0 lead to a minimum of 660 patients. It may be a seemingly counterintuitive result that the longer the delayed onset time, the smaller the required sample size. However, the longer the delay, the greater the hazard ratio after the delayed onset time, and the average hazard ratio are in this assumption. In contrast, when the delayed onset time is lower than 6 months, which is 30% of the median survival time in the placebo group, the required sample size with  D 1 and  D 1 is lower than the minimum with  D 0. And the sample size with  D 1 and  D 0 is much higher with the delayed onset time. In this case, we should avoid choosing weights with  D 1 and  D 0. The result suggests that we need to choose the Fleming–Harrington’s G, classes of weights appropriately and robustly taking delayed onset time into consideration.

500

1000 1500 2000

(a) ε = 0 month

500

Total sample size required

required for choosing an appropriate or robust value of  . Let us consider a randomized, placebo-controlled trial of a cancer vaccine where patients are accrued to the clinical trial for 48 months at a constant rate. They are randomized in a 2:1 ratio to receive a cancer vaccine or placebo, and the final analysis of overall survival with the weighted log-rank test using the Fleming–Harrington class of weights is performed 18 months after the end of enrollment. The value of  ranges from 0 to 2 in increments of 0.1. For comparison, the weighted log-rank tests with  D 1 and  D 1, which emphasize middle differences, and  = 1 and  = 0, which emphasize early differences, are performed. It is also assumed that the overall survival times for patients receiving the cancer vaccine follow a piecewise exponential distribution with a median of 26 months and delayed onset time " of 0, 3, 6 9, 12, or 15 months and that those for the placebo group follow an exponential distribution with a median of 20 months. Thus, the corresponding hazard ratios after the delayed onset time are 0.77." D 0/, 0.74." D 3/, 0.70." D 6/, 0.65." D 9/, 0.57." D 12/, and 0.45 ." D 15/. The longer the delayed onset time, the greater the hazard ratio after the delayed onset time is in this situation, given that the assumed median survival times are fixed in both groups, and the delayed onset time is the only change. Figure 3 shows the total sample sizes required for achieving the desired power of 90% at the one-sided significance level of 2.5%. In Figure 3(a), as expected, the total sample size of 969 patients required with  D 0 corresponding to the standard log-rank test is the lowest when the treatment effect is immediate ." D 0/ because the hazards are proportional. The higher the value of  , the greater the total sample size required. In contrast, when a delayed onset time is present, as in Figure 3(b)–(f ), the value of  leading to the lowest total sample size depends on the delayed onset time. When " D 3, values of  of 0.3 and 0.4 give the low-

0.0

0.5

1.0

1.5

2.0

γ

132

Figure 3. Relationships between total sample size required and  in the Fleming–Harrington’s G, classes when  D 0 at a one-sided significance level of 2.5% with a power of 90% in each assumption about delayed onset time when the median survival time in the cancer vaccine group and placebo group is 26 and 20 months, respectively. " is defined as the delayed onset time. The horizontal broken and dashed lines show the total sample size required by the Fleming–Harrington’s G, class with  D 1 and  D 1, which emphasizes middle differences, and  D 1 and  D 0, which emphasizes early differences, respectively.

Copyright © 2014 John Wiley & Sons, Ltd.

Pharmaceut. Statist. 2014, 13 128–135

T. Hasegawa

Table I. Number of events required to detect a treatment effect by the weighted log-rank test with the Fleming–Harrington class of weights with  = 0 at a one-sided significance level of 2.5% with a power of 90%. Median OS (months) Cancer vaccine

Follow-up period (months)

0

28.6

30

3

27.3

33

6

26.0

36

DOT (months)

 0.0 0.5 1.0 2.0 0.0 0.5 1.0 2.0 0.0 0.5 1.0 2.0

Number of events required Total After DOT 347 390 463 629 500 448 488 631 703 552 552 656

347 390 463 629 419 375 408 528 497 391 391 465

DOT, delayed onset time; OS, overall survival.

of events. Here suppose that for simplicity, all patients are randomized in a 2:1 ratio to receive a cancer vaccine or placebo on the same calendar date and are followed up for the same period. It is assumed that the overall survival times for patients receiving the placebo follow an exponential distribution with a median of 20 months, the treatment effect of the cancer vaccine is delayed by 0, 3, or 6 months, and the hazard ratio after the delayed onset time is 0.7. The final analysis is assumed to be performed 30 months after the delayed onset time so that the follow-up periods to observe the events that would contribute to the power to detect a treatment effect are common. The weighted log-rank test with the Fleming–Harrington class of weights with  D 0 and  D 0.0, 0.5, 1.0, or 2.0 is performed. Table I shows the total numbers of events required and the required numbers of events after the delayed onset time. When the treatment effect is immediate and  D 0.0, the number of events required is the lowest, 347 events. This number denotes the number of events required to detect the treatment effect by the standard log-rank test. When the treatment effect is delayed, a greater number of events after the delayed onset time is required in addition to a greater total number of events. If an appropriate weight is chosen, the required number after the delayed onset time increases to slightly more than 347 events. However, the total number required is much greater than when the treatment effect is immediate. 5.3. Comparison based on average hazard ratio

Pharmaceut. Statist. 2014, 13 128–135

Copyright © 2014 John Wiley & Sons, Ltd.

133

The simulation in Section 5.2 assumed that the hazard ratio after the onset of the delayed treatment effect was kept as 0.7. As an alternative assumption, let us consider that the average hazard ratio is kept as 0.75 under the following conditions: a median survival time of 20 months in the placebo group and p D 1=2 in the formula for the average hazard ratio. Suppose that patients are accrued to the clinical trial for 48 months at a constant rate and randomized in a 2:1 ratio to receive a cancer vaccine or placebo, and the final analysis of overall survival is

performed 18 months after the end of enrollment. It is assumed that the overall survival times in the placebo group follow an exponential distribution with a median of 20 months, and the treatment effect of the cancer vaccine is delayed by 0, 3, 6, 9, 12, or 15 months. Using the standard log-rank test with no consideration for the delayed onset time, 810 patients are required with 547 deaths to have at least 90% power to detect the average hazard ratio of 0.75, at the one-sided significance level of 2.5%. Given this average hazard ratio, the corresponding median survival times in the cancer vaccine group are 26.7 months ." D 0/, 26.4 months ." D 3/, 26.0 months ." D 6/, 25.4 months ." D 9/, 24.6 months ." D 12/, and 23.4 months ." D 15/. Therefore, the actual powers of the standard log-rank test for 810 patients are 90.3% ." D 0/, 86.7% ." D 3/, 82.6% ." D 6/, 775% ." D 9/, 72.2% ." D 12/, and 65.6% ." D 15/. Thus, the longer the delayed onset time, the more the actual power decreases below the desired power. In contrast, when delayed onset time is given due consideration, the total sample sizes required with the use of the weighted log-rank test with the Fleming–Harrington class of weights when  D 0 and the value of  ranges from 0 to 2 in increments of 0.1 are as shown in Figure 4. The 810 patients required by the standard log-rank test for the average hazard ratio was nearly equal to the minimum sample size by the weighted log-rank test with the Fleming–Harrington class of weights when " D 0, 3, or 6. However, the required sample size for the average hazard ratio was higher when " D 9, 12, or 15. Thus, if the treatment effect occurs relatively early, the sample size calculation of the standard log-rank test for the average hazard ratio could be reasonable with respect to the number of patients required. However, we suggest that, in the presence of a delayed treatment effect, the weighted log-rank test be used as the analysis method instead of the standard log-rank test in order to avoid a reduction in the statistical power.

0.5

1.0

1.5

2.0

0.5

1.0

1.5

1600 1200 800

Total sample size required

1600 1200 800 0.0

(c) ε = 6 months

2.0

0.0

0.5

1.0

1.5

0.5

1.0

1.5

2.0

0.5

1.0

γ

1.5

2.0

1200 800

1200 800 0.0

γ

2.0

1600

(f) ε = 15 months

Total sample size required

(e) ε = 12 months 1600

(d) ε = 9 months

Total sample size required

γ

1600

γ

1200 0.0

(b) ε = 3 months

γ

800

Total sample size required

0.0

Total sample size required

1200

1600

(a) ε = 0 month

800

Total sample size required

T. Hasegawa

0.0

0.5

1.0

1.5

2.0

γ

Figure 4. Relationships between total sample size required and  in the Fleming–Harrington’s G, classes when  D 0 at a one-sided significance level of 2.5% with a power of 90% in each assumption about delayed onset time when the median survival time in the placebo group is 20 months, given the average hazard ratio of 0.75. " is defined as the delayed onset time. m2 is defined as the median survival time in the cancer vaccine group corresponding to ". The horizontal broken line shows the total sample size of 810 patients required by the standard log-rank test to detect the average hazard ratio of 0.75.

6. DISCUSSION

134

Cancer vaccines are used to treat cancers and many clinical studies for cancer vaccines are currently being conducted. Although the log-rank test or Cox proportional hazards model has been applied as the primary statistical analysis on the assumption of proportional hazards, a delayed treatment effect is expected based on confirmatory study results [1,2] and some regulatory guidelines [3,4]. Accordingly, as Fine [7] proposed, a weighted log-rank test with the Fleming–Harrington class of weights may be used as the primary analysis in confirmatory studies of cancer vaccines focusing on a survival endpoint, with the purpose of avoiding a substantial loss of statistical power. Thus sample size determination for the weighted log-rank test with the Fleming–Harrington weights is an essential part of planning confirmatory studies with consideration for the delayed onset time. The calculation procedure is easily obtained by extending Cantor’s procedure [11] implementing Lakatos’ calculation [12], as described in Section 3. An important goal is to establish the relationship between the number of events (or patients) required and parameters specifying assumed survival curves. It is particularly important to estimate the delayed onset time with satisfactory accuracy using the results of prior studies. Depending on the timing of the delayed effect, an appropriate or robust Fleming–Harrington class of weights should be chosen, and we can determine the sample size required and estimate the increment relative to the immediate effect assumption. With respect to the choice of weights, a minimum value of  D 0 is often reasonable to accentuate late survival difference. If the delay is approximately 15% or more of the median survival in the placebo group,  D 1.0 could be selected as a robust value because the corresponding sample size is expected

Copyright © 2014 John Wiley & Sons, Ltd.

to be lower than that in the absence of weighting . D 0.0/ and slightly higher over a wide range of the delayed onset time than the minimum sample size. Furthermore, the choice of  D 1 and  D 1 may also be considered reasonable over a wide range of the delayed onset time in terms of statistical power. This weight emphasizes middle differences where the survival rate is around 50%. The clinical or regulatory interpretation of accentuating middle differences may need further discussion. With respect to the number of events required to be observed in a study, the weighted log-rank test needs more events to detect a treatment effect than the standard log-rank test, because the weighted log-rank test focuses on the entire survival curve rather than the late difference after the delayed onset time. For this reason, it may still be inefficient to perform the weighted log-rank test in cancer vaccine studies. If the timing of the delayed effect is known, the method proposed by Logan et al. [14] could be applied. However, it would be difficult to estimate the delayed onset time with satisfactory accuracy using the results of prior studies with limited sample size in addition to the assumption of the median survival time. Berry [15] similarly argued that it would be nearly impossible to determine in an empirical setting whether or not the delayed benefit is genuine. However, if the delayed treatment effect is expected from an immunological mechanism of action, we should assume it when determining the sample size, and use the weighted log-rank test with the Fleming–Harrington class of weights as the primary analysis. Because the required sample size by the standard log-rank test with no consideration for the delayed onset time is not less than the minimum sample size by the unequal-weighted log-rank test, and the use of the standard log-rank test leads to a reduction in the statistical power as discussed in Section 5.3. Although Hasegawa [16] proposed a way to estimate the delayed onset time by a profile

Pharmaceut. Statist. 2014, 13 128–135

T. Hasegawa likelihood approach, there are few methods whose properties are well known especially for small sample sizes. Thus, in practice, we need to select the Fleming–Harrington class of weights and determine the sample size robustly for the delayed onset time by assuming a variety of settings in advance.

Acknowledgements We are obliged to Dr. Hideaki Watanabe at Shionogi & CO., LTD. for his comments, which improved this paper. We would also like to thank three anonymous referees for their valuable suggestions.

[12] Lakatos E. Sample sizes based on the log-rank statistic in complex clinical trials. Biometrics 1988; 44:229–241, DOI: 10.2307/2531910. [13] Kalbfleisch JD, Prentice RL. Estimation of the average hazard ratio. Biometrika 1981; 68:105–112, DOI: 10.1093/biomet/68.1.105. [14] Logan BR, Klein JP, Zhang MJ. Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics 2008; 64:733–740, DOI: 10.1111/j.1541-0420.2007.00975.x. [15] Berry DA. The hazards of endpoints. Journal of the National Cancer Institute 2010; 102:1376–1377, DOI: 10.1093/jnci/djq334. [16] Hasegawa T. Estimation of delayed onset time assuming piecewise exponential distribution in cancer vaccine studies. XXVIth International Biometric Conference, Kobe, Japan, 26–31 August 2012.

REFERENCES

APPENDIX A. DETAILS FOR EXAMPLE OF SAMPLE SIZE CALCULATION

[1] Kantoff PW, Higano CS, Shore ND, Berger ER, Small EJ, et al. Sipuleucel-T immunotherapy for castration-resistant prostate cancer. The New England Journal of Medicine 2010; 363:411–422, DOI: 10.1056/NEJMoa1001294. [2] Hodi FS, O’Day SJ, McDermott DF, Weber RW, Sosman JA, et al. Improved survival with ipilimumab in patients with metastatic melanoma. The New England Journal of Medicine 2010; 363:711–23, DOI: 10.1056/NEJMoa1003466. [3] The FDA Website. Available at: http://www.fda.gov/downloads/ BiologicsBloodVaccines/GuidanceComplianceRegulatoryInformation/ Guidances/Vaccines/UCM182826.pdf (accessed 9 June 2013). [4] The Japan society for biological therapy Website (in Japanese). Available at: http://jsbt.org/guidance/ (accessed 9 June 2013). [5] Hoos A, Eggermont AMM, Janetzki S, Hodi FS, Ibrahim R, et al. Improved endpoints for cancer immunotherapy trials. Journal of the National Cancer Institute 2010; 102:1388–1397, DOI: 10.1093/ jnci/djq310. [6] Fleming TR, Harrington DP. Counting Processes and Survival Analysis. John Wiley & Sons: New York, 1991. [7] Fine GD. Consequences of delayed treatment effects on analysis of time-to-event endpoints. Drug Information Journal 2007; 41:535–539, DOI: 10.1177/009286150704100412. [8] SAS Institute Inc. SAS/STATr 9.1 User’s Guide. SAS Institute Inc. Cary: NC, 2004. [9] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2012. Available at: http://www.R-project.org/ (accessed 9 June 2013). [10] Oller R, Langohr K. FHtest: tests for right and interval-censored survival data based on the Fleming-Harrington class. R package version 0.85, 2012. Available at: http://CRAN.R-project.org/package=FHtest (accessed 9 June 2013). [11] Cantor AB. Extending SAS Survival Analysis Techniques for Medical Research. SAS Institute Inc: Cary, NC, 1997, pp. 84–85.

It is assumed that the overall survival times for patients receiving the sipuleucel-T group follow a piecewise exponential distribution with a median time of 25.8 months, and the delayed onset time of 6 months and that those for the placebo group follow an exponential distribution with a median time of 21.7 months in the example in Section 4. Figure 2 shows their assumed survival curves. In this instance, the hazard  and the hazard ratio after the delayed onset time are 0.032 (/months) and 0.79, respectively. It is also assumed that patients are accrued to a clinical trial for 48 months at a constant rate and that the final analysis is performed 18 months later; that is at 66 months after the start of enrollment, so that the range of follow-up times for patients is from 18 to 66 months. Provided that the number of subintervals per time unit b is 30, the study period of 66 months is partitioned into 1980 subintervals of equal length ft0 D 0, t1 , t2 , : : :, t1980 D 66monthsg in calculations. Suppose that patients are randomly assigned in a 2:1 ratio to receive either cancer vaccine or placebo. Then, the hazard function hj .ti /, the expected proportion at risk Nj .i/, and the survival function Sj .ti / for group j .j D 1 for placebo, 2 for cancer vaccine) at time ti .i D 0, : : :, 1980/ are calculated. Further, the expected proportion of events Di for each subinterval Œti , tiC1 /, the hazard ratio i , the ratio of expected number at risk i at time ti , and the weights ri for the Fleming–Harrington class of weights with  = 0 and  =1 are calculated. These sequences are shown in Table A1. The required total sample size of 1974 patients is obtained from Table A1.

Table A1. The sequences of sample size calculations for the example. Group Placebo (j = 1) Cancer vaccine (j = 2)

0 0

... ...

179 5.97

180 6

... ...

540 18

... ...

1440 48

... ...

1980 66

h1 .ti / N1 .i/ S1 .ti / h2 .ti / N2 .i/ S2 .ti / Di i i ri

0.032 0.333 1.000 0.032 0.667 1.000 0.001 1.000 2.000 0.000

... ... ... ... ... ... ... ... ... ...

0.032 0.275 0.826 0.032 0.551 0.826 0.001 1.000 2.000 0.174

0.032 0.275 0.826 0.025 0.550 0.826 0.001 0.793 2.000 0.174

... ... ... ... ... ... ... ... ... ...

0.032 0.188 0.563 0.025 0.406 0.609 0.001 0.793 2.165 0.406

... ... ... ... ... ... ... ... ... ...

0.032 0.027 0.216 0.025 0.071 0.285 0.000 0.793 2.642 0.738

... ... ... ... ... ... ... ... ... ...

0.032 0.000 0.121 0.025 0.000 0.181 0.000 0.793 NA 0.839

Pharmaceut. Statist. 2014, 13 128–135

Copyright © 2014 John Wiley & Sons, Ltd.

135

i ti (months)

Sample size determination for the weighted log-rank test with the Fleming-Harrington class of weights in cancer vaccine studies.

In recent years, immunological science has evolved, and cancer vaccines are available for treating existing cancers. Because cancer vaccines require t...
286KB Sizes 0 Downloads 0 Views