STATISTICS IN MEDICINE, VOL. 9,447-456 (1990)

APPLICATION OF NEW TWO-SAMPLE TESTS TO DATA FROM A RANDOMIZED PLACEBO-CONTROLLED HEART-FAILURE TRIAL P. K. TANDON, HERBERT STANDER AND RICHARD P. SCHWARZ, Jr. Sterling Research Group, Malvern, Pennsylvania 19355, U.S.A.

SUMMARY In a randomized placebo-controlled double-blind trial of 230 congestive heart failure patients, four treatments were evaluated for efficacy, with exercise tolerance time (ETT) as the primary outcome. Various two-sample tests were applied to the analysis of ETT data. It is shown in this paper that the conventional two-sample tests ( t and rank-sum) are insensitive to situations where the effect of the experimental therapy is not consistent across a patient population. Tests recommended by O’Brien are more appropriate for these data. It is also shown that the application of the OBrien tests led to the identification of sub-groups where the observed effect of the experimental therapy was most pronounced.

1. INTRODUCTION

The statistical analysis of data from two treatment groups, where one group represents a continuation of standard therapy and the other an intervention with an experimental therapy, has focused primarily on differences between treatment means, with minimal emphasis being given to the differences between their respective distributions. Recently, publications by O’Brien’ and Conover and Salsburg’ have appeared which assert that not only is the shift in location of obvious interest, but identifying a treatment whose effect is not consistent across a patient population is also important. This is particularly so when a shift in the distribution can be associated with meaningful characteristics of a particular subgroup of responders and/or nonresponders (Good3). In a clinical trial, where the volunteer patients are maintained on the standard therapy, ‘poor’ and ‘good’ responders to treatment are usually not represented. The ‘poor’ responders are usually receiving other drugs and ‘good’ responders d o not want their medications to be changed. Consequently, the population of volunteer patients, although symptomatic, seem to be neither improving nor deteriorating; however, this does not mean that the potential for either condition does not exist with the experimental therapy. The analyses presented in this paper will demonstrate that modifications in standard therapy can result in changes in study endpoints consistent with this potential. The effect of three such treatment modifications have been investigated. In a randomized placebo-controlled clinical trial in congestive heart failure patients (CHF), the first modification was to replace standard therapy (digoxin) with a placebo; second, to replace it with experimental therapy (milrinone); and finally, 0277-67 15/90/040447-10$05.00 0 1990 by John Wiley & Sons, Ltd.

Received October 1988 Revised June 1989

448

P. K. TANDON, H. STANDER AND R. P. SCHWARZ, Jr.

to add milrinone to conventional therapy. D i g ~ x i none , ~ of the digitalis glycosides, is a positive inotropic agent; that is, it increases cardiac contractile force and improves the diminished cardiac output which is characteristic of congestive heart failure. Milrinone5 is a newer agent from the class known as the phosphodiesterase inhibitors, and also possesses positive inotropic properties. However, because its mechanism of action is distinctly different from digoxin, it also has a direct vasodilator effect as well as improving diastolic relaxation. In this paper, it is shown that the conventional two-sample tests of location are insensitive to situations occurring in clinical trials where the effect of the experimental therapy is heterogeneous. The methodology proposed by O’Brien may be useful for hypothesis testing in these kinds of situations. It is also shown that this methodology could be useful in the identification of potential subgroups of patients for whom the experimental therapy might have most (or least) benefit. In this sense, the OBrien method can be considered as hypothesis generating. 2. METHODOLOGY

OBrien’ has proposed a generalized t-test and a generalized rank-sum test. These tests are extensions of the conventional t and rank-sum tests which take into account the heterogeneous nature of the response due to the experimental therapeutic intervention. We use here the notation from O’Brien’s publication. Let F , and F, represent two cumulative distribution functions for two samples X and Y and let W , , W,, . . . , W, represent the ascending ordered data in the pooled sample, and define Zi= 1 if Wi corresponds to a member of the Y sample and 0 otherwise ( i = 1, 2, . . . , N ) . If F, and F , are normal distributions differing only in location, then the conditional log-odds of group membership (given w)increases linearly with W. That is, Pr(z

=

1 ( w )= 1/{1

+ expC

-

(aL+ BLW)l},

where jLis the linear discriminant function coeficient. If F, and F, are normal distributions differing in both location and scale, then the conditional log-odds is a quadratic function of W with P r ( Z = 1 I w ) = 1/(1 + e x p [ - ( a , + l i ’ , W + y e W 2 ) ] } ,

or equivalently log-odds [Pr(Z= IJW)]=a+pW+yW2.

One can estimate jand y by regressing Z against Wand W 2using the ordinary least squares or logistic regression methodology. O’Brien recommends testing the linearity assumption by testing y = 0 and, if one concludes y # 0, performing an overall test of association by testing j= y = 0. In the generalized rank-sum test, the original data are replaced by the ranks. The usual reason for transforming the original values to ranks is motivated by the non-normal distribution of the data. OBrien gives another reason for using the rank transformation; when F , and F , are normal distributions differing in both location and scale, the distribution of Win the pooled sample will be skewed, a feature that will be even more pronounced with W 2 . According to OBrien, a quadratic model generalization of the rank-sum test, which is obtained by regressing the 2 on the ranks and squared-ranks, might provide a powerful test for comparing two samples when one therapy has a heterogeneous response. Additionally, O’Brien recommends three steps for the comparison of two samples: 1. Perform the t-test in the usual way. If unequal variances are observed, then this might

indicate heterogeneous responses due to treatment.

NEW TWO-SAMPLE TESTS

449

2. Even if the t-test is not significant, apply the quadratic model, and if y is close to 0 then conclude that no heterogeneous responses are apparent. However, if y # 0 then this suggests that a treatment effect may exist which operates differentially within the patient population. 3. If y # 0, perform an overall test for association on the 2 d.f. test of the hypothesis p = y = 0. The methodology described above can be extended to the identification of sub-groups. Because the experimental therapy may affect patients subgroups differently, for example, some patients may benefit greatly from the experimental therapy while some may not, O’Brien also suggests that the alternative hypothesis should incorporate the possibility of a heterogeneous response to the experimental therapy. O’Brien performed Monte Carlo simulations to compare the type I error rates of the following tests: t ; Welch’s t; generalized t; rank-sum; generalized rank-sum; Kolmogorov-Smirnov; Siegel-Tukey and maximal chi-squared. Welch’s (Winer6) t-test is based on adjusting the degrees of freedom for the usual t-test when the variances of two samples are not equal. Lehmann7 describes the Siegel-Tukey test as being sensitive when the effect of the experimental treatment is to produce both large and small responses. If the placebo or control group produces this type of effect, then the ranks for the treatment group will be larger than the ranks for placebo group. To accomplish this, the low ranks are assigned to the extreme observations and the ranks increase toward the centre. The scheme proposed by Lehmann is as follows: assign rank 1 to the smallest observation, rank 2 to the largest, rank 3 to the second largest, rank 4 to the second smallest and so on. The rank-sum test is performed on these ranks. In the maximal chi-squared test as described by Miller and Siegmund,’ a point in the response variable is selected which best separates the two treatment groups. The usual chi-squared test is performed at each point of the variable - making a 2 x 2 table of the numbers - and the maximum chi-squared is selected from all the possible chi-squared values. The value of the maximum chi-squared is compared with the appropriate critical values given by Miller and Siegmund. Conover and Salsburg’ have proposed four locally most powerful tests for detecting treatment effects when only a subset of patients can be expected to respond to the treatment. The basic concept of locally most powerful rank tests is based on a statistic proposed by Hajek and Sidak.’ It is shown that the most powerful rank test against local alternatives is a linear rank test that rejects the null hypothesis for large values of the test statistic n

T=

1 S(Ri) i= 1

where R , are the ranks in the combined samples, and S ( i ) are the suitable chosen scores. Conover and Salsburg extended the statistic T into four locally most powerful tests (Sl, S,, S , and S , ) based on four different models.

3. APPLICATIONS As described in the previous section, the data from a randomized placebo-controlled clinical trial in C H F patients are used as an example for the application of the tests described above. As a

study requirement, patients had to have documented clinical evidence for fatigue and dyspnoea prior to being randomized. During the baseline phase, symptomatic patients were stabilized on digoxin and diuretics. After completing the baseline phase, patients were randomized to one of four treatment groups including a placebo which was a replacement therapy for digoxin. Because active therapy (digoxin) was withdrawn from some treatment groups, administration of other

450

P. K. TANDON, H. STANDER AND R. P. SCHWARZ, Jr.

SD for baseline and change in ETT for four treatment groups

Table I. Mean ~

Treatment group D+M D M P

Number of patients randomized

Number of patients with analysable data

60 62 59 49

53 60 47 40

Baseline (seconds) ETT 477 452 465 437

& 134 5 162 If: 152 f 163

Change (seconds) ETT 48 f 122 65 f 117 83 f 173 1 144

*

therapies for the treatment of C H F (co-intervention) was permitted for those patients whose clinical condition worsened. The primary measures of outcome in this study were: 1. Exercise tolerance time (ETT); 2. Co-intervention [treatment failure] rates.

ETT was selected as a primary outcome because the change from baseline in duration of maximum ETT assessed by progressive multistage treadmill testing after three months of treatment has been a standard endpoint for evaluating drug efficacy in congestive heart failure patients. The study design involved a randomization to possibly ineffective treatments, therefore, the investigator was advised to maintain close clinical surveillance of the patient in order to detect any potential deterioration in clinical status. If, in the judgement of the investigator, a patient’s heart failure worsened during the chronic phase such that therapeutic intervention became clinically necessary, the patient was declared a treatment failure. Patients who required cointervention were subsequently followed according to the visit/evaluation schedule to completion of the study. Two hundred and thirty C H F patients were randomized in this study. The patients’ mean age was 59 with standard deviation 13.6 years. One hundred and ninety-six (70 per cent) were men. Two-thirds of the patients (67 per cent) had New York Heart Association (NYHA) functional class 111. The mean ejection fraction was 23.9 per cent with standard deviation 10.6 per cent and the mean baseline ETT was 7.5 minutes, measured on a modified Naughton and symptom-limited treadmill exercise test. The digoxin group (D), in this trial, represents the control group, since no medication was changed from the baseline throughout the chronic phase. The other treatment g roups D + M, M and P can be considered as the addition of milrinone, substitution of milrinone for digoxin, and the withdrawal of the digoxin, respectively. Therefore, three comparisons among the four treatment group are of interest. These comparisons are D M versus D which tests the additive effect of the milrinone, P versus D tests whether there is a difference between the continuation of the digoxin versus a withdrawal of the digoxin, and finally the comparison M versus D tests whether milrinone can be substituted for digoxin. Table I presents the means and standard deviations of the change in ETT for the four treatment groups. The change in ETT was calculated as the last blinded observation minus the baseline observation. This analysis is referred to as ‘end point’ or ‘last observation carried forward’ analysis, and was pre-specified in the protocol as the primary analysis.

+

451

NEW TWO-SAMPLE TESTS TREATMENT

----- D+M

-

D

401 30

t

PERCENT FREQUENCY

20

1 L

-350 -250

-150

-50

50

150

250

350

450

350

450

CHANGE IN ETT (seconds) Figure 1. Frequency distribution of change in ETT TREATMENT

-D

---

P

PERCENT FREQUENCY

-350 -250

-150

-50

50

150

250

CHANGE IN ETT (seconds) Figure 2. Frequency distribution of change in ETT

3.1. Comparison of Group D + M Versus Group D The distribution of ETT changes for the three comparisons are shown in Figures 1 to 3. Figure 1 shows the distribution of D + M versus D. It appears from the plot that there is no difference between these two treatment groups. This is confirmed in Table I1 which shows the p-values for the various tests. The lack of statistical significancefor this comparison suggests that there is little or no difference between D + M and D for either mean change or distribution of changes in ETT.

3.2. Comparison of Group P versus Group D Figure 2 shows the distribution of change in ETT for group D and group P. On comparing group D and group P, there is evidence of a shift in location. This shift in location is supported by

452

P. K. TANDON. H. STANDER A N D R. P. SCHWARZ, Jr.

PERCENT FREQUENCY

- -.

u -350 -250

-150

-50

50

150

250

350

450

550

CHANGE IN ETT (seconds) Figure 3. Frequency distribution of change in ETT

Table 11. P-values from various tests for three comparisons Comparisons Statistical tests 1-test Generalized t-test (O’Brien) Welch t -tes t Rank-sum test Generalized rank-sum test (O’Brien) Siegel-Tu key Max x2

D + M versus D

P versus D

0.473 0.684 0.630 0.473 0.609 0.848 0.804 0.798 0.400 0.225 0.676 0.675 0.652

0.017 0.028 0.270 0.0 17 0.03 1 0,051 0.280 0.389 0.240 0.045 0.171 0.229 0.075

(2 d.f.) (QT) (2 d.f.) (QT)

(2 d.f.) (QT)

M versus D 0.509 0.039 (2 d.f.) 0.029 (QT) 0.511

(2 d.f.) (QT)

0.756 0.025 (2 d.f.) 0.009 (QT) 0.009 0.07 1 0.160 0.136 0.075 0.24 1

Q T Quadratic term Degrees of freedom

d.f.

significant p-values for t , generalized t, Welch’s t and rank-sum tests (Table 11). Note that the quadratic terms for generalized t and rank-sum tests are uot significant, indicating no evidence of a non-linear effect.

3.3. Comparison of Group M versus Group D The distribution of changes in ETT for group M versus group D is shown in the Figure 3. It is clear from the plot that these two treatment groups have different distributions. Group M has a distribution which is asymmetric, and in the right tail of the distribution, the proportion of

453

NEW TWO-SAMPLE TESTS

PROBABILITY

o,6

OF CO-INTERVENTION

I\

:I, 0.0

1 1,0, \ 10

( ,

20 20

30 30

40 40

,,

50 , 50 EJECTION FRACTION (96)

(

60 60

(

(

70 I 70

Figure 4. Relationship between probability of co-intervention and ejection fraction

patients showing improvement in ETT is higher in group M than in group D. Two-sample t and rank-sum tests gave no indication of a difference between the groups. However, on applying quadratic regression models (generalized t and rank-sum tests) there was a significant effect. Using the generalized test, the overall test for association yielded p = 0.039, while a test for the quadratic term gave p = 0029. Using ranks, the quadratic term was significant, p = 0.009, and the overall test yielded p = 0.025. The observed association between log-odds of treatment membership and the change in ETT is U-shaped. This pattern could explain insensitivity of the t and rank-sum tests. If one examines the distribution of change in ETT within the two treatment groups, the proportion of patients in the lower tail is comparable for the two groups, while group M is over-represented in the upper tail of the distribution. This comparison of groups M and D suggests that a test for only a shift-in-location model may be inappropriate. The non-linear analysis points to the possible existence of subgroups with different treatment effects. The other significant p-value (0.009) from Table I1 is from Siegel-Tukey’s test; this test again suggests that the two distributions are not comparable. Additional analyses demonstrated a significant relationship between three-months survival and baseline ejection fraction (note that the trial was three months in duration). Ejection fraction is a medically acceptable surrogate for the severity of the patient’s underlying disease state so this finding was expected.”-” The lower is the ejection fraction the worse is the patient’s prognosis. It seemed appropriate to analyse the relationship of ejection fraction to the primary outcome measures, As expected from the survival data, patients whose ejection fraction was low showed a higher incidence of co-intervention (one of the primary outcome measures) than patients whose ejection fraction was higher (Figure 4). We then stratified patients into two strata: patients with < 20 per cent ejection fraction and patients with > 20 per cent ejection fraction. A cutpoint of 20 per cent ejection fraction was selected for the following reasons: (i) the cutpoint of 20 per cent seemed to separate seriously ill patients from those who were less seriously ill; (ii) the value of 20 per cent is the value below which ejection fraction should be grouped because of inherent variability in the measurement, and (iii) the value of 20 per cent comes close to the median (21 per cent) of the distribution of ejection fraction for all patients.

454

P. K. TANDON, H. STANDER AND R. P. SCHWAKZ, Jr.

Table 111. Mean & SD change in ETT (seconds) stratified by baseline ejection fraction Treatment D+M

D M P

Patients with EF 29 f 117 ( N = 27) 58 f 114 ( N = 22) 37 & 178 ( N = 23) 10 & 131 ( N = 12)

< 20%

Patients with EF > 20% 68 & 127 ( N = 26) 68 & 120 ( N = 38) 128 160 ( N = 24) - 3 & 151 ( N = 28)

Table I11 shows the means and standard deviations of change in ETT for the two strata. If one compares the change in ETT for the two strata, it appears that group M has a pronounced effect on patients whose E F is > 20 per cent. Although the practical considerations which lead to dichotomizing EF can be appreciated, the effect of treatment is more likely on biological grounds to vary continuously with EF.

4. DISCUSSION AND CONCLUSION Clinical trials often include a study population of patients with chronic disease who are maintained on some active standard treatment and are neither improving nor deteriorating. The condition of these symptomatic patients may improve through experimental therapy. However, an experimental therapy may not have a beneficial effect on all patients exposed to the therapy. For example, in the trial we have described, milrinone may have had a selective activity affecting only certain patients: some patients may have benefitted a lot, some a little, and some not at all. Therefore, improvements in the endpoint (ETT) with milrinone may not necessarily be reflected as a simple shift in the mean, but rather as a more complex modification in the shape of the distribution. Consequently, an important consideration is to test for a shift in distribution based on the relationship between responders and non-responders in the various treatment groups. Given that a significant difference between distributions is found, the question of its meaning has to be addressed. Is the difference a spurious finding or one reflecting a real biological effect? This finding might be true as milrinone is a peak I11 phosphodiesterase inhibitor with both i n ~ t r o p i c ’and ~ v a ~ o d i l a t o r ’actions ~ in heart failure patients. At the cellular level, both of these actions are effected by increases in the intracellular concentration of cyclic AMP in both cardiac and vascular smooth m ~ s c 1 e . In l ~ advanced congestive heart failure (CHF), the inotropic action of phosphodiesterase inhibitors, including milrinone, may be impaired due to a depletion of cyclic A M P in the m y ~ c a r d i u m . ’ ~ ~Furthermore, ” it is well established that advanced C H F is accompanied by pronounced activation of endogenous peripheral vasoconstrictor systems, including the renin-angiotensin-aldosterone system, the beta-adrenergic system, and the arginine -vasopressin system.’ The degree of activation of these systems is proportional to the severity of the disease,” and results in an impaired peripheral vasodilatory capacity and reduced exercise capacity in patients with advanced CHF.20*21 Thus, the biochemical and physiological derangements which accompany advanced C H F may produce a situation in which both the inotropic and vasodilator actions of milrinone are impeded, resulting in less pronounced improvements in exercise tolerance time.

NEW TWO-SAMPLE TESTS

455

In this study, we used the generalized t-test and generalized rank-sum tests (O’Brien) to test the hypothesis of changes in distribution. The analyses demonstrated a significant difference between the distributions of changes in ETT for milrinone and digoxin. ETT data were also analysed by other two-sample tests such as the Welch t-test, the maximum chi-squared and S , , S,, S, and S,. The purpose of presenting the p-values from these tests is not to compare the tests but simply to show the performance of these tests with these data, employing an analysis aimed at generating hypotheses. Consequent to this analysis, we identified a baseline prognostic factor (ejection fraction), which may affect the size of the treatment difference. It should also be noted that these methods are intended not as substitutes for standard analyses but as complements to them. It is clear from the example used here that the tests for shift in location are necessary but not sufficient. Tests for changes in the distribution shape should also be carried out, and may lead to hypotheses that can only be addressed by additional studies. Based on this retrospective subgroup hypothesis, a prospective trial has been initiated to determine whether milrinone produces greater efficacy compared to digoxin in CHF patients with ejection fraction > 20 per cent.

REFERENCES 1. O’Brien, P. C. ‘Comparing two sample: extensions of the t , rank-sum and log-rank tests’, Journal ofthe American Statistical Association, 83, No. 401, 52-61 (1988). 2. Conover, W. J. and Salsberg, D. S. ‘Locally most powerful tests for detecting treatment effects when only a subset of patients can be expected to respond to treatment’, Biometrics, 44, 189-196 (1988). 3. Good, P. I. ‘Detection of a treatment effect when not all experimental subjects will respond to treatment’, Biometrics, 35, 483439 (1979). 4. Doherty, J. E. ‘Conventional drug therapy in the management of heart failure’, in Cohn, J. N. (ed.) Drug Treatment of Heart Failure, Yorke Medical Books, New York, 1983. 5. Smith, T. W., Braunwald, E. and Kelly, R. A. ‘The management of heart failure’, in Braunwald, Eugene (ed.) Heart Disease: A Textbook of Cardiovascular Medicine, W. B. Saunders Co., Philadelphia, 1988. 6. Winer, B. J. Statistical Principles in Experimental Design, McGraw-Hill, 1971. 7. Lehman, E. L. Nonparametrics, McGraw-Hill, 1975. 8. Miller, R. and Siegmund, D. ‘Maximally selected chi square statistics’, Biometrics, 38, 1011-1016 (1982). 9. Hajak, J. and Sidak, Z. Theory and Rank Tests, New York: Academic Press, 1967. 10. Califf, R. M., Bounous, P., Harrell, F. E., McCants, B., Lee, K. L., McKinnis, R. A. and Rosati, R. A. ‘The prognosis in the presence of coronary artery disease’ in Braunwald, E., Mock, M. B. and Watson, J., (eds.) Congestive Heart Failure Current Research and Clinical Applications, Grune and Stratton, 1981. 11. Schwarz, F., Mall, G., Zebe, H., Schmitzer, E., Manthey, J., Scheurlen, H. and Kubler, W. ‘Determinants of survival in patients with congestive cardiomyopathy: quantitative morphologic findings and left ventricular hemodynamics’, Circulation, 70, No. 6, 923-928 (1984). 12. Unverforth, D. V., Magorien, R. D., Moeschberger, M. L., Baker, P. B., Fetters, J. K. and Leier, C. V. ‘Factors influencing the one-year mortality of dilated cardiomyopathy’, American Journal of Cardiology, 54, 147-152 (1984). 13. Ludmer, P. L., Wright, R. F., Arnold, J. M., Ganz, P., Braunwald, E. and Colucci, W. S. ‘Separation of the direct myocardial and vasodilator actions of milrinone administered by an intracoronary infusion technique’, Circulation, 73, (l), 13G137 (1986). 14. Cody, R. J., Muller, F. B., Kubo, S. H., Rutman, H. and Leonard, D. ‘Identification of the direct vasodilator effect of milrinone with an isolated limb preparation in patients with chronic congestive heart failure’, Circulation, 73, (l), 124-129 (1986). 15. Silver, P. J. ‘Biochemical aspects of inhibition of cardiovascular low (Km) cyclic adenosine monophosphate phosphodiesterase’, American Journal of Cardiology, 63, (2), 2A-9A (1989). 16. Feldman, M. D., Copelas, L., Gwathmey, J. K., Phillips, P., Warren, S . E. and Schoen, F. J. ‘Deficient production of cyclic AMP: pharmacologic evidence of an important cause of contractile dysfunction in patients with end-stage heart failure’, Circulation, 75, (2), 331-339 (1987).

456

P. K. TANDON, H. STANDER AND R. P. SCHWARZ, Jr

17. Bohm, M., Diet, F., Kemkes, B., Erdmeann, E. and Klinik, M. I. ‘Functional evidence for reduced CAMP-formation in the failing human heart’, Circulation, 78, (4), 11-347 (1988). 18. Fancis, G. S., Goldsmith, S. R., Levine, B. T., Olivari, M. T. and Cohn, J. N. ‘The neurohumoral axis in congestive heart failure’, Annuls of Znternational Medicine, 101, 370-377 (1984). 19. Levine, B. T., Francis, G. S., Goldsmith, S. R., Simon, A. B. and Cohn, J. N. ‘Activity of the sympathetic nervous system and renin-augiotensin system assessed by plasma hormone levels and their relation to hemodynamic abnormalities in congestive heart failure’, American Journal ofcardiology, 49, 1659-1 666 (1982). 20. Zelis, R., Mason, D. T. and Braunwald, E. ‘A comparison of the effects of vasodilator stimuli on peripheral resistance vessels in normal subjects and in patients with congestive heart failure’, Journal of Clinical Investigation, 47, 960-970 (1968). 21. LaJemtel, T. H., Maskin, C. S., Lucido, D. and Chadwick, B. J. ‘Failure to augment maximal limb blood flow in response to one-leg versus two-leg exercise in patients with severe heart failure’, Circulation, 74, (2), 245-251 (1986).

Application of new two-sample tests to data from a randomized placebo-controlled heart-failure trial.

In a randomized placebo-controlled double-blind trial of 230 congestive heart failure patients, four treatments were evaluated for efficacy, with exer...
609KB Sizes 0 Downloads 0 Views