This article was downloaded by: [UQ Library] On: 06 November 2014, At: 22:00 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biopharmaceutical Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lbps20

Situations where formal confirmatory analysis is inappropriate for clinical research David Salsburg

a

a

Pfizer Central Research Groton , Connecticut, 06340 Published online: 29 Mar 2007.

To cite this article: David Salsburg (1991) Situations where formal confirmatory analysis is inappropriate for clinical research, Journal of Biopharmaceutical Statistics, 1:1, 121-132, DOI: 10.1080/10543409108835009 To link to this article: http://dx.doi.org/10.1080/10543409108835009

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/ terms-and-conditions

Journal of Biopharrnaceutical Statistics, 1(1), 121-132 (1991)

SITUATIONS WHERE FORMAL CONFIRMATORY ANALYSIS IS INAPPROPRIATE FOR CLINICAL RESEARCH

Downloaded by [UQ Library] at 22:00 06 November 2014

David Salsburg Pfizer Central Research Groton, Connecticut 06340

Keywords. Phase I1 clinical trials; Hypothesis testing; Estimation

Abstract When new drugs are developed for indications where no effective therapies have been established, such as emphysema, useful measures of clinical activity have to be derived from a large number of proposed measures. It often happens that most of the proposed measures are of little use in tracking the disease and that the numbers of patients needed to establish whether the drug is effective is very large. Thus, the first study in man may be the size of the usual Phase III 'confmatory" study, and the expense and the time needed to complete it are so great that decisions about the future development of the drug will depend upon this one large study. The situation of such innovative therapies is examined, and methods for controlling the error rate when almost ail analyses have to be exploratory are proposed.

A General Word About Data-Driven Statistical Analyses Data-driven hypothesis testing and model formulation appear very frequently in the statistical literature. In fact, a cursory review of such statistical journals as the Journal of the Royal Statistical Society (series A, B , or C), Biornetrika,

Copyright 0 1991 by Marcel Dekker, Inc.

Salsburg

Downloaded by [UQ Library] at 22:00 06 November 2014

122

Technometrics, Biometrics, Jurimetrics, Journal of the American Statistical Association, or even the Annals of Statistics will show that most of the articles that deal with data describe techniques where the nature of the data observed dictates the methods of analysis used. This has been true through most of this century. The original agricultural papers of R. A. Fisher (1, 2), papers by "Student" (3) dealing with yields from different strains of cereal, the analysis of the Federalist papers by Mosteller and Wallace (4), the Los Angeles Air Pollution Data analysis of Tiao and Box (and students) (5)-all of these derive models to be tested from the data and test those models on the same data. So, from what source do we get the idea that such analyses are "nonstandard"? This idea is found primarily in the statistical/medical literature, in such journals as the Journal of the American Medical Association, the New England Journal of Medicine, and Controlled Clinical Trials. It is a result of the interaction between skeptical statisticians and medical researchers who are all too aware of a long history of "false positives" in medical science. These falsely effective therapies run from theriac and powdered mummy of the 1500s through use of heroin to avoid addiction to morphine, lobotomies, lytic cocktails, mesmerism, DES to prevent loss of fetus, histamine injections to prevent pain of childbirth, cigarettes to prevent cancer, DMSO, APCs, etc. To avoid this sort of error, there is a general agreement within the medical/statistical community that clinical trials have to be examined under rigid rules, defined in advance of the study. As a sidelight, I am intrigued to notice that almost all the false positives of medicine seem to have arisen from completely uncontrolled clinical experience, and that there are no cases I know of where a controlled trial produced a false positive, even when examined with data-driven techniques. But, be that as it may, the result of this consensus in the medical/statistical community is that we have put controlled clinical trials into a chastity belt. As a result, we are in danger of being like the medieval knight in the Balzac short story who has his wife locked into a chastity belt before he goes off to the wars. He returns several weeks later, hot and longing for the pleasures on the marriage bed. However, he can't get the belt unlocked. He struggles, she struggles, until they manage to break the key in the lock.

A Problem Where Data-Driven Analysis Must Be Run Sometime in the next few years, a drug company is going to introduce a new antiemphysema drug into human trials. We know that three proteolytic enzymes are involved in the destruction of lung tissue that is the hallmark of emphysema. Several companies are working on inhibitors of one or more of

Downloaded by [UQ Library] at 22:00 06 November 2014

When Formal Confirmatory Analysis Is Inappropriate

123

these enzymes. This drug will be known to inhibit the appropriate enzymes in vitro. It will have blocked or reduced the rate of deterioration in an animal model. How will we know if it is effective in humans? There is no good single measure of emphysema such as blood pressure for hypertension. In fact, unambiguous diagnosis is possible only on autopsy. We have many indirect measures of lung function. Spirometry provides us with about 10 such measures. The use of a body plethysmograph provides another five. We can sometimes measure elastic recoil of the lung through the combined use of an esophageal balloon and forced expiration, but this is a very difficult maneuver. We will also have several patient rating scales of quality of life and specific symptoms. We will probably collect patient diaries. There will be physical examinations at each clinic visit covering some 20 signs and symptoms that are associated with the disease. We will need to collect ancillary information, dealing with smoking, alcohol consumption, oxygen usage, and so on, as possible covariates to any future analysis. The end result will be about 150 measures taken at each clinic visit. It will be possible to associate groups of these measures, but the grouping will be based on both anatomical and clinical constructs and so be overlapping and, to some extent, redundant. I estimate that there may be about 20 such "reasonable" groupings. Since emphysema is a disease of slow deterioration, we will probably have to run the study for 12 months in order to detect any slowing down of the deterioration due to treatment. Over the 12-month period, we might have 6-20 clinic visits. When it comes to comparing visits, it will seem reasonable to compare each visit to the baseline, to run a trend test across visits, and, and to run several different contracts, comparing early visits to late visits. Finally, since the disease is not easily separated from other chronic lung conditions, there will be an a priori subdivision of patients by supposed etiology or the occurrence of concurrent illness. Since each of the measures is probably not very good at detecting change and since the patients will have to be followed for a long period of time, it is not feasible to run small phase I1 studies to generate hypotheses that will be tested in large phase I11 studies. The company that seeks to investigate an antiemphysema drug will have to dive into a large, very costly study with little or no previous experience. This study will provide us with well over 2000 opportunities to run hypothesis tests comparing the effect of treatment to placebo. How will we know when a "significance" is real and when it is due to the random fall of data? If we try to reduce the number of hypothesis tests by combining measures, how will we know when we have found a reasonably powerful combination for detecting treatment effects?

Downloaded by [UQ Library] at 22:00 06 November 2014

124

Salsburg

This first controlled trial in emphysema will be very expensive to run, in terms of money, time, and patients. Since we will be dealing with an elderly population and do not want the results confounded by the large amounts of concomitant medications, the entrance criteria will probably be difficult to meet. In addition, the number of centers capable of running a complicated study like this will be limited. I suspect that the total patient population available for 12-month studies of emphysema may be less than 5000 in the United States. Recruitment for a 12-month study could extend over 1-2 years. And, the need for special measuring equipment, the overhead charged by large university centers, and so on, will cost the sponsoring company several tens of thousands of dollars per patient. Thus, a large placebo-controlled study with about 200 patients will take about 3 years to run, cost $4-5 million, and represent a major commitment of the available resources in this field. Such a study will be done only once, unless the study shows strong evidence of efficacy and reasonable safety. When it comes to deciding whether this is a useful treatment moiety, this one 200-patient study will be all that is available in the way of controlled clinical experience. With this one study, we will have to: 1. Determine the "normal pattern of disease over 12 months (since very little is known now about the short-term chronoicity of emphysema) 2. Find the best measures or combinations of measures that will track the change in disease and be capable of detecting clinically useful treatment effects 3. Determine if the new drug "works" Although I am using the coming emphysema trial as a paradigm, it should be noted that these problems ,are not unique to emphysema. Similar problems can be found in the development of Congestive heart failure treatments Antimetastatic (nontoxic) drugs Disease-modifying antiarthritic drugs Drugs for chronic lung inflammatory disease Treatments for immunological disorders Nonphenothiazine antipsychotic drugs Steroid sparing antiarthritic drugs Chronic treatments for asthma Osteoporosis treatments Drugs for diabetic neuropathy and retinopathy Drugs for intermittent claudication Drugs for Paget's disease

When Formal Confirmatory Analysis Is Inappropriate

Treatments of tardive dyskinesia And many others In fact, this situation occurs for every new exciting area of drug research. It fails to hold only for well-trodden paths, for new drugs that are slight molecular modifications of drugs and conditions already well studied.

"Looking at Data"

Downloaded by [UQ Library] at 22:00 06 November 2014

In this section, I bow to prior papers by Fisher (6), Cochran (7), Box, Tukey (8), Anscombe (9), and Neyman ( 10)-all of whom have published examples of data dredging. Let me start with a quotation from Anscombe (1 1): How much should we look at the data? Everyone agrees that we must sometimes, to some extent, look at the data. . . . The trouble is, it is easy to be puzzled and misled if we look at them very much. . . . Given adequate computing power, the question of how much to look, how far to go, outstrips available significance tests or any other critical apparatus. . . . To refrain from examining the data because we do not know how to evaluate what we see, that surely is foolish. To assume without evaluation that everything seen is important, is foolish too. Sometimes an unexpected feature is perceived in the data, it is very pronounced, and a plausible expanation immediately suggests itself . . . or perhaps, upon seeking the pronounced unexpected feature, we remember an obviously important fact . . . and we are convinced that our understanding of the subject . . . has taken a real step forward. . . . On the other occasions an unexpected feature of the data . . . leaves us wondering how much attention to pay to it. . . . It is an important clue or a will-0'-the wisp?

In our emphysema trial, we will have over a thousand numbers collected for each patient. We do not know which of these numbers can be used to track changes in disease state, if any. We have to look at the data. So, how can we be protected from chasing Anscombe's will-o'-the-wisps? I suggest that we can protect ourselves from at least one error, the error of declaring the treatment effective when it is not, by something 1'11 call "protected data dredging. "

Protected Data Dredging We are looking for interesting patterns in the data that reflect changes in disease state. Eventually, we hope to compare the treatment groups with respect to those patterns. So, let us initially combine all treatment groups and search for interesting patterns in the combined data. Several tools are now available for such searches. One is projection pursuit (12), done either in-

Salsburg

Downloaded by [UQ Library] at 22:00 06 November 2014

126

teractively or through a computer algorithm like the Grand Tour (13). Another is Tukey's smear-and-sweep (14). I personally prefer the smear-andsweep technique because it may be that the best combination of data is not a linear combination of the collected numbers (as would be found by projection pursuit), but some weighted combination of classified events. We might, for instance, want to look at the largest of a set of measures at each visit, or the median, or an average of the upper quartile, etc. The choices of nonlinear combinations are limited only by the imagination of the analysis. The main point is to be sure that we locate interesting patterns without reference to treatment assigment. Once we have located interesting patterns, we need to be sure that those patterns have meaning to the practicing clinician. That is, we must be able to tell the doctor who is going to use the treatment what measures to watch for efficacy, what degree of effect he should expect to see on those measures, and how long he should expect to wait before treatment effect manifests itself. Thus, it may turn out that the third derivative of the flow-volume loop taken at a point halfway between the MEFR and the MMF appears to track patient change, but, unless we can convert this arcane methematical abstraction to something involving patient disease, this insight will have no value. As G. Udney Yule (15) once wrote: Failing the possibility of measuring that which you desire, the lust for measurement may . . . merely result in measuring something else-and perhaps forgetting the difference. . . . A third principle emerges when we consider the true dimensionality of the study. Although we may have over a thousand measurements on each patient, it is the patients who are the units of experimentation. The statistically independent events that give rise to the random variation we are trying to understand are what happens to the individual patients. Thus, a useful measure of change in disease will be a measure that reflects the entire patient's disease. It is less important to know that the ratio of FEVl to FVC changed than to know that the patient's ability to breathe has changed. Thus, there are three principles we can follow to avoid error: 1. Hunt for interesting patterns in the data when all patients are combined, regardless of treatment. 2. Produce derived measures of change that have clear clinical interpretation. 3. Find measures of change that reflect the entire patient.

Testing the Hypothesis of No Treatment Difference If we are clever and lucky, we will find a way to reduce the 1050 measurements to a small set of measures of change, possibly 10-15 clusters or derivations that reflect slightly different aspects of disease or slightly different

Downloaded by [UQ Library] at 22:00 06 November 2014

When Formal Confirmatory Analysis Is Inappropriate

127

views of the disease. For instance, we might have one derived measure to describe the patient's difficulties in breathing as determined by clinic measuring instruments, another that describes the same difficulties as perceived by the patient, another that describes changes in blood gases, another that describes the resiliency of the lung, and so on. Now, can we test for treatment differences? Can we break the blind and separate the patients into treatment groups to see which derived measure detects a difference in effect? I would suggest not doing so, yet. Even if we have developed the best set of derived measures possible, each one will reflect a slightly different aspect of the disease. Patients may differ in the degree to which any one aspect changes. Since these are newly developed measures, they are probably not as fine-tuned as they can be, so they may, each one of them, be lacking in adequate power to detect differences with a mere 200-patient study. Besides, if we run 10-15 hypothesis tests, we still have Anscombe's problem of knowing when to believe significance levels. I suggest that we combine the 10-15 derived measures into a single measure of response, where we allow the derived measures to reinforce one another. Done properly, we can produce a single-hypothesis test (eliminating the multiple-testing problem) with the best power. I have used two methods for combing these measures. One is due to Peter O'Brien (16). In O'Brien's technique, we rank patients across treatments for each measure, reduce the rank to a standardized rank:

so that if one patient has missing values for a given measure, all patients are equally comparable across measures. We then average the standardized ranks for each patient across measures, and compare treatments with respect to this score. The major advantage of O'Brien's technique is that it automatically scales each measure to lie between its largest and smallest value. As a rank test procedure, it also downgrades the effect of outliers. Its major disadvantage is that the resulting average score has no clear-cut clinical meaning. Thus, we might be able to conclude that there is a "significantn difference in effect among treatments, but we will have no way of converting that "significance" into a clinically useful measure of difference. Another approach, which at first glance appears to be a little less powerful than O'Brien's, since it does not use all the information, is to pick a cutoff point for each of the 10-15 derived measures that is called "response." In many cases, if the derived measures have clinical meaning, we can assign that cutoff point on the basis of clinical perception. If we cannot, and if there is a placebo group in the study, then I suggest we pick the "response/no

Salsburg

Downloaded by [UQ Library] at 22:00 06 November 2014

128

response" cutoff at the point where 20-30% of the placebo patients would "respond." This gives enough room for treatment to produce a greater than 30% response rate but is sufficiently bounded away from a zero response rate to allow for some possible treatment effect. For each patient, we then assign a score that consists of the percentage of derived measures on which the patient "responds. " If events of "response" are statistically independent for both placebo and active treatment, then we are dealing with a Polya distribution (17), and it is well known that the discriminating power of the average increases with the number of "cells." On the other hand, it might be that the measures are reasonably independent unless an effective treatment is applied, at which point they tend to produce similar responses. Then, the power is increased because the variance in the treated group is reduced. Finally, this technique has the great advantage that the measure of effect is the percentage of derived measures on which the patient responded. This has a clear clinical interpretation. We can state that there is a "significant" difference between treatments and that XX% of the treated patients responded on most of the measures while only YY% of the placebo patients did so. So, I add a fourth principle for protected data dredging: 4.

Run a single hypothesis test comparing treatments, which combines all the derived measures.

Now can we break the blind and run that one big significance test? Not yet.

Summing Across Subgroups The one aspect of "nonstandard" analysis that most horrifies the guardians of statistical chastity in the medical/statistical literature is the use of statistical tests run on subgroups of patients. They all fear the study, where no overall effect is seen, but the authors point out that there was a highly significant differences among left-handed females over 37 years in age. However, there is another way to look at subgroups. We can use subgroups to further protect us from type I error if we follow R. A. Fisher's (18) lead. When is a result significant? . . . A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give [a 5%] level of significance.

In effect, Fisher believed in biological replication. A single experiment with even a 0.001 significance was inadequate. To Fisher, the scientist could rest assured that the significance meant something only when he could design a study that repeated it. The new study need not be the same as the previous

When Formal Confirmatory Analysis Is Inappropriate

129

one. The scientist should be able to design a new study that is more powerful than the previous one, using insights derived from the previous analysis. Since we have only one study available to us on emphysema, we cannot follow Fisher's call for replication exactly. However, we can use naturally occurring subsets of the data to verify that what think we have found is consistent across those subsets. Some naturally occurring subsets are those based on

Downloaded by [UQ Library] at 22:00 06 November 2014

sex center (investigator) age etiological subgroup. We cannot afford to demand formal 5% significance in each subgroup, but we can use the subgroups to divide the data and accumulate the single test statistic across all the groups. The general technique for combining across groups is fairly standard. We consider a parameter that should be constant across groups, if the treatment effects are different from controls. In the Mantel-Haenzel procedure this is the odds ratio of a (0, 1) varible. Othe possibilities arise naturally as we consider the clinical nature of the disease. Suppose this parameter is 8 and its estimator for the ith subgroups is Ti. Then, for each estimator, we compute an expectation on the null hypothesis of no effect: and a variance under the same null hypothesis: Finally, we construct a single test statistic, that is asympotically distributed as a Normal [0, 11 variate: This yields the fifth and final rule for protected data dredging:

5.

Run the single overall test of significance by combining across subgroups of patients so that consistency of pattern across subgroups can be used as both check on the validity of the "significance" and as a means of increasing power.

Is There Life After "Significance"? Let us suppose that the emphysema drug "works." We have learned something about the chronic pattern of disease over 12 months. We have derived useful measures of efficacy that can be used to track changes in patients'

Downloaded by [UQ Library] at 22:00 06 November 2014

130

Salsburg

disease. And, with those measures, we have been able to show that there is a "significant" difference in response between treatment and controls. What next? We have those 10-15 derived measures of specific-aspects disease. I suggest we go back to them. Since we know the drug "works," we should have no qualms about trying to estimate the degree to which it works, the subsets of patients in whom it appears to work best, the amount of time it takes to work, etc. I suggest we now construct confidence bounds of mean changes in each of these measures conditional on specific subsets of patients, on intervals of time, and so on. As an example of how useful such method can be, Figure 1 displays a graph of confidence bounds on subsets of patients. This is from a phase I1 study of a vasodilator in congestive heart failure. As an experiment, we provided the patients with pedometers and had them record the number of miles walked each day. We then converted these daily measurements into linear slopes over the period of the study, so we had changes in miles walked per week. Patients were subdivided by treatment center and by categories based on baseline left ventricular ejection fraction. The figure displays 90% confidence intervals on changes in miles walked per week. It can be seen that the two groups produce different results for patients with very low ejection

Ejectlon Fraction < 25%

Ejectlon Fraction

Overall Significance Test Z= 3.988. p

Situations where formal confirmatory analysis is inappropriate for clinical research.

When new drugs are developed for indications where no effective therapies have been established, such as emphysema, useful measures of clinical activi...
497KB Sizes 0 Downloads 0 Views