Downloaded from http://jme.bmj.com/ on November 14, 2015 - Published by group.bmj.com

Viewpoint

Ethics and epistemology of accurate prediction in clinical research Spencer Phillips Hey Correspondence to Dr Spencer Phillips Hey, Studies of Translation, Ethics, and Medicine (STREAM) Research Group, Biomedical Ethics Unit, McGill University, Montreal, Quebec, Canada H3A 1X1; [email protected] Received 8 October 2013 Revised 5 May 2014 Accepted 3 September 2014 Published Online First 23 September 2014

ABSTRACT All major research ethics policies assert that the ethical review of clinical trial protocols should include a systematic assessment of risks and benefits. But despite this policy, protocols do not typically contain explicit probability statements about the likely risks or benefits involved in the proposed research. In this essay, I articulate a range of ethical and epistemic advantages that explicit forecasting would offer to the health research enterprise. I then consider how some particular confidence levels may come into conflict with the principles of ethical research.

INTRODUCTION

To cite: Hey SP. J Med Ethics 2015;41:559–562.

All major research ethics policies assert that the ethical review of clinical trial protocols should include a systematic assessment of risks and benefits. As it is made explicit in The Belmont Report, this includes a determination of ‘whether investigator’s estimates of the probability of harm or benefits are reasonable, as judged by known facts or other available studies’.1 But despite this policy, protocols do not typically contain explicit probability statements (i.e., forecasts) about the likely risks or benefits involved in the proposed research. On the contrary, they are often written with unrealistically positive expectations,2 conveying optimism in conflict with the fact that the vast majority of trials are negative.3 This discordance between optimism in protocols and the realities of clinical research could be explained in a couple of ways. For example, it could be that investigators are strategic in how they present the state of accumulating evidence in their protocols. That is, they are well aware that the likelihood of a positive outcome is low, but because of the competition for funding or patient recruitment, they exaggerate estimates of effect size and downplay the risks. Alternatively, it could be that investigators are simply not well-calibrated forecasters of experimental outcomes. That is, they either do not have a good read on ‘the known facts or other available studies’ or if they do have a good read on this evidence, then they do not know how to convert the relevant knowledge into accurate predictions. Neither of these explanations represents a desirable state of affairs. In the case of strategic optimism, such behaviour would conflict with the language of research ethics policies (as is evident from the wording in the Belmont Report quoted above) and it would also take advantage of patient-subjects by leveraging their therapeutic misunderstandings or misestimations to the benefit of investigators.4 In the case of poor calibration, since explicit forecasting is not enforced, very little is actually known about forecasting ability in the realm of experimental medicine.

There have been studies investigating physicians’ forecasting ability for patient mortality,5 as well as accuracy of emergency room referrals to the intensive care unit,6 7 but to date, there have been no comprehensive studies looking at clinical investigators’ abilities to forecast the outcomes of clinical trials. Given the sheer number of trials (e.g., some 15 000 or so are registered on clinicaltrials.gov and ongoing right now), there would seem to be ample opportunity to assess the ability of investigators to forecast morally relevant likelihoods. Such likelihoods can be sorted into two broad categories: the first category directly concerns risk/benefit outcomes. For example: how likely is it that the trial will terminate due to safety concerns or poor recruitment? How likely is it that the trial will provide an informative posterior interval of the effect estimate? The accuracy of predictions here has direct implications for risk minimisation, scientific knowledge gain and informed consent. The second category of likelihoods speaks to investigator judgments across programmes of research. For example: how likely is it that a new trial with drug x for condition y will have a positive result? The accuracy of predictions here has more indirect implications to the risk/benefit outcomes, but they also bear on judgments of clinical equipoise and speak to the widespread concerns about the low rate of successful translation.3 Across both categories, the base rate frequencies for these events (at least in the published literature) are either known3 8 9 or accessible in principle. These frequencies could provide a benchmark by which to evaluate investigator risk/benefit judgments. But again, so long as the forecasts remain implicit or unstated, it is very difficult to know whether investigators are well calibrated to these frequencies. In what follows, I argue that a policy of explicit forecasting would help to address many of the above concerns. In particular, explicit forecasting would offer the following research system benefits: (1) helping to distinguish honest from irresponsible errors in scientific prediction, (2) a likely reduction in the number of uninformative trials, (3) render risk/benefit judgments more transparent and provide opportunities for feedback and improvement and (4) disincentivise strategic behaviour in protocols and trial design. Following that, I consider how some particular confidence levels may come into conflict with the principles of ethical research.

THE CONSEQUENCES OF FORECASTING IN TRIAL DESIGN When designing and planning a clinical trial, investigators must make judgments about what population of patients to enrol, how many patients, the

Hey SP. J Med Ethics 2015;41:559–562. doi:10.1136/medethics-2013-101868

559

Downloaded from http://jme.bmj.com/ on November 14, 2015 - Published by group.bmj.com

Viewpoint likely effect size, the patient recruitment capabilities of their research team, the necessary duration of the study and so on. Each of these judgments, at least implicitly, reflects a prediction —a forecast of some future set of risk/benefit events. If investigators are poorly calibrated to either the scientific or operational realities in their research domain, then these forecasts will be inaccurate. This can have undesirable consequences for both the patient-subjects enrolled and the research enterprise as a whole. For instance, inaccurate forecasts of safety-related events, e.g., drug-related toxicities, can lead to serious patient harm and early termination of studies. Inaccurate forecasts of recruitment capabilities can lead to an uninformative study. These are trials which enrol and expose patients to the risks of study participation, but because they did not recruit enough patients, they lack the statistical power to draw a valid inference and produce generalisable knowledge. Much the same is true for an inaccurate estimate for the effect size or effect difference between study interventions. An overestimated effect size will lead to an underestimation in the sample size needed to draw a statistically valid inference.2 Whereas an underestimated effect size can lead to over-recruitment and, therefore, unnecessary patient-subject exposure to risk. Trials which are initiated on the basis of inaccurate forecasts are thus more likely to either fail to minimise risk or fail to deliver on the knowledge-benefit portion of the risk/benefit balance. Yet, it is critically important to distinguish here between a negative study and an uninformative study. A negative study definitively rules out one of the pre-experimental possibilities, and in so doing, it still produces knowledge value. In contrast, an uninformative study rules out nothing—showing neither (1) that the experimental intervention is worse or no better than the control nor (2) that it is better or possibly equivalent. Figure 1 provides a useful illustration of this difference. The intervals A through F each represents a possible posterior effect estimate, with the range of clinical equivalence specified by δI−δS. Most of these intervals are more or less desirable given the study design. For example, in a non-inferiority study, the intervals C, D, E and F are all consistent with a positive outcome (although C is obviously the least positive of these), whereas, in a superiority study, only E and F are consistent with a positive outcome. But regardless of the design, interval A represents an uninformative outcome. Because it overlaps all three regions of the

figure—new treatment superior, equivalent and old treatment superior—it does not rule any of them out.i Therefore, the patient-subjects enrolled in the study have undergone the risks and burdens of study participation and yet the trial failed to adequately address its primary study question.

VIRTUES OF EXPLICIT FORECASTING IN TRIAL PROTOCOLS Explicit forecasting of outcomes could mitigate the likelihood of these undesirable trial outcomes—i.e., uninformative results or unnecessary patient risks—in a number of ways. First, explicit forecasts would help to distinguish honest from irresponsible errors of prediction. Inaccurate investigator forecasting due to ignorance or negligence is ethically problematic. Inaccurate forecasting due to irreducible uncertainty, however, is still consistent with research ethics and scientific principles. The philosophical difficulty here lies in articulating the criteria for what counts as an adequate and reasonable review of the total evidence. Statistical meta-analysis provides one answer to this problem by pooling the results of many studies to arrive at aggregated estimate of effect size. Similarly, systematic literature reviews can provide a thorough overview of the total evidence in a particular research area. However, both of these approaches suffer from the so-called ‘file drawer problem’ that subjective selection or biased availability of the evidence included in the analysis can arbitrarily tip the conclusion one way or the other. Nevertheless, many clinical trials are justifiably well designed and well situated within the context of the surrounding evidence, even when the expert community is still divided over the relevance or importance of particular studies. There are even some new approaches to representing and evaluating the accumulating state of evidence in a research programme that can be used to more precisely and transparently elucidate differences in reasoning across the expert community.10 Such approaches provide at least a first step towards grounding judgments of culpability for failed trials, because they could be used to show that a more accurate summary of the total evidence was available and yet ignored. Conversely, they could also protect investigators from error and culpability who could use such tools to demonstrate that their forecasts were consistent with a reasonable interpretation of the total evidence. Second, explicit forecasting in protocols could help to reduce the number of uninformative results. There is already small body of literature on expert elicitation in Bayesian trials. For such trials, a sample of clinicians and investigators are queried for an estimated probability distribution for the expected effect size.11 These predictions, along with a set of judgments about the range of clinical equivalence, are then used to construct a prior distribution reflecting ‘subjective clinical opinion’. This prior distribution, in turn, guides the prospective decision about the necessary sample size to ensure that the posterior distribution, for all or most data sets, will fall into an informative range.12 13 If investigators routinely offered these kinds of probability distributions in their protocols, reviewers could then compare these with their own sense of what is a reasonable prediction, or to the so-called ‘fixed’ distributions representing an ‘enthusiastic’ (centred on the upper bound of clinical equivalence) or a ‘skeptical’ prior (centred on the lower bound of clinical equivalence). As a complement to the virtue above, by making the

i

Figure 1 Posterior interval ranges. 560

These kinds of underpowered studies can, of course, be pooled for a meta-analysis, and in this way, they do still make a contribution to the state of accumulating evidence. Hey SP. J Med Ethics 2015;41:559–562. doi:10.1136/medethics-2013-101868

Downloaded from http://jme.bmj.com/ on November 14, 2015 - Published by group.bmj.com

Viewpoint distributions explicit and transparent, a reviewer at least has an opportunity to scrutinise the scientific honesty of these predictions. Furthermore, the range of predicted distributions bears a direction relationship to the principle of clinical equipoise, which demands that there must exist a state of genuine disagreement or uncertainty in the expert medical community as to which treatment arm in a study is preferable. Insofar as the elicited probability distributions reflect community disagreement or uncertainty about the preferable treatment, clinical equipoise is arguably satisfied. Third, explicit forecasting would facilitate more transparent risk/benefit judgments. Improving the quality of risk/benefit judgments is a long-standing problem in research ethics.14 Despite the fact that numerous evaluative frameworks have been proposed,15 ethical review boards continue to have difficulty with risk/benefit evaluation.16 This can be explained in part by the relevant likelihoods for harm and benefit remaining implicit in protocols, forcing reviewers to interpret statements in natural language according to their intuitions or subjective experience. Explicit forecasting would take a step away from these merely intuitive judgments and towards a more systematic and transparent procedure. Feedback is also known to be a critical component for improving forecasting behaviour.17 So long as forecasts are implicit, there are precious few opportunities for investigators or protocol reviewers to get direct feedback about the quality of their predictions for risk and benefit. Explicit forecasting would thus also provide a greater opportunity for all the stakeholders involved in design and review of trial protocols to think more critically and improve their estimates for likely risk and benefit. Finally, an increase in the transparency of risk/benefit judgment would disincentivise strategic behaviour in protocol and trial design. Were a policy of explicit forecasting enforced, optimistic investigators who consistently demonstrated poor calibration are more likely to be marginalised as funders could more wisely allocate their resources towards the more realistic investigators who demonstrate a better understanding of the evidence. This reallocation in accord with scientific realities could in turn improve the efficiency of the research system as a whole.

ETHICAL CONFIDENCE LEVELS Supposing that these likelihoods were made explicit in trial protocols, how confident should investigators be that their study will be positive? This shifts the discussion from the virtues of predicting risk/benefit likelihoods to the second category of likelihoods —those concerning positive versus negative outcomes. In much of the methodological literature, perfect concordance between the predicted and the observed outcome is held up as an ideal of clinical research. For example, it is argued that animal models should predict the results in phase I trials,18 and the results of phase II trials ought to predict the results of a subsequent phase III.19 Indeed, there is something intuitively appealing about a programme of clinical trials with uniformly predictable and positive results. No studies would need to be repeated and patientsubjects could be assured of therapeutic benefit. Obviously, this is not the reality of clinical research. As already noted, most trials are negative (or uninformative). But less obvious is the fact that perfect concordance and highly predictable results should not even be the ideal.20 If a programme of clinical trials were predictably positive with high confidence, this would actually make those trials unethical. The whole point of clinical trials is to dispel uncertainty about an intervention’s Hey SP. J Med Ethics 2015;41:559–562. doi:10.1136/medethics-2013-101868

effectiveness. If there is not legitimate uncertainty, then there is no need to conduct trials. This leads to a somewhat counterintuitive conclusion. The principles of research ethics entail that some portion of trials must be negative.20 Indeed, if too many studies are positive, this raises questions about the validity of the research questions. Whereas if too few studies are positive, this would represent a systematic underestimate of risk for patient-subjects, calling into question the evidence base underlying the study question. Some optimal balance must therefore be struck, whereby there are enough positive trials to furnish the healthcare system with the treatments it needs and there are enough negative trials to drive further research and theoretical development.21 This suggests that an investigator’s prestudy confidence should never be near 100% probability of a positive result. Such a forecast would undermine the rationale for conducting the experiment in the first place. At least for late phase, confirmatory trials, this much would seem to follow from the principle of clinical equipoise and its stipulation that investigators must ‘recognize that their less favored treatment is preferred by colleagues whom they consider to be responsible and competent’.22 An investigator who believes with near certainty that one treatment arm of the study is better than another (1) cannot coherently claim that their dissenting colleagues are ‘responsible and competent,’ and therefore there is no clinical equipoise, (b) is not taking account of the total body of evidence, which only supports a more modest prediction for success, and is therefore running an inappropriately designed study or (c) lacks some capacity for good judgment. None of these options are acceptable. However, the relationship between clinical equipoise and what is an acceptable likelihood of a positive outcome is further complicated by differing interpretations of equipoise. As Alex London has put it, there is still controversy in the literature about ‘whose uncertainty’ and ‘which equipoise’ are ethically relevant.23 Must the state of uncertainty exist in the subjective beliefs of investigators and clinicians? Or is it instead an objective property of the state of accumulating evidence? If the principle of clinical equipoise refers to the subjective beliefs of medical experts, then the argument I offered above is sufficient as it stands. If an investigator believes with near 100% confidence that the trial will be positive, then this would violate equipoise. Yet, if equipoise refers to the state of evidence, then it seems less problematic for an investigator to have high confidence in a positive result. Insofar as their risk/benefit predictions are well calibrated to the state of evidence, why should it be unethical to predict a high likelihood of a positive result? There are at least two reasons why this high confidence is still ethically problematic. The first follows from the idea that an ethical trial demands an honest null hypothesis.22 An investigator predicting a near 100% chance of a positive outcome—even if they are well calibrated to the risk/benefit realities—clearly violates this principle. Similar to the argument above, if the investigator’s confidence is justified by the evidence, and the outcome is near certain, then why is the trial being conducted? And if the high confidence is not justified by the evidence, then this calls into question either the study design (which may lack an honest null hypothesis) or the integrity of the investigator. The second (related) reason is that a trial whose outcome is almost certain is less informative. In the strictest sense, a maximially informative trial is conducted when the pretrial odds of success are approximately 50%.21 The further one strays from 50%, the less informative is the result. Although such suboptimal information gain might not seem particularly problematic on a protocol-by-protocol basis, when looking across 561

Downloaded from http://jme.bmj.com/ on November 14, 2015 - Published by group.bmj.com

Viewpoint populations of trials and considering the efficiency of the research system as a whole, this becomes a serious problem. Every study carries an opportunity cost. Every suboptimally informative trial that is conducted represents another—perhaps more informative trial—that was not carried out. In light of the widespread concerns about the poor rates of successful translation,3 this inefficiency of information represents a serious cost. These reasons are sufficient to show that even if we interpret equipoise as referring to the state of evidence, near 100% confidence in a positive outcome is still ethically problematic. But let us turn now to consider the opposite end of the testing trajectory from confirmatory trials: what are acceptable confidence levels for first-in-human and early phase trials? These trials are conducted under conditions of the greatest uncertainty. Particularly in realms where animal models are notoriously unreliable, the success rate for first-in-human and early phase trials can be exceedingly low. For example, in many neurological realms, somewhere on the order of 1 of every 10 new drugs tested will demonstrate positive efficacy.24 Thus, a well-calibrated investigator for a first-in-human trial of a new neurology drug should not have confidence much >10%. Such low confidence in the likelihood of a positive outcome may at first seem to conflict with the principle of risk minimisation. How can it be ethical to enrol patients in a study where investigators are almost certain that the trial will be negative and those patients will receive no benefit? Indeed, the ethics of early phase trials is a complex issue, given that there are irreducible risks and little or no guarantees of patients benefit.25 While it is not plausible to extend clinical equipoise to such studies, we might nevertheless articulate an equipoise-like principle that can guide ethical review of early phase protocols. For an early phase study with a low likelihood (10%–20%) of a positive outcome to be ethical acceptable, we could assert that the following conditions should hold: 1. There must exist irreducible uncertainty due to the unreliability of animal models or poorly understood disease pathophysiology or both. 2. The study must be designed such that if it is successfully executed, it will produce an informative result. This principle would clarify the epistemic conditions for when a low likelihood of success is consistent with risk minimisation (condition 1), as well as, connecting ethical judgment to the necessity of accurate forecasting (condition 2). Recall that accurate and explicit predictions are a critical component in protecting against an uninformative trial. Insofar as an early phase trial protocol is able to demonstrate that the available evidence does not warrant high confidence but that it has nevertheless been designed to take the unreliability of evidence into account, then it can still be ethical and scientifically valid.

But to close on a positive note, I have argued that the science of prediction and forecasting offers a fruitful way forward in addressing a range of concerns about risk, trial design and inefficiency across the research enterprise. By making the predictions of risk, benefit and trial success explicit in protocols, researchers and research stakeholders will have a more powerful set of tools at their disposal to analyse and improve their judgements.

CONCLUSION

18

There is already some evidence to suggest that many of the effect differences predicted in the design of trials are often overestimates.2 Moreover, even a superficial search of trial registries reveals that a large number of studies are terminated early for operational reasons, e.g., low recruitment. This evidence should lend further credence to my worries about the forecasting abilities of investigators and reviewers. Unfortunately, so long as investigator forecasts remain implicit, these issues are difficult if not impossible to resolve. Until investigators state their predictions explicitly, we simply will not know how they are thinking about many of the parameters in their protocols. This makes the work of ethical reviewers, who are charged with judging risk/benefit balance in a systematic and transparent way, that much harder. 562

Acknowledgements I would like to thank Jonathan Kimmelman, Julie Walsh, and an anonymous referee at this journal for their thoughtful comments on earlier drafts of this manuscript. Competing interests None. Funding This work was supported by an ethics grant from the Canadian Institutes of Health Research (EOG 201303). Provenance and peer review Not commissioned; externally peer reviewed.

REFERENCES 1

2 3 4 5 6 7 8 9 10

11 12 13 14 15

16

17

19

20 21 22 23 24 25

Belmont Report. The belmont report: ethical principles and guidelines for the protection of human subjects of research, 1979. http://www.hhs.gov/ohrp/ humansubjects/guidance/belmont.html (accessed 15 Jan 2013). Djulbegovic B, Kumar A, Magazin A, et al. Optimism bias leads to inconclusive results: an empirical study. J Clin Epidemiol 2011;64:583–93. Hay M, Thomas DW, Craighead JL, et al. Clinical development success rates for investigational drugs. Nat Biotechnol 2014;32(1):40–51. Kimmelman J. The therapeutic misconception at 25: treatment, research, and confusion. Hastings Cent Rep 2007;37(6):36–42. MacKillop WJ, Quirt CF. Measuring the accuracy of prognostic judgments in oncology. J Clin Epidemiol 1997;50:21–9. Green L, Mehr D. What alters physicians’ decisions to admit to the coronary care unit? J Fam Pract 1997;45:219–26. Gigerenzer G, Gaissmaier W. Heuristic decision making. Annu Rev Psychol 2011;62:451–82. Arrowsmith J. Phase II failures: 2008–2010. Nat Rev Drug Discov 2011;10(5):1. Arrowsmith J. Phase III and submission failures: 2007–2010. Nat Rev Drug Discov 2011;10(2):1. Hey SP, Heilig CM, Weijer C. Accumulating Evidence and Research Organization (AERO) model: a new tool for representing, analyzing, and planning a translational research program. Trials 2013;14:159. Chaloner K, Rhame FS. Quantifying and documenting prior beliefs in clinical trials. Stat Med 2001;20(4):581–600. Spiegelhalter DJ, Freedman LS. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Stat Med 1986;5(1):1–13. Joseph L. Bayesian and mixed Bayesian/likelihood criteria for sample size determination. Stat Med 1997;16:769–81. Meslin EM. Protecting human subjects from harm through improved risk judgments. IRB 1990;12(1):7–10. Guo JJ, Pandey S, Doyle J, et al. A review of quantitative risk–benefit methodologies for assessing drug safety and efficacy—report of the ISPOR risk– benefit management working group. Value Health 2010;13(5):657–66. van Luijn HEM, Aaronson NK, Keus RB, et al. The evaluation of the risks and benefits of phase II cancer clinical trials by institutional review board (IRB) members: a case study. J Med Ethics 2006;32:170–6. Stone ER, Opel RB. Training to improve calibration and discrimination: the effects of performance and environmental feedback. Organ Behav Hum Decis Process 2000;83(2):282–309. van der Worp HB, Howells DW, Sena ES, et al. Can animal models of disease reliably inform human studies? PLoS Med 2010;7(3):e1000245. Seymour L, Ivy SP, Sargent D, et al. The design of phase II clinical trials testing cancer therapeutics: consensus recommendations from the clinical trial design task force of the national cancer institute investigational drug steering committee. Clin Cancer Res 2010;16:1764–9. Hey SP, Kimmelman J. Ethics, Error, and Initial Trials of Efficacy. Sci Transl Med 2013;5:184fs16. Djulbegovic B, Kumar A, Glasziou P, et al. Trial unpredictability yields predictable therapy gains. Nature 2013;500(7463):395–6. Freedman B. Equipoise and the ethics of clinical research. N Engl J Med 1987;317 (3):141–5. London A. Clinical equipoise: fundamental requirement or fundamental error? In: Steinbock B, ed. The Oxford handbook of bioethics. Oxford University Press, 2007:571–96. Schoenfeld DA, Cudkowicz M. Design of phase II ALS clinical trials. Amyotroph Lateral Scler 2008;9:16–23. Kimmelman J. A theoretical framework for early human studies: uncertainty, intervention ensembles, and boundaries. Trials 2012;13:173.

Hey SP. J Med Ethics 2015;41:559–562. doi:10.1136/medethics-2013-101868

Downloaded from http://jme.bmj.com/ on November 14, 2015 - Published by group.bmj.com

Ethics and epistemology of accurate prediction in clinical research Spencer Phillips Hey J Med Ethics 2015 41: 559-562 originally published online September 23, 2014

doi: 10.1136/medethics-2013-101868 Updated information and services can be found at: http://jme.bmj.com/content/41/7/559

These include:

References Email alerting service

This article cites 23 articles, 3 of which you can access for free at: http://jme.bmj.com/content/41/7/559#BIBL Receive free email alerts when new articles cite this article. Sign up in the box at the top right corner of the online article.

Notes

To request permissions go to: http://group.bmj.com/group/rights-licensing/permissions To order reprints go to: http://journals.bmj.com/cgi/reprintform To subscribe to BMJ go to: http://group.bmj.com/subscribe/

Ethics and epistemology of accurate prediction in clinical research.

All major research ethics policies assert that the ethical review of clinical trial protocols should include a systematic assessment of risks and bene...
290KB Sizes 1 Downloads 4 Views