CLINICAL TRIALS

QUESTIONS AND PANEL DISCUSSION

Clinical Trials 2013; 10: 680–689

University of Pennsylvania 5th annual conference on statistical issues in clinical trials: Emerging statistical issues in biomarker validation (Morning Session) David DeMets, Janet Wittes and Jay Siegel

Questions for Dan Sargent William Mietlowski: How would you handle a phase II strategy where you might be interested in the activity of a treatment broadly, but expect a treatment effect only in a subgroup of patients? Dan Sargent: I would recommend powering the trial based on just the treatment effect on the subgroup, but then enroll patients regardless of biomarker status, and in secondary analyses, look for efficacy more broadly. The optimal strategy depends on the prevalence of the marker. For a low prevalence marker (5% or 10%), the aforementioned is probably not a good strategy (in that case an enrichment design is likely the only option), but for a moderate prevalence marker (40% or 50%), I think such a strategy makes sense. This strategy allows a sufficiently strong signal from the biomarker-negative patients of no activity that the phase III strategy is clear, or conversely, it can demonstrate that the biomarker does not matter and you should do an unselected phase III trial. Eric Rubin: One of the designs mentioned, that I think is interesting, is the Freidlin–Simon design [1], where you do a big study and you separate subjects into two groups, a test group and a confirmation group. Can you comment on your thoughts on that design? Dan Sargent: These are designs that are intended to allow both an exploration of a biomarker and then a validation of that biomarker within the same trial. From a statistical perspective, these designs preserve the appropriate properties of type I and II error. In reality, however, I think many of us still have reservations about ‘doing everything’ within the context of one trial, because there are so many idiosyncrasies of any one particular trial. What is the specific patient population? What is the specific way that

endpoints were defined? I don’t think a biomarker identified through a single trial using such an approach has the same external validity as if from a prespecified enrichment or unselected design with a prespecified marker. The designs also do have some penalty in terms of sample size due to the need to preserve alpha in order to perform the crossvalidation. The penalties, depending on how you split your alpha, can be small or large.

Questions for Lisa McShane William Mietlowski: In studies involving paired biopsies, there was a marked genetic disparity between the primary tumor and the metastases. What kind of impact does that have on the development of targeted therapies in cancer? Do you need to do more studies in patients with early disease? Do those findings apply to patients with metastatic disease? Lisa McShane: That’s a very good question. It raises a lot of issues. First, our strategy in oncology, at least, has often been that we start with advanced stage patients and we try to find drugs that work there in our phase II trials. Then we take those drugs into the adjuvant setting if they look promising. If it turns out that the biology is really different (and I think in many cases it could be), this may not be a good strategy. In fact, that relates to one of the current questions about which patients benefit from anti-human epidermal growth factor receptor-2 (anti-HER-2) therapy such as HerceptinÒ (trastuzumab). Dan Sargent had mentioned how the trastuzumab trials used an enrichment design. The initial trials were run in metastatic disease. People thought that the biomarker was the right one and that it was really needed to determine who should get trastuzumab. One of the theories is that the biology may be different in the adjuvant setting. In the adjuvant B-31

Ó The Author(s), 2013 Reprints and permissions: http://www.sagepub.co.uk/journalsPermissions.nav Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

10.1177/1740774513500388

Emerging statistical issues in biomarker validation and N9831 trials, it looks like those patients who got on the trial by mistake, who were actually HER-2 negative when retested in a central lab, may have benefited from trastuzumab. Possibly, it is a totally different biology at work, although those cases were not random negative cases, so it is hard to know what biases might be at work. Those cases had been found to be positive in a local laboratory in order to get on the trial. Another issue is that once you start talking about getting metastatic tumor samples, particularly depending on the disease site, it is hard to get specimens. In colon cancer, for example, you might need to obtain a biopsy of tumor that metastasized to the liver. This is not without risk and discomfort to the patient. It also has bearing on the endpoints you look at. It could be that different biomarkers predict for short-term endpoints compared to long-term endpoints, or that markers found in early disease are not the same ones that are found in the tumor once it metastasizes.

Questions for Gene Pennello Susan Ellenberg: Gene, from what you’ve said, it sounds like all the bad examples that Lisa McShane described were things that you would never have approved had they come to you, and I just wanted to see whether that’s right. Can people use these assays without getting them approved by the Food and Drug Administration (FDA)? Gene Pennello: I suspect they would not be approved. I would like to point out that biomarker assay test kits can and have been approved on the basis of enrichment trials, in which only a subgroup of subjects are enrolled into the trial based on having a test result that putatively predicts treatment response. Unfortunately, with these trials, you can’t evaluate whether the treatment effect is the same or smaller for those not selected for the trial, that is, those with a disqualifying test result. However, that doesn’t necessarily mean that we are not going to approve the test kit. In fact, we have approved several now in this way. However, the labeling has to be crafted in a way that honestly reflects what was studied. Regarding laboratory-developed tests (LDTs), to date, we have been exercising regulatory discretion as to whether or not we will review some classes of these tests. The number of LDTs is very large; we don’t have the resources to review them all. However, if you are going to sell your test to different labs (i.e., engage in interstate commerce), then they do have to come into FDA for review. Songbai Wang: Gene, I believe you mentioned independent validation versus random split validation earlier in your talk and that you prefer

681

independent validation. How do you define random split validation and how does it differ from independent validation? Gene Pennello: I think random split validation is less robust than independent validation. When you divide the data set randomly into training and test sets, they are expected to have the same characteristics. Thus, the evaluation is less robust than getting an entirely new data set with possibly different characteristics. It is also difficult to prove to regulatory authorities (e.g., FDA) that data leakage did not occur between the training and test sets. You might have to provide a lot of documentation, including an audit trail. I think the whole enterprise could be difficult. It may be easier to lock down your test and get a new set of samples. I think that independent validation is the most robust way to evaluate the assay because that’s how the assay will be used in practice: it is going to be applied to new samples. Eric Rubin: Just a question about markers that are continuous. You talked about cut points and the rigor of the FDA in evaluating a test at the cut point selected. I think cut-point selection often ends up being somewhat empiric, with HER-2 as an example. Gene Pennello: I believe it is true that for some tests, like assays for troponin (used as an aid in the diagnosis of myocardial infarction), there can be a lot of difficulty in setting a cut point that is appropriate for all laboratories. The cut point that tends to be used is the 99th percentile (of measured troponin concentration) in samples from a healthy reference population. Unfortunately, the 99th percentile can vary depending on the population that is studied. (Additionally, troponin tests are not standardized in terms of the epitopes they target.) Most in vitro diagnostic tests that I have seen have predefined cut points. That’s not to say we should not be moving toward thinking about evaluating biomarkers more globally across the entire range of measurement. I think you would get a better overall picture of their performance. For evaluation, we have sometimes used receiver-operating characteristic (ROC) plots, which plot sensitivity versus 1 – specificity at all cut points.

Panel discussion I David DeMets: Thank you. It is my first visit to this conference, although it is the fifth one. I think Jonas is still sore that I didn’t come last year when I knew something about the topic, but here I am on a topic that I know very little about, which is typical, but it has never stopped me from talking. I want to just make a few comments. First, I had the privilege of

http://ctj.sagepub.com

Clinical Trials 2013; 10: 680–689 Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

682

DeMets et al.

being part of the Institute of Medicine (IOM) report that Lisa McShane and others have alluded to [2]. Although I have never analyzed a set of genomic data in my life, it was an interesting experience, which gave me at least some insight at what some of the challenges are. I think that this IOM report does provide a road map from the discovery phase, all the way through the clinical application. When I was asked by a reporter, what’s new, well, there is nothing really new. It’s a lot of old ideas, but they are all in one place. You just have to start at the beginning and read your way through, and you will get some pretty good advice. Included in the discovery phase there are a lot of things, as Lisa McShane said, that are not done right or correctly, such as locking down the data, locking down the algorithm, and doing the validation correctly. Analytic validation is something that is given less attention in this genomics world. I used to work in the clinical chemistry world and there analytic validation, of course, is a big deal. Clinical utility is another area that is often not done well either. As I looked at this, as a newcomer and outsider, it struck me that what we really are observing is a culture clash. The basic scientists are not used to this kind of statistical rigor. They are used to working in the lab and tinkering around and making discoveries, which always amazes me, but nevertheless, they are not used to tying things down and going through this careful process. On the other end of the spectrum, the clinical scientists have probably very little understanding of the magic that takes place in this genomic world and so have to accept some things at face value, whether they have been properly tested or not. I think about a couple other themes – I think that much of this activity, at least to date, has been done by academics who have very little understanding of the FDA and its regulations in this area. It is a mystery to them. They don’t even know what they don’t know, and they certainly don’t know when they need to go talk to the FDA about getting advice and when an Investigational Device Exemption (IDE) is needed and when it is not. I think the opportunity to go consult the FDA, to the extent their staffing will allow it, is a tremendous opportunity that genomic researchers should take advantage of. Another factor, which I found prominent in the Duke saga, was the issue of conflict of interest. It seems to pervade that story. There is a lot of pressure in academia to commercialize one’s discoveries. At least at our place and for many others, the equivalent of the patent office is always looking for things they can patent and can commercialize, but it puts an incredible pressure on a number of players. I sit on our campus Conflict of Interest Committee, and we have had several cases where the inventor or the discoverer also wants to be the investigator and also has an investment opportunity. So they have it all

locked up. They are the inventor, investigator, and investor. That produces, obviously, conflicts. These types of conflicts were present in the Duke story and certainly contributed to the problems that occurred. Institutions have conflicts. There is a lot of prestige in these discoveries, and in some cases, the institutions themselves might be invested financially in the discoveries. For this particular meeting, I want to say that the biostatisticians have a terrific opportunity, but also an incredible responsibility. They need to weigh in early and often, as I say in the biostatistics department at Wisconsin, but they have to also be assured that they have the independence to say what is needed to be said. We were talking earlier in a previous presentation about how somebody doesn’t like the answers we give, but – tough. We are not supposed to be biased by the fact that our salary is being supported by the grant or, in some cases, that we’ve even been offered an investment opportunity in the spin-off of the company. The biostatistician has a fantastic opportunity and needs to be involved, but also with responsibility of giving advice, without the influence or bias of the salary support or financial investments. One final comment regards authorship. In reviewing the Duke saga and other cases in the IOM report, we clearly are now in an era where very few of us understand everything that is in the paper. I think that’s probably been true for a while, but it is really true now. Most of us who put our name on a paper may understand a piece of it, probably contributed to that piece, but we don’t have any idea about details for the rest of the experiment. It is very difficult to take authorship responsibility for the entire manuscript. But one thing that really was striking in the Potti–Nevins case [2], there were about 160 authors involved in 40 papers. At least half of those papers have been retracted or partially retracted, maybe more than that, but what struck me is that not one author, at least according to the testimony we were able to hear, not one author raised a question to anybody, to Duke or even to the publishers, until much later in the game when it became clear there were serious problems with these genomic predictors. I would say the authors clearly didn’t take their responsibility as seriously as they might have, even if they didn’t understand everything in these papers. When I was first asked to participate in this IOM committee, I talked to Lisa McShane. I said, ‘I don’t know beans about how to analyze genomic data’. She said, ‘You don’t need to, just follow the process’. So, to me, the genomic predictor was a black box. It was the same to me as if it was cholesterol or blood pressure or anything else. The validation process, I think, was pretty straight forward and clear, it is just that the black box is a little bit more complicated. One other issue, which I think

Clinical Trials 2013; 10: 680–689

http://ctj.sagepub.com Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

Emerging statistical issues in biomarker validation confused some of us on the IOM committee for a while, is the definition of risk assessment, prognostic algorithms, and predictive algorithms. What does prognostic mean? What’s predictive? This is a source of confusion that I do not think helps anybody and certainly didn’t help me. I know that a definition can be put on a slide, but as soon as you take the slide down, I forgot the exact difference. This adds to the confusion of a very confusing area. Janet Wittes: Congratulations to Drs Sargent, McShane, Pennello, and Cai. They did a great job introducing the broad topic. Their themes were very similar to each other, even though the four talks were really quite different. In some sense, I agree with Dave DeMets that we have been using biomarkers forever. It is a little bit like Molie`re’s Bourgeois Gentilhomme (ca. 1670) who was so surprised and delighted to learn that he had been speaking prose all his life. Physicians have always used biomarkers routinely in treatment. We’ve used them in designing trials, especially in defining entry criteria. What is really new now is that we have the biological tools to develop targeted molecules, and so what we were all doing in our primitive ways before has suddenly become a huge industry. Lisa McShane’s description of some of the really bad things that happen was very sobering to me, and I suspect to the rest of the audience. I was really interested to hear the discussion, by all four of you, of replication, and especially of independent replication because many people talk about being able to replicate internally, as if it were sufficient. But what we all recognize at some level, and what became crystal clear in today’s talks, is that internal replication (i.e., cross-validation) validates the results of the study at hand, but it doesn’t validate the answer to the entire question being asked. I want to follow up on some comments Dave DeMets made about our role as statisticians. At the end of her talk, Tianxi Cai spoke of the desire of statisticians to perform cross-validation in a study, but the non-statistician analysts resist doing it. I find it very hard when I am dealing with people who know how to use software to analyze data in a way that produces answers really quickly with tiny little p-values, and I say, ‘Well, that’s nice, but you can’t infer that . The comparison isn’t valid . There are no controls’. And they say, ‘Yes, there are two groups so we have controls’, but in many situations, the so-called controls are not really controls. What cross-validation tends to do, as these speakers have shown so eloquently today, is damp down p-values with lots of 0s in front of the 1 and those unbelievably high hazard ratios. Suddenly, what was really exciting is either not there at all or reduced to a small, not very convincing, little effect. As Bernie

683

Fisher, MD (a founder and long-time chair of the National Surgical Adjuvant Breast and Bowel Project (NSABP)), once said, in a different place in Pennsylvania, in a different context, ‘Statisticians are the terrorists of clinical trials’. Such an analogy is no longer politically correct, but if I may take the liberty to echo him, we are now the terrorists of biomarkers, because what we will be doing, if we perform our job correctly, is preventing people from publishing papers that they want to publish, preventing journals from getting great, exciting articles, preventing reporters from reporting these brand new findings, and then later when other results come out, saying, ‘Oh, look, it is not real’, or ‘You can’t trust the scientists anyhow’. Careful validation deflates much excitement, but being skeptical and careful is what our job is and it is not easy. But just think, if we do our job carefully, what will be exciting is that maybe the truth will be out. Jay Siegel: Picking up on Janet Wittes’ comment, one of the prevailing themes I heard in the four talks this morning is that there is a need for new methodology. It also comes through loud and clear that as much as there is a need for new methodology, there is also a need for more consistent application of basic principles and well-known and understood approaches and methodology. These basic principles and approaches include prespecification, randomization and replication, understanding the difference between clinical utility and statistical significance, handling specimens well, looking for batch effects, and caution regarding over-fitting. Clearly, there is a lot that is understood and not being consistently applied in this field. In that regard, statisticians must be the leaders in reaching solutions. There’s also a role for FDA guidance and FDA education. FDA has learned a lot of these lessons and has played a leadership role. One can see that in their approvals and their guidance documents, including recent guidance on companion diagnostics and older guidance on in vitro diagnostics. But there are a lot of issues for which there could be additional guidance and there are a lot of emerging issues we at Johnson & Johnson see, in light of our expertise in diagnostics as well as therapeutics, and our interest in developing targeted therapeutics. Even for issues where there is no consensus, for example, how to handle intermediate and equivocal results, guidance can provide insight into FDA expectations, and thus will remove some of the risk and resulting disincentive from working in this area. Another theme that some of the speakers spoke to was the case where you have a hypothesis that a product will work in a subset, is it sufficient to study just that subset or should a broader population be studied? We heard from Dan Sargent the example of

http://ctj.sagepub.com

Clinical Trials 2013; 10: 680–689 Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

684

DeMets et al.

Herceptin. When I was at FDA, I headed regulation and approval of that product, so I can provide a little more background. There was no standardized test for HER-2 expression in the phase III trials of Herceptin (trastuzumab). In fact, each site used its own ‘home-brew’ test, often with different antibodies. There was no fluorescence in situ hybridization (FISH) test available at the time. Only immunohistochemistry was used, and it was not standardized at all. Many sites and patients did not have specimens banked for later testing for new assay development, and when specimens were available, there was limited information about how those specimens were handled and what their quality was. Those circumstances led to some really challenging problems. We had a life-prolonging treatment and pretty strong data that a marker could identify who would benefit from a treatment, and yet no reliable way to test for that marker. The Herceptin example raises a couple of issues on which I want to comment. First is the issue of when you should test more broadly than the anticipated target population. One situation for broader testing is when the optimal cut point is unclear. HER-2 expression is a continuum; it is broken down into discrete values, for example, 1 + , 2 + , and so on, if you use immunohistochemistry, but it is a continuum. It is interesting, but not surprising in light of some of the study conditions I just summarized, that 14 years later, there is still uncertainty regarding the optimal cut point and the optimal assay for predicting favorable response. A second issue, as Dan Sargent pointed out, is the importance of collecting a database that can be used for developing and validating future assays. In the case of trastuzumab, it would have been very valuable to have complete, quality tumor samples from the placebocontrolled trial, so that when FISH assays and other assays became available, those assays could have been directly validated as predictors of response to Herceptin in the original, placebo-controlled trial. In considering whether to limit testing to those patients testing positive on the candidate companion diagnostic, a factor to consider from a regulatory and public health standpoint is the likelihood that off-label use is going to occur in a population testing negative, or not tested. If it is likely, and particularly if the drug has risks that might outweigh benefits in such a population, then that use probably should be studied. As we heard from Dr Pennello, the FDA can label a test for selection of patients to receive a drug, but if test negative patients have not been studied on the drug, you really don’t know whether the test is adding value. Dan Sargent commented that it may be hard to study the broad population of test negative and positive patients, based on a putative predictive test, where there is low prevalence of positivity, for

example, 5%. In those cases, it is worth thinking about partial enrichment where, instead of studying all comers or only test positive patients, you enrich enrollment for test positivity such that, for example, half of the patients enrolled are marker positive. I want to comment briefly on the very important concept Lisa McShane mentioned – specimen quality and batch effects. While her comments were made in the context of ‘omics’ trials, the concerns are very broadly applicable to many types of markers. A few other areas of important uses of biomarkers have received a little less attention, but might be appropriate areas for development of more rigorous methodological approaches. One of the most important uses of biomarkers in drug development is the use of effects on biomarkers in phase I and phase II studies to decide whether or not to proceed with the project and with what doses, treatment regimens, and target population to proceed. Often, biomarker data, even in not validated surrogates, are the best available indicators of potential efficacy that you have in phases I and II. The difference between success and failure in drug development often comes down to making the right decisions during and after these phases about which drugs to invest tens or hundreds of millions of dollars in and which doses to test. There may well be room for better methodology for analysis of data in that critically important situation. Another situation rising very rapidly in importance for use of biomarkers is in the clinical assessment of biosimilar candidates. We do not have generic biologics because analytic testing in vitro cannot ensure that products are the same. Therefore, some amount of clinical testing for candidate biosimilars is required by countries around the world including the United States, in order to ensure that clinically important differences do not occur. However, requiring clinical testing with validated surrogates or clinical endpoints to the same precision as for innovator products would make biosimilar testing very costly and might undermine a key goal of biosimilars policy, which is to improve affordability and access. Many people would argue that if clinical testing of a biosimilar is needed to exclude significant differences, it should be with validated surrogates or with clinically valid endpoints. An alternative view, however, is that given knowledge that a candidate biosimilar is highly similar to an innovator molecule, demonstration of highly similar effects on biomarkers may suffice to draw conclusions of efficacy even in some cases where those biomarkers would not be accepted as validated surrogates for efficacy of a new molecule. Regardless of whether biomarkers suffice in lieu of clinical endpoints in assessing clinical biosimilarity, they may have an important role in excluding

Clinical Trials 2013; 10: 680–689

http://ctj.sagepub.com Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

Emerging statistical issues in biomarker validation differences, as biomarkers often can be measured much more precisely than some clinical outcomes such as survival. The power to detect differences can therefore be much greater with biomarkers. Differences in effects on biomarkers may or may not be clinically meaningful, but often would raise enough concern to warrant further scrutiny or testing. So use of biomarkers in biosimilars testing is another area in which methodological development might be of value. In conclusion, I will note that our company is very interested in moving in a couple of directions that need to be powered by the type of work that many of you are doing. One is in the area of targeted medicines. We look at all our drug development projects for potential targeting. Another is in the area of disease prevention, secondary prevention, and interception. Many in public health believe prevention to be very important, but there is little drug development done in those areas (besides vaccines for infectious diseases) for a number of reasons. One is reimbursement; it may be hard to get payers to pay for prevention, especially if the benefits of prevention are realized years after treatment. Another is adherence; some healthy patients may not take preventive medications reliably. Other challenges in developing preventative therapies are that studies of healthy individuals to assess development of disease are likely to be of enormous size, duration, and cost. Biomarkers can help in identifying both populations at risk and early signs of progression. So that is an important area for improved methodology. Where there is an important public health need and scientific potential for better prevention, given the challenge of developing preventive therapies, regulatory decisions regarding how high to set standards for validation of markers that identify at-risk populations or that identify progression could have a major impact on development of preventive therapies.

Panel discussion I: discussion within and from the floor Jason Liao: I thoroughly enjoyed the two talks by Dan Sargent and Lisa McShane. They clearly articulated the principles in biomarker validation from a frequentist point of view. We just heard from the panelists that statisticians can be the terrorists in clinical trials. We can be ‘hard-line’ and always think about the worst-case scenario and then we annoy our clients. That’s the wrong way to do business. Another way is to try to give our nonstatistical colleagues some benefit of the doubt and be sympathetic, and at the same time, try to hold to our statistical principles. There are several ways that we can think about whether a claim of a discovery of a marvelous biomarker is really meaningful. How do

685

we help our colleagues to think about prior knowledge and the biological probability or the importance of this potential discovery? We need to be helpful, rather than just veto their research. For example, can we introduce Bayesian ideas so that we may be more flexible and more friendly? While I am not a hard-line Bayesian, I feel that if we always think like hard-line frequentists, we are limiting our ability to contribute to the process. Janet Wittes: I don’t think Bayesian statistics is going to cure the most important parts of the problem. The solution comes not from the flexibility of one methodology versus the other. You still need controls, whether you are a frequentist or a Bayesian. You still need to look at batch effect, whether you are a frequentist or a Bayesian. So whatever method you use, whatever philosophical school you come from, the basic principles of experimental design have to be there. I want to make another comment that is important in adaptive designs. In some of the adaptive designs that I have been involved in, the investigators don’t care about concurrent controls, so that the models are ignoring the possibility that over time, the patients may be different. The control group and the treatment group are recruited at different times. To me, that ignores possible batch effects and fundamental experimental design. You could design a Bayesian adaptive design that did take care of those issues. We need to think about the primary questions that are being asked and consider the basic principles of design. Then, whichever philosophy you like is fine. David DeMets: I would like to rethink our role as not being terrorists, but perhaps as a member of the team. Yesterday, I spent the day in Chicago at a conference on team science and I think in the area of clinical trials, we have been pretty successful being a member of the team that put together the protocol. I think in this area of biomarkers, we should also be part of the team, so that we don’t look like a terrorist, but part of the team struggling with a very difficult problem. Our job is to educate in that process, and the answers are not simple. There are some real challenges, and we have to help resolve them. Whatever environment we are in, we will always make statements that are based on imperfect data, imperfect analysis, and imperfect assumptions. As long as we keep that in mind, I think we will make progress, but if we come in the back door and just say no, then we will look like terrorists. I think the whole goal is to get involved early and often. In the IOM report, it was very clear that some of the problems that took place can be attributed to the fact that biostatisticians were not really integrated as

http://ctj.sagepub.com

Clinical Trials 2013; 10: 680–689 Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

686

DeMets et al.

part of the team, or if they were there, they were not really independent. I think that’s always been our challenge, but it is especially important here since it is so complicated and there may be lots of things we don’t understand. I would say we need to approach this as a member of the team and that requires, by the way, us learning a little bit about the biology and the genetics and all that machinery, so that we can talk sensibly in their language and maybe along the way teach our colleagues a little bit about what it is we have to offer, so that we have communication that passes knowledge, not just data. PK Tandon: In Genzyme, we deal with varied diseases where you are lucky if you have 3000 undiagnosed patients in the United States and maybe less than 10,000 worldwide. One of the challenges we face, as Dr Siegel just pointed out, is validating biomarkers as surrogate endpoints. What we have done in the past is to use animal models to test biomarkers in addition to using registries and natural histories. Is there an experience base for the process of identifying validating biomarkers in rare diseases? Jay Siegel: No. When I was at FDA, I worked on a number of rare disease projects where biomarkers were very important in the regulatory process, several of them enzyme replacement therapies that Genzyme was involved in. In that rare disease setting, formal validation of a biomarker for use as a surrogate, at least prior to being in the market, is not a realistic goal. Fortunately, FDA regulations recognize that problem and allow accelerated approval which allows use of a not fully validated surrogate endpoint, providing it is shown to be reasonably likely to predict clinical benefit. Of course, that is a somewhat subjective standard; the agency has shown considerable flexibility with regard to therapies addressing unmet needs in serious disease, particularly rare diseases. In the regulatory setting, animal models can be of value in supporting a determination that a biomarker is reasonably likely to predict benefit, but only to the extent that they are really good models of human disease. For selected genetic deficiencies of specific enzymes, a reasonable case could often be made that animal models are relevant. PK Tandon: For a new molecule, if it doesn’t work in animal models as a biomarker, we say kill it. Jay Siegel: Yes, lacking such an effect, you have too little to go on. Anonymous: We see a lot of statistical challenges in the biomarker area, but data processing is also a big challenge. In an example from proteomics, only

preprocessed data were provided and others were not able to reproduce the original analytic results. Going back to the raw data, the results were yet again different. What suggestions can you give to handle the preprocessing stage before we enter into the data analysis? David DeMets: While this is an area I do not know enough about, my instinct would be to keep and make the raw data available because there are various ways to preprocess it. You can do all the kinds of serendipity you want, but once you think you’ve got something, start locking both the data and algorithm because, otherwise we have a moving target. How can you analyze a moving target and make any claim of validity? So I think we are going to need to be much more rigorous. The IOM report, which I have referred to, tries to address this a little bit and makes three or four very important points. One of them is that you need to manage and lock the data down and have an auditing trail, so you know who changed what and when. Traditionally laboratory people stored the data on an Excel spreadsheet – you probably can’t do it that way anymore, but that was the culture. Now when you are going to take things forward all the way to the review process and then to the clinic, you have got to nail it down. I think it is something we need to do and we know what to do. Janet Wittes: And I think actually industry is much better at that than academia. Houston Gilbert: I am interested in predictive biomarkers and getting companion diagnostics to market, as well as bringing multi-marker gene signatures to the market as a companion diagnostic. What are the challenges of bringing a multi-marker diagnostic or predictive marker to market. Gene Pennello: One of the challenges of multi-marker diagnostics is the analytical validation part. In my talk, I addressed studies of precision. For a single, you’d look at maybe a low sample of the analyte, then a medium, and then a high. Once you have multiple analytes that you are trying to combine, trying to span the space and look at the imprecision across replicates can be a challenging problem. You can look at it individually. You can look at the composite score and try to stay in that range, but that’s one of the difficulties. Clinical validation, I think, is quite similar. You’ve got to lock the thing down and get an independent sample and clinical study to validate it. Lisa McShane: The theme that came through repeatedly this morning was that there are basic

Clinical Trials 2013; 10: 680–689

http://ctj.sagepub.com Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

Emerging statistical issues in biomarker validation principles that should be followed – even in the multi-analyte situation, where you have a classifier that depends on even 100 or 200 different genes. We have found in some of the clinical trial protocols that we get, a simple question like, Forget about how you got to that classifier, but if you took a tumor sample and you split it in two, and you ran your classifier on the one half and ran it again on the other half, how often did those two calls agree?

That bypasses all of the complexity of the multidimensional. It’s a simple question and that is the kind of stuff that is not occurring. With regard to predictive biomarkers, my personal bias is if you look for predictive markers, where you want to be doing your hard work is understanding the biology. I think it is naı¨ve to believe that you are just going to do a massive data exploration and come up with some black box predictor that is going to work well as a predictive marker. Our big success stories over the last 10 years, in cancer at least, have been because someone has done the difficult biology. Once you get that figured out, a lot of the rest falls in line. Jay Siegel: I have a follow-up question. When you are developing a multivariate assay for prediction of a therapeutic benefit, you are going to lock in prospectively (before the clinical trial) a way to combine those multiple variables into a predictive test. Similarly, if you use an assay with a single, continuous variable, you are going to lock prospectively into the cut point. After a large clinical validation trial, if safety and efficacy are found in the population defined by that prospective composite or cut point, and if a broader population was studied, it may well be the case that looking at the new, larger data set, you might find a slightly different composite or cut point that appears even more useful for predicting who responded better than the prospectively identified ones. The seemingly improved prediction might be real but might represent overfitting. What is the regulatory approach in a setting like that? Does one then label the product for use according to prospectively planned analysis or does one label it according to an optimized analysis based on larger data that may be over-fit? Gene Pennello: This may not be an exciting answer, but we like to see clean analyses instead of going back and trying to reassess the cutoff; but from some of the things that I have seen, if you can do a proper cross-validation of the cutoffs that maybe you are trying to optimize, it may not be all that biased compared to an external validation. It may not rise to the level that if you don’t cross-validate the feature selection, you really get into trouble. Actually

687

having locked down everything except the cutoff is relatively less of a problem. I also want to respond to the previous question about the importance of getting an honest estimate when you have this multiple feature classifier. I actually think Bayesian methods might be helpful in shrinking your estimate’s performance if you are looking at several different classifiers, because then it is important to get an honest estimate. Internal validation and cross-validation would also be good, but you want to have some trustworthy estimate of precision to plan the validation study, otherwise you will size it incorrectly. Lisa McShane: I want to revisit the issue of whether one should cast the net wider initially and then go back and refine afterwards. I think that that is a hugely important philosophical decision and I think it really represents a potential tension between those who develop diagnostics and those who develop therapeutics. If I am developing a new drug, I frankly might not care if I get 10% or 20% of the cases called wrong, in terms of whether they are going to benefit or not. My goal is going to be to enrich the population enough so that when I try my therapy on that population, it is going to show a benefit. It is really hard to know when is the right place in the development process to say we need to really start honing in on the right assay. The incentives are very different for people who are developing diagnostics versus therapeutics. Diagnostic companies don’t typically earn nearly the money that companies developing therapies do. If you ask someone in the early stages of drug development to spend 3 years fine-tuning an assay, doing all these preclinical studies, doing this and that, to figure out exactly what is the right cut point, that person would justifiably say, ‘Well, we are not even sure how well this therapy is going to work. So why should we invest substantial resources this assay for something that might not actually pan out?’ And even if it does pan out, we are going to get a whole lot less money for our diagnostic than the company that developed the therapeutic will get. There are a lot of tensions here. These are very difficult decisions to make. Richard Chappell: Coincidentally, my question relates to those tensions that Dr McShane just mentioned, except related to a phase III trial. Suppose you have a biomarker or composite biomarker reflected in the risk score and an associated treatment – a drug or some other treatment. I could personify that tension by considering representatives from the two companies in a room where we are designing a phase III trial. The company that developed the treatment may not even care about the biomarker. They are happy if the treatment works.

http://ctj.sagepub.com

Clinical Trials 2013; 10: 680–689 Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

688

DeMets et al.

In fact, they may be happier without a biomarker because they can give it to everybody, although they might be happier with an enrichment trial because it is cheaper. But then, the company that developed the biomarker has a couple of things to worry about. First, the treatment has to work or else what good is a prognostic biomarker for a treatment – there is no such thing as a prognostic biomarker for an ineffective treatment. Second, of course, not only must the treatment work, but it must work better for one class of their biomarker, than the other, and as Dr McShane pointed out, you should be able to use the biomarker to determine that some patients should not receive the treatment Could the two representatives have very differing opinions as to the kind of design of the clinical trial? I would ask the speakers and the discussants to talk about how you might compromise or merge the interests of all parties involved in designing such a trial? Jay Siegel: I want to comment on one aspect of that question. From the perspective of the pharmaceutical manufacturer, while use of a predictive biomarker might shrink the target population and might allow for a smaller trial, there is a third factor that can be even more important. In a large portion of the pharmaceutical market, particularly across Europe, the price and positioning of a drug is driven by Health Technology Assessments of its measurable effects, often in terms of quality adjusted life years. In the United States, there is concern among payers about what benefit one gets for the dollar paid. Therefore, in fact, even from a pharmaceutical perspective, use of a biomarker that enriches the risk benefit profile of a drug, while narrowing the population, may not only increase the likelihood of medical and regulatory success, but may also improve the risk benefit profile and thereby favorably affect market penetration and reimbursement. So that factor, to some extent, makes perspectives converge upon interest in developing a companion diagnostic. Janet Wittes: The situation you describe is a little artificial because there is often not just a single trial. In many cases, there are several trials. Moreover, time does not stand still. I am speaking out of turn because I have never been involved in a meeting like the one Jay Siegel describes, but I would guess that, as it becomes more and more likely that a drug is going to be successful and approved, there would be more interest in developing a better and better diagnostic. Moreover, it is not as if there is one diagnostic and that’s it. Very often there is a diagnostic, then there is another one, and then there is yet another one. I also want to address the issue of missing values that Gene Pennello brought up. At least for the

studies that I have seen, this is a huge issue. Phase III studies that are looking at primary and secondary outcomes put their emphasis on those outcomes. On the other hand, the laboratory measurements and the long list of biomarkers don’t get collected with the rigor, attention, and completeness paid to the primary and secondary outcomes. The implicit assumption so often appears to be ‘Those that we don’t have were unavailable for one reason or another. They don’t really matter’. Yet when you are talking about a big chunk of missing data, often more than 30% of what should have been collected, it is really hard to believe that those are just missing at random. Richard Chappell: Let me give a concrete example of the tension I was talking about. For optimal efficiency and testing the use of the biomarker, you might want to enroll equal numbers of biomarkerpositive and biomarker-negative subjects (in the simplest case where there are just two categories). For testing whether the drug works, you may not care about the relative numbers in the biomarkerdefined strata and they might be a big inconvenience because you would want to accrue patients as quickly as possible and not have to wait for the lower accruing strata. So that is an example. Could anyone comment on that tension? Jay Siegel: As I noted, there are several factors that influence interest in testing a biomarker for use as a companion diagnostic but increasingly, pharmaceutical developers, as others, are showing interest in personalized therapies. Notwithstanding Janet Wittes’ comment that interest in companion diagnostic development may increase as it looks more likely that a drug will be successful, work on developing a predictive biomarker must begin much earlier. The time it takes to codevelop a biomarker with a drug as a companion diagnostic is such that, unless one has a pretty good idea of what the test will be, one generally has to start thinking about what the biomarker is, or might be, during preclinical development. And putative biomarker testing often must be initiated in the first trials in patients. Otherwise, to develop the hypothesis, to identify the cut point to use, and to create a commercializable test all in time for validation in phase III clinical trials, is next to impossible. If biomarker development is not begun early, one might need to choose between unattractive alternatives. One would be to delay the development of the drug substantially, potentially years, for the biomarker development to catch up; the other would be to launch the drug alone and to develop the biomarker after the drug is on the market. But if the drug only has benefit when used with the biomarker, then the

Clinical Trials 2013; 10: 680–689

http://ctj.sagepub.com Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

Emerging statistical issues in biomarker validation drug cannot be launched alone; FDA guidance largely addresses this situation. From a business perspective, it is problematic to launch a product and then subsequently define the subpopulation in which it works best. In many areas, pricing is set on (and utilization is driven by) benefit risk profile. The financial advantage of targeting a subpopulation is that you get a better benefit risk profile and a better price, which may compensate for the narrower indication. But markets don’t allow much increase of drug prices, with a few exceptions, after they are already launched. So if you determine after launch that your drug has a much improved profile when use is restricted to a subpopulation defined by a biomarker, you are stuck with the original price, even as the usage contracts. Therefore, if you believe that there is a reasonable likelihood that biomarkers will, in fact, improve the performance of your drug in a clinically meaningful way, then the incentives are actually pretty strong from the business as well as the medical perspective to test those hypotheses from very early in development and plan for a co-launching. Dan Sargent: I actually have been in a lot of rooms like those described by Jay Siegel, because in my group, the Alliance for Clinical Trials in Oncology (a merger of the Cancer and Leukemia Group B (CALGB), the North Central Cancer Treatment Group (NCCTG), and the American College of Surgeons Oncology Group (ACOSOG)), we want to develop both therapeutics and biomarkers. I just have to second Jay Siegel’s comments from a different perspective and that is the clinical uptake of a new therapy. Therapies that provide marginal benefit don’t get adopted in the community in many cases, whereas the therapies that give hazard ratios of 0.5 and 0.4 do. There is an example in pancreatic cancer of an agent that is providing a modest survival benefit, but it has some toxicity and a hazard ratio of 0.8. It doesn’t get used because it is costly and the benefit is marginal, so people are looking very hard for a biomarker for that agent, in order to, maybe, be able to change the price, but even if not, the biomarker would still provide some uptake of that agent; so I

689

think it is not only from a pricing perspective, it is also from an uptake perspective in the community. I haven’t actually seen the types of conflicts Dr Siegel described. It seems the interests are aligned in most cases to truly identify the right population to receive a treatment.

Participants David DeMets: University of Wisconsin Janet Wittes: Statistics Collaborative Jay Siegel: Johnson & Johnson William Mietlowski: Novartis Oncology Dan Sargent: Mayo Clinic Eric Rubin: Merck Oncology Lisa McShane: NCI Susan Ellenberg: University of Pennsylvania Gene Pennello: FDA Songbai Wang: Ortho Clinical Diagnostics Jason Liao: Penn State Hershey Cancer Institute PK Tandon: Genzyme Corporation Houston Gilbert: Genentech Richard Chappell: University of Wisconsin

Funding The National Institutes of Health (NIH) (National Cancer Institute (NCI)) provided conference grant funding 5R13CA132565-05, GRANT 00249811. The Center for Clinical Epidemiology and Biostatistics at the University of Pennsylvania provided scientific and logistical support, and the American Statistical Association and the Society for Clinical Trials provided in-kind support for the conference.

References 1. Freidlin B, Simon R. Adaptive signature design: An adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clin Cancer Res 2005; 11(21): 7872–78. 2. Institute of Medicine (IOM). Evolution of Translational Omics: Lessons Learned and the Path Forward. The National Academies Press, Washington, DC, 2012.

http://ctj.sagepub.com

Clinical Trials 2013; 10: 680–689 Downloaded from ctj.sagepub.com at MOUNT ALLISON UNIV on June 26, 2015

University of Pennsylvania 5th annual conference on statistical issues in clinical trials: emerging statistical issues in biomarker validation (Morning Session).

University of Pennsylvania 5th annual conference on statistical issues in clinical trials: emerging statistical issues in biomarker validation (Morning Session). - PDF Download Free
131KB Sizes 0 Downloads 0 Views