Projecting the benefits and harms of mammography using statistical models: proof or proofiness?

JNCI J Natl Cancer Inst (2015) 107(7): djv145 doi: 10.1093/jnci/djv145 First published online May 26, 2015 Editorial

editorial Projecting the Benefits and Harms of Mammography Using Statistical Models: Proof or Proofiness? Barnett S. Kramer, Joann G. Elmore

Correspondence to: Barnett S. Kramer, MD, MPH, National Cancer Institute, Division of Cancer Prevention, 9609 Medical Center Drive, Room 5E410, Rockville, MD 20852 (e-mail: [email protected]).

On numbers: “If you want to get people to believe something…just stick a number on it.” (1) -Charles Seife, Proofiness On statistical modeling: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” (2) -John von Neumann

Statistical models are often used in medicine and public health when there are important gaps in a body of empirical evidence regarding the impact of interventions on health outcomes. The models generally incorporate multiple parameters and variables with uncertain values. For example, lacking firm evidence that stage at diagnosis is a valid surrogate for a health outcome such as mortality, statistical models produce projections based on a chain of assumptions. Health outcomes are often projected beyond available evidence from clinical trials, perhaps years or decades into the future—a classic “out of sample” problem (3). Such modeling requires assumptions, many of which are unobserved or even unobservable, such as progression rates of preclinical biological processes. In this issue of the Journal, a team of very experienced modelers tackles an important question: What are the benefits and harms of mammography screening after the age of 74 years? (4) They conclude that the balance of benefits and harms of routine screening mammography is likely to remain positive until about age 90 years. To reach this conclusion, the authors employ three complex statistical microsimulation models, a necessity given that the well of reliable empirical evidence from randomized trials runs dry beyond the age of 74 years (5). The average reader will lack the time, patience, or skill to dissect the three models or their underlying assumptions, and so many will have to take on faith the model outputs emphasized in the abstract, despite the recognition by most modelers that identifying and studying the uncertainties in the assumptions that drive the output is as important as—and perhaps more important than—the actual

output. As telegraphed in the title of the paper, the methods of estimating overdiagnosis are major drivers of the models. A well-worn trope by statistician George E. P. Box is that “essentially, all statistical models are wrong, but some are useful” (6). That raises two key questions for any model: 1) How wrong is it? and 2) How useful is it? It is worthwhile examining every model through the lens of the two questions implied by Box’s maxim.

How Wrong Are the Models? Every forecasting model is prone to three major components of uncertainty, as described by Nate Silver in The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t (7): 1) uncertainty in the initial condition (eg, variability in baseline risk of breast cancer, drift in incidence trends), 2) structural uncertainty (eg, imprecise knowledge about outcome utilities, dynamics of subclinical disease progression, validity of intermediate endpoints such as tumor size or stage), and 3) scenario uncertainty (eg, variation in screening mammography sensitivity and specificity among radiologists, drifts in therapy patterns and efficacy). The uncertainty in the third category increases over time (see Figure 1 adapted from [7]). We have inverted and modified Figure 1 to draw an analogy with stepping from the terra firma of observed empirical evidence derived from clinical trials to wading into a figurative lake or pool of estimated data derived from statistical modeling (Figure 2). As one wades into the water from the shore of a lake, moving through longer time projections into the future lives of patients, there is progressively less support from firm evidence under foot. Suddenly, one loses contact with the underlying empirical evidence. At that point, the swimmer can no longer touch bottom and does not even know if the bottom is inches or many feet below. Any statistical model of a biological system bumps up against the concept of chaos theory, in which predictions of outcome are difficult because of hypersensitivity to starting conditions. Models can become chaotic when two criteria hold: 1) the

Received: April 21, 2015; Accepted: April 23, 2015 Published by Oxford University Press 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.

1 of 2

Downloaded from http://jnci.oxfordjournals.org/ by guest on November 14, 2015

Affiliations of authors: Division of Cancer Prevention, National Cancer Institute, Rockville, MD (BSK); University of Washington School of Medicine, Seattle, WA (JGE).

2 of 2 | JNCI J Natl Cancer Inst, 2015, Vol. 107, No. 7

of the frequency of overdiagnosis. Any such underestimation would inflate projected benefits of breast cancer screening over time in the Ravestyn et al. models.

How Useful Are the Models?

Figure 1. Sources of model uncertainty. Adapted with permission from Nate Silver, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t (7).

system is dynamic (ie, there are feedback loops in which factors influence each other, including tumor-microenvironment interactions); and 2) the processes follow exponential rather than additive relationships (8). Most would agree that the intertwined processes involved in breast cancer pathogenesis, progression, detection, and treatment fulfill both criteria. Van Ravesteyn and colleagues (4) tell the reader that all three models have been “validated,” providing evidence in their Figure 2 that they reproduce the incidence data from SEER over the period 1975 to 2000. This provides evidence of calibration rather than true validation. Leaving aside the fact that the most recent year of the comparison Surveillance, Epidemiology, and End Results (SEER) data is about 15 years ago, one of the models consistently predicts lower breast cancer incidence over the entire period compared with SEER, and one overpredicts for the first 10 years then underpredicts considerably. It is risky to assume that even the third model, which approximates SEER incidence relatively well, is fully “validated.” Even purposely mis-specified models can be shown to fit existing datasets well (9). John von Neumann’s quote is apropos here. It is also well known that statistical models that are “validated” by showing close correlation to previous economic downturns notoriously fall short in predicting the next downturn. The van Ravesteyn article also reports substantial differences among the models in estimates of age-specific overdiagnosis (Figure 3, A-C, and Table 3 of [4]) a major pillar of the models. There is active debate in the field (9–11) about the appropriate methods of overdiagnosis estimation using models that attempt to adjust for lead times, as in the three models used in van Ravesteyn et al. (4). These methods can lead to underestimation

Notes Opinions expressed in this manuscript are those of the authors and do not necessarily represent the opinions or official positions of the US Department of Health and Human Services or the US National Institutes of Health.

References 1. Seife C. Proofiness: The Dark Arts of Mathematical Deception. New York, NY: Penguin Group; 2010:295. 2. Attributed to John von Neumann by Enrico Fermi, a.q.b.F.D. Turning points: A meeting with Enrico Fermi. Nature. 2004;427:297. 3. Silver N. Out of sample, out of mind: a formula for failed prediction, in The Signal and the Noise: Why So Many Predictions Fail--But Some Don’t. New York, NY: Penguin Group; 2012:44. 4. van Ravesteyn NT, Stout NK, Schechter CB, et al. Benefits and harms of mammography screening after age 74 years: model estimates of overdiagnosis. J Natl Cancer Inst. 2015;107(7):djv103 doi:10.1093/jnci/djv103. 5. Gotzsche P, Jorgensen K. Screening for breast cancer with mammography (review). Cochrane Database Syst Rev. 2013(6):1–81. 6. Box G, Draper N. Empirical Model Building and Response Surfaces. New York, NY: John Wilery & Sons; 1987. 7. Silver N. A climate of healthy skepticism, in The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t. New York, NY: Penguin Group; 2012:370–411. 8. Silver N. The Signal and the Noise: Why So Many Predictions Fail--But Some Don’t. New York, NY: Penguin Group; 2012:534. 9. Baker S, Prorok P, Kramer B. Lead time and overdiagnosis. J Natl Cancer Inst. 2014;106(12). 10. Zahl P-H, Jorgensen K, Gotzsche P. Lead-time models should not be used to estimate overdiagnosis in cancer screening. J Gen Intern Med. 2014;29(9):1283– 1286. 11. Zahl P-H, Jorgensen K, Gotzsche P. Overestimated lead times in cancer screeninghas led to substantial underestimation of overdiagnosis. Br J Cancer. 2013;109:2014–2019.

Downloaded from http://jnci.oxfordjournals.org/ by guest on November 14, 2015

Figure 2. Wading into deep water.

Policy makers and clinicians should only use models if they understand what goes on inside the “black box” and the potential limitations of extrapolation beyond observed and observable evidence. This is a tall order, given the issues we raised above. However, the models are useful in that they do afford insights about the important role of the informed decision-making process that is encouraged with respect to breast cancer screening. The biggest driver of personal decisions regarding screening for cancer is likely to be personal values. Van Ravestyn et al. (4) recognize this, stating that divergence of individual preferences from their assumed values is the most important drawback of their models. Given that challenge, an important area of future research is learning how best to incorporate patients’ values into informed decision-making, with or without models. In summary, the models presented in the van Ravesteyn study (4) provide important insights and new directions for research. However, direct application to policy and clinical practice remains a challenge. We need to better gauge the depth of the lake and to avoid the pseudo-precision that proofiness can convey.

Benefits and harms of mammography screening.

Quantifying the benefits and harms of screening mammography.

Projecting Benefits and Harms of Novel Cancer Screening Biomarkers: A Study of PCA3 and Prostate Cancer.

Benefits, harms, and costs for breast cancer screening after US implementation of digital mammography.

Assessing mammography's benefits and harms.

Benefits and Harms of Screening Mammography by Comorbidity and Age: A Qualitative Synthesis of Observational Studies and Decision Analyses.

Critical Overview on the Benefits and Harms of Aspirin.

The potential harms and benefits from research on medical practices.

NSAIDs in the older patient: balancing benefits and harms.

Treating mild gestational diabetes yields benefits with little or no evidence of harms.

Harms and deprivation of benefits for nonhuman primates in research.

Balancing benefits and harms of treatments for acute bipolar depression.

Benefits and harms of roflumilast in moderate to severe COPD.

Benefits and harms of endoscopic screening for gastric cancer.

[First trimester miscarriages: benefits and harms of different management options].

The effect of information about the benefits and harms of mammography on women's decision-making: study protocol for a randomized controlled trial.

Beyond Harms and Benefits: Rethinking Duties to Disclose Misattributed Parentage.

Oxidative Stress: Harms and Benefits for Human Health.

Imaging-based screening: maximizing benefits and minimizing harms.

JAMA patient page. Breast cancer screening: benefits and harms.

ME.

Contrast-enhanced spectral mammography: Does mammography provide additional clinical benefits or can some radiation exposure be avoided?

Using Advanced Statistical Models to Predict the Non-Communicable Diseases.

D3D augmented reality imaging system: proof of concept in mammography.