GENERAL MEDICINE/EDITORIAL

The Sinking STONE: What a Failed Validation Can Teach Us About Clinical Decision Rules Steven M. Green, MD*; David L. Schriger, MD, MPH *Corresponding Author. E-mail: [email protected]. 0196-0644/$-see front matter Copyright © 2016 by the American College of Emergency Physicians. http://dx.doi.org/10.1016/j.annemergmed.2015.11.022

A podcast for this article is available at www.annemergmed.com.

SEE RELATED ARTICLES, P. 423, 439, AND 449. [Ann Emerg Med. 2016;67:433-436.] Clinical decision rules are common and widely promoted in emergency medicine, yet their quality varies widely. Few decision rules undergo the rigorous external validation necessary to confirm their function, as should appropriately occur before their incorporation into clinical practice.1 This issue of Annals features 3 articles on urolithiasis, including 2 on a clinical decision rule called the STONE score. Wang et al2 report a large and compelling external evaluation, the original STONE investigators3 test whether ultrasonography enhances score performance,4 and Wang5 provides a short review of urolithiasis management. These articles—in particular the external evaluation by Wang et al2—illustrate key principles about clinical decision rule validity, purpose, and function that can be applied when evaluating other rules. We will discuss 4 such issues here, referring readers interested in greater background discussion to Annals’ guidelines for the appraisal of clinical decision rules.1 IS THE STONE SCORE VALID? The STONE score includes 5 separate elements identified during its derivation as statistically predictive of urolithiasis, which, when combined, spell the word STONE (Figure). Wang et al2 retested these components in an independent sample substantially larger than that in the original validation (845 versus 491 patients) and were able to confirm 4 of the 5 elements as statistically predictive, although notably less so than in the original study. More important, however, the score item “origin—nonblack race” did not validate as predictive, representing a serious failure of the rule’s infrastructure. This indicates that up to 3 of the possible 13 score points are likely unhelpful and Volume 67, no. 4 : April 2016

thus are misleading “noise,” rendering the overall rule unstable and effectively invalid. A clinical decision rule whose core configuration has been invalidated should be abandoned, refined, or perhaps retested in an even larger sample. Wang et al2 explored refinement by eliminating the invalid item and found that the resulting “STNE” score exhibited predictive value similar to that of STONE. Thus, this decision rule could be relaunched in 4-item fashion; however, the validity and function of such repackaging would require independent confirmation in a new data set. DOES THE STONE SCORE FULFILL AN IMPORTANT NEED? How are we supposed to use the STONE score? This decision rule’s authors designed it to sort patients already selected to receive computed tomography (CT) scanning into general groups with low, moderate, or high probability of having a stone.3 They stopped short of recommending that these resulting risk categories be used for specific clinical decisionmaking, citing the need for “future investigations” to assess “our hope.that this score can be incorporated into imaging decisions.”3 Is this valuable? Do we need further risk stratification if we already believe that a CT scan is warranted? Isn’t the point moot because the imaging will answer the question better than the score? Wouldn’t we instead prefer a clinical decision rule applicable to all patients for whom urolithiasis is a reasonable diagnostic consideration, not just the subset already selected for imaging? Unfortunately, the STONE score was not designed for this more clinically relevant question. Even more fundamental, is the presence or absence of a ureteral stone the true clinical concern? As noted in the clinical review in this issue,5 most clinicians obtain imaging not so much to confirm the stone’s presence but to assess whether a patient with a stone is likely to require care beyond pain management according to the stone’s location, size, or the presence of substantial hydroureter or hydronephrosis, and to exclude a dangerous alternative Annals of Emergency Medicine 433

The Sinking STONE

Figure. Calculating the STONE score.3

diagnosis. A score that accurately excludes indications for urgent urologic management and serious competing diagnoses would be useful; a score that merely predicted the presence or absence of stone, less so. Being old-timers, we are comfortable with a far lower frequency of imaging than was apparently used in the STONE studies and as is recommended in the diagnostic algorithm by Wang.5 For decades, urolithiasis has been diagnosed through careful clinical judgment and managed expectantly, and the anecdotal recollection of our previous generation is that the clinically important misses were rare. Contemporary imaging trends appear driven by defensive medicine or the stubborn quest for diagnostic certainty.6 DOES THE STONE SCORE PROVIDE A CLEAR COURSE OF ACTION? The STONE score sorts patients into low-, moderate-, or high-risk categories, which its authors propose might 434 Annals of Emergency Medicine

Green & Schriger

have a role in actual clinical decisionmaking.3 In their validation study, Wang et al2 affirm that indeed the score can successfully divide patients into these risk categories, although notably the resultant sorting was less accurate than in the original report. Thus, the STONE score can provide a general estimation of risk, as might analogously be achieved using a WBC count in suspected appendicitis. Other readers would instead prefer the STONE score to have specific utility. The Ottawa Ankle Rules and the National Emergency X-Radiography Utilization Study cervical spine rule are popular and successful because they tell clinicians how to act.1 When these rules are applied to nontrivial-risk populations, patients should generally receive imaging when the rules are positive and should generally not receive it when they are negative. It is generally accepted that useful tests or rules have a positive likelihood ratio of 10 or greater, a negative likelihood ratio of 0.1 or less, or, ideally, both. STONE falls substantially short of these marks in all scenarios tested (Table). For example (data from Wang et al2), using the high-risk category of the score rather than CT to rule in urolithiasis identified barely half of the patients with stones while diagnosing 13% of nonstone patients as falsely having calculi. Using the low-risk classification rather than CT to rule out the diagnosis would miss 8% of stones. If it truly were important to diagnose the presence or absence of stones, most clinicians would be uncomfortable with the frequency of diagnostic errors in any of the scenarios outlined in the Table. Wang et al2 reviewed multiple other score thresholds for high or low risk and could not identify any that appeared more promising. Thus, STONE lacks sufficient accuracy to reliably rule in or rule out a stone (Table) and does not show potential as a clinically feasible tool to reduce CT scanning. What may be needed is a rule for who receives imaging based not on the presence or absence of stones but on the possibility of a stone complication’s affecting outcome or the need to exclude competing dangerous diagnoses. DOES THE STONE SCORE YIELD OUTCOMES SUPERIOR TO THOSE DERIVED WITH GESTALT? The 4 remaining elements of the STONE score are hematuria, acute onset of pain, male sex, and nausea or vomiting. Wait! Aren’t those among the classic textbook factors that we already routinely use when clinically sizing up patients with possible stones? The fundamental purpose of a clinical decision rule is to improve clinical care, not just to mirror what we are already doing. The Ottawa Ankle Rules, for example, improve on clinical judgment because they are just as sensitive while being more specific. A score Volume 67, no. 4 : April 2016

Green & Schriger

The Sinking STONE

Table. Accuracy of the STONE score in predicting urolithiasis. Risk (95% CI) STONE Score

Low

Moderate

High

STONE score alone—original authors

Sensitivity 3% (2%–5%) Specificity 67% (62%–72%) Positive likelihood ratio 0.03 Negative likelihood ratio 1.4

Sensitivity 41% (37%–46%) Specificity 42% (37%–47%) Positive likelihood ratio 0.7 Negative likelihood ratio 1.4

Sensitivity 55% (51%–60%) Specificity 91% (88%–94%) Positive likelihood ratio 6.1 Negative likelihood ratio 0.5

STONE score alone in external validation2

Sensitivity 8% (6%–12%) Specificity 65% (61%–69%) Positive likelihood ratio 0.2 Negative likelihood ratio 1.4

Sensitivity 38% (33%–43%) Specificity 48% (43%–52%) Positive likelihood ratio 0.7 Negative likelihood ratio 1.3

Sensitivity 53% (48%–59%) Specificity 87% (84%–90%) Positive likelihood ratio 4.1 Negative likelihood ratio 0.5

STONE score with hydronephrosis on ultrasonography5

Sensitivity 64% (36%–86%) Specificity 87% (80%–92%) Positive likelihood ratio 4.9 Negative likelihood ratio 0.4

Sensitivity 60% (52%–67%) Specificity 71% (64%–76%) Positive likelihood ratio 2.1 Negative likelihood ratio 0.6

Sensitivity 69% (63%–75%) Specificity 60% (42%–76%) Positive likelihood ratio 1.7 Negative likelihood ratio 0.5

5

CI, Confidence interval.

that replicates but does not improve on baseline gestalt has failed to add value.1,7,8 The creators of the STONE score did not contrast it with unstructured clinical judgment, and thus Wang et al2 appropriately extend their validation to include this vital metric. They found statistical superiority for the 13-point score versus a 5-level treating physician gradation of pretest probability, with areas under receiver operating characteristic curves of 0.78 versus 0.68. However, both such measures indicate weak predictive value, and this modest observed difference is unlikely to have clinical importance. Furthermore, Wang et al2 graphically depict (their Figure 2) that any apparent superiority of the score over gestalt lies solely at the specificity end of the curve, ie, the score might rule in a stone slightly better than clinical judgment but does not appear any better for ruling one out. This potential advantage would appear to be in the least useful direction for potential imaging reduction because clinicians will remain prone to imaging higher-risk patients when they have concerns about stone emergencies5 and dangerous alternative diagnoses.

WHAT DOES THIS MEAN FOR THE STONE SCORE? These new data confirm that the STONE score can stratify patients already selected for CT scanning into low-, moderate-, and high-risk groups and that the score is measurably more specific than unaided clinical judgment. However, the validation failure of one of its 5 components renders the score unstable and thus unreliable. The score’s fundamental intent of general risk sorting is of limited practical utility because it does not identify outcomes of Volume 67, no. 4 : April 2016

clinical importance, and the 2 new reports in this issue2,4 were unable to identify any promise for it as a tool to defer CT. The observed statistical superiority of the score over gestalt is based on a modest margin that is likely below the threshold of clinical importance and lies in a direction with doubtful potential to reduce imaging. Although the STONE score is well intentioned, we believe it is effectively sunk by these weaknesses and that it should therefore be abandoned in its current form. Researchers and clinicians can learn from this example of a rule validation failure and attempt to determine whether a urolithiasis score is even needed, and, if so, how it might be redesigned to have potential for clinical utility. WHAT DOES THIS MEAN FOR CLINICAL DECISION RULES OVERALL? The STONE score has been subjected to rigorous external evaluation, unlike most clinical decision rules in emergency medicine. How many other widely promoted and trendy rules are similarly invalid, but we just don’t know it? We believe that emergency physicians should maintain a healthy distrust of all rules that have not been convincingly demonstrated to retain validity in independent large samples, fill an important need, provide a clear course of clinical action, and yield outcomes superior to clinical judgment alone.1 Without such confirmation of success, they are simply not worth the trouble and may harm more than they help. Supervising editor: Michael L. Callaham, MD Author affiliations: From Loma Linda University Medical Center and Children’s Hospital, Loma Linda, CA (Green); and the University of California, Los Angeles, Los Angeles, CA (Schriger). Annals of Emergency Medicine 435

Green & Schriger

The Sinking STONE Funding and support: By Annals policy, all authors are required to disclose any and all commercial, financial, and other relationships in any way related to the subject of this article as per ICMJE conflict of interest guidelines (see www.icmje.org). The authors have stated that no such relationships exist and provided the following details: Dr. Schriger’s work on this editorial was funded in part by an unrestricted grant from the Korein Foundation. Dr. Callaham was the supervising editor on this article. Dr. Green did not participate in the editorial review or decision to publish this article.

REFERENCES 1. Green SM, Schriger DL, Yealy DM. Methodologic standards for interpreting clinical decision rules in emergency medicine: 2014 update. Ann Emerg Med. 2014;64:286-291.

2. Wang RC, Rodriguez RM, Moghadassi M, et al. External validation of the STONE score, a clinical prediction rule for ureteral stone: an observational multi-institutional study. Ann Emerg Med. 2016;67:423-432. 3. Moore CL, Daniels B, Luty S, et al. Derivation and validation of a clinical prediction rule for uncomplicated ureteral stone—the STONE score: retrospective and prospective observational cohort studies. BMJ. 2014;348:g2191. 4. Daniels B, Gross CP, Molinaro A, et al. STONE PLUS: evaluation of emergency department patients with suspected renal colic, using a clinical prediction tool combined with point-of-care limited ultrasonography. Ann Emerg Med. 2016;67:439-448. 5. Wang RC. Managing urolithiasis. Ann Emerg Med. 2016;67:449-454. 6. Kassirer JP. Our stubborn quest for diagnostic certainty. A cause of excessive testing. N Engl J Med. 1989;320:1489-1491. 7. Schriger DL, Newman DH. Medical decisionmaking: let’s not forget the physician. Ann Emerg Med. 2012;59:219-220. 8. Green SM. When do clinical decision rules improve patient care? Ann Emerg Med. 2013;62:132-135.

Notice of Duplicate Publication and Retraction A reader alerted us to similarity between the article: Mao Y, Qin Z. Association of apneic oxygenation with decreased desaturation rates during rapid sequence intubation by a Chinese emergency medicine service. Int J Clin Exp Med. 2015; 8(7):11428-11434. with an article published earlier in our journal: Wimalasena Y, Burns B, Reid C, Ware S, Habig K. Apneic oxygenation was associated with decreased desaturation rates during rapid sequence intubation by an Australian helicopter emergency medicine service. Ann Emerg Med. 2015; 65(4):371-376. Indeed, we found that the Int J Clin Exp Med article includes nearly identical data and large sections of identical text to the article we published. We contacted the authors, who stated an intent to withdraw their manuscript. We then contacted the editors and publisher of Int J Clin Exp Med, and this journal is currently in the process of formally retracting this redundant publication.

436 Annals of Emergency Medicine

Volume 67, no. 4 : April 2016

The Sinking STONE: What a Failed Validation Can Teach Us About Clinical Decision Rules.

The Sinking STONE: What a Failed Validation Can Teach Us About Clinical Decision Rules. - PDF Download Free
566B Sizes 1 Downloads 5 Views