Understanding the reliability of American College of Surgeons National Surgical Quality Improvement Program as a quality comparator.

Research

Original Investigation

Reliability of Risk-Adjusted Outcomes for Profiling Hospital Surgical Quality Robert W. Krell, MD; Ahmed Hozain, BS; Lillian S. Kao, MD, MS; Justin B. Dimick, MD, MPH

IMPORTANCE Quality improvement platforms commonly use risk-adjusted morbidity and

Invited Commentary page 474

mortality to profile hospital performance. However, given small hospital caseloads and low event rates for some procedures, it is unclear whether these outcomes reliably reflect hospital performance.

CME Quiz at jamanetworkcme.com and CME Questions page 496

OBJECTIVE To determine the reliability of risk-adjusted morbidity and mortality for hospital performance profiling using clinical registry data. DESIGN, SETTING, AND PARTICIPANTS A retrospective cohort study was conducted using data from the American College of Surgeons National Surgical Quality Improvement Program, 2009. Participants included all patients (N = 55 466) who underwent colon resection, pancreatic resection, laparoscopic gastric bypass, ventral hernia repair, abdominal aortic aneurysm repair, and lower extremity bypass. MAIN OUTCOMES AND MEASURES Outcomes included risk-adjusted overall morbidity, severe morbidity, and mortality. We assessed reliability (0-1 scale: 0, completely unreliable; and 1, perfectly reliable) for all 3 outcomes. We also quantified the number of hospitals meeting minimum acceptable reliability thresholds (>0.70, good reliability; and >0.50, fair reliability) for each outcome. RESULTS For overall morbidity, the most common outcome studied, the mean reliability depended on sample size (ie, how high the hospital caseload was) and the event rate (ie, how frequently the outcome occurred). For example, mean reliability for overall morbidity was low for abdominal aortic aneurysm repair (reliability, 0.29; sample size, 25 cases per year; and event rate, 18.3%). In contrast, mean reliability for overall morbidity was higher for colon resection (reliability, 0.61; sample size, 114 cases per year; and event rate, 26.8%). Colon resection (37.7% of hospitals), pancreatic resection (7.1% of hospitals), and laparoscopic gastric bypass (11.5% of hospitals) were the only procedures for which any hospitals met a reliability threshold of 0.70 for overall morbidity. Because severe morbidity and mortality are less frequent outcomes, their mean reliability was lower, and even fewer hospitals met the thresholds for minimum reliability. CONCLUSIONS AND RELEVANCE Most commonly reported outcome measures have low reliability for differentiating hospital performance. This is especially important for clinical registries that sample rather than collect 100% of cases, which can limit hospital case accrual. Eliminating sampling to achieve the highest possible caseloads, adjusting for reliability, and using advanced modeling strategies (eg, hierarchical modeling) are necessary for clinical registries to increase their benchmarking reliability.

JAMA Surg. 2014;149(5):467-474. doi:10.1001/jamasurg.2013.4249 Published online March 12, 2014.

Author Affiliations: Department of Surgery, University of Michigan Health System, Ann Arbor (Krell, Dimick); Department of Surgery, Michigan State University College of Human Medicine, East Lansing (Hozain); Department of Surgery, The University of Texas at Houston Medical School, Houston (Kao). Corresponding Author: Robert W. Krell, MD, Department of Surgery, University of Michigan, 2800 Plymouth Rd, Bldg 16, Office 016-100N-13, Ann Arbor, MI 48109 ([email protected]).

467

Copyright 2014 American Medical Association. All rights reserved.

Downloaded From: http://archsurg.jamanetwork.com/ by a J H Quillen College User on 06/04/2015

Research Original Investigation

Outcomes for Profiling Hospital Surgical Quality

C

linical registries have had a prominent role in increasing transparency and accountability for the outcomes of surgical care. Many, if not all, of the preeminent surgical clinical registries use risk-adjusted outcomes feedback to benchmark performance and guide surgical quality improvement efforts.1-4 With the increased prevalence of linking postoperative outcomes to reimbursements and quality improvement efforts, it is important that outcome measures be highly reliable to avoid misclassifying hospitals.1,5 However, a systematic evaluation of the statistical reliability of commonly used outcome metrics in surgery is lacking.6-8 Because of financial or personnel limitations, not all surgical registries capture 100% of cases from their participating hospitals.9 As a consequence, the yearly maximum number of cases reported by many hospitals in those programs can be limited. The combination of low caseload and low outcome rates reduces the ability of many outcomes to distinguish true quality differences among providers, which results in low reliability—analogous to power limitations in clinical trials.7 Several studies10,11 have called into question the reliability of certain complications for measuring quality in specific clinical populations. A better understanding of the reliability of commonly reported risk-adjusted outcomes and measures to counteract low reliability will help to improve the accuracy of surgical outcome reporting. In this context, we conducted an evaluation of the statistical reliability of 3 commonly used outcomes (mortality, severe morbidity, and overall morbidity) for profiling hospital performance across multiple procedures. We used logistic regression modeling techniques, a common risk-adjustment method, to calculate risk-adjusted mortality and morbidity rates following 6 different procedures. We then examined the reliability of those measures by investigating the effect of hospital caseload (ie, reported cases) on outcome reliability and then by assessing the number of hospitals that met 2 commonly accepted minimum reliability standards. We hypothesized that limited caseloads and rare event rates would result in low reliability for most commonly reported outcomes, even in clinically rich surgical registries.

Methods Data Source and Study Population We analyzed data from the 2009 American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) clinical registry. Details of data collection and validation in ACS-NSQIP have been provided elsewhere.12 In brief, the registry includes more than 135 variables encompassing patient and operative characteristics, 21 postoperative complications, reoperation, and 30-day mortality. Using relevant Current Procedural Terminology codes, we identified patients undergoing colon resection, pancreatic resection, laparoscopic gastric bypass, open ventral hernia repair, abdominal aortic aneurysm (AAA) repair, or lower extremity bypass procedures.

Outcomes Our primary outcomes of interest were risk-adjusted overall morbidity, severe morbidity, and mortality. Postoperative com468

plications recorded by ACS-NSQIP include surgical (wound dehiscence, bleeding, graft failure, or superficial, deep, or organspace surgical site infection), medical (cardiac arrest, myocardial infarction, deep venous thrombosis, pulmonary embolism, urinary tract infection, renal insufficiency, or acute renal failure), pulmonary (pneumonia, prolonged intubation, or unplanned intubation), nervous (coma, stroke, or peripheral nerve injury), and systemic (sepsis or septic shock) complications. In addition, ACS-NSQIP records reoperation and 30-day postoperative mortality rates. For the present study, we defined 30-day morbidity as any of the 21 possible complications. To define severe morbidity, we excluded superficial surgical site infection, deep venous thrombosis, urinary tract infection, peripheral nerve injury, or progressive renal insufficiency.

Statistical Analysis Creation of Risk-Adjusted Outcome Rates We entered patient demographics, comorbid conditions, and operative characteristics when applicable into a forward stepwise logistic regression model with each outcome (mortality, severe morbidity, and morbidity) as a dependent variable. Those variables with coefficient P < .05 from the stepwise regression model were then used in a logistic regression model for each outcome to generate a patient’s probability of experiencing that particular outcome. We repeated the process across procedure types. To generate hospital risk-adjusted outcome rates, patient probabilities were then summed for each hospital and compared with each hospital’s observed outcome rate to generate hospital-level observed to expected ratios. Multiplying each hospital’s observed to expected ratio by the mean outcome rate yielded its risk-adjusted rate. Calculating Reliability Reliability is a quantification of the proportion of provider performance variation explained by true quality differences (ie, statistical signal) and is measured on a scale of 0 (all differences attributable to measurement error) to 1 (all differences attributable to quality differences). A requisite for calculating reliability is the calculation of statistical “noise” for a particular outcome. Reliability is then defined as the ratio of signal to (signal + noise).13 To determine the reliability of each outcome measure, we used hierarchical logistic regression modeling. We defined signal as the variance of hospital random effect intercepts in the logistic model after full adjustment for patient risk factors.10 We quantified a hospital’s statistical noise by estimating that hospital’s measurement error variance in the logistic regression model.6,14 The reliability of each hospital’s risk-adjusted outcome rate was then calculated as signal/ (signal + noise). To assess the influence of caseload on reliability, we created hospital caseload (ie, cases reported by each hospital) terciles for each procedure. We then calculated the mean reliability of each outcome measure across caseload terciles. In further analysis, we quantified the number of hospitals with greater than 0.70 or 0.50 reliability for each outcome by procedure. A reliability of 0.70 is considered adequate for differentiating provider performance.6,15 Finally, we used the hos-

JAMA Surgery May 2014 Volume 149, Number 5



jamasurgery.com


Original Investigation Research

Table 1. Characteristics of Included Operations in American College of Surgeons National Surgical Quality Improvement Program, 2009

Characteristic No. of inpatient operations

Colon Resection

Pancreatic Resection

Laparoscopic Gastric Bypass

Ventral Hernia Repair

Abdominal Aortic Aneurysm Repair

Lower Extremity Bypass

22 664

2804

8979

10 516

4746

5757

No. of hospitals

199

169

130

199

187

190

Caseload, mean (range)

114 (1-282)

17 (1-205)

69 (1-414)

32 (1-185)

25 (1-108)

Age, mean (SD), y

62.6 (15.4)

62.2 (13.4)

45.4 (11.3)

56.4 (14.5)

73.0 (9.4)

30 (1-124) 67.4 (11.8)

Male sex, %

46.8

48.0

21.6

41.7

79.3

63.8

White race, %

77.3

79.5

73.8

75.4

85.0

74.1

>4 Comorbidities, %

14.1

9.4

3.9

7.4

28.5

68.3

Emergency operations, %

17.2

1.3

0.0

6.7

11.7

6.4

Open operations, %

67.9

100.0

0.0

100.0

29.3

100.0

30-d Mortality Unadjusted rate Risk-adjusted rate, mean (range)

4.3 4.3 (0-20.1)

2.4 4.9 (0-100)

0.1 0.2 (0-10.8)

0.5 0.5 (0-11.9)

5.2 5.4 (0-89.1)

2.6 2.8 (0-34.3)

30-d Severe morbidity Unadjusted rate Risk-adjusted rate, mean (range)

18.2 18.7 (2.6-97.7)

23.8 24.6 (0-100)

2.7 2.8 (0-18.8)

5.7 5.9 (0-66.8)

14.8 16.0 (0-98.8)

16.4 16.0 (0-51.3)

30-d Morbidity Unadjusted rate Risk-adjusted rate, mean (range)

26.8 27.0 (5.5-89.7)

32.1 31.0 (0-100)

pital-level random intercept variance in the hierarchical model as well as the total measurement error for each procedure group to calculate the number of cases needed to achieve 0.70 and 0.50 reliability for each outcome across procedures. We performed all statistical analyses using Stata, release 12 (StataCorp). The study protocol was reviewed and determined as “not regulated” by the University of Michigan Institutional Review Board.

Results There were 55 466 patients in 199 hospitals who underwent colon resection, pancreatic resection, laparoscopic gastric bypass, open ventral hernia repair, AAA repair, or lower extremity bypass procedures. Descriptive characteristics of the patients, unadjusted and adjusted outcome rates, and hospital caseload (ie, their collected cases) are presented in Table 1. Overall morbidity was the most frequent outcome across all procedures and ranged from 5.5% (laparoscopic gastric bypass) to 31.0% (pancreatic resection). Severe morbidity varied widely by procedure, ranging from 2.8% (laparoscopic gastric bypass) to 24.6% (pancreatic resection). Mortality was the least frequent outcome, ranging from 0.2% (laparoscopic gastric bypass) to 5.4% (AAA). Hospital caseload varied widely across procedures as well (Table 1). Colon resection was the most commonly captured procedure performed, with hospitals averaging 114 cases per year, and pancreatic resection was the least commonly captured procedure, with hospitals performing a mean of 17 cases per year. Mean reliability for each outcome across procedure types and hospital volume is presented in Table 2 and graphically in the Figure. Mean reliability for overall morbidity, the most fre-

5.4 5.5 (0-30.2)

9.7 10.0 (0-88.2)

18.3 19.0 (0-91.8)

23.5 23.7 (0-77.1)

quent outcome, ranged from 0.17 (lower extremity bypass) to 0.61 (colon resection). Mean reliability for severe morbidity ranged from 0.13 (laparoscopic gastric bypass) to 0.49 (colon resection). Mean reliability for mortality ranged from 0 (laparoscopic gastric bypass) to 0.39 (colon resection). Reliability for each outcome depended on how frequently the event occurred, with more common outcomes having higher reliability (Figure). Mean reliability for infrequent events such as mortality was lower than that for more frequent events such as overall morbidity. For example, reliability for mortality following pancreatic resection (mean riskadjusted mortality rate, 4.9%) was 0.06 and ranged from 0.01 in low-accrual hospitals to 0.13 in high-accrual hospitals. In contrast, reliability for overall morbidity following pancreatic resection (mean risk-adjusted overall morbidity rate, 31.0%) was 0.33 and ranged from 0.11 in low-accrual hospitals to 0.60 in high-accrual hospitals (Table 2). An exception to the trends we observed was with lower extremity bypass, in which reliability for severe morbidity was higher than reliability for overall morbidity across hospital caseloads (Table 2). Reliability was generally higher for more commonly captured procedures (Table 2). For example, mean reliability for overall morbidity was higher for common procedures such as colon resection (mean caseload, 114/y; mean reliability, 0.61) than for less commonly captured procedures such as AAA repair (mean caseload, 25/y; mean reliability, 0.29). This relationship persisted when comparing only the highest-volume hospitals. Mean reliability for morbidity in high-volume hospitals for colon resections was 0.75, and mean reliability for high-volume hospitals for AAA repair was 0.47 (Table 2). Moreover, reliability for all outcomes increased in a stepwise fashion as hospital caseload increased for all procedures (Figure). For example, reliability for overall morbidity following AAA re-

jamasurgery.com




469



Table 2. Hospital Caseload Tercile Cutoffs and Mean Reliability for Outcomes by Procedure Type and Caseload Tercile Caseload Procedure

All Hospitals

Low

Medium

High

Colon resection Caseload, No. of cases per year

145

Mean reliability, 30-d risk adjusted Mortality

0.39

0.24

0.44

0.52

Severe morbidity

0.49

0.31

0.53

0.62

Overall morbidity

0.61

0.43

0.66

0.75

Pancreatic resection Caseload, No. of cases per year

16


0.06

0.01

0.04

0.13

Severe morbidity

0.28

0.08

0.25

0.52

Overall morbidity

0.33

0.11

0.32

0.60

Laparoscopic gastric bypass Caseload, No. of cases per year

76


0

0

0

0

Severe morbidity

0.13

0.04

0.12

0.24

Overall morbidity

0.45

0.19

0.48

0.68

Ventral hernia repair Caseload, No. of cases per year

65


0.09

0.03

0.09

0.13

Severe morbidity

0.20

0.10

0.21

0.31

Overall morbidity

0.29

0.16

0.30

0.43

Abdominal aortic aneurysm repair Caseload, No. of cases per year

31

Mean reliability, 30-d risk-adjusted Mortality

0.02

0.01

0.02

0.03

Severe morbidity

0.29

0.12

0.29

0.46

Overall morbidity

0.29

0.12

0.29

0.47

Lower extremity bypass Caseload, No. of cases per year

35

Mean reliability, 30-d risk-adjusted Mortality

0.15

0.06

0.15

0.26

Severe morbidity

0.21

0.09

0.21

0.35

Overall morbidity

0.17

0.07

0.16

0.29

pair (mean reliability, 0.29) ranged from 0.12 in low-caseload hospitals to 0.47 in high-caseload hospitals (Table 2). Pancreatic resection and laparoscopic gastric bypass showed the largest variation in outcome reliability across hospital caseloads. For example, mean reliability for severe morbidity following pancreatic resection ranged from 0.08 in low-caseload hospitals to 0.52 in highcaseload hospitals, and mean reliability for overall morbidity following laparoscopic gastric bypass ranged from 0.19 in lowcaseload hospitals to 0.68 in high-caseload hospitals (Figure). An exception to this general trend was reliability for mortality following laparoscopic gastric bypass. All hospitals had reliability of zero for mortality regardless of caseload (Figure). Table 3 reports the proportion of hospitals that met 2 common reliability benchmarks for each outcome. For overall morbidity, the most frequent outcome, colon resection (37.7% of 470

hospitals), pancreatic resection (7.1%), and laparoscopic gastric bypass (11.5%) were the only procedures for which any hospitals met a reliability threshold of 0.70, which is considered good.6 When assessing a reliability threshold of 0.50, which is considered fair, few hospitals met the reliability benchmark for most procedures (Table 3). An exception was colon resection, in which 80.4% of hospitals met a 0.50 reliability threshold for overall morbidity. For lower event rate outcomes (ie, severe morbidity and mortality), fewer hospitals met reliability thresholds. Colon resection (2.5% of hospitals) and pancreatic resection (3.0% of hospitals) were the only procedures for which hospitals met a 0.70 reliability threshold for severe morbidity. Colon resection (1.5% of hospitals) was the only procedure for which hospitals met a 0.70 reliability threshold for mortality (Table 3). No hospitals met a reliability thresh-




jamasurgery.com



Figure. Mean Reliability of Risk-Adjusted 30-Day Outcomes by Hospital Caseload Tercile and Procedure Type

Mean Reliability of Risk-Adjusted 30-d Outcomes

A

0.8 Low caseload

0.7

Medium caseload

0.6

High caseload

0.5 0.4 0.3 0.2 0.1 0 Colon Resection




AAA Repair


Colon Resection




AAA Repair


Colon Resection




AAA Repair



B

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0


C

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

old of 0.70 for any outcome following ventral hernia repair, AAA repair, or lower extremity bypass. Table 4 lists the calculated number of cases required to achieve reliability benchmarks for each outcome across procedures. In general, as outcomes became less frequent, hospitals would have to provide larger caseloads to achieve 0.50 or 0.70 reliability. For example, to meet 0.50 reliability for mortality, a hospital would have to perform 147 colon resections, 237 pancreatic resections, 520 ventral hernia repairs, 1342 AAA repairs, or 151 lower extremity bypass procedures (Table 4). With more frequent outcomes (overall morbidity), hospitals would require smaller caseloads to meet reliability thresholds.

Discussion As quality measurement platforms are increasingly used for public reporting and value-based purchasing, it has

A, Mortality; the mortality rate for laparoscopic gastric bypass was zero for all hospital caseloads. B, Severe morbidity. C, Any morbidity.

never been more important to have reliable performance measures.5,16,17 Reliability is the most widely used indicator to assess an outcome’s capability to detect differences in quality if they exist.13 This is analogous to a power calculation used to avoid type II errors (failure to detect a real difference between groups) in clinical trials. Similar to the need for sufficient sample size and large enough treatment effect to have adequate power in a clinical trial, hospital outcomes measurements require both large enough caseloads and frequent enough adverse event rates to reliably capture quality differences.6 We have demonstrated that commonly used outcome measures have low reliability for hospital profiling for a diverse range of procedures. Hospital caseload was a strong driver for outcome reliability, with higher-caseload hospitals showing the most reliable outcomes. However, with infrequent outcomes, the number of submitted cases needed for adequate outcome reliability was much larger than most hospitals were able to provide.

jamasurgery.com




471



Table 3. Hospitals Meeting 0.70 and 0.50 Reliability for Postoperative Mortality, Severe Morbidity, and Morbiditya Colon Resection

Procedure




Abdominal Aortic Aneurysm Repair


Mortality Hospitals with >0.70 reliability

3 (1.5)

0

0

0

0

0

Hospitals with >0.50 reliability

56 (28.1)

0

0

0

0

0

Severe morbidity Hospitals with >0.70 reliability

5 (2.5)

5 (3.0)

0

0


114 (57.3)

30 (17.8)

0

0

0


75 (37.7)

12 (7.1)

15 (11.5)

0


160 (80.4)

41 (24.2)

59 (45.4)

9 (4.5)

0

19 (10.2)

2 (1.1)

Morbidity

a

0

0

20 (10.7)

0

Data are given as number (percentage).

Table 4. Hospital Caseload Requirements for Meeting 0.70 and 0.50 Reliability Thresholds for Overall Morbidity, Severe Morbidity, and Mortality Abdominal Aortic Aneurysm Repair


Colon Resection




No. of cases required for 0.70 reliability

342

554

…a

1213

3133

352


147

237

…a

520

1342

151


236

68

967

438

118

233


101

29

415

188

50

100

Procedure Mortality

Severe morbidity

Morbidity

a


134

48

135

262

116

311


58

21

58

112

50

133

Unable to calculate from available data because of a lack of hospital-level random intercept variance.

Our findings underscore the importance of carefully considering reliability when designing outcomes feedback programs for providers. There have been few studies6,7,10,15 assessing outcome measure reliability using claims or clinical registry data. Most have shown that many hospitals lack the caseloads to reliably detect differences in performance for certain outcomes in specific clinical populations. Dimick et al7 demonstrated that few hospitals met caseload requirements to detect meaningful differences from performance benchmarks following cardiovascular, pancreatic, esophageal, or neurosurgical procedures. In a study similar to ours, Kao et al10 used ACS-NSQIP data to evaluate the reliability of surgical site infection as a quality indicator following colon resection and found that only half of the hospitals examined had adequate caseloads to meet reliability benchmarks. The present study goes further and provides a comprehensive evaluation of the reliability of 3 commonly used outcomes across a collection of general and vascular procedures and highlights the reliability problems that can occur with low caseloads and infrequent outcomes. Outcomes with low reliability can mask both poor and outstanding performance relative to benchmarks. Hospitals with poor outcomes might assume they have no quality problems when they do (analogous to a type II error). Likewise, outcomes with low reliability may cause average (or wellperforming) hospitals to be spuriously labeled as poor per472

formers (analogous to a type I error: detecting a difference between groups when none exists). Without a formal assessment of outcome reliability, it is unclear whether a hospital’s performance is the result of quality or if it simply lacks an adequate caseload. When reporting outcomes, most quality reporting programs use P values and/or CIs to assign significance to a hospital’s performance relative to benchmarks. However, these significance measures are often relegated to a footnote or dismissed. When hospitals act to investigate and amend a spuriously high outcome rate, they may direct resources to where they do not have a problem—this is known as tampering in the quality improvement lexicon.18,19 Given the cost of maintaining and implementing quality improvement programs, hospitals have a vested interest in using highly reliable outcome measures to minimize misclassification and unnecessary spending. There are 3 main strategies to improve the reliability of outcome measures. One approach is to increase the caseload by sampling 100% of certain procedures.20,21 An alternative approach gaining momentum is the use of reliability adjustment. This technique has been discussed extensively elsewhere22 and is gaining traction in several statewide and national outcomes reporting programs. In brief, reliability adjustment uses empirical Bayes techniques to shrink a provider’s risk-adjusted outcome rate toward the overall mean rate, according to the provider’s caseload.23 Reliability adjustment




jamasurgery.com



has been demonstrated11,24 to more accurately predict future hospital performance for both general surgical and vascular procedures. A third option to increase reliability is by using composite quality indicators that combine quality signal from other measures and procedures within a hospital, such as outcomes from multiple related procedures, length of stay, and reoperation rate.23,25,26 Composite measures have been shown25 to more accurately predict future hospital performance compared with a single risk-adjusted outcome measure. Although these strategies are far from universal, they are gaining traction in some registries. For example, ACS-NSQIP has been among the leaders in implementing best practices to increase the reliability of outcome measures. Specifically, ACSNSQIP now offers 100% sampling for certain procedures, uses hierarchical modeling and reliability adjustment for reporting outcomes, and has investigated using composite measures for certain procedures for use in quality profiling.26 There are several important limitations to the present study. Our results may not be generalizable to clinical registries that already capture nearly 100% of their patients.22 However, even with 100% case capture, some hospitals that participate in clinical registries may not have the caseload for reliable benchmarking, especially if considering rare out-

comes (eg, mortality) or uncommon procedures (eg, pancreatectomy). This underscores the importance of using other methods for increasing reliability (eg, composite measures and reliability adjustment) as well. Another limitation of this study is that ACS-NSQIP may not be generalizable to all US hospitals because it oversamples larger teaching hospitals.

Conclusions Currently, outcomes reported by many clinical registries may have low reliability for profiling hospital performance for most commonly performed general and vascular surgery procedures. Implementing procedure-targeted data collection and accounting for statistical reliability when reporting outcomes will better inform hospitals of where they stand relative to their peers. More broadly, providers and payers should consider strategies to improve reliability when using clinical registry data for performance profiling, such as 100% sampling of highrisk conditions, reliability adjustment for outcomes reporting, and use of composite measures. Such measures should give more insight into quality differences between providers and better target high leverage areas for quality improvement.

ARTICLE INFORMATION

REFERENCES

Accepted for Publication: July 15, 2013.

1. Steinbrook R. Public report cards—cardiac surgery and beyond. N Engl J Med. 2006;355(18):1847-1849.

Published Online: March 12, 2014. doi:10.1001/jamasurg.2013.4249. Author Contributions: Dr Dimick had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Krell, Dimick. Acquisition, analysis, or interpretation of data: Krell, Hozain, Dimick. Analysis and interpretation of data: All authors. Drafting of the manuscript: Krell, Hozain, Dimick. Critical revision of the manuscript for important intellectual content: All authors. Statistical analysis: Krell, Hozain, Dimick. Obtained funding: Dimick. Administrative, technical, or material support: Dimick. Study supervision: Dimick. Conflict of Interest Disclosures: Dr Dimick has a financial interest in ArborMetrix, Inc, which had no role in the analysis herein. No other disclosures were reported. Funding/Support: Dr Krell is supported by grant 5T32CA009672-22 from the National Institutes of Health. Role of the Sponsor: The National Institutes of Health had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. Disclaimer: The ACS-NSQIP and the hospitals participating in the ACS-NSQIP are the source of the original data and cannot verify or be held responsible for the statistical validity of the data analysis or the conclusions derived by the authors.

improvement: the power of hospital collaboration. Arch Surg. 2010;145(10):985-991. 10. Kao LS, Ghaferi AA, Ko CY, Dimick JB. Reliability of superficial surgical site infections as a hospital quality measure. J Am Coll Surg. 2011;213(2):231-235.

2. Cohen ME, Bilimoria KY, Ko CY, Hall BL. Development of an American College of Surgeons National Surgery Quality Improvement Program: morbidity and mortality risk calculator for colorectal surgery. J Am Coll Surg. 2009;208(6):1009-1016.

11. Osborne NH, Ko CY, Upchurch GR Jr, Dimick JB. The impact of adjusting for reliability on hospital quality rankings in vascular surgery. J Vasc Surg. 2011;53(1):1-5.

3. Daley J, Forbes MG, Young GJ, et al. Validating risk-adjusted surgical outcomes: site visit assessment of process and structure: National VA Surgical Risk Study. J Am Coll Surg. 1997;185(4):341-351.

12. Shiloach M, Frencher SK Jr, Steeger JE, et al. Toward robust information: data quality and inter-rater reliability in the American College of Surgeons National Surgical Quality Improvement Program. J Am Coll Surg. 2010;210(1):6-16.

4. Campbell DA Jr, Henderson WG, Englesbe MJ, et al. Surgical site infection prevention: the importance of operative duration and blood transfusion—results of the first American College of Surgeons–National Surgical Quality Improvement Program Best Practices Initiative. J Am Coll Surg. 2008;207(6):810-820.

13. Adams JL. The reliability of provider profiling: a tutorial. Santa Monica, CA: RAND Corp; 2009. http://www.rand.org/pubs/technical_reports /TR653. Accessed October 15, 2012.

5. Lindenauer PK, Remus D, Roman S, et al. Public reporting and pay for performance in hospital quality improvement. N Engl J Med. 2007;356(5):486-496. 6. Adams JL, Mehrotra A, Thomas JW, McGlynn EA. Physician cost profiling—reliability and risk of misclassification. N Engl J Med. 2010;362(11): 1014-1021. 7. Dimick JB, Welch HG, Birkmeyer JD. Surgical mortality as an indicator of hospital quality: the problem with small sample size. JAMA. 2004;292(7):847-851. 8. Russell EM, Bruce J, Krukowski ZH. Systematic review of the quality of surgical mortality monitoring. Br J Surg. 2003;90(5):527-532. 9. Campbell DA Jr, Englesbe MJ, Kubus JJ, et al. Accelerating the pace of surgical quality

jamasurgery.com

14. Hosmer DW, Lemeshow S. Confidence interval estimates of an index of quality performance based on logistic regression models. Stat Med. 1995;14(19):2161-2172. 15. Scholle SH, Roski J, Adams JL, et al. Benchmarking physician performance: reliability of individual and composite measures. Am J Manag Care. 2008;14(12):833-838. 16. Calikoglu S, Murray R, Feeney D. Hospital pay-for-performance programs in Maryland produced strong results, including reduced hospital-acquired conditions. Health Aff (Millwood). 2012;31(12):2649-2658. 17. Faber M, Bosch M, Wollersheim H, Leatherman S, Grol R. Public reporting in health care: how do consumers use quality-of-care information? a systematic review. Med Care. 2009;47(1):1-8. 18. Wan TTH, Connell AM. Total quality management and continuous quality improvement. In: Wan TTH, Connell AM. Monitoring the Quality of JAMA Surgery May 2014 Volume 149, Number 5



473



Health Care: Issues and Scientific Approaches. New York, NY: Springer; 2003:143-158.

a population-based cohort study. Ann Surg. 2013;257(3):469-475.

outcomes with surgery. Ann Surg. 2012;255(4):703-707.

19. Cheung YY, Jung B, Sohn JH, Ogrinc G. Quality initiatives: statistical control charts: simplifying the analysis of data for quality improvement. Radiographics. 2012;32(7):2113-2126.

22. Birkmeyer NJ, Dimick JB, Share D, et al; Michigan Bariatric Surgery Collaborative. Hospital complication rates with bariatric surgery in Michigan. JAMA. 2010;304(4):435-442.

25. Dimick JB, Staiger DO, Osborne NH, Nicholas LH, Birkmeyer JD. Composite measures for rating hospital quality with major surgery. Health Serv Res. 2012;47(5):1861-1879.

20. Birkmeyer JD, Shahian DM, Dimick JB, et al. Blueprint for a new American College of Surgeons: National Surgical Quality Improvement Program. J Am Coll Surg. 2008;207(5):777-782.

23. Dimick JB, Staiger DO, Hall BL, Ko CY, Birkmeyer JD. Composite measures for profiling hospitals on surgical morbidity. Ann Surg. 2013;257(1):67-72.

26. Merkow RP, Hall BL, Cohen ME, et al. Validity and feasibility of the American College of Surgeons colectomy composite outcome quality measure. Ann Surg. 2013;257(3):483-489.

21. Hendren S, Fritze D, Banerjee M, et al. Antibiotic choice is independently associated with risk of surgical site infection after colectomy:

24. Dimick JB, Ghaferi AA, Osborne NH, Ko CY, Hall BL. Reliability adjustment for reporting hospital

Invited Commentary

Understanding the Reliability of American College of Surgeons National Surgical Quality Improvement Program as a Quality Comparator Kim F. Rhoads, MD, MPH; Sherry M. Wren, MD

Value-based purchasing and pay-for-performance programs are critical to the success of the Affordable Care Act.1 Despite the imminent implementation of the policy, robust indicators of hospital surgical quality have not been well described. Krell and Related article page 467 colleagues2 elegantly assess the reliability of American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) measures of morbidity, severe morbidity, and mortality for 6 major general and vascular surgical procedures. The investigators found that the measures fail to meet even the lowest threshold of reliability, resulting in part from lack of adequate case volumes, even when most hospitals would be otherwise considered high volume. The findings are cautionary to ranking systems that use observed to expected ratios as a surrogate for surgical quality. Reliability describes the confidence with which a measure can distinguish one hospital from another. It is dependent on case volume, measurement error or statistical noise, ARTICLE INFORMATION

REFERENCES

Author Affiliations: Department of Surgery, Stanford University School of Medicine, Stanford, California (Rhoads, Wren); Department of Surgery, Veterans Affairs Palo Alto Health Care System, Palo Alto, California (Wren).

1. The Affordable Care Act: lowering Medicare costs by improving care. CMS.gov; Centers for Medicare and Medicaid Services website. http://www.cms .gov/apps/files/aca-savings-report-2012.pdf. Accessed August 3, 2013.

Corresponding Author: Sherry M. Wren, MD, Department of Surgery, Veterans Affairs Palo Alto Health Care System, 3801 Miranda Ave, Ste G112, Palo Alto, CA 94304 ([email protected]).

2. Krell RW, Hozain A, Kao LS, Dimick JB. Reliability of risk-adjusted outcomes for profiling hospital surgical quality [published online March 12, 2014]. JAMA Surg. doi:10.1001/jamasurg.2013.4249.

Published Online: March 12, 2014. doi:10.1001/jamasurg.2013.4253.

3. Adams JL. The reliability of provider profiling: a tutorial. RAND Corporation website. 2009. http://www.rand.org/pubs/technical_reports /TR653. Accessed August 3, 2013.

Conflict of Interest Disclosures: None reported.

474

and, most important, the presence of true differences between the entities being compared.3 In other words, if hospitals in the cohort are too similar with respect to the outcome of interest, it will be nearly impossible to detect differences between them in the absence of increasing case volumes. This explains the impossibly high volumes deemed necessary to reach an acceptable threshold of reliability for mortality in pancreatic resection, ventral hernia, and abdominal aortic aneurysm repairs. Although the measures may fail to reliably detect differences in quality between the hospitals currently enrolled in ACS-NSQIP, dismissing the measures outright may be premature. In 2009, fewer than 5% (199 of more than 4000) of all US hospitals participated in ACS-NSQIP.4 For example, in Wisconsin, only 5 of 123 “general medical and surgical” hospitals participate in ACS-NSQIP.5 Therefore, until the hospital cohort reflects the well-documented variation that occurs across the country,6 quality as determined by ACS-NSQIP should be interpreted with healthy skepticism. 4. State health facts: total hospitals. Henry J. Kaiser Family Foundation website. http://kff.org/other/state -indicator/total-hospitals/. Accessed August 3, 2013. 5. Wisconsin hospitals. Wisconsin Hospital Association Inc website. http://www.wha.org /wisconsin-hospitals.aspx. Accessed August 4, 2013. 6. Birkmeyer JD, Sharp SM, Finlayson SR, Fisher ES, Wennberg JE. Variation profiles of common surgical procedures. Surgery. 1998;124(5):917-923.




jamasurgery.com

Quality improvement in gastrointestinal surgical oncology with American College of Surgeons National Surgical Quality Improvement Program.

Surgical outcomes of hyperthermic intraperitoneal chemotherapy: analysis of the american college of surgeons national surgical quality improvement program.

"July effect" in elective spine surgery: analysis of the American College of Surgeons National Surgical Quality Improvement Program database.

Flap Failure in 2013: A Perfect Year for American College of Surgeons National Surgical Quality Improvement Program Microsurgeons?

Complication timing and association with mortality in the American College of Surgeons' National Surgical Quality Improvement Program database.

Outcomes for symptomatic abdominal aortic aneurysms in the American College of Surgeons National Surgical Quality Improvement Program.

Avoiding immortal time bias in the American College of Surgeons National Surgical Quality Improvement Program readmission measure.

Urinary Tract Infection Following Posterior Lumbar Fusion Procedures: An American College of Surgeons National Surgical Quality Improvement Program Study.

Risk of discharge to postacute care: a patient-centered outcome for the american college of surgeons national surgical quality improvement program surgical risk calculator.

The use of report cards and outcome measurements to improve the safety of surgical care: the American College of Surgeons National Surgical Quality Improvement Program.

Risk factors for readmission after lower extremity bypass in the American College of Surgeons National Surgery Quality Improvement Program.

An evaluation of the timing of surgical complications following nephrectomy: data from the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP).

Predicting the risk of death following coronary artery bypass graft made simple: a retrospective study using the American College of Surgeons National Surgical Quality Improvement Program database.

Risk factors for unplanned readmission within 30 days after pediatric neurosurgery: a nationwide analysis of 9799 procedures from the American College of Surgeons National Surgical Quality Improvement Program.

The impact of blood transfusion on perioperative outcomes following gastric cancer resection: an analysis of the American College of Surgeons National Surgical Quality Improvement Program database.

Achieving high-quality surgical care: observations from the American College of Surgeons Quality of Care Programs.

The impact of peri-operative blood transfusions on post-pancreatectomy short-term outcomes: an analysis from the American College of Surgeons National Surgical Quality Improvement Program.

Impact of Resident Involvement on Orthopaedic Surgery Outcomes: An Analysis of 30,628 Patients from the American College of Surgeons National Surgical Quality Improvement Program Database.

Impact of resident involvement in neurosurgery: an analysis of 8748 patients from the 2011 American College of Surgeons National Surgical Quality Improvement Program database.

Efficacy of laparoscopic-assisted approach for reversal of Hartmann's procedure: results from the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database.

Overall Similar Infection Rates Reported in the Physician-reported Scoliosis Research Society Database and the Chart-abstracted American College of Surgeons National Surgical Quality Improvement Program Database.

Morbidity and mortality following elective splenectomy for benign and malignant hematologic conditions: analysis of the American College of Surgeons National Surgical Quality Improvement Program data.

Morbidity, mortality, and readmission after vertebral augmentation: analysis of 850 patients from the American College of Surgeons National Surgical Quality Improvement Program database.

How sick are dialysis patients undergoing cholecystectomy? Analysis of 92,672 patients from the American College of Surgeons National Surgical Quality Improvement Program database.