SPINE Volume 39, Number 16, pp 1311-1312 ©2014, Lippincott Williams & Wilkins

JOURNAL CLUB

Understanding the Statistics and Limitations of Large Database Analyses Hiroyuki Yoshihara, MD, PhD,* and Daisuke Yoneoka, MS†

M

any articles have been published analyzing large databases, such as national databases. In this report, we offer several points that readers should be aware of to correctly interpret articles analyzing large databases using examples from Nationwide Inpatient Sample (NIS) studies. It has been reported that large databases have limitations such as in-hospital data and coding bias. It is true that many databases are limited to in-hospital events; consequently, the true incidence rates of complications and mortality may be underestimated. In addition, databases do not include clinical outcomes such as neurological or functional assessment. With regard to coding, data extraction from the sites is done by professionals who presumably have no “monetary risk” regarding patient care; thus, the accuracy of data extraction is likely to be high. However, the coding bias can be more evident in minor and new procedures. For example, the authors analyzed the utilization of neuromonitoring in spine surgery using the NIS. The coding for this procedure was started in the middle of 2007. The analysis revealed that neuromonitoring was used in only 11.1% of spinal fusion surgical procedures for pediatric patients (age, ≤17 yr) with idiopathic scoliosis in 2008 and 2009. Spinal fusion for pediatric patients with idiopathic scoliosis usually necessitates the correction of the curve, and neuromonitoring is a mandatory procedure for the surgery. Therefore, these data clearly demonstrate the coding bias and the true incidence was underestimated. Furthermore, the coding of the number of fused levels was started at the end of 2003 and needs to be reported when spinal fusion is performed. However, the number of fused levels was reported

From the *Department of Orthopaedic Surgery & Rehabilitation Medicine, SUNY Downstate Medical Center, Brooklyn, NY; and †Department of Statistical Science, School of Advanced Sciences, the Graduate University for Advanced Studies, Tokyo, Japan. Acknowledgment date: November 10, 2013. First revision date: March 16, 2014. Second revision date: March 19, 2014. Acceptance date: March 20, 2014. The Manuscript submitted does not contain information about medical device(s)/drug(s). No funds were received in support of this work. No relevant financial activities outside the submitted work. Address correspondence and reprint requests to Hiroyuki Yoshihara, MD, PhD, SUNY Downstate Medical Center, 450 Clarkson Ave, Brooklyn, NY 11203; E-mail: [email protected] DOI: 10.1097/BRS.0000000000000352 Spine

in only 95.4% of spinal fusion surgical procedures between 2004 and 2009. Definitions of comorbidities and complications from the diagnosis codes are also important aspects of large database studies because both are usually included in the same data set. For example, databases such as NIS, Medicare, and Kids’ Inpatient Database do not have specifically isolated comorbidities data, although National Trauma Data Bank has specifically isolated comorbidity data. There are 2 established comorbidity scores that are frequently used in large database studies using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), codes. Those are the Elixhauser comorbidity index scores and Charlson comorbidity index scores (Deyo Index).1–3 The Elixhauser comorbidity measures include 30 comorbidities and Charlson comorbidity measures include 17 comorbidities. The Elixhauser comorbidity measures were developed as comorbidity measures for use with large administrative inpatient data sets using the California Statewide Inpatient Database.1 Comorbidities that were infrequent or statistically unrelated to length of stay, total charges, or in-hospital mortality were excluded. Complications were omitted by excluding ICD9-CM codes that reflect acute conditions. Elixhauser comorbidity measures were further verified by the Federal Agency for Healthcare Research and Quality as comorbidities and not postoperative conditions. This Agency for Healthcare Research and Quality comorbidity measure contains more ICD-9-CM codes than the original Elixhauser ICD-9-CM coding algorithm, but excludes cardiac arrhythmias from the list of comorbidities. In contrast, the original Charlson index was developed for the purpose of prospectively predicting 1-year mortality rate among patients being considered for breast cancer clinical trials2 and then was translated into a set of ICD-9-CM codes by Deyo et al.3 In the scoring, the code for acquired immunodeficiency syndrome (AIDS) and metastatic solid tumor received the greatest weight; however, given the advances in AIDS treatments, it seems clear that the highest predictor for mortality is no longer AIDS. In addition, coding for myocardial infarction in Charlson comorbidities includes acute myocardial infarction, which can be a complication after surgery. With regard to complications, previous studies analyzed deep venous thrombosis (DVT) as a complication after spinal surgery.4–6 However, until recently, it was not possible to distinguish using codes between DVT present www.spinejournal.com

1311

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited. SPINEJC1301_LR 1311

20/06/14 10:40 PM

JOURNAL CLUB

Statistics and Limitations of Large Database Analysis • Yoshihara and Yoneoka

on admission and those that occurred during the hospitalization. Therefore, the incidence of DVT as a complication was overestimated. ICD-9-CM codes were created and existing code titles were revised in 2009 to help distinguish between acute and chronic DVT. Furthermore, in future longitudinal studies in which both ICD-9-CM and International Classification of Diseases, Tenth Revision, Clinical Modification, codes are involved, additional issues may arise. Another important subject is the meaning of the P value in large samples. It is still debatable as to up to what degree the sample size makes the P value meaningless in detecting significant clinical predictors compared with the alternative of relying only on the practical importance of the results. Statisticians have long been aware of the pitfalls of the P value since Fisher7 adopted the P = 0.05 as a reference point to reject null hypothesis in 1925. Looking at previous articles analyzing the NIS, readers would frequently see “P < 0.001.” When large samples are analyzed, it is highly likely that small differences will be detected with “P < 0 .001” using statistical tests because statistical power arises. The confidence interval (CI) is affected by larger sample size (as sample size increases, the width of the CI decreases). Therefore, a larger sample leads to a smaller P value and a higher likelihood of rejecting the null hypothesis. Despite this tendency, the significance of P value is usually set on “P < 0.05.” When the association of variables is significant under the conventional threshold in the analysis of large samples, controversy exists if it is clinically important. To solve this issue, some other interpretations have been reported in previous NIS studies. Ma et al8 used a P value of 0.001 to define significant differences. Passias et al9 decided not to rely only on a conventional threshold of statistical significance (i.e., P < 0.05) to draw conclusions from the study findings; instead, they used a 95% CI as a measure of effect size, and let the readers evaluate the significance of the findings. Mostly, effect size is robust to sample size; thus it can be considered as the true magnitude of effect. However, except in some cases of reporting test statistics, it is difficult to justify the conversion from 95% CI to an effect size. Therefore, as recommended by many guidelines, authors should report effect size and CIs. In addition, several other methods have been proposed in other fields. Yao et al10 rebuilt the statistical test using 50%, 10%, and 1% of random samples from the main data and confirmed the robustness of the test. Using this method, it is possible to test using a sample size that is used at the clinical frontier. More statistically, on the basis of the central limit theorem and the

1312

approximation of normal distribution, generalized χ2 goodness of fit has also been proposed.11 This generalization can be extended to another statistical test and will make us consider P value in the same context as with small samples. Although there are many methods of correction for the P value, 95% CI as a measure effect size is currently the most common way of understanding clinical relevance in large sample analysis. Because the NIS database does not contain experimental data with feasible sample power analysis and has no procedure to determine the optimal sample size, researchers should always consider these techniques to report more meaningful results. Although this piece focused on studies where data were gathered from the NIS database, the principles discussed are relevant to all large databases and registries. Such studies are important from the standpoint of analyzing large samples and including data from surgeons of all experience levels. We hope the aspects mentioned earlier will help readers more clearly understand large database studies.

References

1. Elixhauser A, Steiner C, Harris DR, et al. Comorbidity measures for use with administrative data. Med Care 1998;36:8–27. 2. Charlson ME, Pompei P, Ales KL, et al. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis 1987;40:373–83. 3. Deyo RA, Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. J Clin Epidemiol 1992;45:613–9. 4. Kalanithi PS, Patil CG, Boakye M. National complication rates and disposition after posterior lumbar fusion for acquired spondylolisthesis. Spine 2009;34:1963–9. 5. Kalanithi PA, Arrigo R, Boakye M. Morbid obesity increases cost and complication rates in spinal arthrodesis. Spine 2012;37:982–8. 6. Oglesby M, Fineberg SJ, Patel AA, et al. The incidence and mortality of thromboembolic events in cervical spine surgery. Spine [published online ahead of print January 30, 2013] 2013;38:E521–7. 7. Fisher RA. Statistical Methods for Research Workers. Edinburgh, Scotland: Oliver & Boyd; 1925. 8. Ma Y, Passias P, Gaber-Baylis LK, et al. Comparative in-hospital morbidity and mortality after revision versus primary thoracic and lumbar spine fusion. Spine J 2010;10:881–9. 9. Passias PG, Ma Y, Chiu YL, et al. Comparative safety of simultaneous and staged anterior and posterior spinal surgery. Spine 2012;37:247–55. 10. Yao Y, Dresner M, Palmer J. Private network EDI vs. Internet electronic markets: a direct comparison of fulfillment performance. Manage Sci 2009;55:843–52. 11. McLaren CE, Legler JM, Brittenham GM. The generalized χ2 goodness-of-fit test. Statistician 1994;43:247–58.

www.spinejournal.com

July 2014

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited. SPINEJC1301_LR 1312

20/06/14 10:40 PM

Understanding the statistics and limitations of large database analyses.

Understanding the statistics and limitations of large database analyses. - PDF Download Free
85KB Sizes 2 Downloads 5 Views