Accurate Identification of Fatty Liver Disease in Data Warehouse Utilizing Natural Language Processing.

Dig Dis Sci DOI 10.1007/s10620-017-4721-9

ORIGINAL ARTICLE

Accurate Identification of Fatty Liver Disease in Data Warehouse Utilizing Natural Language Processing Joseph S. Redman1 • Yamini Natarajan1,2 • Jason K. Hou1,2 • Jingqi Wang3 Muzammil Hanif2 • Hua Feng2 • Jennifer R. Kramer2 • Roxanne Desiderio2 Hua Xu3 • Hashem B. El-Serag1,2 • Fasiha Kanwal1,2

• •

Received: 14 June 2017 / Accepted: 10 August 2017 Ó Springer Science+Business Media, LLC (Outside the USA) 2017

Abstract Introduction Natural language processing is a powerful technique of machine learning capable of maximizing data extraction from complex electronic medical records. Methods We utilized this technique to develop algorithms capable of ‘‘reading’’ full-text radiology reports to accurately identify the presence of fatty liver disease. Abdominal ultrasound, computerized tomography, and magnetic resonance imaging reports were retrieved from the Veterans Affairs Corporate Data Warehouse from a random national sample of 652 patients. Radiographic fatty liver disease was determined by manual review by two physicians and verified with an expert radiologist. A split validation method was utilized for algorithm development. Results For all three imaging modalities, the algorithms could identify fatty liver disease with [90% recall and precision, with F-measures [90%.

The views expressed in this article are those of the author(s) and do not necessarily represent the views of the Department of Veterans Affairs. & Yamini Natarajan [email protected] Fasiha Kanwal [email protected] 1

Baylor College of Medicine, Houston, TX, USA

2

Clinical Epidemiology and Comparative Effectiveness Program, Center for Innovations in Quality, Effectiveness and Safety, Michael E. Debakey VA Medical Center, John P. McGovern Campus, 2450 Holcombe Blvd., Suite 01Y, Houston, TX 77021, USA

3

School of Biomedical Informatics, UT Health, Houston, TX, USA

Discussion These algorithms could be used to rapidly screen patient records to establish a large cohort to facilitate epidemiological and clinical studies and examine the clinic course and outcomes of patients with radiographic hepatic steatosis. Keywords Natural language processing Electronic health records Fatty liver Nonalcoholic fatty liver disease Triglycerides Epidemiology

Introduction Nonalcoholic fatty liver disease (NAFLD) is the most common chronic liver disease worldwide. The prevalence of NAFLD in the USA has more than doubled in the last twenty years in all age groups, occurring in about 11% of children and 20–35% of adults [1, 2]. Almost a quarter of patients with NAFLD develop progressive disease that is associated with NASH, cirrhosis, and related complications, including hepatocellular carcinoma. Identifying patients with possible NAFLD in electronic health records is important for clinical practice as well as research purposes. Medical records contain information with varying accuracy and completeness for the diagnosis of NAFLD. Liver biopsy, the longstanding gold standard for NAFLD detection, is associated with complications, sampling errors, suboptimal reproducibility, and implausibility for applying to large groups of patients. Given this, most clinicians rely on readily available noninvasive test criteria such as evidence of steatosis on abdominal imaging with or without elevated blood levels of alanine aminotransferase (ALT) to detect NAFLD. Cross-sectional abdominal imaging is commonly used in clinical practice to identify patients with FLD. Ultrasonography (US) is a

123

Dig Dis Sci

well-established imaging modality with reported sensitivity of 84.8% and specificity of 93.6% for detecting moderate to severe steatosis or fatty liver disease (FLD) [3]. Computed tomography (CT), although not recommended for routine screening for FLD, has reasonable accuracy for detecting FLD when performed for other indications with reported sensitivity and specificity of 73–82 and 91–100%, respectively [4]. Magnetic resonance imaging (MRI) techniques are now regarded as the most accurate methods of measuring liver fat in clinical practice, with sensitivity and specificity of 96 and 93%, respectively [5]. Therefore, imaging tests are likely the most valid, complete, and reproducible of possible FLD diagnostic tests available in EMR. Natural language processing (NLP) is a computer linguistics technology capable of identifying specific phrases or findings from text reports and incorporating language cues and context in order to interpret its findings [6]. NLP is capable of mining large databases to rapidly identify desired clinical information. We have previously implemented NLP to successfully determine colorectal adenoma detection rates [7] and accurately interpret pathology reports to diagnose inflammatory bowel disease [8]. However, there have been limited efforts to apply this method to identify patients with radiographic evidence of NAFLD in electronic health records. Using NLP, we developed and validated algorithms to identify NAFLD in radiology reports of US, CT, and MRI scans.

Methods Data Source We used electronic medical record (EMR) data from the Veterans Affairs (VA) Corporate Data Warehouse (CDW)—a national level database that houses radiology reports from nearly 8 million veterans nationwide. CDW stores radiology raw data files that house all reports from radiology tests performed in the VA nationwide from 1984 (CDW’s inception) to present. This database includes demographic information, type of study, date and time of study, report and impression, all linked through a unique identification number that can be used to identify different studies from an individual patient. Study Subjects Radiology reports of abdominal US, CT, or MRI for any indication in CDW were collected from a random national sample of 1000 unique VA patients who received abdominal imaging tests between 2000 and 2016. Radiology reports were identified based on searching keyword

123

pairings of ‘‘CT,’’ ‘‘MRI,’’ ‘‘magnetic resonance imaging,’’ ‘‘ultrasound,’’ and ‘‘sonogram’’ with ‘‘abdomen,’’ ‘‘abd,’’ and ‘‘liver.’’ After removing duplicate reports generated by redundancy in search terms, we excluded 348 individuals because they did not have appropriate abdominal imaging or only had image reports that referenced imported non-VA images. Thus, a total of 1199 reports remained from 652 individuals (Fig. 1). Manual Classification The presence of FLD in each report was first determined by manual evaluation of all reports by two physicians (J.R., Y.N.) and verified with an expert radiologist (M.H.) in cases of ambiguity. The diagnosis of FLD was determined based on explicit criteria, defined a priori, and based on input from an experienced radiologist (M.H.) and hepatologist (F.K.). Specifically, we classified reports to have definite evidence of FLD if they documented {1} steatosis or fatty liver disease, {2} diffusely increased echogenicity on US without any other apparent pathology or evidence of cirrhosis, {3} diffusely decreased liver density or Hounsfield units \50 on CT, or {4} signal dropout greater than 15% in out-of-phase compared to in-phase sequences on MRI. Multiple reports from single patients were linked by unique EMR identifiers, enabling population-level FLD prevalence calculations and an assessment of report concordance. We classified reports to be ‘‘indeterminate but likely FLD’’ if they mentioned diffuse enhancement in the liver but also listed multiple etiologies as the possible underlying cause (e.g., fibrosis), or used uncommon descriptions such as enhanced echotexture, echoarchitecture, or coarsening. We classified reports to be ‘‘indeterminate but unlikely FLD’’ when the liver showed some features consistent with fatty infiltration but the pattern of deposition or the context of the report made other diagnostic explanations more likely. Reports lacking mention of hepatic steatosis or any of the a priori archetypal features were deemed as negative. All reports were included for algorithm training and validation. Both reviewers reviewed a 10% sample of the above reports to ensure internal consistency of the review. Any disagreements were reviewed by a third reviewer and a radiologist (M.H., F.K.) and resolved by discussion. The two reviewers then separately reviewed the remaining samples. All indeterminate cases were adjudicated after discussion with the radiologist; approximately 150 reports were discussed, with less than 10% requiring revision in assessment, typically de-escalating the prevalence of fatty liver disease when reports were more suggestive of anatomical variance (e.g., local fatty infiltration) or fibrosis.

Dig Dis Sci

1000 Veterans screened for abdominal image reports

348 individuals without readable reports. 1,199 image reports retrieved from remaining 652 patients

Presence of Fatty Liver Disease determined manually by 2 physicans

Ultrasound (n=407) 271 reports for algorithm development 136 reports for testing

CT scan (n=741) 494 reports for algorithm development 247 reports for testing

MRI (n=51) 34 reports for algorithm development 17 reports for testing

Fig. 1 Split validation method for NLP algorithm development and testing

NLP An NLP software, Clinical Language Annotation, Modeling, and Processing Toolkit (CLAMP), was used to develop algorithms that could identify FLD based on key radiographic findings paired with language rules and context. To develop the NLP algorithms, we performed a split validation study of randomly selected radiology reports. Specifically, radiology reports of each imaging modality were randomly split into two sets: 70% for algorithm training and 30% for validation. CLAMP NLP software was applied to the first set in order to teach the software key phrases suggestive of FLD. The algorithm was then tested with the remaining 30% in order to determine its accuracy for independently determining findings suggestive of FLD. Data Analysis Algorithm performance characteristics for each imaging modality were calculated as recall (also known as sensitivity), precision (also known as positive predictive value), accuracy (proportion of tests that are either true positive or

true negative), and F-measures (harmonic mean of precision and recall) for both the development and validation samples.

Results Of the 1199 imaging studies from 652 individuals included in the study, 741 were reports from CT scans (62%) from 358 patients, 407 were reports from US (34%) from 253 patients, and 51 were reports from MRI studies (4%) from 41 patients. Based on the manual review, the prevalence of FLD (percentage of individuals with any report demonstrating FLD) for imaging modality ranges from 17.6 to 45.5% (Table 1). The F-measure and accuracies for each imaging modality algorithm were greater than 90% (Table 2). Ultrasound A total of 271 US reports were analyzed during the training phase. The NLP US algorithm performed with 95.1% recall

123

Dig Dis Sci Table 1 Individuals with fatty liver disease by imaging modality obtained for any reason (as determined by manual review) Modality

Number of individuals

Individuals with at least 1 report showing FLD

Ultrasound

253

115

45.5

Computerized tomography

358

63

17.6

41

8

19.5

Magnetic resonance imaging

Table 2 Recall, precision, accuracy, and F-measures

NLP algorithm

# Reports in set

Recall

Precision

Accuracy

F-measure

Ultrasound, development

271

0.951

0.933

0.956

0.942

Ultrasound, validation

136

0.900

0.918

0.934

0.909

CT scan, development

494

0.979

0.958

0.994

0.968

CT scan, validation

247

0.935

0.967

0.988

0.951

MRI, development

34

1.00

1.00

1.00

1.00

MRI, validation

17

1.00

1.00

1.00

1.00

(97/102) and 93.3% precision (97/104) for identifying FLD when compared to manual verification. In the validation cohort (136 files), the algorithm retained a 90.0% recall (45/50) and a 91.8% precision (45/49) (Table 3). Accuracy was 94.2% in development and 90.9% in validation. Computed Tomography In the CT training phase, 494 CT scans reports were analyzed. The NLP CT algorithm performed with 97.9% recall (46/47) and 95.8% precision (46/48). In validation (247 files), the algorithm retained a 93.5% recall (29/31) and 96.7% precision (29/30) (Table 4). Accuracy was 96.8% in development and 95.1% in validation. Magnetic Resonance Imaging In the MRI training phase, 34 MRI reports were analyzed. The NLP MRI algorithm performed with 100% recall (6/6) and 100% precision (6/6), and in validation (17 files) had 100% recall (5/5) and 100% precision (5/5) (Table 5). Accuracy was 94.2% in development and 90.9% in validation.

Table 3 Test characteristics of ultrasound algorithm for detecting fatty liver disease, validation cohort

Concordance of Reports Within Individual Patients A total of 51.7% of patients had one report, 20.6% with 2 reports, and 27.7% had 3 or more reports (range 1–41, median 1, mean 2.39). For patients with greater than 1 report from the same imaging modality, FLD assessment agreed in 100% of MRI reports, 93.4% of CT reports, and 84.7% of US reports. Ninety-six patients had both CT and US reports, which agreed on patient’s FLD status in 55.2% of cases (53/96), with the majority of disagreements occurring when US identified FLD, while CT did not (79.1%; 34/43). Ten patients had both CT and MRI reports, which agreed on patient’s FLD status in 70.0% of cases (7/ 10), with all disagreements occurring when CT reports identified FLD, while MRI reports did not (3/3). Six patients had both US and MRI reports, which agreed on the patient’s FLD status in 66.7% of cases (4/6), with the two disagreements occurring with US reports identifying FLD, while MRI reports did not (2/2). Seventeen patients had reports from all three imaging modalities, which agreed in 52.8% of cases (9/17). In the majority of disagreements, CT reports did not identify FLD where one of the other two imaging modalities detected it.

FLD-positive (manual review)

FLD-negative (manual review)

Total reports

FLD-positive (NLP)

45

4

49

FLD-negative (NLP)

5

82

87

Total reports

50

86

136

Sensitivity = 90.0% PPV = 91.8%

Specificity = 95.3% NPV = 94.3%

PPV positive predictive value, NPV negative predictive value

123

% prevalence

Dig Dis Sci Table 4 Test characteristics of CT scan algorithm for detecting fatty liver disease, validation cohort



Total reports

FLD-positive (NLP)

29

1

FLD-negative (NLP)

2

215

217 247

Total reports

31

216

Sensitivity = 93.5%

Specificity = 99.5%

PPV = 96.7%

NPV = 99.1%

30


Table 5 Test characteristics of MRI scan algorithm for detecting fatty liver disease, validation cohort



Total reports

FLD-positive (NLP)

5

0

5

FLD-negative (NLP)

0

12

12 17

Total reports

5

12

Sensitivity = 100%

Specificity = 100%

PPV = 100%

NPV = 100%


Discussion We developed NLP-based algorithms which, when applied to EMR-based radiology reports, accurately identified patients with radiographic evidence of FLD. US reports demonstrated FLD in nearly half of patients, and the US NLP algorithm detected its presence 95% of the time with an accuracy exceeding 96%. CT and MRI reports had evidence of FLD in nearly 20% of patients, and the respective NLP algorithms could detect its presence 98 and 100% of the time with accuracies exceeding 98%. All algorithms retained their performance characteristics in the validation samples. Given these data, the algorithms could next be applied to larger samples of radiology reports to identify patients with moderate to severe FLD. NLP technology has been increasingly implemented to discover elusive but clinically important information from EMR data sets. We previously reported the use of NLP in adenoma detection rates [7] and in diagnosing inflammatory bowel disease [8]. Others have implemented NLP in numerous contexts, from identifying patients’ smoking status [9], extracting body weights from clinical notes [10], detecting homelessness [11], and even real-time catheterassociated UTI surveillance [12]. We believe that the use of NLP algorithms to identify patients with FLD is particularly relevant given the high prevalence of the condition, lack of sensitive and specific ICD codes, and the implausibility of performing liver biopsies to identify these patients at the larger population level. NLP algorithms capable of evaluating existing radiology reports offer the promise to identify a large cohort of patients with FLD for epidemiological and clinical studies.

The prevalence of NAFLD is under-recognized in the primary care setting [13]. Accordingly, we recognized that many individuals will have radiographic evidence of fatty liver disease that is not yet clinically apparent. The American Association for the Study of Liver Disease (AASLD) guidelines for NAFLD [14] do not address either the prevalence or natural history of incidentally discovered hepatic steatosis, and there are no data comparing cohorts where the definition is based on laboratory versus imaging data. NLP may aid in identifying a cohort of such patients with radiographic steatosis not yet exhibiting symptoms and help to determine the natural history of clinically silent FLD. We are cognizant of the limitations of our study. Estimating prevalence of FLD based on radiographic data is limited since not all FLD patients receive abdominal imaging, and patients with high-risk features may be more likely to undergo imaging than those at low risk for FLD or progressive disease. For example, we found that nearly half of the patients undergoing ultrasound of the liver had evidence of hepatic steatosis. However, CT and MRI of the abdomen were conducted for a myriad of reasons, and our estimates of steatosis identified in these studies (17.6 and 19.5%, respectively) are consistent with other reports on overall population prevalence of FLD [14–16]. Another limitation was related to difficulty classifying cases as FLD in the presence of concurrent cirrhosis. Cirrhosis alters the liver architecture making it impossible to reliably determine the presence of fat on ultrasound or CT scan. As a screening tool, however, we favored identification of such mixed findings. We found that MRI reports could readily determine whether fatty infiltration and/or fibrosis were

123

Dig Dis Sci

concurrently present and often reported on the presence of both. In summary, we developed NLP-based algorithms that accurately identify patients with FLD. Initial evaluation provided support to the validity of these algorithms. Future studies will include the broader application of the algorithms to larger groups with abdominal imaging data. The resulting cohort of individuals with radiographic FLD can form the basis of many quality improvement initiatives as well as epidemiological and clinical studies to examine the clinic course of patients with hepatic steatosis. Acknowledgments This study was approved by the Institutional Review Boards of Baylor College of Medicine and the Michael E. DeBakey VA Medical Center in Houston, Texas. Guarantor of the article: Fasiha Kanwal, MD. Author’s contribution Drs. Redman and Natarajan developed the NLP algorithms and performed the analysis of the data, with principle writing of the manuscript. Dr. Feng performed the random imaging search and compiled the associated reports. Dr. Hanif reconciled disagreements on imaging reports. Mss. Kramer and Desiderio facilitated IRB and VA approvals. Drs. Wang and Xu are CLAMP developers and helped troubleshoot our NLP algorithms. Drs. Hou and El-Serag reviewed the completed manuscript. Dr. Kanwal facilitated the project and made helpful edits of the manuscript. Compliance with ethical standards Conflict of interest None.

References 1. Rinella ME, Sanyal AJ. Management of NAFLD; a stage-based approach. Nat Rev Gastroenterol Hepatol. 2016;13:196–205. 2. Welsh JA, Karpen S, Vos MB. Increasing prevalence of nonalcoholic fatty liver disease among United States adolescents, 1988–1994 to 2007–2010. J Pediatr. 2013;162:496–500. 3. Hernaez R, Lazo M, Bonekamp S, et al. Diagnostic accuracy and reliability of ultrasonography for the detection of fatty liver; a meta-analysis. Hepatology. 2011;54:1082–1090.

123

4. Lee SS, Park SH. Radiologic evaluation of nonalcoholic fatty liver disease. World J Gastroenterol. 2014;20:7392–7402. 5. Mennesson N, Dumortier J, Hervieu V, et al. Liver steatosis quantification using magnetic resonance imaging: a prospective comparative study with liver biopsy. J Comput Assist Tomogr. 2009;33:672–677. 6. Hou JK, Imler TD, Imperiale TF. Current and future applications of natural language processing in the field of digestive diseases. Clin Gastroenterol Hepatol. 2014;12:1257–1261. 7. Imler TD, Morea J, Kahi C, et al. Multi-center colonoscopy quality measurement utilizing natural language processing. Am J Gastroenterol. 2015;110:543–552. 8. Hou JK, Chang M, Nguyen T, et al. Automated identification of surveillance colonoscopy in inflammatory bowel disease using natural language processing. Dig Dis Sci. 2013;58:936–941. 9. Wang L, Ruan X, Yang P, Liu H. Comparison of three information sources for smoking information in electronic health records. Cancer Inform. 2016;15:237–242. 10. Murtaugh MA, Gibson BS, Redd D, Zeng-Treitler Q. Regular expression-based learning to extract bodyweight values from clinical notes. J Biomed Inform. 2015;54:186–190. 11. Redd A, Carter M, Divita G, et al. Detecting earlier indicators of homelessness in the free text of medical records. Stud Health Technol Inform. 2014;202:153–156. 12. Branch-Elliman W, Strymish J, Kudesia V, Rosen AK, Gupta K. Natural language processing for real-time catheter-associated urinary tract infection surveillance: results of a pilot implementation trial. Infect Control Hosp Epidemiol. 2015;36:1004–1010. 13. Blais P, Husain N, Kramer JR, Kowalkowski M, El-Serag HB, Kanwal F. Nonalcoholic fatty liver disease is underrecognized in the primary care setting. Am J Gastroenterol. 2015;110:10–14. 14. Chalasani N, Younossi Z, Lavine JE, et al. The diagnosis and management of nonalcoholic fatty liver disease; practice guideline by the American Gastroenterological Association, American Association for the Study of Liver Diseases, and American College of Gastroenterology. Hepatology. 2012;55:2005–2023. 15. Lazo M, Hernaez R, Eberhardt MS, et al. Prevalence of nonalcoholic fatty liver disease in the United States: the third national health and nutrition examination survey, 1988–1994. Am J Epidemiol. 2013;178:38–45. 16. Kanwal F, Kramer JR, Duan Z, Yu X, White D, El-Serag HB. Trends in the burden of nonalcoholic fatty liver disease in a United States cohort of veterans. Clin Gastroenterol Hepatol. 2016;14:301–308.

Multi-center colonoscopy quality measurement utilizing natural language processing.

Chemoprevention of nonalcoholic fatty liver disease by dietary natural compounds.

Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning.

Interactive Cohort Identification of Sleep Disorder Patients Using Natural Language Processing and i2b2.

Survey of Natural Language Processing Techniques in Bioinformatics.

Evaluation of PHI Hunter in Natural Language Processing Research.

Contribution of Natural Language Processing in Predicting Rehospitalization Risk.

Natural language processing in psychiatry. Artificial intelligence technology and psychopathology.

Natural Language Processing Technologies in Radiology Research and Clinical Applications.

Nonalcoholic Fatty Liver Disease for Identification of Preclinical Carotid Atherosclerosis.

Protecting privacy in a clinical data warehouse.

Surveillance of Peripheral Arterial Disease Cases Using Natural Language Processing of Clinical Notes.

Data warehouse for detection of occupational diseases in OHS data.

Identifying Peripheral Arterial Disease Cases Using Natural Language Processing of Clinical Notes.

Crowdsourcing and curation: perspectives from biology and natural language processing.

Deriving comorbidities from medical records using natural language processing.

Expert guided natural language processing using one-class classification.

Using natural language processing techniques to inform research on nanotechnology.

Unlocking echocardiogram measurements for heart disease research through natural language processing.

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.

Histopathological differences utilizing the nonalcoholic fatty liver disease activity score criteria in diabetic (type 2 diabetes mellitus) and non-diabetic patients with nonalcoholic fatty liver disease.

The natural history of nonalcoholic fatty liver disease: mortality rates and liver enzymes.

MouseMine: a new data warehouse for MGI.

Developing a standardized healthcare cost data warehouse.