Multi-center colonoscopy quality measurement utilizing natural language processing.

nature publishing group

ORIGINAL CONTRIBUTIONS

1

Multi-Center Colonoscopy Quality Measurement Utilizing Natural Language Processing Timothy D. Imler, MD, MS1,2,3, Justin Morea, DO2,3, Charles Kahi, MD1,2,4, Jon Cardwell, MS4, Cynthia S. Johnson, MS5, Huiping Xu, PhD5, Dennis Ahnen, MD6, Fadi Antaki, MD7, Christopher Ashley, MD8, Gyorgy Baffy, MD9, Ilseung Cho, MD10, Jason Dominitz, MD11, Jason Hou, MD12, Mark Korsten, MD13, Anil Nagar, MD14, Kittichai Promrat, MD15, Douglas Robertson, MD16, Sameer Saini, MD17, Amandeep Shergill, MD18, Walter Smalley, MD19 and Thomas F. Imperiale, MD1,2,4,20 BACKGROUND: An accurate system for tracking of colonoscopy quality and surveillance intervals could improve

the effectiveness and cost-effectiveness of colorectal cancer (CRC) screening and surveillance. The purpose of this study was to create and test such a system across multiple institutions utilizing natural language processing (NLP). METHODS:

From 42,569 colonoscopies with pathology records from 13 centers, we randomly sampled 750 paired reports. We trained (n=250) and tested (n=500) an NLP-based program with 19 measurements that encompass colonoscopy quality measures and surveillance interval determination, using blinded, paired, annotated expert manual review as the reference standard. The remaining 41,819 nonannotated documents were processed through the NLP system without manual review to assess performance consistency. The primary outcome was system accuracy across the 19 measures.

RESULTS:

A total of 176 (23.5%) documents with 252 (1.8%) discrepant content points resulted from paired annotation. Error rate within the 500 test documents was 31.2% for NLP and 25.4% for the paired annotators (P=0.001). At the content point level within the test set, the error rate was 3.5% for NLP and 1.9% for the paired annotators (P=0.04). When eight vaguely worded documents were removed, 125 of 492 (25.4%) were incorrect by NLP and 104 of 492 (21.1%) by the initial annotator (P=0.07). Rates of pathologic ﬁndings calculated from NLP were similar to those calculated by annotation for the majority of measurements. Test set accuracy was 99.6% for CRC, 95% for advanced adenoma, 94.6% for nonadvanced adenoma, 99.8% for advanced sessile serrated polyps, 99.2% for nonadvanced sessile serrated polyps, 96.8% for large hyperplastic polyps, and 96.0% for small hyperplastic polyps. Lesion location showed high accuracy (87.0–99.8%). Accuracy for number of adenomas was 92%.

CONCLUSIONS: NLP can accurately report adenoma detection rate and the components for determining guideline-

adherent colonoscopy surveillance intervals across multiple sites that utilize different methods for reporting colonoscopy ﬁndings. Am J Gastroenterol advance online publication, 10 March 2015; doi:10.1038/ajg.2015.51

1

Division of Gastroenterology and Hepatology, Indiana University School of Medicine, Indianapolis, Indiana, USA; 2Department of Medicine, Indiana University School of Medicine, Indianapolis, Indiana, USA; 3Department of Biomedical Informatics, Regenstrief Institute, LLC, Indianapolis, Indiana, USA; 4Center of Innovation, Health Services Research and Development, Richard L, Roudebush VA Medical Center, Indianapolis, Indiana, USA; 5Department of Biostatistics, Indiana University School of Medicine, Indianapolis, Indiana, USA; 6Division of Gastroenterology, University of Colorado, Denver, Colorado, USA; 7Division of Gastroenterology, Wayne State University, Detroit, Michigan, USA; 8Division of Gastroenterology, Albany Medical College, Albany, New York, USA; 9Department of Medicine, VA Boston Healthcare System, Boston, Massachusetts, USA; 10Division of Gastroenterology, New York University School of Medicine, New York, New York, USA; 11Division of Gastroenterology, University of Washington School of Medicine, Seattle, Washington, USA; 12Division of Gastroenterology and Hepatology, Baylor College of Medicine, Houston, Texas, USA; 13Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, Bronx, New York, USA; 14 Division of Digestive Diseases, Yale School of Medicine, New Haven, Connecticut, USA; 15Division of Gastroenterology, Brown Medical School, Providence, Rhode Island, USA; 16Division of Gastroenterology, The Dartmouth Institute, Lebanon, New Hampshire, USA; 17Division of Gastroenterology, University of Michigan, Ann Arbor, Michigan, USA; 18Division of Gastroenterology, University of California at San Francisco, San Francisco, California, USA; 19Division of Gastroenterology, Vanderbilt University, Nashville, Tennessee, USA; 20Health Services Research, Regenstrief Institute, Indianapolis, Indiana, USA. Correspondence: Timothy D. Imler, MD, MS, Division of Gastroenterology and Hepatology, Research Scientist; Regenstrief Institute, 702 Rotary Circle, Suite 225, Indianapolis, Indiana 46202, USA. E-mail: [email protected] Received 24 October 2014; accepted 28 January 2015

© 2015 by the American College of Gastroenterology

The American Journal of GASTROENTEROLOGY

ENDOSCOPY

see related editorial on page x

2

Imler et al.

ENDOSCOPY

INTRODUCTION Screening colonoscopy’s strength is to identify and remove precancerous (adenomatous) polyps. Adenoma detection rate (ADR) (1–6), defined as the proportion of screening colonoscopies in which one or more adenoma is detected multiplied by 100, is inversely related to the risk of interval CRC (cancer diagnosed after initial colonoscopy and before the next scheduled screening or surveillance exam) (7), advanced-stage disease, and fatal interval cancer in a dose-dependent manner (8). In a recent report, each 1% increase in ADR was associated with a 3% decrease in the risk for an interval cancer (8). ADRs vary widely among endoscopists (7.4–52.5%), making it an important quality and performance metric. However, ADR cannot easily be extracted from electronic data, limiting the ability to monitor and improve colonoscopy quality (9). Once neoplastic tissue has been identified, follow-up colonoscopy is recommended. Surveillance colonoscopy is possibly overutilized among patients who need it least and underutilized among those who need it most (10,11). A system that could measure proper use of surveillance would enhance the effectiveness and cost-effectiveness of colonoscopy and could be utilized for a payfor performance system (12). NLP is a tool that may be used for such a system (13). NLP is a computer-based linguistics technique that uses artificial intelligence to extract information from text reports (14). NLP has been utilized broadly across the medical field (15–19), but has been limited by accuracy, location, and context-specific utilization (15,20,21). Several reports from single sites have reported accuracies of NLP quality measurements, including ADR (12,22–27). However, it is not known whether there is significant linguistic variation, the way providers express the same concept or disease entity, across multiple centers that would make NLP challenging to implement. Using data from 13 Veterans Affairs (VAs) endoscopy units, we sought to validate the performance of an NLP-based system for quantifying ADR and for identifying the requisite variables for providing guideline-based surveillance recommendations.

METHODS Setting and data source

This study was approved by the VA Central Institutional Review Board. Data were obtained from 13 VA medical centers by means of electronic retrieval from the Computerized Patient Record System (28), the VA electronic medical record. Sites are listed in the acknowledgments section. Extracted data include colonoscopy and, when applicable, pathology reports from Veterans aged 40–80 years undergoing first-time VA-based colonoscopy between 2002 and 2009 for any indication except neoplasia surveillance. Extracted reports were linked using study-specific software to their corresponding pathology reports and were deidentified for NLP analysis. The colonoscopy reports were generated at each facility based on their local protocols (e.g., endowriter, dictation, etc). Exclusion criteria were as follows: previous VA-based colonoscopy for any indication within the 8-year interval; colonoscopy The American Journal of GASTROENTEROLOGY

indication of neoplasia surveillance; previous colon resection; history of polyps or cancer of the colon or rectum; history of inflammatory bowel disease; and history of hereditary polyposis or non-polyposis CRC syndrome. All potentially eligible colonoscopies underwent preprocessing of the colonoscopy report using a text search of the indication field of the report with the terms “surveillance”, “history of adenoma”, and “history of polyp”, and were excluded if these terms were present. Associated International Classification of Diseases, ninth revision codes were then searched within the documents for V12.72 (personal history of colonic polyps), 211.3 (benign neoplasm of colon), 211.4 (benign neoplasm of rectum and anal canal), and 153* (malignant neoplasm of colon). Documents with any of these terms were excluded. Each patient-related report was given a unique ID for tracking and blinding the investigators to patient identity and VA location. Text reports were combined before NLP processing by merging the “Findings” and “Impression” sections and combining them with pathology. Natural language processor

The Apache Software Foundation clinical Text Analysis and Knowledge Extraction System (cTAKES) (29) version 3.1.1 was utilized as the NLP engine for examination of colonoscopy and pathology reports. cTAKES is an open-source NLP system that uses rule-based and machine-learning methods with multiple components for customization (Apache License, Los Angeles, CA, USA, Version 2.0). Machine-learning methods included the following: sentence boundary detection, tokenization (dividing a sentence into unique words), named entity recognition using the Unified Medical Language System (UMLS) (30), and negation (e.g., recognizing “no adenoma” as the absence of an adenoma). A custom dictionary was created for synonyms not identified within UMLS and for additional postprocessing of common expressions. Documents were stored within MySQL version 5.5.36 software, an open-source database released under the General Public License (GNU), version 2.0. Using the MySQL (RAND) (MySQL. Mathematical Functions) function, we selected 750 combined reports from the 42,569 eligible for annotation to create a reference standard for training and testing. The 750 annotated documents were randomly split in a 2:1 ratio, allocating 250 documents to the training set (documents to be reviewed by the investigators for NLP refinement) and 500 documents to the test set. Measures

The primary outcome was NLP system accuracy to identify the necessary components for high quality, guideline adherent, surveillance recommendations from colonoscopy, and pathology reports, including detection of adenomas. ADR among institutions was a secondary outcome. The terms for each concept were agreed upon by the authors a priori. Each unique colonoscopy report was categorized into nine categories as follows: (1) adenocarcinoma, (2) advanced adenoma www.amjgastro.com

(AA), (3) advanced sessile serrated polyp/adenoma (SSP), (4) nonAA, (6) nonadvanced SSP, (7) ≥10 mm hyperplastic polyp (HP), (8)

Natural language processing as an alternative to manual reporting of colonoscopy quality metrics.

Accurate Identification of Fatty Liver Disease in Data Warehouse Utilizing Natural Language Processing.

Anatomic and advanced adenoma detection rates as quality metrics determined via natural language processing.

Measuring physician adherence with gout quality indicators: a role for natural language processing.

Quantitative chamber angle measurement utilizing image-processing techniques.

Survey of Natural Language Processing Techniques in Bioinformatics.

Crowdsourcing and curation: perspectives from biology and natural language processing.

Natural language processing in psychiatry. Artificial intelligence technology and psychopathology.

Deriving comorbidities from medical records using natural language processing.

Evaluation of PHI Hunter in Natural Language Processing Research.

Expert guided natural language processing using one-class classification.

Using natural language processing techniques to inform research on nanotechnology.

Natural Language Processing Technologies in Radiology Research and Clinical Applications.

Contribution of Natural Language Processing in Predicting Rehospitalization Risk.

Quality indicators for colonoscopy.

Colonoscopy: quality indicators.

Hospital-based acute care after outpatient colonoscopy: implications for quality measurement in the ambulatory setting.

Conversion of colonoscopy to flexible sigmoidoscopy: an unintended consequence of quality measurement in endoscopy.

Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare.

Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning.

Quantifying care coordination using natural language processing and domain-specific ontology.

Natural language processing methods for enhancing geographic metadata for phylogeography of zoonotic viruses.

Adapting a natural language processing tool to facilitate clinical trial curation for personalized cancer therapy.

Natural Language Processing Based Instrument for Classification of Free Text Medical Records.