Using natural language processing and machine learning to identify gout flares from electronic clinical notes.

Arthritis Care & Research Vol. 66, No. 11, November 2014, pp 1740 –1748 DOI 10.1002/acr.22324 © 2014, American College of Rheumatology

ORIGINAL ARTICLE

Using Natural Language Processing and Machine Learning to Identify Gout Flares From Electronic Clinical Notes CHENGYI ZHENG,1 NAZIA RASHID,2 YI-LIN WU,2 RIVER KOBLICK,2 ANTONY T. LIN,3 GERALD D. LEVY,4 AND T. CRAIG CHEETHAM2

Objective. Gout flares are not well documented by diagnosis codes, making it difficult to conduct accurate database studies. We implemented a computer-based method to automatically identify gout flares using natural language processing (NLP) and machine learning (ML) from electronic clinical notes. Methods. Of 16,519 patients, 1,264 and 1,192 clinical notes from 2 separate sets of 100 patients were selected as the training and evaluation data sets, respectively, which were reviewed by rheumatologists. We created separate NLP searches to capture different aspects of gout flares. For each note, the NLP search outputs became the ML system inputs, which provided the final classification decisions. The note-level classifications were grouped into patient-level gout flares. Our NLPⴙML results were validated using a gold standard data set and compared with the claims-based method used by prior literatures. Results. For 16,519 patients with a diagnosis of gout and a prescription for a urate-lowering therapy, we identified 18,869 clinical notes as gout flare positive (sensitivity 82.1%, specificity 91.5%): 1,402 patients with >3 flares (sensitivity 93.5%, specificity 84.6%), 5,954 with 1 or 2 flares, and 9,163 with no flare (sensitivity 98.5%, specificity 96.4%). Our method identified more flare cases (18,869 versus 7,861) and patients with >3 flares (1,402 versus 516) when compared to the claims-based method. Conclusion. We developed a computer-based method (NLP and ML) to identify gout flares from the clinical notes. Our method was validated as an accurate tool for identifying gout flares with higher sensitivity and specificity compared to previous studies.

INTRODUCTION Gout is an inflammatory arthritic condition associated with the crystallization of monosodium urate within synovial joints (1–3). Gout flares are the most common clinical finding in gout patients who have acute inflammation characterized by red, swollen, and painful joints (2,3).

Supported by Savient Pharmaceuticals, Inc. 1 Chengyi Zheng, PhD, MS: Kaiser Permanente Southern California, Pasadena; 2Nazia Rashid, PharmD, MS, Yi-Lin Wu, MS, River Koblick, BA, T. Craig Cheetham, PharmD, MS: Kaiser Permanente, Pharmacy Analytical Services, Downey, California; 3Antony T. Lin, MD: Southern California Permanente Medical Group, Kaiser Permanente Southern California, Fontana; 4Gerald D. Levy, MD: Southern California Permanente Medical Group, Kaiser Permanente Southern California, Downey. Address correspondence to Chengyi Zheng, PhD, MS, Department of Research and Evaluation, Kaiser Permanente Southern California, 100 South Los Robles Avenue, 2nd Floor, Pasadena, CA 91101. E-mail: [email protected]. Submitted for publication December 6, 2013; accepted in revised form March 18, 2014.

1740

Since flares are common in gout patients, gout flares are important factors in gout-related clinical trials and observational studies (4). The frequency of flares usually increases over time and is highly associated with elevated serum uric acid (UA) levels (5). Previous studies have used diagnosis codes, laboratory data, and pharmacy and medical claims to identify gout flares (6,7). However, these clinical surrogates have not accurately identified gout flares, and potentially underestimate or overestimate the rates of gout flares. The lack of a standard definition of gout flare in observational studies demonstrates the intrinsic difficulty of flare identification (8). Therefore, better methods need to be investigated to identify gout flares. Natural language processing (NLP) is a field of computer science and linguistics that aims to understand human (natural) languages and facilitate interactions between humans and computer. In the clinical domain, NLP has been utilized to identify and extract information from the unstructured “free text” that exists in chart notes. When compared to human chart review of medical records, NLP is more efficient and consistent (9 –12). Previously, NLP

Automated Identification of Gout Flares From Clinical Notes

Significance & Innovations ●

We developed and validated a computer-based method to identify gout flares from the clinical notes in a large integrated health care system. We demonstrated that natural language processing (NLP) ⫹ machine learning (ML) can be used as an accurate tool for identifying gout flares with high sensitivity and specificity, based on the evidence presented in the clinical notes.

●

Compared to the claims-based methods used by prior literature to identify gout flares, our method identified more flare cases (18,869 versus 7,861) and patients with ⱖ3 flares (1,402 versus 516).

●

We demonstrated that NLP⫹ML can be used as a highly efficient and accurate tool to acquire clinically meaningful data from electronic health records, thus saving tremendous time and costs associated with manual chart review.

has been used to identify possible lung cancer patients based on their radiology reports (13) and extract disease characteristics for patients with prostate and breast cancer (14). Machine learning (ML) is a branch of computer science aimed at training computer systems with humanlabeled data and then using that information to make decisions on unlabeled new data (15). ML has been used in the biomedical domain, mainly for named entity recognition such as gene names (16) and clinical entity extraction (17,18). In this study, the goal was to use an NLP algorithm to capture evidence of gout flares from the clinical notes with high sensitivity and reasonable specificity, and then use ML to achieve better specificity without significant loss in sensitivity. Both the NLP and ML algorithms were first developed using training data sets, then tested using the gold standard data set, and finally run against the study population’s data. The objective for combining NLP and ML was to achieve high levels of specificity and sensitivity and then compare our findings to approaches used in previous observational studies.

MATERIALS AND METHODS Study setting. Kaiser Permanente Southern California (KPSC) provides integrated, comprehensive medical services to 3.6 million members through its own facilities, which include 14 hospitals, 202 outpatient facilities, and a centralized laboratory. Every member receives a unique medical record number that they keep for life. This allows the member to be linked to various clinical and administrative databases, including member enrollment and benefits, inpatient and outpatient visits, laboratory test results, and drug dispensing. All aspects of care and interactions with the health care delivery system are captured in a comprehensive electronic medical record (EMR) system.

1741 In addition, care delivered outside KPSC is captured by a claims system. The study schema is shown in Figure 1. Our initial step was to identify the base study population of gout patients and retrieve their clinical notes. We then created a beta version of the NLP algorithm and ran it against the clinical notes so we could generate 2 randomly selected sets of 100 patients each to form training and evaluation (gold standard) patient groups. A rheumatologist (GDL) then reviewed the notes from the training set of patients to identify gout flares and allow us to refine and develop the NLP and ML algorithms, respectively. Two rheumatologists (GDL and ATL) then reviewed the evaluation set of patients to generate the gold standard data set. Analyses were then conducted on the gold standard data set and the entire base study population. Base study population. This study was developed under a project with an objective of identifying gout patients receiving urate-lowering therapy (ULT) and determining the adequacy of control based on the frequency of gout flares. Using the KPSC system, we conducted a retrospective analysis between January 1, 2007 and December 31, 2010. The base study population included patients who were taking a ULT prescription anytime (index date defined as the date of the first ULT prescription) during this time period, had an International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) code of gout (274.xx) anytime during the study time period, and were age ⱖ18 years on the index date. Patients were also required to have continuous membership and drug benefits for 12 months before and after the index date. Patients with a history of cancer, dialysis, stage 5 chronic kidney disease, or a glomerular filtration rate of ⬍15 ml/ minute/1.73 m2 in the 12 months prior to the index date were excluded. A total of 16,707 patients met the inclusion and exclusion criteria, of whom 16,519 (98.9%) had notes available during the 15-month time period (3 months before and 12 months after the index date). We looked back 3 months from the index date to incorporate cases when notes had gout flare documented before the index date. Certain types of clinical notes such as telephone and e-mail notes were excluded because they did not involve a face-to-face visit with a provider. A total of 599,317 clinical notes were retrieved and used in our NLP search. It is worth noting that each encounter in our EMR system could have generated multiple notes; each version of a note will create a separate note and duplicated notes can also occur. Encounters with multiple notes did not impact our results, since those notes have the same encounter date. Creating the NLP search system (beta version). Despite recent advances of NLP, linguistic ambiguity and variation still pose challenges to NLP, since we use different terms and forms to communicate similar ideas. Those various terms are organized with other words, phrases, and punctuation in a variety of ways to form sentences and documents. Compared with other free-text data sources such as

1742

Zheng et al

Figure 1. Flow chart of the study schema. NLP ⫽ natural language processing; ML ⫽ machine learning.

publications and news articles, clinical notes are especially difficult to process because of the lack of formatting requirements, misspellings, widespread usage of abbreviations and acronyms, etc. In our NLP system, we used spelling correction and

acronym recognition modules to capture misspellings and abbreviations, a negation module to handle negated expressions, and a temporal information module to identify the time of disease onset. Various terminologies such as SNOMED CT (Systematized Nomenclature of Medicine–

Automated Identification of Gout Flares From Clinical Notes Clinical Terms) and MeSH (medical subject headings) were used to create our medical ontology based on the gout flare–related keywords we developed. We created our initial NLP search algorithms based on searching for specific mentions of gout flare– or goutrelated signs or symptoms. A list of gout- or gout flare– related keywords was provided by our rheumatologists and used in creating this beta version of the NLP system (Table 1). This beta version of the NLP system was used to create our training and evaluation data set. Creating the training data set. After the initial search using our beta version of the NLP system, we classified the patients into 3 categories: no flare, 1 or 2 flares, and ⱖ3 flares. We randomly selected 20 patients (no flare group), 30 patients (1 or 2 flares), and 50 patients (ⱖ3 flares) to obtain a total of 100 patients for our training data set. We chose this sampling approach to get enough positive instances. There were 3,816 notes for these 100 patients during our study period. The majority of these notes were not gout related, and including these non– gout-related notes would have artificially increased the specificity for our task and the review burden. These notes were searched using a broad list of keywords mapped to several ontologies, and then used in a case-insensitive string search. A note was elected for manual review if it contained one of the keywords. We deemed this approach sufficient to include most of the gout flare–related notes. Finally, 1,264 notes were identified and reviewed blindly by a rheumatologist (GDL), who classified these notes into 4 categories, i.e., yes (n ⫽ 264), no (n ⫽ 983), maybe (n ⫽ 2), and unknown (n ⫽ 15), based on the available information in each note. Creating the evaluation (gold standard) data set. A second set of 100 different patients was randomly selected using the same selection process as creating the training data set. A total of 1,195 notes were selected for manual review. Two rheumatologists (GDL and ATL) independently reviewed them by classifying them into 2 categories of gout flare: yes or no. One hundred seventy-seven notes had discordant results. Kappa statistics were used to calculate the interrater agreement between the 2 reviewers. The Cohen’s kappa value was 0.65 and the linear weighted kappa value was 0.64. The second rheumatologist and the study team then re-reviewed these 177 notes together and finalized the categories. There were 3 notes for which a definitive decision could not be made and they were removed from the data set, leaving 1,192 notes with 318 annotated as yes and 874 annotated as no values. On the patient level, there were 28 patients with no flare, 37 patients with 1 or 2 flares, and 31 patients with ⱖ3 flares. Four patients did not have any notes available. These 1,192 annotated notes, together with their grouped patientlevel flare counts, were used as our gold standard. NLP and ML system development. One challenge of this study was to identify gout flare occurrences rather than simply identifying patients with gout. Gout is a chronic disease and gout-related keywords often appear in

1743

Table 1. Sample list of gout- or gout flare–related keywords* Core keywords Gout Gouty Podagra Tophaceous Tophi Tophus Related keywords Acute flare Acute inflammatory process Allopurinol Arthritis Attack Big toe Cellulitis Chronic arthritis Codeine Colchicine Corticosteroids Diclofenac Edema Elevated levels of uric acid Flare Flare up Flare-up G6PD Gonagra High uric acid level Hydrocodone Hyperuricemia Ibuprofen Indomethacin Inflammation of joint Joint pain Kidney stone King’s disease Metacarpal Metacarpophalangeal joint Metatarsal phalangeal Metatarsal-phalangeal Naprosyn Naproxen NSAID Oxycodone Recurrent attacks Red joint Redness and swelling Swelling Swollen joint Synovial biopsy Synovial fluid analysis Tender joint Urate lower drugs Urate-lowering therapy Urate nephropathy Uric acid Uric acid crystals Uric crystals Voltarol Zyloric * NSAID ⫽ nonsteroidal antiinflammatory drug.

1744

Zheng et al

Figure 2. General architecture of the natural language processing (NLP) ⫹ machine learning system.

patients’ notes. Caregivers also document the gout flare with wide variation. The 1,264 notes from the training data set were used to refine our beta version of the NLP algorithms. We created separated NLP searches to capture different aspects of gout flares, from gout flare symptoms, problems, complaints, clinical findings, tests, diagnoses, treatments, etc. For example, searches included the identifications of patients’ symptoms of joint pain and patients’ complaints of gout problems. All of these searches must have satisfied the same timing constraint, having occurred recently (ongoing or within 1 month). These searches could be an indicator of gout flares, but the individual search was not sufficient enough to achieve good sensitivity and specificity. Combining the outputs of these NLP searches achieved very good sensitivity but low specificity. To achieve better specificity, we applied the ML technique to make further classifications based on the results of NLP. For each note, the NLP search outputs became the ML system inputs (features). The output of the ML system was a label of “yes” or “no” flare for each note. The 1,264 notes from the training data set were used to select and train the ML system. Several ML algorithms, naive Bayes, neural network, and decision trees were evaluated before choosing the support vector machine because of its better performance. We used the 10-fold cross-validation to select the best model parameters. We used the positive and negative likelihood ratios and F score as our performance metric. The number of instances for unknown (n ⫽ 15) and maybe (n ⫽ 2) in this study were insufficient to train a reliable model. We tested several options, including keeping their original category label, converting to the no category, and discarding these instances. We decided to convert both of them to the yes category because it offered the best results based on the F score and identifying potential

cases was considered clinically better than discarding them. Figure 2 shows the general architecture of our NLP⫹ML system. Analysis method. We evaluated how well the NLP⫹ML identified gout flares based on a single note and how well it identified unique gout flares for each patient. Besides comparing the NLP⫹ML results to the gold standard, we compared these results to the claims-based method, since it was commonly used in previous studies. We also evaluated the initial review results from the 2 rheumatologists, since they were the best comparator for performance in manual chart review. Lastly, we combined the results of NLP⫹ML and the claims-based method to gauge whether they had any complementary information to improve the sensitivity. The numbers of true-positive, false-positive, true-negative, and false-negative were calculated. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were then derived based on those numbers. Identifying gout flares using NLP⫹ML. The output of the NLP⫹ML system was a “yes” or “no” flare label for each note, which we referred to as being a note-level result. Identifying gout flares using diagnosis codes. In previous studies (7,8), gout flares have been identified using ICD-9-CM codes and antigout therapies. In this study, we compared gout flares identified by 2 previously published algorithms versus NLP⫹ML findings. Patients who were diagnosed with gout were classified with a gout flare if they met any of the following published criteria (6,7): 1) a medical claim with a diagnosis of gout (ICD-9-CM code 274.xx) followed within 7 days by claims for 1 or any combination of colchicine, nonsteroidal antiinflammatory drug, corticosteroid, or adrenocorticotropic hormone; radiograph of the toe, heel, foot, ankle, knee, finger, hand,

Automated Identification of Gout Flares From Clinical Notes

1745

Table 2. Performance comparisons for clinical note–level gout flare identification* Sensitivity (95% CI)

Specificity (95% CI)

PPV (95% CI)

NPV (95% CI)

Positive LR (95% CI)

Negative LR (95% CI)

NLP⫹ML 0.821 (0.78–0.86) 0.915 (0.9–0.93) 0.779 (0.77–0.85) 0.934 (0.92–0.95) 9.7 (7.7–12.1) 0.2 (0.15–0.25) Reviewer 1 0.767 (0.72–0.81) 0.952 (0.94–0.97) 0.853 (0.81–0.89) 0.918 (0.9–0.94) 16.0 (11.8–21.6) 0.24 (0.2–0.3) Reviewer 2 0.899 (0.87–0.93) 0.973 (0.96–0.98) 0.923 (0.89–0.95) 0.964 (0.95–0.98) 32.8 (22–48.7) 0.1 (0.07–0.14)

F score† 0.87 0.88 0.93

* 95% CI ⫽ 95% confidence interval; PPV ⫽ positive predictive value; NPV ⫽ negative predictive value; LR ⫽ likelihood ratio; F score ⫽ harmonic mean of sensitivity and specificity (2 ⫻ [sensitivity ⫻ specificity]/[sensitivity ⫹ specificity]); NLP ⫽ natural language processing; ML ⫽ machine learning. † F score or F measure is a weighted average of the sensitivity and specificity with a range of value from 0 –1 (worst to best). It is commonly used in the fields of NLP and ML to evaluate the classification or identification performance.

wrist, or elbow; magnetic resonance imaging of the upper or lower extremity; joint aspiration; microscopic evaluation of joint fluid; or serum urate test, or 2) a medical claim with a diagnosis of joint pain (ICD-9-CM code 719.4x) followed within 7 days by a prescription for colchicine. Consistent with prior studies, the date of the medical claim with a gout or joint pain diagnosis was defined as the gout flare date. Gout flares and their attendant care were assumed to last for a minimum of 30 days. Two flares within one 30-day period were counted as 1 flare. Note-level gout flares. The 1,192 notes used to create the gold standard were used to perform the comparative analyses. The note-level analysis compared the “yes” or “no” flare label for each note against the annotated labels of the gold standard (318 yes, 874 no). Three sets of results were evaluated against the note-level gold standard, including the NLP⫹ML system and the initial review results from the 2 rheumatologists. Patient-level gout flares. The note-level results were grouped together to form the patient-level results. For each patient, the yes-labeled notes were sorted by their contact date and 2 yes flares within one 30-day period were counted as 1 flare. The total numbers of flares for each patient form results at the patient level. The total numbers of flares were further used to classify patients into no flare, has flare, and ⱖ3 flares groups. Five sets of results were evaluated against the patient-level gold standard, including the NLP⫹ML system, the initial review results from the 2 rheumatologists, the results from the claims-based method, and the combined results of the NLP⫹ML and claims-based method. Comparison of NLP⫹ML with claims-based method on the complete study cohort. We compared NLP⫹ML to the claims-based method on the complete study cohort (16,519 patients), since it was commonly used in retrospective studies. We were especially interested in finding whether NLP⫹ML could identify more gout flare cases or patients. Basic statistical analysis. The 16,519 patients were classified into 6 groups based on their number of gout flares, as defined by the NLP⫹ML. We intended to check whether such grouping based on NLP⫹ML made sense. We calculated the P value to determine whether there was statistical significance among these 6 groups and their serum UA levels, comorbidities, or medications.

RESULTS On the note level (Table 2), NLP⫹ML achieved 82.1% sensitivity, 91.5% specificity, 77.9% PPV, and 93.4% NPV for identifying gout flares. On the patient level, 5 sets of results (NLP⫹ML, 2 rheumatologists’ initial review results, claim codes, and NLP⫹ML and claim codes combined) were compared to the gold standard on 2 tasks (Table 3). The first task was to identify patients with ⱖ1 flare and the second task was to identify patients with ⱖ3 flares. The first task was essentially equal to identifying patients with or without a gout flare during the study period. For the first task, NLP⫹ML achieved the best results across the 5 measurements. The differences between the NLP⫹ML and the 2 rheumatologist reviewers were statistically insignificant. More importantly, except for PPV, the NLP⫹ML proved to be significantly better than the claims-based method. On the second task of identifying patients with ⱖ3 flares, NLP⫹ML had the highest sensitivity (93.5%) and NPV (96.5%), but less specificity (84.6%) and PPV (74.4%). The sensitivity of the NLP⫹ML approach was much higher than the claims-based method for identifying flare patients (98.5% versus 88.2% on ⱖ1 flare and 93.5% versus 41.9% on ⱖ3 flares). Similar to the first task, the NLP⫹ML was statistically significantly better than the claims-based method, except on the PPV. Combining the results from the claims-based method and NLP⫹ML methods did not improve the sensitivity, but reduced the specificity. It demonstrated that there is no benefit to combine claims-based method with the NLP⫹ML. Complete study cohort. The NLP⫹ML system processed all 599,317 notes for 16,519 patients. On the note level, NLP identified 49,415 flare notes. The ML system further narrowed them down to 18,869 flare notes. On the patient level, NLP identified 3,954 patients without any evidence of a flare. For the remainder, an additional 5,209 patients had no flare based on the ML decisions. The final NLP⫹ML output resulted in 9,163 patients with no flare, 5,954 with 1 or 2 flares, and 1,402 with ⱖ3 flares. As a comparison, the diagnosis codes method identified 7,861 positive notes: 11,061 patients with no flare, 4,942 with 1–2 flares, and 516 with ⱖ3 flares. The NLP⫹ML method identified significantly more patients with a flare (7,356 versus 5,458) and patients with ⱖ3 flares (1,402 versus 516).

1746

Zheng et al

Table 3. Performance comparisons for identifying patients with >1 and >3 flares*

Patients with ⱖ1 flare NLP⫹ML Reviewer 1 Reviewer 2 Claim codes NLP⫹ML ⫹ claim codes Patients with ⱖ3 flares NLP⫹ML Reviewer 1 Reviewer 2 Claim codes NLP⫹ML ⫹ claim codes

Sensitivity (95% CI)

Specificity (95% CI)

PPV (95% CI)

NPV (95% CI)

Positive LR (95% CI)

0.985 (0.91–1) 0.985 (0.91–1) 0.971 (0.89–0.99) 0.882 (0.78–0.94) 0.985 (0.91–1)

0.964 (0.8–1) 0.929 (0.75–0.99) 0.929 (0.75–0.99) 0.893 (0.71–0.97) 0.893 (0.71–0.97)

0.985 (0.91–0.99) 0.964 (0.8–1) 27.6 (4.0–189.1) 0.971 (0.89–0.99) 0.963 (0.79–1) 13.8 (3.6–52.5) 0.971 (0.89–0.99) 0.929 (0.83–1) 13.6 (3.6–51.7) 0.952 (0.86–0.99) 0.758 (0.57–0.88) 8.2 (2.8–24.1) 0.957 (0.87–0.99) 0.962 (0.78–1) 9.2 (3.2–26.8)

0.935 (0.77–0.99) 0.742 (0.55–0.87) 0.839 (0.66–0.94) 0.419 (0.25–0.61) 0.935 (0.77–0.99)

0.846 (0.73–0.92) 0.923 (0.82–0.97) 0.954 (0.86–0.99) 0.954 (0.86–0.99) 0.738 (0.61–0.84)

0.744 (0.58–0.86) 0.821 (0.62–0.93) 0.897 (0.79–1) 0.813 (0.54–0.95) 0.63 (0.48–0.76)

0.965 (0.87–0.99) 6.1 (3.4–10.8) 0.882 (0.78–0.94) 9.6 (4.1–23) 0.925 (0.86–0.99) 18.2 (6.0–55.5) 0.775 (0.67–0.86) 9.1 (2.8–29.6) 0.96 (0.85–0.99) 3.6 (2.4–5.4)

Negative LR (95% CI)

F score

0.015 (0.002–0.11) 0.016 (0.002–0.11) 0.03 (0.008–0.12) 0.13 (0.07–0.25) 0.04 (0.002–0.22)

0.97 0.96 0.95 0.89 0.94

0.07 (0.02–0.29) 0.28 (0.15–0.51) 0.17 (0.08–0.38) 0.61 (0.45–0.82) 0.09 (0.02–0.34)

0.89 0.82 0.89 0.58 0.83

* 95% CI ⫽ 95% confidence interval; PPV ⫽ positive predictive value; NPV ⫽ negative predictive value; LR ⫽ likelihood ratio; F score ⫽ harmonic mean of sensitivity and specificity (2 ⫻ [sensitivity ⫻ specificity]/[sensitivity ⫹ specificity]); NLP ⫽ natural language processing; ML ⫽ machine learning.

The 16,519 patients were classified into 6 groups based on the number of gout flares as identified by NLP⫹ML. Descriptive statistics were used to compare baseline characteristics between the 6 groups of patients, as shown in Table 4. Patients with more gout flares had higher serum UA levels, more comorbidities, and higher utilization of gout medications. Patients with more gout flares were prescribed medication by rheumatologists more frequently.

DISCUSSION Previous studies used various diagnosis codes and antigout therapies to identify gout flares. However, coding alone is insufficient to capture the complexity of the gout flare encounter. In this study, we used NLP⫹ML to identify gout flares from the clinical notes. Our method generated an accurate tool for identifying gout flares with much

Table 4. Baseline patient characteristics for patients with number of gout flares*

Comorbidities Arthropathy Cardiovascular Diseases related to renal function Osteoarthritis RA Laboratory data Serum UA, mg/dl Mean ⫾ SD Serum UA not at goal (ⱖ6 mg/dl) Mean ⫾ SD Any symptomatic gout flare–related medication‡ Concomitant medications Antihypertensives Diuretics Antihyperlipidemics Antidiabetics Prescriber specialty Primary care physician Rheumatologist Other

NLP, no flares (n ⴝ 3,954)

NLPⴙML, no flares (n ⴝ 5,209)

NLPⴙML, 1 flare (n ⴝ 4,014)

NLPⴙML, 2 flares (n ⴝ 1,940)

NLPⴙML, 3 flares (n ⴝ 833)

NLPⴙML, >3 flares (n ⴝ 569)

P†

9 (0.2) 407 (10.3) 501 (12.7) 490 (12.4) 21 (0.5)

27 (0.5) 800 (15.4) 1,060 (20.4) 1,022 (19.6) 52 (1)

28 (0.7) 606 (15.1) 977 (24.3) 759 (18.9) 52 (1.3)

17 (0.9) 356 (18.4) 555 (28.6) 425 (21.9) 34 (1.8)

7 (0.8) 186 (22.3) 285 (34.2) 195 (23.4) 23 (2.8)

17 (3.0) 140 (24.6) 233 (41.0) 165 (29.0) 22 (3.9)

⬍ 0.0001 ⬍ 0.0001 ⬍ 0.0001 ⬍ 0.0001 ⬍ 0.0001

1,854 (46.9) 3,028 (58.1) 3,063 (76.3) 1,557 (80.3) 7.1 ⫾ 2.0 7.5 ⫾ 2.0 8.6 ⫾ 1.9 8.8 ⫾ 1.8 1,284 (32.5) 2,286 (43.9) 2,841 (70.8) 1,476 (76.1) 8.1 ⫾ 1.5 8.4 ⫾ 1.5 8.9 ⫾ 1.6 9.0 ⫾ 1.6 2,066 (52.3) 3,287 (63.1) 3,485 (86.8) 1,801 (92.8)

680 (81.6) 9.0 ⫾ 1.8 655 (78.6) 9.1 ⫾ 1.6 775 (93.1)

499 (87.7) 9.3 ⫾ 2.0 483 (84.9) 9.5 ⫾ 1.8 555 (97.5)

⬍ 0.0001 ⬍ 0.0001

3,325 (84.1) 1,889 (47.8) 2,324 (58.8) 881 (22.3)

635 (76.2) 443 (53.2) 376 (45.1) 157 (18.9)

435 (76.5) 315 (55.4) 272 (47.8) 110 (19.3)

⬍ 0.0001 0.0002 ⬍ 0.0001 ⬍ 0.0001

638 (76.6) 107 (12.9) 88 (10.6)

374 (65.7) 132 (23.2) 63 (11.1)

⬍ 0.0001 – –

4,315 (82.8) 2,959 (73.7) 1,413 (72.8) 1,649 (31.7) 1,930 (48.1) 964 (49.7) 3,002 (57.6) 1,890 (47.1) 883 (45.5) 1,155 (22.2) 688 (17.1) 341 (17.6)

3,434 (86.9) 4,476 (85.9) 3,420 (85.2) 80 (2.0) 145 (2.8) 198 (4.9) 440 (11.1) 588 (11.3) 396 (9.9)

1581 (81.5) 157 (8.1) 202 (10.4)

⬍ 0.0001

* Values are the number (percentage) unless indicated otherwise. NLP ⫽ natural language processing; ML ⫽ machine learning; RA ⫽ rheumatoid arthritis; UA ⫽ uric acid. † P ⬍ 0.05 is significant. The methods used to generate the P values are: chi-square or Fisher exact test if the expected value is ⬍5 (comorbidities), one-way analysis of variance (laboratory data), and chi-square test (gout flare–related medication, concomitant medications, and prescriber specialty). ‡ Colchicine, nonsteroidal antiinflammatory drug, corticosteroid, or adrenocorticotropic hormone.

Automated Identification of Gout Flares From Clinical Notes higher sensitivity and specificity compared to the diagnosis codes– based method. On the task of identifying whether the patient had ⱖ3 flares, NLP⫹ML had 94% sensitivity versus 42% sensitivity by the code method. Compared to other studies using diagnosis codes to identify flare, the percentage of patients with ⱖ1 flare is 33% in our study, compared with 11% found by Primatesta et al (19), 35% found by Sarawate et al (20), 40.9% found by Wu et al (21), and 45.2% found by Saseen et al (22). In our study, the diagnosis codes method identified 3% of our study population as having ⱖ3 flares, compared with 0.4% reported by Primatesta et al (19), 3% reported by Saseen et al (22), and 5.6% reported by Wu et al (21). It is worth noting that there were differences among study population selection and the definition of gout flare for these studies. The base study population (19,21) was required to have 2 diagnoses of gout, compared to 1 in our study. Saseen et al treated multiple flares within 14 days as a single flare, whereas previous studies (19,21) and our study used a 30-day window. The study by Saseen et al also used additional inclusion criteria compared to other studies, which might explain its higher percentage of patients with flare. Primatesta et al did not use a diagnosis for joint pain (ICD-9-CM code 719.4x) as part of their definition of gout flare, which could contribute to their low rate of gout flares. The mean age of our study population is 62 years, compared to 55 years in Primatesta et al, 57 years in Sarawate et al, 58 years in Saseen et al, and 72 years in Wu et al. Our diagnosis code– based results appear to be in line with these published studies. In comparison, our NLP⫹ML method identified that 8.5% had ⱖ3 flares and 44.5% had ⱖ1 flare. It demonstrated that the NLP⫹ML method identified more gout flares compared to a diagnosis codes approach. In other words, the diagnosis codes approach results in many false negatives and significantly underestimated the occurrence of gout flares. EMR systems store digital records of patient encounters, where much of the information consists of unstructured, free-formatted, clinical notes at volumes that make systematic manual review nearly impossible. Our study compared the results of 2 rheumatologists who identified gout flares from a limited set of clinical notes. The kappa statistic (0.65) shows that the 2 experienced rheumatologists had a fair amount of disagreement, which suggests that gout flare identification is not an easy task. There could be various reasons for their moderate discordances, including note fatigue after reading hundreds of notes, poorly documented timing of the attack in the note, exact timing of a flare is difficult to decide for cut-and-paste notes, the ambiguity of “current attack,” and the incomplete or conflicted information documented in the note. Within the manually reviewed 2,456 notes, we identified 5 notes with possible cut and paste. Three cut-and-paste notes occurred in the history section, which was ignored by the NLP. The remaining 2 notes were inpatient notes. All of these 5 notes were cut and pasted from notes less than 30 days ago, so they had no impact on the gout flare counts. Identifying true gout flares can be difficult. High UA levels make gout flares more likely, but are not temporally related to flares. Medications used to treat acute flares can be used for other conditions and joints can be painful for

1747 other reasons than gout, making a diagnosis of gout flare complicated and uncertain. This study was designed to look for patients with uncontrolled gout with more than 3 flares in the study period. Based on this study, we believe that the NLP⫹ML method could be used to identify potentially undiagnosed gout patients or identify known gout patients who might benefit from more intensive therapy. In the future, we will test whether our NLP⫹ML method is able to identify gout patients who do not have a gout diagnosis or ULT prescription. Identifying more gout patients who are misclassified by ICD coding will benefit research studies and clinical care considering the challenges in gout diagnosis (23). The majority of the NLP⫹ML errors on the patient level were due to ⫾1 flare (11 of 14), with most errors occurring between the 2 flares and 3 flares groups (9 of 11). This might explain why NLP⫹ML had relatively low specificity on identifying patients with ⱖ3 flares compared to identification of patients with ⱖ1 flare (84.6% versus 96.4%). In conclusion, this is the first NLP⫹ML algorithm specifically designed to identify gout flares. NLP⫹ML algorithms can identify patients with poorly controlled gout based on the information documented in the clinical notes. NLP⫹ML is a valuable tool that can be used to screen large numbers of patients with chronic conditions electronically. Once fully implemented, the NLP⫹ML program can be used clinically to create automated patienttracking systems to ultimately improve patient care, address treatment gaps, and help with quality of life while reducing health care resource use (24 –26).

ACKNOWLEDGMENT The authors thank Dr. Wayne S. Yee for his support during the chart review process when validating the algorithm. AUTHOR CONTRIBUTIONS All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Dr. Zheng had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study conception and design. Zheng, Rashid, Levy, Cheetham. Acquisition of data. Zheng, Rashid, Wu, Koblick. Analysis and interpretation of data. Zheng, Rashid, Wu, Koblick, Lin, Levy, Cheetham.

ROLE OF THE STUDY SPONSOR Savient Pharmaceuticals, Inc. had no role in the study design or in the collection, analysis, or interpretation of the data, the writing of the manuscript, or the decision to submit the manuscript for publication. Publication of this article was not contingent upon approval by Savient Pharmaceuticals, Inc.

REFERENCES 1. Schlesinger N. Management of acute and chronic gouty arthritis: present state-of-art. Drugs 2004;64:2399 – 416. 2. Suresh E. Diagnosis and management of gout: a rational approach. Postgrad Med J 2005;81:572–9. 3. Teng GG, Nair R, Saag KG. Pathophysiology, clinical presentation and treatment of gout. Drugs 2006;66:1547– 63. 4. Gaffo AL, Schumacher HR, Saag KG, Taylor WJ, Dinnella J,

1748

5.

6.

7.

8.

9. 10. 11. 12. 13.

14.

15.

Outman R, et al. Developing a provisional definition of flare in patients with established gout. Arthritis Rheum 2012;64: 1508 –17. Shoji A, Yamanaka H, Kamatani N. A retrospective study of the relationship between serum urate levels and recurrent attacks of gouty arthritis: evidence for reduction of recurrent gout arthritis with antihyperuricemic therapy. Arthritis Rheum 2004;51:321–5. Halpern R, Fuldeore MJ, Mody RR, Patel PA, Mikuls TR. The effect of serum urate on gout flares and their associated costs: an administrative claims analysis. J Clin Rheumatol 2009;15: 3–7. Wu EQ, Forsythe A, Guerin A, Yu AP, Latremouille-Viau D, Tsaneva M. Comorbidity burden, healthcare resource utilization, and costs in chronic gout patients refractory to conventional urate-lowering therapy. Am J Ther 2012;19:e157– 66. Taylor WJ, Shewchuk R, Saag KG, Schumacher HR Jr, Singh JA, Grainger R, et al. Toward a valid definition of gout flare: results of consensus exercises using Delphi methodology and cognitive mapping. Arthritis Rheum 2009;61:535– 43. Buchan NS, Rajpal DK, Webster Y, Alatorre C, Gudivada RC, Zheng C, et al. The role of translational bioinformatics in drug discovery. Drug Discov Today 2011;16:426 –34. Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc 2011;18:544 –51. Clark C, Aberdeen J, Coarr M, Tresner-Kirsch D, Wellner B, Yeh A, et al. MITRE system for clinical assertion status classification. J Am Med Inform Assoc 2011;18:563–7. Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 2006;7:356. Danforth KN, Early MI, Ngan S, Kosco AE, Zheng C, Gould MK. Automated identification of patients with pulmonary nodules in an integrated health system using administrative health plan data, radiology reports, and natural language processing. J Thorac Oncol 2012;7:1257– 62. Thomas AA, Zheng C, Jung H, Chang A, Kim B, Gelfond J, et al. Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results. World J Urol 2013;17:1–5. Witten IH, Frank E, Hall MA. Data mining: practical ma-

Zheng et al

16. 17.

18.

19.

20. 21.

22.

23.

24.

25. 26.

chine learning tools and techniques. 3rd ed. Burlington (MA): Morgan Kaufmann; 2011. He Y, Kayaalp M. Biological entity recognition with conditional random fields. AMIA Annu Symp Proc 2008:293–7. De Bruijn B, Cherry C, Kiritchenko S, Martin J, Zhu X. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. J Am Med Inform Assoc 2011;18:557– 62. Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc 2011;18:601– 6. Primatesta P, Plana E, Rothenbacher D. Gout treatment and comorbidities: a retrospective cohort study in a large US managed care population. BMC Musculoskelet Disord 2011; 12:103. Sarawate CA, Patel PA, Schumacher HR, Yang W, Brewer KK, Bakst AW. Serum urate levels and gout flares: analysis from managed care data. J Clin Rheumatol 2006;12:61–5. Wu EQ, Patel PA, Mody RR, Yu AP, Cahill KE, Tang J, et al. Frequency, risk, and cost of gout-related episodes among the elderly: does serum uric acid level matter? J Rheumatol 2009; 36:1032– 40. Saseen JJ, Agashivala N, Allen RR, Ghushchyan V, Yadao AM, Nair KV. Comparison of patient characteristics and goutrelated health-care resource utilization and costs in patients with frequent versus infrequent gouty arthritis attacks. Rheumatology (Oxford) 2012;51:2004 –12. Wijnands JM, Boonen A, Arts IC, Dagnelie PC, Stehouwer CD, van der Linden S.Large epidemiologic studies of gout: challenges in diagnosis and diagnostic criteria. Curr Rheumatol Rep 2011;13:167–74. American College of Rheumatology Pain Management Task Force. Report of the American College of Rheumatology Pain Management Task Force. Arthritis Care Res (Hoboken) 2010; 62:590 –9. Singh JA, Hodges JS, Toscano JP, Asch SM. Quality of care for gout in the US needs improvement. Arthritis Rheum 2007;57: 822–9. Sundy JS. Gout management: let’s get it right this time [editorial]. Arthritis Rheum 2008;59:1535–7.

Using natural language processing to provide personalized learning opportunities from trainee clinical notes.

Building a Natural Language Processing Tool to Identify Patients With High Clinical Suspicion for Kawasaki Disease from Emergency Department Notes.

Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning.

Using Machine Learning and Natural Language Processing Algorithms to Automate the Evaluation of Clinical Decision Support in Electronic Medical Record Systems.

Identifying Peripheral Arterial Disease Cases Using Natural Language Processing of Clinical Notes.

Surveillance of Peripheral Arterial Disease Cases Using Natural Language Processing of Clinical Notes.

Methodological Issues in Predicting Pediatric Epilepsy Surgery Candidates Through Natural Language Processing and Machine Learning.

Using language models to identify relevant new information in inpatient clinical notes.

NLP-PIER: A Scalable Natural Language Processing, Indexing, and Searching Architecture for Clinical Notes.

Deriving comorbidities from medical records using natural language processing.

A machine learning approach to identify clinical trials involving nanodrugs and nanodevices from ClinicalTrials.gov.

Temporal bone radiology report classification using open source machine learning and natural langue processing libraries.

Using natural language processing techniques to inform research on nanotechnology.

Early recognition of multiple sclerosis using natural language processing of the electronic health record.

Natural Language Processing Technologies in Radiology Research and Clinical Applications.

Measuring physician adherence with gout quality indicators: a role for natural language processing.

An evaluation of a natural language processing tool for identifying and encoding allergy information in emergency department clinical notes.

Crowdsourcing and curation: perspectives from biology and natural language processing.

Expert guided natural language processing using one-class classification.

Applying machine learning to identify autistic adults using imitation: An exploratory study.

Using machine learning algorithms to identify genes essential for cell survival.

A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data.

Using Machine Learning to Identify Benign Cases with Non-Definitive Biopsy.

From sequence to enzyme mechanism using multi-label machine learning.