Symptom Names Using Multi-View Nonnegative Matrix Factorization.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2422612, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < effects of using medication/symptom names to improve the clinical documents clustering results; (3) we compare the performances of NMF and multi-view NMF on clinical documents clustering. The rest of the paper is organized as follows: Section II introduces the overview of extracting symptom names and medication names from clinical notes. Section III describes NMF and multi-view NMF. Section IV presents our experiment dataset, evaluation methodology, and preprocessing. Section V discusses experimental results; and Section VI gives our conclusions.

2

Fig. 2. An overview of symptom/medical term extraction from Clinical Notes

II. MEDICATION/SYMPTOM NAME EXTRACTION A. Clinical Documents Clinical Note is an important part of patient records in an unstructured free-text format. An example of a clinical note with a few selected sections is shown in Fig. 1.

Fig. 1. An example of selected sections from Clinical Note

Three sections from a clinical note are included in this example: Principal Diagnosis, List of Problems/Diagnoses, and Medicines. These highlighted symptom and medication names are valuable information for physicians and patients. As shown in Fig. 1, they are embedded in multiple sections in unstructured/semi-structured text. We conduct statistical analysis on our experiment dataset. The most frequent sections in clinical notes contain medication/ symptom names are as shown in Table I. TABLE I MOST FREQUENT CLINICAL NOTES SECTIONS WITH MEDICATION/SYMPTOM NAMES Most Frequent Sections with Symptom Names Amit Diagnosis History Of Present Illness Hospital Course Past Medical History Brief Resume Of Hospitlal Course Discharge Medications Hpi Physical Examination Hospital Course By System Hospital Course By Problem

Most Frequent Sections with Medication Names Discharge Medications Hospital Course History Of Present Illness Potentially Serious Interaction Medications On Admission Brief Resume Of Hospitlal Course Medications Medications On Discharge Hospital Course By System Hospital Course By Problem

Discharge Medications, History of Present Illness, Hospital Course, Brief Resume of Hospital Course, Hospital Course By System, And Hospital Course By Problem are most frequent sections contain both symptom names and medication names.

B. Name Extraction An overview of extracting symptoms and medications from clinical notes is showed in Fig. 2. We extract the symptom names such as “hypertension” and medication names such as “Isordil, Cardizem” from the clinical texts “He was kept off aspirin given his GI bleeding. The patient also has hypertension and was on Isordil and Cardizem for that.” First, we pre-process clinical notes to identify words and sentences from clinical notes using Stanford CoreNLP Tool (http://nlp.stanford.edu/downloads/ ). During the pre-processing, we use section annotator to identify different sections for each clinical note. The section annotator depends on the section header information from clinical notes. Negation sections, such as “ALLERGIES” or “Family History”, are excluded. For example, “She is allergic to MORPHINE” from the section “ALLERGIES”, the medication name “MORPHINE” is a negation medication name, so we exclude it. We also use negation annotator to remove negation symptom and medication names. An example is that “The patient was told to avoid taking aspirin or any other NSAIDs given his GI bleed”, we remove “aspirin” and “NSAIDs” because of the pre-negation words “avoid”. Pre-negation and post-negation are defined in Negation maker (NegEx: http://www.dbmi.pitt.edu/chapman/NegEx.html). Pre-negation is negation words like avoid, deny, cannot, without, and so on. Post-negation is negation words like free, was ruled out, and so on. After pre-process, we use symptom annotator based on the MetaMap[18] to extract symptom names from clinical notes. Meanwhile, we use medication annotator based on MedEx System[19] to extract medication names from clinical notes. We use MetaMap to extract symptom names from clinical notes. MetaMap (http://nls3.nlm.nih.gov) is a program that maps biomedical texts to concepts in the UMLS Meta-thesaurus[18] [20]. Since Metamap returns all types of concepts, we only keep these concepts related to symptom names, such as concept labeled as “sosy”, which represents “sign and symptom”. The related types of concepts include: {sosy, dsyn, neop, fngs, bact, virs, cgab, acab, lbtr, inpo, mobd, comd, anab}, see [21]in detail. We use MedEx system to extract medication names from clinical notes. The MedEx system is a natural language processing system to extract medication information from

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Accuracy represents the number of correctly classified compared with known class labels. The higher accuracy means better performance. NMI measures the clustering performance, the higher the better. 𝑛 ∑ℎ,𝑙 𝑛ℎ,𝑙 log ℎ,𝑙 𝑛ℎ , 𝑛𝑙 𝑁𝑀𝐼 = 𝑛 √∏𝑖=ℎ,𝑖 𝑛𝑖 log 𝑖 𝑛 Where 𝑛 represents the total number of documents, 𝑛ℎ is the number of document in standard class ℎ, 𝑛𝑙 is the number of documents in predicted cluster 𝑙, and 𝑛ℎ,𝑙 is the number of documents in both clusters ℎ and 𝑙.

#

TABLE III 2009 DATASET RESULTS (NMF) Major Features Symptom Medication

1

Pain; meds (microcephaly, epilepsy, and diabetes syndrome); infections

2

Congestive heart failure; coronary artery disease; secondaries (neoplasm metastasis); diabetes Ischaemia; nausea; congestive heart failure; symptoms

3

4

5

Hypertension; obesity; asthmatics; pulmonary failure; gout; apnea, sleep apnea syndromes; mental depression; hepatitis b; diabetes mellitus; depressive disorder Erythema; diarrhea; abdominal pain; haematocrit; obesity; wound; place (ocular myopathy with hypogonadism); vomiting

Fluvastatin; nicardipine; methyldopa; amphotericin; thera; ammonia; hydroxyzine hcl Emtricitabine; potassium citrate; bicalutamide; mcp; dipyridamole Procaine; hydroxyzine hcl; menthol; dextran 40; linezolid; clopidogrel bisulfate -

Beta blockers; emtricitabine

V. EXPERIMENTAL RESULTS A. 2009 Clinical Notes Dataset Results

# 1

2

3

4

5

TABLE IV 2009 DATASET RESULTS (MULTI-VIEW NMF) Major Features Symptom Medication Hyperlipidaemia; hypercholesterolaemia; polycythaemia; gerd; hypertensive disease Chest pain; constipation; facial hemiatrophy; pain; food-drug interactions Place (ocular myopathy with hypogonadism); haematocrit; secondaries (neoplasm metastasis); pain; chest pain Diabetes mellitus; glaucoma; hepatitis c; hepatitis c virus; congestive heart failure Diabetes mellitus; depression; diabetes; sleep apnea, obstructive; asthma

Aspirin; Lisinopril; furosemide; phencyclidine; metoprolol

Heparin, porcine; digoxin; amiodarone; furosemide; warfarin Dextrose; insulin; metoprolol; aspirin; creatinine

Prednisone; insulin, aspart, human/rdna; acetaminophen; vancomycin; levofloxacin Insulin glargine; albuterol; Lisinopril; digoxin; furosemide

Feature Type Count

TF-IDF

Feature Type Count

TF-IDF

TABLE V 2014 DATASET RESULTS (K=3) Views Accuracy (%) Words Symptom/Medication All 3 views Words Symptom/Medication All 3 views

40.54 52.03 53.38 35.47 52.36 52.36

TABLE VI 2014 DATASET RESULTS (K=2) Views Accuracy (%) Words Symptom/Medication All 3 views Words Symptom/Medication All 3 views

57.77 55.07 59.80 53.38 73.31 75.00

4

NMI 0.0228 0.1273 0.1459 0.0020 0.1606 0.1711

NMI 0.0198 0.0924 0.1751 0.0034 0.1844 0.2283

We choose 𝑘 = 5 to cluster documents into 5 groups. For each document clusters, the top 10 features with the highest weight are listed in Table III (NMF results) and Table IV (Multi-view NMF results). In Table III, all the major features in component 4 are symptom names. While Multi-NMF can get uniform symptom names and medication names for each clusters. The solution provides a way to observe intrinsic patterns between symptom names and medication names in each cluster. B. 2014 Clinical Notes Dataset Results We choose 𝑘 = 3 and 𝑘 = 2 . 𝑘 = 3 represents clustering patients into three groups: the first type is patients who develop Coronary Artery Disease (CAD); the second type is patients who have CAD in their first records; and the third type is patients never develop CAD. The result is shown in Table V. 𝑘 = 2 represents clustering patients into two groups: The first type is patients who develop Coronary Artery Disease (CAD) or have CAD in their records; and the second type is patients never develop CAD. The result is shown in Table VI. In both Table V and Table VI, we use word counts and TF-IDF as features to generate the feature matrices. Using symptom names and medication names have better accuracy and NMI than just using words. Using all 3 views (words, symptom names, and medication names) together can achieve the highest performance. The results of using all three views are compared between NMF and multi-view NMF are shown in Fig.4. When 𝑘 = 3 , using word count as feature shows that multi-NMF achieves about 12% higher accuracy than NMF. It has 14% higher accuracy when using TF-IDF as features. When 𝑘 = 2, using word count as feature, multi-view NMF has the same accuracy as NMF. While using TF-IDF as features, multi-view NMF has 24% higher accuracy. Multi-view NMF has better performances than NMF.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.



Adaptive Multiview Nonnegative Matrix Factorization Algorithm for Integration of Multimodal Biomedical Data.

Variational regularized 2-D nonnegative matrix factorization.

Convex nonnegative matrix factorization with manifold regularization.

Link community detection using generative model and nonnegative matrix factorization.

Max-min distance nonnegative matrix factorization.

Nonnegative matrix factorization for the identification of EMG finger movements: evaluation using matrix analysis.

3-D Lung Segmentation by Incremental Constrained Nonnegative Matrix Factorization.

Uncovering community structures with initialized Bayesian nonnegative matrix factorization.

A fast algorithm for nonnegative matrix factorization and its convergence.

Automated graph regularized projective nonnegative matrix factorization for document clustering.

Sparse Nonnegative Matrix Factorization Strategy for Cochlear Implants.

Online nonnegative matrix factorization with robust stochastic approximation.

A Quasi-Likelihood Approach to Nonnegative Matrix Factorization.

Motor imagery classification via combinatory decomposition of ERP and ERSP using sparse nonnegative matrix factorization.

Impact of the Choice of Normalization Method on Molecular Cancer Class Discovery Using Nonnegative Matrix Factorization.

Machine learning source separation using maximum a posteriori nonnegative matrix factorization.

Integrative clustering by nonnegative matrix factorization can reveal coherent functional groups from gene profile data.

Limited-memory fast gradient descent method for graph regularized nonnegative matrix factorization.

Two-hierarchical nonnegative matrix factorization distinguishing the fluorescent targets from autofluorescence for fluorescence imaging.

Mining seasonal marine microbial pattern with greedy heuristic clustering and symmetrical nonnegative matrix factorization.

On nonnegative matrix factorization algorithms for signal-dependent noise with application to electromyography data.

Structure constrained semi-nonnegative matrix factorization for EEG-based motor imagery classification.

Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks.

Multiplicative update rules for concurrent nonnegative matrix factorization and maximum margin classification.