Extracting Concepts Related to Homelessness from the Free Text of VA Electronic Medical Records Adi V. Gundlapalli, MD, PhD, MS1,2 , Marjorie E. Carter, MSPH1,2, Guy Divita, MS1,2, Shuying Shen, MStat1,2, Miland Palmer, MPH2, Brett South, MS1,2, B.S. Begum Durgahee, MS1,2, Andrew Redd, PhD1,2, Matthew Samore, MD1,2 1 VA Salt Lake City Health Care System and 2University of Utah School of Medicine, Salt Lake City, UT Abstract Mining the free text of electronic medical records (EMR) using natural language processing (NLP) is an effective method of extracting information not always captured in administrative data. We sought to determine if concepts related to homelessness, a non-medical condition, were amenable to extraction from the EMR of Veterans Affairs (VA) medical records. As there were no off-the-shelf products, a lexicon of terms related to homelessness was created. A corpus of free text documents from outpatient encounters was reviewed to create the reference standard for NLP training and testing. V3NLP Framework was used to detect instances of lexical terms and was compared to the reference standard. With a positive predictive value of 77% for extracting relevant concepts, this study demonstrates the feasibility of extracting positively asserted concepts related to homelessness from the free text of medical records. Introduction Homelessness, especially among Veterans, is a matter of national importance to the United States. The US Department of Veterans Affairs (VA) has committed to ending homelessness among Veterans[1]. Providing appropriate services for those who are experiencing homeless and, of equal importance, early interventions to those at risk are key components of this initiative. An essential element of preventing homelessness among Veterans is a systematic approach to identifying those at risk. Current methods of identification of Veterans with any of these risk factors from within the VA consist of mining administrative databases for ICD-9-CM codes associated with these diagnoses [2, 3]. The reliability and validity of using administrative data is domain specific and has been shown to be useful in several domains [4-8]. However, this method is unlikely to be either timely or complete enough to be useful for planning interventions. Additionally, many factors of a social and behavioral nature are not captured completely by ICD-9-CM codes and so these may offer only a limited view of a patient’s needs or risk factors. References to risk factors as well as evidence for homelessness are often found only in the free text of medical records written by VA providers and possibly precede the formal identification of Veterans as being homeless[9]. Homelessness also serves as an ideal use case for identifying a wide range of psychosocial risk factors that may be documented in the free text during visits to the health care system [10]. The value of these text data has been shown in several clinical and biomedical domains including bio-surveillance, adverse event detection, and quality improvement [11-14]. Mining the free text of electronic medical records requires informatics-based methods such as natural language processing (NLP) to reliably extract appropriate and relevant information of interest, often referred to as concepts [12, 14]. These concepts can be single words, phrases, or larger spans of text. In general, NLP systems function by parsing sentences, identifying words or phrases, and mapping those to standardized vocabularies or ontologies such as the Unified Medical Language System (UMLS) Metathesaurus. In the absence of a formal ontology or in a situation where it is not known how domain-specific concepts are represented in clinical documents, an intermediate step would be to develop a lexicon or vocabulary of relevant concepts in the domain of interest. The lexicon could then be used as a dictionary in an NLP system for identification and extraction of appropriate concepts. With no existing literature, reference standard, or lexicon to refer to in the specialized domain of homelessness, the goal of this study was to develop an NLP pipeline to extract positively asserted concepts related to homelessness

1 589

from the free text of VA medical records. The primary objective was to describe the positive predictive value (PPV) of the extraction process. We first describe the development of a lexicon of concepts related to homelessness through an iterative process. This lexicon was then used in an NLP pipeline to extract the concepts of interest. The performance of the extraction process was evaluated against a reference standard of free text clinical documents that were classified at the document level by human review to have evidence of homelessness along with a control group of documents with no evidence of homelessness. The prevalence of different categories of concepts in this reference standard is described using the NLP pipeline. Methods Setting and data sources This study was performed using Veterans Information and Computing Infrastructure (VINCI), a central repository of VA-wide databases available for research including administrative, pharmacy and laboratory data, and free-text clinical documents [2]. For this study, a cohort of Veterans was identified who had at least one encounter within the VA health care system during the calendar year 2009 in two of four VA national regions. A corpus of free text clinical documents from outpatient encounters for this cohort of Veterans was extracted from VINCI databases and made available for this concept extraction project. Developing a lexicon for concepts related to homelessness A review of available concepts and variant lexical terms covering risk factors for homelessness and homelessness status was determined to be sparse and incomplete within available and frequently used controlled clinical vocabularies. It was determined that the traditional NLP extraction using UMLS based terminology needed to be augmented with additional concepts and terms that are missing. There were several stages involved in lexicon development. The initial list of terms and phrases was constructed from a literature review for known risk factors for homelessness in general and Veterans in particular [15]. Next, a think-a-loud session was held, where the members of the research team, many of whom have experience with health care for the homeless, reviewed charts from known homeless Veterans and enriched the lexicon with concepts identified from the free text notes written by various VA providers including medical, mental health and social work providers. The lexicon was refined using an iterative process. A panel of domain experts reviewed documents pre-annotated by NLP to determine if relevant concepts were identified, make modifications to annotations, add missing annotations, or reject annotations found to be incorrect or irrelevant. Concepts found within the notes that were relevant but not currently in the lexicon were added to it. An additional 154 variant forms of the terms were added through the use of NLM’s lexical variant generation tools[16] coupled with additional human curation. The final list was adjudicated by a group of clinicians, social workers as well as homeless shelter providers to ensure generalizability in both the VA and external settings. A further refinement of the lexicon was performed by manual review of the documents for false negatives after processing by the NLP pipeline. The resulting set of 356 terms were categorized into eight broad categories: direct evidence, “doubling up”, mentions of mental health diagnoses, behavioral factors, social stressors, sexual and other trauma, medical comorbidities, and other risk factors. Establishing a human-reviewed reference standard of text documents for homelessness The premise for choosing text documents was based on discussions with VA clinicians and service providers who care for Veterans experiencing homelessness. The rationale was that barring a self-declaration of homeless status by the Veteran, the most direct reference to a Veteran experiencing homelessness is likely to be in the free text notes of VA providers. From the text documents generated from VA facility encounters during the calendar year 2009, a random sample of 500 documents was extracted without regard to VA facility or any Veteran characteristic. Another random 500 documents were extracted with the word ‘homeless’ in the note title. This step was taken to enrich the text document

2 590

corpus for homelessness as a random sample of documents is likely to have a low prevalence of homelessness and concepts of interest. Using the official US Government definition for homelessness [17], a written guideline was developed for human review of text documents. Reviewers were instructed to read a document and then classify the document as either (1) having positive documentation for evidence of current or past homelessness or risk factors associated with homelessness (ever homeless/at risk) or (2) no evidence of homelessness. Three reviewers were recruited to perform the classification. As part of the training to establish and achieve high agreement, each reviewer was instructed to classify 4 sets of 20 documents (total 80) with review and discussion among reviewers of concordantly and discordantly classified documents. Once the set of 80 ‘human training’ documents were completed, reviewers were assigned sets of 20 documents till the corpus was exhausted (total corpus of 862 documents). The classification task was performed using a general purpose annotation tool called eHOST, in which documents were displayed in human-readable format similar to the VA electronic medical record [18]. The file format for the documents was based on Knowtator, a widely used annotation tool [19]. Researching available electronic resources to map concepts related to homelessness In an effort to map the terms in our lexicon to existing ontologies, two steps were taken. First, a search for concepts related to homelessness was conducted using the NCI Metathesaurus browser [20]. This resource is described as a wide-ranging biomedical terminology database that covers most terminologies used by NCI for clinical care, translational and basic research, and public information and administrative activities. It is an extensive dictionary of concepts from over 75 sources including the Unified Medical Language System (UMLS) Metathesaurus. In addition, an effort was made to map the lexical concepts to established ontologies by utilizing BioPortal (http://bioportal.bioontology.org) to ensure uniformity and to avoid duplication of efforts in converting unstructured free text to structured data. BioPortal is an open source Web repository enabling access to terms from about 372 biomedical ontologies and data sources [21, 22]. However, in addition to clinically-related ontologies, BioPortal includes genomic-associated ontologies, such as cell line ontology, as well as other organism ontologies, such as mouse genome. The 372 ontologies as of year 2013 were manually curated in order to limit our lexical search to clinically-related data sources. A total of 98 ontologies and terminologies were considered for concepts mapping and two searches were performed on the lexicon created for this project: an exact match term search and an “at least one” term search (wildcard search strategy). The results of both search types were retrieved and manually curated by three people, with one adjudicating the review results of the other two. UMLS Concept identifiers were also generated for each term in the lexicon via term to concept mapping tools [23]and human judgment for those that did not map. UMLS concept ids are provided as attributes for the 84 terms that were mapped. These efforts provide an additional semantic locality entry point that could be used for classification from different perspectives. Natural language processing (NLP) pipeline: V3NLP The basic premise of an NLP pipeline is to take unstructured free text and convert the text to machine-readable ‘structured’ elements. The V3NLP Framework, used for this project, is a UIMA [24] based set of tools, annotation label guidelines, annotators, readers and writers designed to aid VINCI NLP developers to build out applications. The V3NLP Framework evolved initially from HITEx [25], CTAKEs [26], and MetaMap[23], which are NLP systems that have been applied across a wide variety of use cases. The pipeline created for this task involved term lookup utilizing the lexicon developed for this project along with negation detection. Additional modules were developed and employed for this task to address boilerplated text such as Slot: Value, check boxes, and question structures. The slot:value pipeline component and the question pipeline component were employed to identify concepts that fall within these structures so that the concept assertion status is updated [27]. Concept assertion semantics are different for each of these boilerplated structures. For example, checkbox structures of the form Homeless: Y N was observed within Homelessness Survey records. It is necessary to recognize that this is a check box structure, and that the assertion status for the homelessness concept rests on the dependent content, to the left of the delimiter. If only Y appears, or any positive variant, the concept is asserted. Any other dependent content value would cause the concept to be tagged with negated. A tail end module was created to turn the homelessness terms into annotations in 8 named categories as described above.

3 591

Error analysis for training NLP pipeline False positive error analyses were performed by two human reviewers on a subset of 50 documents that were classified by the human reference standard as “positive” evidence of homelessness and 50 documents with “no” evidence of homelessness. The task was to review all the concepts extracted by the NLP pipeline (true and false positives) and classify the concept as either positively asserted or not, thus establishing the true positive rate which is the positive predictive value, PPV). This was performed using eHost and reviewing text documents highlighted with concepts identified by the NLP pipeline [18]. The false positive analysis was used to identify problem areas in the text such as templates (such as question/answer formats) and refine the NLP pipeline in an iterative manner. As an initial review of false positives were noted to be due to negation and boilerplated text such as check boxes and questions, modules to address these issues were added to the NLP pipeline (Negex for negation and slot:value and question for boilerplated text). Further, false negative analyses were performed by reviewing the text of the documents to identify additional concepts related to homelessness that were not already identified by the NLP algorithm. These words, terms or phrases were then included in the lexicon to be used in the next iteration of the NLP algorithm. Positive predictive value of the NLP pipeline for extracting positively asserted concepts An agile review of 50 documents from the homeless and non-homeless category was used to improve the overall PPV of concept extraction in an iterative manner. The final, refined iteration of the NLP pipeline was used to determine the instances of positively asserted concepts in eight categories in the 862 documents of the reference standard. After excluding the 100 documents used above for the error analysis and training of the pipeline, the remainder 762 documents were reviewed by two human reviewers to determine the true positivity or PPV of the concepts identified by the NLP pipeline. The rationale for determining the PPV or precision alone as opposed to precision and recall (sensitivity) as is often reported in the informatics literature is due to the low prevalence of concepts of interest and the resulting high number of documents that are expected to be negative with regard to concepts. Data extraction and analyses was performed using SAS (Version 9, SAS Institute Inc., Cary, NC) and R software (Version: 2.15.0, The R Foundation for Statistical Computing, Vienna, Austria ). This study was approved by the Institutional Review Board of the University of Utah and Research Review Committee of the VA Salt Lake City Health Care System. The research protocol was granted waiver of authorization and waiver of consent to access existing electronic medical records in VA research databases. Results Lexicon of Concepts Related to Homelessness The final version of the manually created lexicon contained 202 high-level psychosocial and homelessness related concepts, divided in to the eight categories already mentioned. Table 1 lists the categories and examples of the terms and concepts included in each one. Reference standard of free text documents for concept extraction Starting with a corpus of 1000 documents, our annotators reviewed a total of 862 documents for the reference standard after removing 138 documents due to training and for logistic reasons. Of these, 424 were classified as having evidence of homelessness at the document level (Homeless) and 438 as having no evidence of homelessness (Not Homeless). Reviewers had 98% overall inter-rater agreement in classifying the documents. The top 5 document types in terms of frequency as determined by the note title in the “Homeless” group were: HCHV Healthcare for Homeless Veterans, Homeless Program Initial Assessment (X), Homeless Program/OPC/SOAP/Social Work, Homeless Program Intake, and Healthcare for Homeless Veterans. The top 5 most frequent document types in the “Not Homeless” group were: Addendum, Discharge Summary, Primary Care, Ambulatory Outpatient Care Note, and Primary Care Clinic Notes.

4 592

Figure 1: Homelessness psychosocial risk factors NLP pipeline Table 1. Lexicon concept categories and examples of terms. Concept Categories Examples Behavioral Health Addiction, alcohol/drug abuse or dependence, detox, DUI, pathological gambling, violent behavior Direct Evidence Homeless, living on streets, sleeping outdoors, lack of housing, V60.0 Doubling up Doubled up, couch surfing, lives with parents/significant other/sibling, crashing at friend’s house Medical Comorbidities Hepatitis C, frostbite, HIV, gangrene Mental Health Axis II, poor coping skills, social isolation, paranoia, schizophrenia, depression, anger Other Risk Factors Emergency medical services, at risk ethnic or racial groups, no ID, needs ID Sexual & Other Sexual abuse/trauma, childhood abuse/trauma, domestic violence, military sexual trauma Trauma (MST) Social Stressors No or poor family support, recent divorce, death of close family member, job loss, legal issues UMLS and BioPortal searches The search for ‘homeless’ and ‘homelessness’ on the NCI Metathesaurus revealed one unique concept of homelessness with several synonyms (concept unique identifier, CUI C0237154). A review of the relationships of this concept to others in the form or parent, child and sibling relationships (representing hierarchies of relationships between associated concepts) reveals mapping to several risk factors related to homelessness. For example, parent concepts include social problems; child concepts include temporary shelter arrangements; sibling concepts include social behavior and substance abuse disorders. The challenges in directly applying NLP algorithms that map free text to unique concepts in the Metathesaurus is the problem of one term mapping to several different related concepts and the lack of knowledge of how these concepts are represented in the electronic medical record by diverse types of providers in a healthcare system. The Bioportal search resulted into 648 terms or phrases that were mapped to the initial 201 high-level concepts. This included similar terms from mostly SNOMED, LOINC and Medical Subjects Headings. The results were manually reviewed by two reviewers and ultimately, 185 concept unique identifiers (CUIs) that are psychosocial and homelessness related were retrieved. However, about 62% of the initial concepts did not result into any mapping from Bioportal.

5 593

Prevalence of concepts in document corpus The iterative development of the NLP pipeline based on an agile review of false positives in 50 documents each of homeless and non-homeless resulted in improvement of the positive predictive value (PPV) of concept extraction. (Table 2). While several iterations resulted in improvement in PPV of extracted concepts, some iterations resulted in degradation of PPV in certain concept categories. Table 2 shows three representative steps of the iterative process with corresponding numbers of hits from NLP and associated PPV. The final iteration of the NLP pipeline (version 10) was then applied to the entire reference standard corpus (Table 3). A total of 4251 concepts were identified from the 862 documents. Among the documents with ‘homeless’ in the note title, 403 out of 417 (97%) contained at least one concept. Based on a human review of the false positive concepts, the overall positive predictive value (PPV) for extracting homeless-related concepts from this set of documents was 76%. Among the random sample of documents, only 150 out of 455 (33%) contained at least one concept, with an overall PPV of 77%. Table 2. True positive (Positive Predictive Value) during NLP pipeline iterations based on error analysis

Error Analysis Comments

Documents with evidence of homelessness

Documents with no evidence of homelessness

(N=50 documents)

(N=50 documents)

Concept Categories

Concept Categories

Number of True + False Positives (PPV%)

Number of True + False Positives (PPV%)

Direct

Mental

Social

All

Health And

Stressors

others

Direct

Behavioral NLP Version 6 Key word; Negation with phrasal span

False positives noted in

1323

templated questions;

(85)

Social

All

Stressors

others

109 (85)

46 (80)

20 (65)

50 (68)

20 (70)

10 (40)

239 (80)

79 (95)

49 (69)

Behavioral 558 (87)

57 (72)

80 (73)

acceptable PPV Decrease in overall

850

Templated question/answer

positivity due to ignoring

(72)

sections ignored

template question/answer;

NLP Version 8

881 (84)

Mental Health And

344 (72)

208 (82)

16 (63)

36 (44)

decrease in PPV NLP Version 10 Recognize template

total all 3690

1557

1005 (61)

(76)

681 (91)

66 (79)

14 (71)

question/answer; also template with colon (:) and[]

In addition, the “homelessness” documents had an average of 8 concepts per document, as compared to 2 concepts per document in the random sample. Table 2 contains additional information related to the prevalence of concepts, including concepts per category, PPV by category, and most frequent concepts from each category. False positive analysis The human review of documents for false positive and false negative analyses was conducted using eHOST (Figure 2). The most common reason for false positivity was that the concept was part of a templated question, such as “How long have you been homeless?’ In addition, the context of the word or alternate meanings of a word or abbreviation lead to several false positives. For example, the abbreviation SA is often used to refer to substance abuse and was included in the lexicon. However, in the context of a medication list, SA TAB refers to sustained action. The phrase rule out is commonly used in medical records to literally indicate that the providers were trying to rule out a particular condition or diagnosis. This had to be added as a term to insure the negation module picked it up as one unit within one phrase to appropriately negate the rest of the phrase.

6 594

Table 3. Prevalence of concepts related to homelessness in reference standard document corpus (Total N=862 documents, Total positively asserted concepts = 4251) Documents with evidence of Homelessness

Documents with no evidence of Homelessness

(N=403 documents)*

(N=150 documents)**

Total concepts = 3309

Total concepts = 320

Average 8 concepts per document, range 1 to 34, median 8

Average 2 concepts per document, range 1 to 13, median 1

Number of Concepts

Most Frequent

Most Frequent

Number of Concepts

Most Frequent

Most Frequent

(Positive Predictive

Concepts

Document Types

(Positive Predictive

Concepts

Document Types

Value %) Direct Evidence

1557 (76)

Value %) shelter,

homeless, social

shelter,

psychiatry nursing

homelessness, hchv,

work, mental health

rehabilitation,

assessment,

homeless program,

program notes

halfway house,

mental health

20 (70)

housing program

rehabilitation Mental Health and

1016 (62)

Behavioral Factors

detox, heroin, drug

homeless, social

use, etoh, alcohol

work, discharge

abuse

etoh, alcohol use,

discharge

withdrawal,

summaries,

summary and mental

alcohol abuse,

general mental

health program notes

substance abuse,

195 (77)

health note, follow-up primary care

Social Stressors

681 (91)

divorced,

homeless, social

unemployed,

work, discharge

unemployed, lives

discharge

summary and mental

alone, prison, jail

summaries,

separated, prison, jail

66 (95)

divorced,

health program Other

55 (75)

trauma, hep c, mst

homeless, medical progress and social work program

mental health, ,

history & physical 39 (67)

mst, trauma,

primary care,

hep c

medical clinic, mental health note

* 14 of 417 (3%) documents with evidence of homelessness had 0 concepts ** 305 of 455 (67%) documents with no evidence of homelessness had 0 concepts hchv = health care for homeless veterans; etoh = alcohol; hep c = hepatitis c; mst = military sexual trauma

False negative analysis In the first round of review, the most common terms that represented concepts related to homelessness that were found in the free text and were missing from the lexicon were terms related to alcohol use. For example, a template used in screening for alcohol use, the phrase used was ‘how many drinks containing alcohol do you consume each day?’; similarly several creative variations on the theme of heavy drinking were noted in the free text including the terms ‘problem drinker‘. Other examples of missed concepts were due to misspelling and the use of different cases (capital vs. lower case) in a non-standard fashion. For example, typically the usage for the Housing and Urban Development, VA Assisted Housing is represented as HUD-VASH; in some instances, the provider used a nonstandard mix-up of upper and lower case letters such as Hud-Vash which was not recognized by the NLP program as a variant (even though NLP programs have a case insensitive algorithm built in). Discussion Using the VA electronic record and informatics methods, this study demonstrates the feasibility of extracting positively asserted concepts related to homelessness from the free text of medical records. Sub-optimal mapping of concepts to existing ontologies demonstrated the necessity for developing a customized lexicon for concepts related to homelessness. Homelessness may be considered a ‘non-medical’ condition and as such this represents an interesting use of NLP to extract non-medical concepts from medical records. It is important to note that the established risk factors for homelessness (as denoted by the different categories in Table 1) are very much part of the medical domain and are expected to be found in medical records. The potential applications of the lexicon and NLP algorithm are to study the prevalence of these concepts among those who have evidence of homelessness versus those who are not. These analyses may also play a role in

7 595

determining associations of concepts with the onset of homelessness and in the ideal situation, provide a means of identifying early warning indicators of risk of homelessness, prior to the formal ‘diagnoses’ through ICD-9-CM codes for homelessness or known risk factors. This could be achieved by applying the algorithm on longitudinal electronic medical records of Veterans. With the use of available ontologies and vocabularies, this work could be extended to other medical domains.

Figure 2. Synthetic record containing mentions of homelessness as displayed in eHost, the annotation tool used. The issue of templates in VA medical records has been shown to be a major factor in poor performance of information extraction algorithms[27], as demonstrated by the use of a version of this pipeline in a study of extracting psychosocial factors from large electronic medical record corpora [10]. This study demonstrates the feasibility of successfully processing question/answer and other formats of templates to extract positively asserted concepts to increase yield (hit rate of concepts) and PPV. This has broad applicability in the field of NLP as templated medical records are used in many settings to document patient-related information both by providers (forms and checklists in the EMR) and by patients (in-clinic, through kiosks and on-line surveys and questionnaires). The iterative process by which the lexicon was developed and made available for use by the NLP algorithm reinforces the challenges of ‘teaching’ computers to read human language. The myriad variations possible in terms of spelling, meaning and abbreviations, the variation noted in text written by different types of providers and regional variations lead one to conclude that this likely represents a sub-language of its own. We note several limitations. Based on our experience, the lexicon for homelessness concepts is neither exhaustive nor complete. Thus we would consider the current lexicon to be a living document. We did not attempt to add to the lexicon using automated or ‘self-learning’ methods. It is likely that there are variations in use of concepts related to homelessness based on sub-language differences among various clinical providers or based on geography (by state or region of the US). Thus, it would be important to study such variations to look at iso-semantic concepts. A more detailed mapping of our lexical terms to available vocabularies and ontologies is necessary and ongoing as it is desirable to convert all NLP-derived phenotypes to structured data. We applied the NLP pipeline to a small subset of VA records in this study. A version of the pipeline has been applied to a large corpus with acceptable performance [10]. While this pipeline has been developed using VA medical records, the principles are generalizable and adaptable to the EMR of other large health care systems. Concept assertion status including asserted, negated, hypothetical, historical, and not relating to the patient are ever present sources of failures within the NLP process. NegEX2 [28]was used for this study to determine asserted/negated status. Future iterations will incorporate a refined version of ConTEXT to determine assertion status with more granularity and precision[29].

8 596

Concepts of interest extracted from the NLP pipeline are not sufficient to make a homelessness classification for a given patient. On-going work involves using concepts of interest extracted from reference documents as features to train a machine learning predictive model. Conclusion Extracting information related to a non-medical condition from the free text of medical records can be a challenge. In this use case involving concepts related to homelessness, it required the development of a specialized lexicon, as the concepts of interest were poorly covered by existing ontologies. This method can be applied to other areas of interest and relevance while mining the EMR. Acknowledgements This work is supported by VA HSR&D Merit Review Award # HIR 10-002 (PI: AVG). We would like to express our gratitude to the administration and staff of the VA Informatics and Computing Infrastructure (VINCI) for their support of our project. The project benefited immensely from an active advisory role by the VA SLC Health Care System’s homeless service coordinator, Mr. Aldo Hernandez. We are deeply appreciative of the expertise and active participation in this project by our homeless service community partners in Salt Lake City, Utah: Fourth Street Clinic (Mr. Monte Hanks); The Road Home (Ms. Michelle Flynn and Ms. Michelle Vasquez), Volunteers of America (Ms. Jessica Fleming, Ms. Jamie Jones) and the State of Utah (Ms. Kathleen Moore). We would like to thank our colleagues at the National Center on Homelessness Among Veterans (Drs. Dennis Culhane, Steven Metraux and Jamison Fargo) for their discussions and advice. We gratefully acknowledge the support of our research team members: Mr. Thomas Ginter, Ms. Sarah Craig and Ms. Natalie Kelly. We also acknowledge other resources and facilities provided by the VA Salt Lake City Health Care System. The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the United States government. References

1. 2. 3.

4. 5. 6. 7. 8.

9. 10.

United States Interagency Council on Homelessness, Opening Doors: Federal Strategic Plan to Prevent and End Homelessness, United States Interagency Council on Homelessness, Editor. 2010, United States Interagency Council on Homelessness Washington, DC. US Department of Veterans Affairs. VA Information Resource Center. 2012 [cited 2012; Available from: http://www.virec.research.va.gov/DataSourcesName/DataNames.htm. U.S. Department of Veterans Affairs Office of Inspector General, Homeless Incidence and Risk Factors for Becoming Homeless in Veterans, VA Office of Inspector General, Editor. 2012, VA Office of Inspector General: Washington DC. Kashner, T.M., Agreement between administrative files and written medical records: a case of the Department of Veterans Affairs. Med Care, 1998. 36(9): p. 1324-36. Schneeweiss, S., et al., Veteran's affairs hospital discharge databases coded serious bacterial infections accurately. J Clin Epidemiol, 2007. 60(4): p. 397-409. Roumie, C.L., et al., Validation of ICD-9 codes with a high positive predictive value for incident strokes resulting in hospitalization using Medicaid health data. Pharmacoepidemiol Drug Saf, 2008. 17(1): p. 20-6. Banerjea, R., et al., Co-occurring medical and mental illness and substance use disorders among veteran clinic users with spinal cord injury patients with complexities. Spinal Cord, 2009. 47(11): p. 789-95. Tracy, L.A., et al., Predictive ability of positive clinical culture results and International Classification of Diseases, Ninth Revision, to identify and classify noninvasive Staphylococcus aureus infections: a validation study. Infect Control Hosp Epidemiol, 2010. 31(7): p. 694-700. Redd, A., et al., Detecting earlier indicators of homelessness in the free text of medical records. Stud Health Technol Inform, 2014. 202: p. 153-6. Gundlapalli, A.V., et al., Validating a strategy for psychosocial phenotyping using a large corpus of clinical text. J Am Med Inform Assoc, 2013. 20(e2): p. e355-64.

9 597

11. 12. 13. 14. 15.

16. 17.

18. 19. 20. 21. 22.

23. 24. 25. 26. 27. 28. 29.

Meystre, S.M., et al., Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform, 2008: p. 128-44. Chapman, W.W., Closing the gap between NLP research and clinical practice. Methods Inf Med, 2010. 49(4): p. 317-9. Jha, A.K., The promise of electronic records: around the corner or down the road? JAMA, 2011. 306(8): p. 880-1. Nadkarni, P.M., L. Ohno-Machado, and W.W. Chapman, Natural language processing: an introduction. J Am Med Inform Assoc, 2011. 18(5): p. 544-51. Balshem, H., et al., A Critical Review of the Literature Regarding Homelessness Among Veterans, in A Critical Review of the Literature Regarding Homelessness Among Veterans, US Department of Veterans Affairs, Editor. 2011: Washington (DC). McCray, A.T., S. Srinivasan, and A.C. Browne, Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care, 1994: p. 235-9. U.S. Department of Housing and Urban Development. Federal Definition of Homelessness. 2011 [cited 2011 December 3, 2011]; Available from: http://portal.hud.gov/hudportal/HUD?src=/topics/homelessness/definition. South, B., et al. A Prototype Tool Set to Support Machine-Assisted Annotation. in BioNLP 2012. 2012. Montreal, Canada. Ogren, P.V. Knowtator: A Protégé plug-in for annotated corpus construction. . in Proceedings of the Human Language Technology Conference of the NAACL. 2006. National Cancer Institute. NCImetathesaurus. 2012 [cited 2012; Available from: http://ncim.nci.nih.gov/ncimbrowser/. Noy, N.F., et al., BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res, 2009. 37(Web Server issue): p. W170-3. Whetzel, P., et al., BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. . Nucleic Acids Res. , 2011. Jul (39(Web Server issue)): p. W541-5. . Aronson, A.R., Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp, 2001: p. 17-21. Ferrucci, D. and A. Lally, UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 2004. 10(3-4): p. 327-348. Zeng, Q.T., et al., Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak, 2006. 6: p. 30. Savova, G.K., et al., Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc, 2010. 17(5): p. 507-13. Divita, G., et al., Recognizing Questions and Answers in EMR Templates Using Natural Language Processing. Stud Health Technol Inform, 2014. 202: p. 149-52. Chapman, W.W., et al., A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform, 2001. 34(5): p. 301-10. Chapman, W.W., D. Chu, and J.N. Dowling, ConText: an algorithm for identifying contextual features from clinical text, in Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing. 2007, Association for Computational Linguistics: Prague, Czech Republic. p. 81-88.

10 598

Extracting Concepts Related to Homelessness from the Free Text of VA Electronic Medical Records.

Mining the free text of electronic medical records (EMR) using natural language processing (NLP) is an effective method of extracting information not ...
190KB Sizes 0 Downloads 5 Views