An Automatic Indexing Method for Medical Documents Michael M. Wagner Section of Medical Informatics, University of Pittsburgh School of Medicine Intelligent Systems Program, University of Pittsburgh This paper describes Metalndex, an automatic indexing program that creates symbolic representations of documents for the purpose of document retrieval. MetaIndex uses a simple transition network parser to recognize a language that is derivedfrom the set of main concepts in the Unified Medical Language System Metathesaurus (Meta-1). Metalndex uses a hierarchy ofmedical concepts, also derived from Meta-1, to represent the content of documents. The goal of this approach is to improve document retrieval performance by better representation of documents. An evaluation method is described, and the performance of MetaIndex on the task of indexing the Slice of Life medical image collection is reported.

1. Introduction In this paper, we describe a method for automatically creating representations of documents for use in document retrieval systems. The process of constructing such representations is called automatic indexing. The earliest description of an automatic indexing method is attributed to Luhn [11]. Luhn's basic idea, which is now called the vector-space model, was that simple text features (e.g., words or phrases) could be identified automatically in documents and used to represent the documents for indexing and retrieval. The properties of the vector-space model have been extensively investigated [4, 13-15]. One of the principle results of this body of research is that there is an upper bound on performance of document retrieval systems using this method. Partly because of this performance limitation, there has been recent interest in the use of symbolic representations of documents in document retrieval systems [5]. We use the term symbolic representation to mean a representation of the objects in some domain (and their inter-relationships) that has the property of cognitive correspondence, i.e., the symbols correspond to some natural human conceptualization of the domain. One example of a symbolic representation for indexing is the Medical Subject Heading (MESH) thesaurus used by MEDLARS1. MESH subject headings correspond to medical concepts and the inter-relationships between subject headings in the MESH trees correspond to medical taxonomic relations. Symbolic representations may lead to improved document retrieval performance because they can potentially represent the contents of documents better than the vector-space model (e.g., "treatment of endocarditis" can be represented in MESH, but not in a vector space model). However, a key

drawback of symbolic indexing is that it is difficult to automate the process, in the general case [5]. The obstacles to automatic generation of symbolic representations are significant, being similar to the problems of automatic natural language understanding. These obstacles can broadly be described as problems with ambiguities (e.g., word sense, pronoun reference) that may require vast amounts of general and specific knowledge to resolve. Although currently there is no general solution to the problem of natural language understanding, for selected classes of medical documents it has been possible to create reasonably accurate representations of the content automatically from text [3, 9, 12]. This paper describes a Meta-l based method that can do this for a medical image collection. 2. The Slice of Life Medical Image Collection

The document collection used in this work is the Slice of Life (SOL) medical image collection, a collection of more than 31,000 medical images [20, 21] on a videodisc. Each image in SOL is represented by a record in a flatfile database that contains both coded and free text fields. The coded fields indicate the type of image, the stain (if applicable) and magnification (if applicable). The text fields contain the title of the slide and a brief description (Figure 1). The free text descriptions in the SOL database are linguistically simple. The descriptions consist of brief phrases that usually describe one central object (e.g., a heart valve) and some property of the object to which attention is being drawn (e.g., endocarditis). The language of the descriptions is often elliptical (missing words, e.g., "pituitary" instead of "pituitary gland"). The descriptions also contain non-standard abbreviations and misspellings. However, there is very little word sense ambiguity (e.g., foot does not mean twelve inches or the base of a cliff). Expressions involving negation, quantification, or uncertainty are rare. The language is devoid of metaphor, anaphora and other computationally problematic features.

1 Medical Literature Access and Retrieval Service

0195-4210/91/$5.00 © 1992 AMIA, Inc.

1011

-

Esophagitis, Candida, Mult Diffuse Plaq., 6 mo painful 1 . dysphagia, 40 lb wgt loss, no predisposition Menstrual Cycle days 6-10 endometrial cells; cervical 2. smears: gynecologic cytology

Mitral Insufficiency, Floppy Mitral valve with Ruptured 3. chordae Figure 1. Three image descriptions from SOL database (the title and comment fields are merged in this figure). _.

3. Metalndex

quently used inter-concept link type is the narrower-than type (NT). The NT link type is a standard link type used in

MetaIndex is a program that processes the free text SOL descriptions to generate a symbolic representation of SOL images. Since Metalndex makes extensive use of the information in Meta-1, a description of Meta-1 is first presented. Meta-1 is also described by its documentation [1], and in a series of papers by Sherertz and Tuttle [16-18, 2224]. 3.1 The Unified Medical Language System Metathesaurus (Meta-1) In this project, we used a pre-release version of Meta-1, which we denote as Meta-1p, because only the pre-release version was available when we began our investigation. To avoid confusion, we will use the term Meta-1 in this paper when comments apply to the general Metathesaurus schema, and reserve the term Meta-lp for comments that pertain just to Meta-Ip. For clarity, we use three figures to describe the organization of information in Meta-1. As a starting point, we view Meta-1 as a very large set of unrelated strings of medical language (Figure 2) from different source nomenclatures. These strings are predominantly simple noun phrases, e.g., Mitral Valve Prolapse. Heart Disease/MESH

Disease; heartACD

Endocardiis/MESH

Cardiac Diseases/MESH

Prolapsed; mitral valve/lCD Diseased; hear/lCD

HEART DIS/MESH Disease Syndromes Heart/SNOMED Mitral click-murmur syndrome/MESH Prolapse; mitral valve/lCD

Heart Diseases/MESH

Heart Diseases/MESH Heart Disease/MESH

Disease; hear/lCD Cardiac Diseases/MESH Diseased; heart/ICD HEART DIS/MESH Disease Syndromes HearVSNOMED HEART DISEASE NOS#ICD

ECrditi/ME

Mitral valve grolaose/MESH

Systolic click-murmur syndrome/MESH Floppy Mitral Valve/MESH Prolapsd; mitral valve/lCD

Prolapse; mitral valve/lCD Mitral click-murmur syndrome/MESH

Figure 3. Four of the -30,000 concepts in Meta-lp. The string chosen to be the name of the concept is underlined.

the field of information retrieval that is equivalent to the combination of two standard semantic network relation types, part-of and is-a. A NT link means that anything classified under the left member of the relation is an instance of, or a part of, the the right member. For example, heart diseases is-a diseases. In Meta-1, the set of NT links between concepts was created by a team of Meta-I editors using the set of inter-concept links in the MESH hierarchy to identify candidate relationships. A variety of more specific link types (e.g., is-a and manifestation-of) are also included on an experimental basis in Meta-1.2 This level of organization can be interpreted as a semantic network, and used to make inferences. Therefore, we refer to this as a domain model.

Mitral Valve/MESH

HEART DISEASE NOS/ICD

Floppy Mitral Valve/MESH Diseases/MESH

Mitral valve prolapse/MESH

Systolic cdick-murmur syndrome/MESH

Figure 2. Seventeen of -97,700 strings in Meta-lp with Abbreviations: MESH- Medical Subject Headings, ICD- International Classification of Diseases, SNOMED- Systematic Nomenclature of Medicine. source.

This set of strings is clustered into conceptual equivalence classes based on two types of links (binary relationships) between strings in Meta-1: lexical variant and synonymy links (Figure 3). Two strings are lexical variants if one is a reordering of the words of the other, e.g., "Huntington's disease" and "disease; Huntington's" or if they differ only in capitalization or other minor typographical ways, e.g., "Huntingtons Disease." Two strings are synonyms if they are synonyms in one of the Meta-I source nomenclatures. In Meta-1p, there are 19,748 lexical variant relations and 11,700 synonymy relations. For convenient reference, each concept is given the name of one of its constituent strings. The highest epistemological level in Meta-I is defined by a set of links between concepts (Figure 4). The most fre-

3.2 MetaIndex Components MetaIndex consists of three logical components, a preprocessor, a parser, and a domain model. Because the parser and domain model use Meta-1, they will be described first.

Parser The parser recognizes a language similar to the language defined by the set of strings in Meta-1p. These strings are found in the main concept and lexical variant files (i.e., the MRMC and MRLV files) in Meta-1p. In MetaIndex, we 2 The links are in file MRCTX.;1 in the 9/90 Meta-1 release.

1012

consider each of these strings to be a legal expression in a "activated" and placed on a search queue. Nodes are then medical language, and that the set of these strings taken to- taken off the queue in turn and their children are visited. gether completely defines the language3. We justify the ad- Whenever a child is visited, its list of words is marked equacy of this language of noun phrases for the natural language I~~~~~~~~~~~ processing of SOL descriptions B. by pointing out that the SOL A. language is also basically a i; 1ilrai Valve Prdapse I _ simple noun phrase language. E~ ~ itrA Vabe Prdapsel Because this language is finite (i.e., every legal expression can / ./ \ \ be listed), a simple parser for it Bactera Endocardits Vav pilral Badena Endocardim Ropit could be based on a dictionary 4 data structure with a member / / \/ operation. However, we use a w it slightly more complex parser, Fndocaritis PAP for reasons that we will discuss. We organize the set of Meta-1 "Endocarditis in a floppy itral valve" 'EndocadJtls In a lloppymitraI Volvo" strings into a graph in which i each node represents a string from the Meta-1 language. To Figures 5A and 5B. Transition network before and after parsing the phrase "Endocarditis in a this structure we add another set floppy Mitral valve." Strings in the "Meta-lp language" are in boxes; unboxed strings are not of nodes that represent strings in the language. Shading indicates recognition of a string. Transition conditions are not that are not in the language, but labelled but can be inferred in a straightforward manner (e.g., transition from "Valve" to "Mitral are necessary to convert the Valve" requires the word "Mitral") structure into a simple transition network (Figure 5A). based on the words in the parent node. For example, when Specifically, the transition netw(ork is constructed by first the node Floppy Mitral Valve is visited from the node downcasing and removing most puinctuation marks from the Floppy Mitral, the words Floppy and Mitral are marked. set of strings in the main concept file. Then,, wegroup the When all of a child's words are marked, it is itself terms from the main concept file into size classes by the "activated" and put on the search queue. The goal of this number of words in the terms. Let n be the size of the algorithm-recognition of legal expressions in the lanlargest class. The network is then constructed by adding a guage-is achieved whenever a node representing a Meta-Ip pair of parent nodes for each node in sizetclass n. Thefirst expression is "activated." This approach is similar to that parent consists of the first n-i woi rds and the second parent described by Shoval [19]. consists of words 2 through n. For example, the term Note that this parser does not use the information about coronary artery disease generates tl eparentscoronary artery lexical variants contained in the Meta-lp MRLV files. and artery disease. Pointers frorn parents to children are Instead, word-order variants are recognized because the algomaintained, and duplicate nodes in size class n-i are merged rithm that traverses the transition network is word-order inThis process is repeated for each smaller size class until sensitive. Thus the node a b c will be activated by the inn=1. Note that we do not generate all n-i word parents of a put a c b. We chose this design because unusual word node; the two parents described ar e sufficient for the word- orderings, perhaps not in the MRLV file, are common in SOL. Punctuation and capitalization lexical variants are order independent phrase recogniticin that we desire. recognized because we remove punctuation marks and downnhawedeir.@ The parser uses a queue-based sp)reading-activation search case capital letters from Meta-1 terms before network algorithm to traverse the transitic n network structure in a construction and from documents before indexing. word-order independent manner. As an example of the beThis parser implementation has several advantages over a havior of this algorithm, consider how the SOL description simple dictionary look-up. First, it ignores word order, Endocarditis in a Floppy Mitral V'alve is processed (Figure thus enlarging the set of word-order lexical variants recog5B). First, each of the individuail words in the phrase is nized. Second, it can potentially be used to resolve elliptipresented to the network. Words t,hat do not match nodes in cal usages in the SOL database. For example, the term the entry level of the network (theD set of single word roots Hashimoto's only has one meaning-Hashimoto's thyroidiof the transition network) are not irecognized and, in effect, tis. If the word Hashimoto's alone was found in the text of eliminated. Those words that are rrecognized (Endocarditis, a SOL database record, it would be reasonable to infer that Floppy, Mitral, Valve) cause the ecorresponding node to be Hashimoto's thyroiditis was the intended meaning. Structurally, this corresponds to a network in which the 3 Because of our handling of lexical variants, the actual set is node representing "Hashimoto's" has only one child, and that child is "Hashimoto's thyroiditis." Note that this is a somewhat different. This will be exjplained. heuristic method which we have not yet implemented. N

V#es

1013

Domain Model The second MetaIndex component, the domain model, is taken directly from Meta-1 as the set of all Meta-l concepts with their NT (is-a, part-of) inter-concept relationships (Figure 4). This domain DomainModel model is a very simple model of medicine, representing only a fraction of the relationships Seaw that exist between medical conIcepLt.

before they are presented to the parser. The preprocessor generates both plurals and singular forms because Meta-l contains a mixture of plural and singular usages.

rUo eAxaIIIpi, in1

reality, endocarditis and mitral valve are related by the fact that endocarditis is a pathologic process that can involve the mitral valve, but this relationship is not encoded in the domain model. We choose to represent images with concepts drawn from this domain model

"Endocarditls In a floppy mitral valve"

IMAMA;fI-M l"nr%IszlMAfT 1h40r-11101 OCCHUSC thFA UIC MCUICal

KHUWICUgO I

6u- A---.,- --A-l uomain moaei in tne encouea :----A-A

------------------

Figure 6. Connection between parser and domain model. The arrows from the language model to the domain

is potentially useful for model indicate that the presence of the string in the SOL free text is sufficient evidence to conclude that the retrieval. For example, the image is about that concept. Shading indicates that a node has been "activated." The output of the parser is the Meta-1 concept pregnancy set of shaded, boxed nodes, e.g., Floppy Mitral Valve, Mitral Valve, Endocarditis, Valve. The representation in complications is broader than the domain model is the two shaded nodes, Endocarditis and Mitral Valve Prolapse, as well as Valve and Mitral abruptio placenta. This Valve (not shown due to lack of space). domain model can potentially support the inference that The preprocessor uses substitution tables to expand abimages relevant to abruptio pLacenta are of interest to breviations and complete ellipses. We developed substitusomeone interested in pregnancy (complications. tion tables for ellipses and abbreviation by inspecting the input and output of MetaIndex on a random 10% sample of to Parser Don from The Mapping zain Model We have described the parser anId the domain model. Next SOL descriptions. Candidate substitutions for inclusion in we show the connection betweein the two. Automatically substitution tables (e.g., thyroid disease for thyroid) were deriving a representation in a doniain model from the output identified. Keyword in context (KWIC) listing of every ocof a parser can be a difficult prc blem. Consider the SOL currence of a candidate substitution were reviewed to ensure description Endocarditis in a Floppy Mitral Valve. One that the term was only used in one sense in SOL before it cannot conclude that the image shiould be represented by the was added to the substitution table. Thirty-four ellipses and terms Endocarditis and Mitral V'alve Prolapse without as- 30 abbreviations were identified and handled in this manner. suming that the text descriptions of SOL images are decla- Note that these elliptical usages differed from those previrations of the content of the imatge. In indexing SOL, we ously discussed because they only became unambiguous in categorically make this assumptLion. We assume that the the context of the SOL collection. For example, thyroid is only expressions that the parser will recognize are state- a highly ambiguous term in Meta-lp (it has fifteen children ments about the medical concelpts that the image can be in the network, e.g., thyroid storm and thyroid effects). used to illustrate. In Meta[Index, this assumption However, in SOL thyroid only denotes the thyroid gland. corresponds to a node in the parse-r being linked to a node in Therefore, this substitution was permitted. the domain model if and only if the text string represented 4. Evaluation of MetaIndex by the node is a member of the conceptual class that is named by the concept in the dlomain model. Figure 6 shows this parser to domain model mapping. The parser 4.1 Methods Five physicians with specialty training in Internal Medicine node representing the text strinigg Floppy Mitral Valve is Mitral Valve Prolapse (2), Pediatrics (1), Neurosurgery (1), or Pathology (1), were relinked to the domain model concisept mitr Valve Mita I cruited to serve as judges. A study set of 126 SOL images was because Floppy Mitral Valve is member of the Mitral selected at random from the collection of 31,500 images. fields for each image record in the SOL database The free Valve Prolapse equivalence class of strings (see Figure 3). ,

a

text

concatenated with the contents of the "organ" field and the resulting string was used as input to MetaIndex, which generated a set of indexes for the record. The "organ" field was inwere

Preprocessor MetaIndex handles plurals, infllections, abbreviations and some elliptical forms by preproi,cessing SOL descriptions

1014

cluded because it contained text that also described the subject of the image. Each judge was given detailed instructions and asked to judge the MetaIndex indexes of 25-35 image records for accuracy and completeness. For each image, the judge was given the SOL database description and the MetaIndex-generated terms. Note that the judges were provided with the SOL database description of the image, not the image itself. The judge was asked to complete two tasks. In the first task, s/he decided whether each index was acceptable relative to the slide description. In the second task, the judge was asked to write in any additional indices that s/he felt were missed by MetaIndex. No time limit was placed on the judging. To assess the reliability of the method, inter-judge variability was measured for a set of ten image descriptions. This set was evaluated by all the judges.

4.2 Results On average, a SOL image was described by 8 words of text and MetaIndex generated 407 indexing terms for the 126 image sample (3.3 terms/image).

Accuracy of Indexing Of a total of 407 MetaIndex generated terms, 372 (91%4) were judged acceptable. For the task of judging accuracy, inter-judge agreement, defined as the proportion of terms generated by MetaIndex for which at least four of the five judges agreed on the acceptability of the term, was 94%

(34/38).

We analyzed the 35 inaccurate indexes to determine the cause of the errors (Table 1). Most (57%) errors resulted from our assumption that the presence of a text string was sufficient evidence for a concept ("wrong term sense"). For example, the index patients was generated from the description Huntington's disease patient with old cva. A smaller number of errors resulted from properties of the algorithm that could be changed easily. For example, for the description Multiple Sclerosis, the indexes sclerosis

Table 1. Analysis of Inaccurate Indexes (N=35) N % Cause 20 (57) wrong term sense 10 (29) term part of longer index term 2 (6) word order effect 1 (3) missed negation (6) 2 codererr terms were listed by at least four of five judges. Since most of these indexes were reasonable indexes, 64% should be interpreted as an upper bound on the completeness of the indexing of the algorithm. This methodological problem is discussed below. We analyzed the 207 indexes that were missed by MetaIndex to determine the cause of the misses (Table 2). We divided the missed terms into two categories, refractory and tractable, based on whether MetaIndex, with modifications, could be expected to generate the missed term. The categorization was done manually by the author by looking for the connection between an index suggested by the judge and the SOL database description. In 34% (71/207), the missed term was considered refractory to algorithmic identification. Of these 71 refractory terms, 3 were missed due to misspellings in SOL (e.g., the word opstructive in the description obstructive lung disease). Seven were unrecognized abbreviations (e.g., nplsm for neoplasm). Seven were elliptical usages that were either missed by the sampling method used to build preprocessor substitution tables, or were ambiguous in the SOL database (e.g., cortex meaning either adrenal, auditory or cerebral cortex). Two missed terms were word variants (e.g., arterial instead of arteries). The remaining 52 refractory terms were missed because they could not be readily identified algorithmically. These terms were the product of sophisticated inference on the part of the physician. For example, from the text "endometrium, 7-10 days, normal" a physician produced the indexes proliferative endometrium and endometrial dating technique.

and Multiple Sclerosis were generated. This tye or error is counted in the category "term part of longer index term." Two errors resulted from ignoring word order. For example, the index cell cycle was generated for the description Menstrual ck Table 2. ALnalysis of indexes that were missed by Metalndex (2O2) N % %Tractable N Refractory days 6-10 endometrial clls; cervical smears: sophisticalted inference 52 (25) missing concept gynecologic cytology. Ignoring negative 66 (32) abbreviati4bns constructions caused one error. 46 (22) 7 (3) missing is-a link link 16 (8) part-of missing (3) 7 ellipses Completeness of Indexing 4 (2) 2 coder error (1) ants varia word To the set of 372 acceptable MetaIndex 4 (2) error 3 program (1) igs 207 misspellin the an additional added judges terms, index terms for a total of 579 correct terms. TOTALS 136 (66) 71 (34) l Thus, the algorithm identified 372/579 (64%5) of the total set of terms that were acceptable to, or volunteered by, the judges. Sixty-six percent (136/207) were considered amenable to However, inter-judge agreement was poor for the task of algorithmic identification, given an augmentation of listing missing terms. For the ten image descriptions Meta-1p. Sixty-six indexes were literally present in the indexed by all five judges, only 6% (3/53) of the additional SOL text and could have been identified if the string had been part of the Meta-1 language. Sixty-two terms would have been identified by simple one-step or chained 4 the 95% confidence interval is 85-94% deductions from the text through is-a (46), or part-of (16) inferences, but the necessary inter-concept links were not S the 95% confidence interval is 56-72%

1015

present in Meta- lp6. The method used to make these determinations involved writing down the inference chain starting with the phrase in the SOL description and checking that each link in the chain was a simple is-a or part-of inference or a synonomy transformation. 5. Related Work

Use ofMeta-1 in Automatic Indexing In 1968, the MESH thesaurus (the principal Meta-1 source) was used experimentally in the SMART system to index MEDLARS documents automatically [14]. More recently, Meta-I has been used in this manner in the SAPHIRE and CLARIT systems [6, 10]. Vries [25] uses thesaurus narrower-than links to identify the most specific indexes for documents (the standard vector-space approach uses thesaurus transformations to find broader-than indexes). Natural Language Understanding (NLU) and Automatic Indexing NLU requires many types of knowledge about language and the world to constrain the interpretation of natural language. These types of knowledge include morphological, syntactic, and semantic knowledge. Prior work related to automatic indexing and NLU can be organized by the types of knowledge used in the systems. Morphologic knowledge is knowledge about the substructure of words (e.g., roots and affixes). Morphologic knowledge is sometimes useful in NLU because the meaning of a medical word, such as appendicitis, can be determined from the meanings of its root and suffix. In NLU, the term syntax is used to refer to knowledge about the lexical types of words (e.g., noun), and the allowed grammatical relationships between lexical types in a language. Syntactic knowledge can be used to constrain the set of possible meanings for words with multiple senses. Syntax also can be used to identify potentially meaningful clusterings of words, e.g., noun phrases. Syntactic parsing has been used to recognize noun phrases in the SMART and CLARIT systems. Semantic knowledge alone is often effective for NLU in technical domains because syntactic knowledge is not needed to constrain the meanings of words. For example, the word foot has a single meaning in most medical sublanguages. Canfield used a purely semantic approach to understand echocardiogram reports [2, 3]. Haug also used a semantic approach to understand chest x-ray reports [9]. Semantic approaches to NLU problems in technical domains have become sufficiently common that the term sublanguage analysis was coined to refer to the process [8]. Semantic knowledge can be used in combination with syntactic knowledge. For example, semantic types can be used as selectional restrictions in basically syntactic parsers. Sager's Medical Narrative Processor [12] is an example of a mixed approach. 6 A subsequent analysis using the inter-concept relations in Meta-1 showed that 17/62 would be identified.

Our approach can be viewed as a semantic approach in which each Meta-1 concept is a semantic type and there is one combination rule stating that a legal SOL description is any combination of zero or more types. Within this framework, it is easy to see that this approach might be extended by adding semantic combination rules that define meaningful ways that types combine in the language. 6. Discussion

Performance ofMetaIndex Our evaluation measured the ability of MetaIndex to generate indexes similar to those generated by, or accepted by physicians. Thus, the gold standard in this evaluation was the set of indexes either generated by or accepted by physicians for the SOL image descriptions. This evaluation design embodies several assumptions: First, we assume that similarity to physician-generated/accepted indexes relates to the real outcome of interest, namely retrieval performance. We think this is a reasonable assumption because there is no reason to believe that physicians use a different language for indexing than they do for querying. However, it is possible that some of the terms physicians accepted are not terms that they would use for retrieval. This effect would tend to inflate the measured completeness of indexing. This design also assumes that the SOL database is an accurate and complete description of the images for information retrieval, i.e., for the needs of the users of the collection. Using the above evaluation methodology, we found that 91% of the MetaIndex generated indexes were judged acceptable. The majority of the unacceptable indexes were a result of our assumption that the presence of a text string was sufficient evidence to conclude that a concept was present. We believe that this rate of false positive indexes is acceptable in an automatic indexing application. Because of methodological problems discussed in the previous section, we were unable to measure the completeness of indexing. The high inter-judge variability suggests that each image could be indexed by many more terms than a single physician would produce. A second problem stems from the fact that we provided the judges with the Metalndex-generated indices and let them mark them as appropriate or not. We do not know whether physicians would have generated all of these terms spontaneously. Both of these problems might increase the measured completeness of indexing spuriously. However, the study identified the principal factors that determine the completeness of indexing. Completeness of indexing depended roughly equally on the completeness of the conceptual coverage of Meta-1, completeness interconcept links in Meta-1, and basic limitations of the MetaIndex approach. Generalization of the Method The performance of NLU methods is determined by the domain of discourse and the linguistic characteristics of the natural language. The accuracy of our approach depends on the simple semantics of SOL descriptions. The semantics allows a simple phrase recognition procedure and a

1016

mapping from text to concepts defined by the synonomy relations in Meta-1, to generate accurate indexes. NLU performance also depends on the granularity of the target meaning representation. Our approach works as well as it does because the target representation was coarse. The physician indexers did not use coordinate indexing terms, e.g., endocarditis, treatment of. We could not achieve such a level of representation (similar to that of MEDLARS indexing, with sub-heading and focus modifiers) using our current approach. Knowledge-based approaches tend to be brittle and fail to generalize to other domains. The current version of MetaIndex is no exception. This approach is applicable to other document collections provided they have the following characteristics: (1) the domain is biomedicine, (2) the semantics are simple (i.e., the presence of a phrase signifies that the document is about the phrase), (3) the representation requirements are coarse, i.e., at the level of MESH subject headings. To index a non-medical domain, a different thesaurus would be required. To handle more linguistic complexity, a different parser and a different method for mapping from parser output to the symbolic representation would be required. The kinds of knowledge required by a parser for a more complex language are not in Meta-1 at the current time. The problem of mapping from parser output to a knowledge representation is a hard problem for which there is currently no general solution. However, there are new approaches to this problem in which the mapping between text features and conceptual representation are learned from examples [7]. 6. Summary and Conclusions

MetaIndex demonstrates a method for the automatic generation of symbolic representations of documents. An evaluation showed that MetaIndex indexing was accurate, but incomplete. Completeness depended equally on the conceptual coverage of Meta-1, the inter-concept links coverage of Meta-1, and the limitations of our approach. The completeness of indexing also depends on the completeness of image descriptions, but this effect was not measured. We have partly defined the strengths and weaknesses of Meta-1 as a knowledge source for automatic indexing. These results also apply to the more conventional use of Meta-1 as the thesaurus component in a vector-space information retrieval system. ACKNOWLEDGMENTS- I wish to thank my advisor, Greg Cooper, and my master's project committee members: Randolph A. Miller, Johanna Moore, and Steven Small. Suzanne Stensaas supplied the Slice of Life Database. Stuart Weinberg and others contributed as judges. Dario Giuse commented on a draft of this paper. This work was supported by National of Library of Medicine training grant T15 LM 07059.

7. References

[2]Canfield K, et al.: Representation and Database Design for Clinical Information. Proc 14th Ann SCAMC. IEEE Comp Soc Press, pp 350-3, 1990. [3]Canfield K, et al.: Database Capture of Natural Language Echocardiographic Reports: A Unified Medical Language System Approach. Proc 13th Ann SCAMC. IEEE Comp Soc Press, pp 559-63, 1989. [4]Cleverdon C, et al.: Factors Determining the Performance of Indexing Systems, Aslib-Cranfield Research Project. Cranfield College of Aeronautics, England, 1966. [5]Croft W: Automatic Indexing. Proc of Am Soc Indexers. pp 85-100, 1988. [6]Evans D, et al.: Automatic Indexing Using Selective NLP and First-Order Thesauri. Proc RIAO-91. 1990. [7]Fung R, et al.: An Architecture for Probabilistic ConceptBased Information Retrieval. Proc. 13th Int. Conf. on Res. Dev. in Inf. Retr. pp 392-404, 1990. [8]Grishman R, Kittredge R: Analyzing Language in Restricted Domains: Sublanguage Description and Processing. Hillsdale, NJ: Lawrence Erlbaum Assoc, 1986. [9]Haug P, et al.: Computerized Extraction of Coded Findings from Free-Text Radiologic Reports. Rad. 174: 543-8, 1990. [10]Hersh W, et al.: Adaptation of Meta-1 for SAPHIRE, A General Purpose Information Retrieval System. Proc 14th Ann SCAMC. IEEE Comp Soc Press, pp 156-60, 1990. [11]Luhn H: A New Method of Recording and Searching Information. Am Doc. 4: 14-16, 1953. [12]Sager N, et al.: Medical Language Processing: Computer Management of Narrative Data. Reading, MA: AddisonWesley, 1987. [13]Salton G: Automatic Information Organization and Retrieval. New York: McGraw-Hill, 1968. [14]Salton G: The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ: Prentice-Hall, 1971. [15]Salton G, McGill M: Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. [16]Sherertz D, et al.: Source Inversion and Matching in the UMLS Metathesaurus. Proc 14th Ann SCAMC. IEEE Comp Soc Press, pp 141-5, 1990. [17]Sherertz D, Tuttle M: Lexical Mapping in the UMLS Metathesaurus. Proc 13th Ann SCAMC. IEEE Comp Soc Press, pp 494-9, 1989. [18]Sherertz D, et al.: Intervocabulary Mapping Within the UMLS: The Role of Lexical Matching. Proc 12th Ann SCAMC. IEEE Comp Soc Press, pp 201-6, 1988. [19]Shoval P: An Expert Consultation System for a Retrieval Data-Base with a Semantic-Network of Concepts. University of Pittsburgh Ph.D. Dissertation, 1981. [20]Stensaas S: Slice of Life IV Database. Available from Spencer S. Eccles Health Sciences Library, U of Utah, 1989. [21]Stensaas S: Slice of Life IV Videodisk. Available from Spencer S. Eccles Health Sciences Library, U of Utah, 1989. [22]Tuttle M, et al.: Toward a Biomedical Thesaurus: Building the foundation of the UMLS. Proc 12th Ann SCAMC. IEEE Comp Soc Press, pp 191-5, 1988. [23]Tuttle M, et al.: Implementing Meta-1: the first version of the UMLS Metathesaurus. Proc 13th Ann SCAMC. IEEE Comp Soc Press, pp 483-7, 1989. [24]Tuttle M, et al.: Using Meta-1: the first version of the UMLS Metathesaurus. Proc 14th Ann SCAMC. IEEE Comp Soc Press, pp 131-5, 1990. [25]Vries J, et al.: An Automated Indexing System Utilizing Semantic Net Expansion. submitted for publication. 1990.

[1]UMLS Knowledge Sources Experimental Edition. National Library of Medicine, 1990.

1017

An automatic indexing method for medical documents.

This paper describes MetaIndex, an automatic indexing program that creates symbolic representations of documents for the purpose of document retrieval...
1MB Sizes 0 Downloads 0 Views