An application-independent subsystem for free-text analysis.

COMPUTERS

AND

BIOMEDICAL

9, 159-167 (1976)

RESEARCH

An Application-Independent Free-Text

Subsystem

for

Analysis*

ROBERTR. FENKHEL?AND G.OCTOBARNETT Laboratory

of Computer Science, Massachusetts Boston, Massachusetts 02114

General

Hospital,

Received March I 2, 1975 The analysis of free English text is a major subproblem in medical computing. A freetext-analysis subsystem, written entirely in MUMPS language, is in operation at the Laboratory of Computer Science, Massachusetts General Hospital. This subsystem, built around a numerical taxonomy of anticipated input, services multiple application programs. Facilities are provided for matching, spelling correction, disinflection, and phrase analysis. A variety of administrative programs are provided for logging uninterpreted input and for dynamically incrementing the data-base so as to expand the range of correctly interpreted input.

Facilities for collection and interpretation of user-supplied data are a critical part of any information system. In medical information processing, data are often nonnumeric and expressed with essential use of large natural-language vocabularies. Nevertheless, several computer-based medical-information systems (1-12) have provided facilities for input and analysis of such data. Most of these systems have attempted the analysis of information-rich, declarative text, written with little or no thought to subsequent computer analysis. The text has included operative notes (I), pathology reports (2-7), and autopsy reports (8). A surprising amount of information can be gleaned from such corpora using only the most elementary analysis (34). Useful analysis, however, may require highly sophisticated linguistic models (I, 2). To avoid the difficulties of such analysis, some systems have avoided natural language altogether, capturing the original data using multiple-choice input. Narrative output text, if later required, is generated by insertion of stock phrases (13). In all of these systems, the primary purpose of the original data input has been * This work was supported in part by research grant HS 00240 from the Bureau of Health Services Research, and in part by contract Number NOI -LM4716 from the National Library of Medicine. Dr. Fenichel was supported in part by a Special Postdoctoral Fellowship (SPF-4) from the American Cancer Society. t Current address (from 1 June 1976): Department of Internal Medicine, Center for the Health Sciences, UCLA, Los Angeles, California 90024. 159 Copyright 0 1976 by Academic Press, Inc. All rights of reproduction Printed in Great Britain

in any form

reserved.

160

PENICHEL AND BARNETT

the construction of a data-base and the subsequent provision of information retrieval facilities. In other systems(Y-/2, f4) free text has been accepted only fat the queries, with the underlying data obtained by other means. Queries are often much easier to analyze than general declarative text. Because queries are explicitly directed to a computer, users may adapt their language to observed computer capabilities. Independently of users’adaptation, some queries (e.g., “What were the intraoperative findings?“) will inevitably be expressedusing only a limited vocabulary and structure, even though the range of possibleanswers may be vast. Essentially all of the investigations of an ordinary diagnostic workup are queries of this sort. At least two documented systems (Y-12), utilizing this simplicity of these queries, have presented computer-simulated patients for interactive diagnostic evaluation by medical students. Students using these programs specify proposed investigations and therapies; after each student input, the program may respond with the results of tests, results of therapy, or other simulated clinical developments. Similar clinical simulations have beendeveloped by the Laboratory of Computer Science of the Massachusetts General Hospital (1.5, 16). These simulations had, however, allowed only numeric input, with tests and diagnoses specified from printed manuals of codes. This mode of operation allowed students to profit from various kinds of testmanship and from recognition memory in addition to recall. It was decided in mid-1971 to provide a free-text analysis subsystemfor use by these instructional programs. Other, noninstructional, applications were also envisioned, so that the free-text-analysis subsystemwould not be allowed to make strong assumptions(as in II, 11) about the nature of the analyzed text. Finally, an early decision was to provide a single linguistic data-base to cover all applications. NUMERICALTAXONOM\I

In general, all natural-language input is ultimately translated into numerical codes taken from a single numerical taxonomy. To maximize potential interaction with other systems, certain standard taxonomies are subsumed; for example, the portion of our taxonomy devoted to diagnosesis an information-preserving copy of the ICDA taxonomy (17). Other portions are devoted to anatomy (subsuming the SNOP topographic taxonomy (18)) laboratory tests, physical examination, therapies, and so on. Each category in our taxonomy may have any number of distinct character strings (“terms”) associatedwith it. For example, category 912.7 corresponds to ICDA code 127; among the terms in this category are “ROUND WORMS” and “ASCARIASIS.” Any term may optionally be accompanied by a specification of context. For example, the term “CA” meanscancer in the diagnostic context, but “CA” means serum calcium concentration if no context is specified. If each context-dependent

APPLICATION-INDEPENDENT

FREE-TEXT ANALYSIS

161

interpretation of a term be regarded as a new term, then no term may appear in more than one category of the taxonomy. The procedures relating input data to taxonomic codes are here grouped into techniques of exact match, disinflection, phrase analysis, and application service. EXACT MATCH

An input string is most simply translated into a taxonomic code by verifying that the string exactly matches some term in the code’s category. This translation is efficiently performed using standard techniques (19). In certain other systems, less rigorous criteria are used for identification of input strings. For example, an input string I might be considered to match a stored term S if 1 and S have an identical initial substring that is not present in any other stored term. If, for example, there were only one stored term beginning with “Q,” then any input string beginning with “Q” would, according to this criterion, be accepted as a spelling of this term. Such looseness would allow a wide range of inflected and misspelled input to be correctly identified, but fantastic undetected misunderstandings (e.g., see (12, pp. 264ff)) would also occur. The range of correctly-interpreted input can always be piecemeal expanded by simply adding terms. For example, the test called “T4” by the MGH Chemistry Laboratory is so entered in the taxonomy, but the term “THYROXINE” also appears. Similarly, the chlorothiazide diuretics are so often miscalled “CHLORTHIAZIDES” that both terms are in the chlorothiazide category. DISINFLECTION

The grammatical niceties of case, tense, and number need not be preserved in any of our present or envisioned applications. For example, the various forms PERFORATE PERFORATION PERFORATED PERFORATING PERFORATES PERFORATIONS need not be distinguished. All of these inflected forms might simply be added as additional terms in the same category as the root term, “PERFORATE,” but such proliferation of terms would be costly. Instead, using a hybrid of the algorithms of Pratt and Pacak (5) and Winograd (20), any nonmatchable input with a recognized inflectional ending is disinflected, and a new exact match is attempted. Of input not exactly matched in its original form, about 15 %’ is successfully interpreted by this process. The potential yield is actually a bit higher, since a few common inflected 1Note addedin proof: Now about22%.

163

FENIC‘HEL AND BARNEY1

forms are present as terms to gain somespeedat the cost of increaseddata storage space. At least in principle, some input may be mistranslated by false disinflection. For example, we have not yet had reason to store the term “LACTATION ;” the term “LACTATE” is present as a reference to CH,CHOHCOO---, the organic anion. If the former term were presentedas input, it would be interpreted as an inflected form of the latter. For that matter. if “LIVER” were not available for exact match, it might be recognized in input as a comparative (or agentive) form of the adjective (or verb) “LIVE.” We do not believe that any such misdisinflections are occurring in practice. SPELLING CORRECTION

False-positive interpretation will be a much more serious problem in any noninteractive schemeto handle spelling errors other than those (like “CHLORTHIAZIDE” above) specifically anticipated. For example, what could one ever make of “ENDOMETRIISIS?” Neither of the interpretations “ENDOMETRIOSIS” and “ENDOMETRIITIS” is especially compelling against the other. Faced with an input form that is neither exactly matched nor interpretable by disinflection, we proceed to a two-phase processconsisting of (I ) identification of similarly-spelled terms stored in the data-base, and (2) interaction with the user to determine which of these stored terms (if any) was intended. Grouping similarly-spelled strings is a familiar operation in hospitals where old patient records must be located by patient name. The schemein widest useappears to be the Soundex algorithm, which transforms each input string to a code of one letter and three decimal digits. Strings likely to be phonetically similar in English (for example, strings containing the sameconsonants in the sameorder, or differing only in a doubling of one letter) are transformed into the samecode. While the Soundex scheme was designed to retrieve words after error-prone oral-aural transmission by English-speakers, there is evidence that intraspeaker written spelling errors are often misconstructed from the correct forms by the same phonetics-based blunders (2/). Accordingly, we adopted a Soundex-like scheme. Each stored term is accompanied by an “abbreviated” form in which vowels have been elided, double letters have been made single, and letters from any class of similar-sounding consonants (e.g.. the class whose members are “M” and “N”) have been replaced by a canonical member of that class. To be considered as the possible intention of an unrecognizable input string I, a stored term S need not have an identical abbreviated form. Instead, we require that S passthe following four tests. (I) The initial few characters of the abbreviated form of S must be identical to the corresponding characters in the abbreviated form of 1. After someexperimentation, this number of characters has been set at one more than 40% of the number of characters in the input string I.


FREE-TEXT

ANALYSIS

163

(2) The length of S must be within 20% of that of I (once again, the number is the result of crude empiricism). (3) An application may specify that S must come from a specific portion of the taxonomy (e.g., an application may effectively require that S be the name of a drug). (4) The term S must not be among those permanently flagged as ineligible for nomination to the user. The flagged terms include (a) common misspellings (“CHLORTHIAZIDE,” etc.) stored as terms but, to avoid reinforcing misspelling, never typed out to users; (b) variant spellings (“RBC INDEXES,” etc.) so similar in spelling to canonical forms (“RBC INDICES,” etc.) that the latter will be suggested as possible interpretations whenever the former might be; (c) medically deviant forms (“WOMB,” “SWOLLEN GLANDS,” etc.) stored as terms but, to avoid reinforcing ill usage, never typed out to users; and (d) socially unacceptable forms ([expletive deleted], etc.) stored as terms (primarily to facilitate ignoring them) but never typed out to users.

164

CENICHEL

AND

BARNETT

In practice, one or more stored strings S passingall four tests are found for about half of those input strings I that are recognizable by neither exact match nor disinflection. Usually, only one such candidate S is found, but sometimesas many as IO or I2 will appear. As these possible interpretations of 1 are located. the user at the teletypewriter is asked which (if any) of them was intended. About 60”, of the times that candidates are presented to a user, one of them is acceptable to him; we have no way of knowing how many candidates are accepted as novel suggestions, actually different from the user’s original intent. Overall utilization review during a 4-month period gave the results diagrammed in Fig. I. An input string I that cannot be interpreted to the satisfaction of the user(that is, a term that is neither exactly matched, nor recognizable by disinflection. nor acceptably interpreted by the spelling-correction process)is logged in a special file. From time to time, this file is printed out for human analysis. Most of its contents are usually true gibberish, the results of students’ mischief or of electrical mistransmission. Often. however, especially just after installation of new applications, the file contains words that are useful additions to the medical vocabulary for the data-base. PHRASE

ANALYSIS

When input includes spacesor other word-separators, additional techniques may be necessary.Of course, somemultiword input may be handled just as single-word input is. For example, “NORMAL SALINE” is stored in the sameway (and in the samecategory) as “NS.” A multiword input may, on the other hand, refer to two or more categories. For example, there is an ICDA category for ureterolithiasis, but there are no separate categories to indicate diseasethat is left-sided, bilateral, recurrent, etc. Given the input “ACUTE LEFT URETEROLITHIASIS,” we can return an array of the three codes corresponding to its three words. The techniques of disinflection and spellingcorrection can be applied to each word in turn, as needed; noncontributory words (articles, partitives, and SO forth) can be ignored ; and number-unit concatenations (“2LITERS” etc.) can be divided. This piecewiseencoding is easy to implement, but it leaves much of the analysis undone. For example, the input “GASTRIC CANCER” should be translated to the single code of the appropriate (ICDA) category. The techniques already described will produce this single code only if “GASTRIC CANCER” is stored as a term; it might appear necessarythat all of the terms GASTRIC CANCER GASTRIC CA GASTRIC MALIGNANCY CANCER OF THE STOMACH MALIGNANT NEOPLASM OF THE STOMACH etc., be stored.


FREE-TEXT

ANALYSIS

165

Intuitively, one should be able to avoid this proliferation of terms. The stomach category needs “GASTRIC” and “STOMACH;” the malignant neoplasm category needs “CANCER,” “ CA,” and a few more. But once these components are recognized, their joint import (that is, gastric cancer) should be specified in a way that is independent of the combinatorial multiplicity of possible phrases. We use data known as rules to identify combinations like “GASTRIC CANCER.” In this case, a rule associated with the category stomach is CANCER

= “GASTRIC

CANCER.”

This rule specifies that however cancer may have been expressed (“CANCER,” etc.), if it appears in the same input string as stomach “CA.” “ MALIGNANCY,” (however expressed), then the two codes should be replaced in the analysis’ output by the single code of gastric cancer. A slightly different rule would require that the specifications of stomach and cancer be adjacent, in a particular order; a related mechanism digests multiword references (e.g., “TWO MILLION”) to numbers. Rules are the last of the purely linguistic facilities of the free-text-analysis subsystem. Certain extralinguistic facilities are provided to minimize redundant effort within classes of applications. APPLICATION

SERVICES

For example, one potential inconvenience of our unified, application-independent data-base is that any given application may wish to ignore certain distinctions that the data-base as a whole must maintain. Thus, while an orthopedic application might utilize the several ICDA codes allocated to the various fractures of the upper limb, an application accepting diagnoses of abdominal pain would be easier to implement if these several codes could be ignored as a group. A special interface program is provided to allow convenient, application-dependent recoding of entered diagnoses. Several applications allow laboratory tests to be “ordered” on the simulated patient. Most of these applications have some interest in the costs, risks, discomforts, and delays of these tests; these data are maintained in application-independent fashion with the data-base. Because the data-base includes essentially all of the tests routinely available at the MGH, it is possible, for any given application, to order tests for which the application is not specifically prepared. For example, an application concerned with acute management of diabetic ketoacidosis is prepared to provide values for the results of assaying serum electrolytes, blood gases, and so forth. This application refuses to perform certain tests (e.g., biopsies) known not to be indicated, but others, especially those of potential screening value (e.g., sicklecell testing) are allowed. So that such applications need not be prepared to provide plausible results for all tests that they allow, an application-independent program is available to generate such values. Similar programs provide plausible (not necessarily negative) physical findings and histories of disease. Every generation of

166

FENICHEL AND BAKNETT

a plausible result is logged into a special file that, from time to time, is printed out for human analysis.The contents of the file occasionally suggestapplication changes, either disallowal of certain tests or provision of application-dependent values for others. Drug orders are entered in several applications. After the rules have simplified 50 UNITS OF CZl IV IN HALF A LITER OF HALF AND HALF, there is still a substantial amount of application-independent analysis to be done. A special program parsescomplex drug orders into one or more simple orders; each simpleorder contains no more than one agent, dose,or route, and the doseisspecified in regularized units (drug-dependent transformations, such as units-to-milligrams. are not performed, but a dose expressedin any weight unit is reexpressedin milligrams, all volume-specified dosesare reexpressedin milliliters, etc.). IMPLEMENTATION

The text-free-analysis package here described has been implemented as approximately 3300linesof code in the MUMPS (22) language. Of these,only about 20 “/:, are used dynamically by applications. The remainder are necessaryfor such utility functions as listing the data-base, adding or deleting terms, and so forth. As of mid- 1974, the data-base contained about 4000 terms in about 1600categories. CONCLUSION

This free-text analysissubsystemprovides, in an application-independent manner, linguistic analysis comparable to that described in various published applicationspecific systems.This generality allows applications designersto be freed from the non-medical distractions of free-text analysis. In addition, the generality of the linguistic data base means that the subsystem can potentially be used in a wide variety of application activities. REFERENCES P. A. ACORN-an automated coder of report narrative. Methods Inform. Med. 6, 153 (1967). 2. GELL, R. L. AND BECKER, H. Klartextanalyse pathologischer Biopsiebefunde mit Bildschirmabfrage. Methods Inform. Med. 12, 10 (1973). 3. WONG, R. L. AND GAYNON, P. An automated routine for diagnostic statements of surgical pathology reports. Methods Inform. Med. 10, 168 (1971). 4. PRATT, A. W. AND THOMAS, L. B. An information processing system for pathology data. In “Pathology Annual.” Appleton, New York, 1966. 5. PRATT, A. W. AND PACAK, M. Identification and transformation of terminal morphemes in medical English. Methods Inform. Med. 8, 84 (1969). 6. LAMSON, B. G. AND DIMSDALE, B. A natural-language information retrieval system. Proc. ZEEES4, 1636 (1966). 1. SHAPIRO,


FREE-TEXT

167

ANALYSIS

7. JACOBS,H. A natural-language information retrieval system. Methods Inform. Med. 7,8 (1968). 8. SMITH, J. C. AND MELTON, J. Manipulation of autopsy diagnoses by computer technique. JAM,4 188,958 (1964). 9. WEBER, J. C. AND HAGAMEN, W. D. Medical education-a challenge for natural language analysis, artificial intelligence, and interactive graphics. Proc. Amer. Fed. Inform. Proc. Sot. Conf 10.

35,307

(1969).

C. AND HAGAMEN, W. D. ATS: A new system for computer-mediated medical education. J. Med. Educ. 47,637 (1972).

WEBER,

J.

W.

G., DRENNON,

G. G.,

MARXER,

J. J., ROOT,

II.

HARLESS,

12.

G. E. CASE-a natural language computer model. Compur. Biol. Med. 3,227 (1973). HARLESS, W. G., DRENNON, G. G., MARXER, J. J., ROOT, J. A., WILSON, L. L., AND MILLER, G. E. GENESYS-a generating system for the CASE natural language model. Comput. Biol.

Med. 3, 247 ( I973 ). 13. BARNETT, G. O., GREENES, R. A., AND GROSSMAN, information. Methods Inform. Med. 8, 177 (1969). 14. 1.5.

16.

17. 18. 19. 20. 21.

22.

J. A.,

WILSON,

L.

L.,

tutorials in AND

MILLER,

J. H. Computer processing of medical text

ROBINSON, R. E., III AND MESCHAN, I. Computerized radiological reporting with word retrieval using MT/ST. Diugn. Radiol. 101,323 (1971). BARNETT, G. O., BAILLIEUL, J. B., AND FARQUHAR, B. B. The testing of clinical judgment-an experimental computer-based measurement of sequential problem-solving ability. In “Diagnostic Process” (J. Jacquez, Ed.), pp. 191-202. C. C. Thomas, Springfield, Illinois, 1972. HOFFER, E. P., BARNEIT, G. O., FARQUHAR, B. B., AND PRATHER, P. Computer-aided instruction in medicine. In “Annual Review of Biophysics and Bioengineering,” Vol. 4. Yearbook Medical Publishers, Palo Alto, California, pp. 103-l 18 (1975). “Eighth Revision International Classification of Diseases.” United States Public Health Service Publication No. 1693 (no date). “Systematized Nomenclature of Pathology.” College of American Pathologists, Chicago, 1965. KNUTH, D. E. “The Art of Computer Programming,” Vol. 3, Section 6.4. Addison-Wellesley, Reading, Massachusetts, 1973. WIN~GRAD, T. “Understanding Natural Language.” Academic Press, New York, 1972. ALBERGA, C. N. String similarity and misspellings. Commun. Assoc. C’omput. Much. 10. 302 (I 967). BARNETT, G. 0. The modular hospital information system. In “Computers in Biomedical Research” (B. Waxman and R. W. Stacy, Eds.), Vol. 4. pp. 243-266. Academic Press, New York, 1974.

Model of an oculomotor subsystem.

Periodic subsystem density-functional theory.

Subsystem-DFT potential-energy curves for weakly interacting systems.

No need for external orthogonality in subsystem density-functional theory.

An integrated environment monitoring system for underground coal mines--Wireless Sensor Network subsystem with multi-parameter monitoring.

An integrated time-of-flight versus residual energy subsystem for a compact dual ion composition experiment for space plasmas.

The Miltenberger subsystem: is it obsolescent?

Entanglement, subsystem particle numbers and topology in free fermion systems.

Subsystem throughputs of a clinical picture archiving and communications system.

Subsystem density-functional theory as an effective tool for modeling ground and excited states, their dynamics and many-body interactions.

Revisiting default mode network function in major depression: evidence for disrupted subsystem connectivity.

Microbiology subsystem of a total, dedicated laboratory computer system.

Hornets Have It: A Conserved Olfactory Subsystem for Social Recognition in Hymenoptera?

Dataset of anomalies and malicious acts in a cyber-physical subsystem.

Miltenberger subsystem of the MNSs blood group system. Review and outlook.

A Multiwavelet Treatment of the Quantum Subsystem in Quantum Wavepacket Ab Initio Molecular Dynamics through an Hierarchical Partitioning of Momentum Space.

Predicting Early Bulbar Decline in Amyotrophic Lateral Sclerosis: A Speech Subsystem Approach.

An online system for metabolic network analysis.

Fragman: an R package for fragment analysis.

SMiRK: an Automated Pipeline for miRNA Analysis.

SPAI: an interactive platform for indel analysis.

An iterative algorithm for analysis of variance.

Comparative analysis of classifiers for developing an adaptive computer-assisted EEG analysis system for diagnosing epilepsy.

An automated system for quantitative analysis of Drosophila larval locomotion.