Q & A: A Query Formulation Assistant Rulane B. Merz , Christopher Cimino, M.D. t G. Octo Barnett, M.D., Dyan Ryan Blewett, John A. Gnassi, M.D., Robert Grundmeier and Laurie Hassan Laboratory of Computer Science Massachusetts General Hospital Boston, Massachusetts

ABSTRACT

Inezperienced users of online medical databases often do not know how to formulate their queries for effective searches. Previous attempts to help them have provided some standard procedures for query formulation, but depend on the user to enter the concepts of a query properly so that the correct search strategy will be formed. Intelligent assistance specific to a particular query often is not given. Several systems do refine the initial strategy based on relevance feedback, but usually do not make an effort to determine how well-formed a query is before actually performing the search. As part of the Interactive Query Workstation (IQW), we have developed an expert system, Questions and Answers (Qf&A), that assists in formulating an initial strategy given concepts entered by the user and that determines if the strategy is well-formed, refining it when necessary. INTRODUCTION Much of the medical literature has been organized and stored as citations in large retrieval files, or databases. However, users face many problems when attempting to obtain information from these databases. No common interface between databases exists; each has its own protocol and query language. Furthermore, the query languages are difficult to learn; they are often complex and unfriendly, and failure to use them correctly frequently causes relevant citations to be missed in a search. Results are returned in many different formats, which can cause confusion as to what they mean. Finally, there is no clear procedure for determining which database is most appropriate to search for a given query. We developed "Questions and Answers" (Q&A) as an intelligent assistant to the Interactive Query Workstation (IQW), a project at Massachusetts General *author also affiliated with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology tnow at Albert Einstein College of Medicine

0195-4210/92/$5.00 ©1993 AMIA, Inc.

498

Hospital's Laboratory of Computer Science [2]. IQW provides a common interface to several databases. It chooses an appropriate database for the user's query and formulates an initial search strategy. Q&A helps IQW form the initial strategy, converts the strategy to the database query language, and estimates how many articles will be retrieved, refining the strategy if necessary. Both Q&A and IQW depend upon the knowledge sources provided by the National Library of Medicine's Unified Medical Language System@ (UMLS) project. These sources - the Metathesaurus (Meta), the Semantic Network and the Information Sources Map (ISM) - have been described previously

[7, 9]. Previous Work Various attempts have been made to facilitate access to medical databases. Yip extended CONIT (Connector for Networked Information Transfer) to include an expert intermediary for the inexperienced user [8, 14]. His program, EXPERT-1, constructed a strategy based on concepts entered by the user and helped to reformulate the strategy based on index terms selected by the user from the relevant articles that were retrieved. However, it did not contain any real subject knowledge to help the user enter a query, nor did it attempt to examine the input. Powsner and Miller developed a front-end to Grateful Med which helps to construct queries from terms in clinical records [10]. While this is a useful tool for clinicians, it does not provide assistance in combining the terms in a strategy, nor does it make use of available co-occurrence data to determine if the strategy is likely to retrieve any citations. "Animal

Welfare TOME.SEARCHER" (AWTS), developed by TOME Associates of London UK [12], is an intelligent system that allows the user to enter a query as free text. Before searching, it estimates the number of references that will be retrieved and if necessary modifies the strategy with the user's help. However,

it depends on a dictionary of terms and a classification hierarchy that must be built for each subject domain. This would be prohibitive for a large database covering many different subjects, or for handling many separate databases. It also does not obtain feedback about the relevance of retrieved articles. SAPHIRE [6], the SMART system [11], and IRX [4] are information retrieval systems which automate the indexing as well as the retrieval of documents in their databases. Queries may be entered as free text. The index terms in the documents and the concepts in the queries are assigned weights, which then become the basis for determining the relevance and corresponding ranking of retrieved citations.

Interactive Query Workstation (IQW) IQW, a part of the UMLS project, provides a uniform, easy-to-use interface for several online medical databases [2]. The user is allowed to enter a query either as free text or as a list of concepts from a selected patient record. Once the query has been entered, IQW chooses an appropriate database and converts the query into the format expected by that database. It then performs the actual search. METHODS AND PROCEDURES "Questions and Answers" (Q&A) is a rule-based system written in Smalltalk/V, an object-oriented programming language which runs under MicrosoftO Windows. Smalltalk/V does not currently support a rule-based system, so we built a simple, forwardchaining inference engine. Q&A works with IQW to formulate an initial strategy, then estimates the number of articles that will be retrieved. Based on the estimate, it attempts to improve the strategy before an actual search. To accomplish these tasks, it uses information from Meta and the Semantic Network as well as feedback from the user. In addition, it employs knowledge derived from physicians who use the online medical databases and from professional librarians who perform online searches. This knowledge is available as rules in the knowledge base. Query Refinement Currently, the rules for query refinement depend on a rough estimate of the number of articles that will be retrieved. If too few relevant articles will be found, the query must be expanded. This can be done in several ways. Restrictive terms that are unimportant to the user can be dropped. Tag terms, which specify gender, age, or "human" or "animal", are examples of such restrictive terms. Synonyms and associated expressions can be added. If an essential concept retrieves no articles or is not in Meta, it should be searched as a text

word. Concepts can be replaced by their more general parents in the hierarchical tree structure of Meta. If the intersection of two or more concepts contains no articles, the union of the concepts can be found instead. In MEDLINE, the scope of the concepts can be broadened so that an article will be retrieved even when the concepts are not the primary focus of the article. Terms can also be exploded in MEDLINE; the search strategy will incorporate them and their more specific children as given in MEDLINE's hierarchical tree structure. If too many articles are found, the query must be reduced. Restrictive terms like those described previously can be added. Concepts can be replaced by their more specific children in the Meta hierarchy tree. In MEDLINE, qualifiers can be added to concepts to focus the search on a specific area of interest. We currently accept strategies whose estimates range from 15 to 30 articles. This range was chosen so that enough articles would be retrieved for an evaluation of the search results, while not flooding the user with information. Retrieval Estimation The calculation of the rough estimate used in query refinement requires information found in Meta. Meta includes some statistics, which at present are mostly for MEDLINE, about the number of papers in which

a concept occurs; this number is called the frequency of occurrence of the concept in a source. Additional statistics count the number of articles in MEDLINE in which two terms co-occur as primary concepts; this is called the co-occurrence frequency of the pair of terms. The fact that the co-occurring terms must be indexed as main concepts for an article to be counted does not usually pose a problem; when users select a term for searching they are probably interested primarily in those articles which focus on that term or some aspect of it. However, tag terms are not often specified as main concepts in an article, and therefore little or no co-occurrence data is provided about them. Since they still may be important to a query, we have chosen not to include them in estimate calculations. Instead, we account for the fact that they will decrease the amount that will be retrieved by adjusting the range of acceptable estimates, increasing the bounds to 30 and 70. The wider range also reflects the added uncertainty. Strategies containing terms without occurrence data are modified only if the estimate for the rest of the terms is too low. When the termns have occurrence but no co-occurrence data, a very rough estimate of the number of articles that contain all of the terms is

499

Estimate==-l*-

*** *freqn *T2

where

Table 1: Occurrence Data Occurrence Frequency J Concept 2371 Smoking Myocardial Infarction 3946 Pneumonia 971 Bronchitis 482

n = number of terms in the query freqn = frequency of occurrence of the nth term T, = total number of articles in the source when the occurrence data were tabulated T2 = total number of articles currently in the source

This equation is produced by calculating the probability that a concept will appear in a particular article (by dividing the number of articles in which the concept appears by the total number of articles in the source when the concept's frequency was determined), multiplying the probabilities for each concept in the strategy, then multiplying the result by the total number of articles currently in the source to get the expected number of articles. The calculation above depends on the assumption that the terms' occurrences are statistically independent. A more accurate retrieval estimate is possible for terms for which co-occurrence data are available. In the simplest case, the strategy contains only two terms, allowing the estimate to be obtained directly. Predicting the result of a search is also simple when the frequency of co-occurrence is so low for one or more pairs of terms that Q&A can immediately conclude that the query must be expanded. Otherwise, an estimate must be calculated. We have developed a procedure which uses co-occurrence frequencies to help minimize the error in the calculation. First we find the pair of terms in the query that exhibit the greatest statistical dependence. This is the pair for which the use of occurrence data would give the worst estimate of the probability that the terms will occur together. Therefore we use the cooccurrence data to determine the actual probability that an article will contain the pair. The occurrence probabilities of the rest of the terms are calculated using their occurrence data. We then multiply these probabilities to find the probability that all of the terms will co-occur. Since this requires an assumption of independence, we include -a correction factor in the calculation. Our retrieval estimate is the estimate of the probability that an article will have all of the terms multiplied by the total number of articles in the database. RESULTS To show how a query may be refined before a search, assume that the user wants to know how smoking may be related to myocardial infarction, pneumonia, and bronchitis in the elderly and has entered the concepts "smoking", "myocardial infarction", "pneumonia", "bronchitis" and "elderly". Since most of the

Table 2: Co-occurrenc Data Co-occurrence I Concept Pair Frequency 14 Smoking, Myocardial Infarction 3 Smoking, Pneumonia 12 Smoking, Bronchitis Myocardial Infarction, Pneumonia 4 Myocardial Infarction, Bronchitis 0 Pneumonia, Bronchitis 36 occurrence data in Meta currently is available only for MEDLINE, assume for the sake of illustration that the user is doing a literature search, for which MEDLINE is the best database. The occurrence and cooccurrence data that Meta contains for the first four terms are shown in Tables 1 and 2, respectively. "Elderly" is not used in MEDLINE; IQW uses Meta to provide the appropriate synonym "aged". "Aged" is a tag term, so its occurrence data is not used to predict retrieval, although it does cause the range of estimates for which a strategy will be accepted to be adjusted. It is clear from Table 2 that no articles will be retrieved using a strategy that conjoins all of the concepts; "myocardial infarction" and "bronchitis" do not co-occur. Simply replacing the intersection of "myocardial infarction" and "bronchitis" with their union will not significantly improve retrieval, since the concept pair "smoking" and "pneumonia" reduces the number of possible articles to at most three. One option is to explode the query terms. However, the concept data in Meta indicates that "myocardial infarction", "bronchitis" and "smoking" have only one child, so taking the intersection of the exploded terms is not likely to adequately improve search results. Another option is to formulate a strategy consisting of the union of the concepts, but this will retrieve over 7500 articles. An alternate strategy is to take the union of each concept pair in the table that occurs in MEDLINE, thus retrieving the articles for each pair. This will retrieve up to 69 articles, which is within the acceptable range for a query that includes a tag term. Q&A therefore accepts it as the initial strategy. In the previous query, no calculations were necessary to predict the search results. Another example

500

Table

Occurrence Data for Rtieval timation Occurrence Concept Frequency 775 Nifedipine (N) Coronary Disease (C) 5766 6189 Hypertension (H)

:

Table 4: oroccurrence Retrieval Estimation rr Co-occurrence Frequency Concept Pair N and C (NC) 65 N and H (NH) 168 C and H (CH) 179 will illustrate the calculation of a retrieval estimate. This time, Q&A is given the concepts "coronary disease", "nifedipine" and "hypertension". The occurrence and co-occurrence data are given in Tables 3 and 4. First, for each possible pair of terms in the query, we calculate the ratio of the actual number of articles that will be retrieved (as given by the co-occurrence data) to the number of articles that would be predicted if the terms in the pair were assumed to be independent. Thus for N and C the ratio is

N*C *T, If the ratio is less than one, the two terms are negatively correlated; if it is equal to one, they are independent; and if it is greater than one, they are positively correlated. In order to compare the strength of correlation, whether positive or negative, that exists within each pair of terms, we take the reciprocal of those ratios that are less than one; we call this the absolute ratio of the terms. MEDLINE had 730,259 articles when the occurrence data were compiled. Taking this as our value of T1, the ratios are 10.62, 25.58 and 3.66 for NC, NH and CH respectively. Since all of the ratios are greater than one, these values are the absolute ratios for the terms as well. We determine which two terms are most dependent on each other by finding the pair with the largest absolute ratio, in this case, N and H. We use the cooccurrence data for these terms to calculate the probability that both will appear in the same article; this avoids making an estimate based on an independence assumption for the two most dependent terms. The probability of occurrence for each of the remaining terms is then calculated using the occurrence data. ro determine the retrieval estimate, we begin with P(NH), the probability that N and H, the most highly Ratio=

NC

correlated terms, will occur in the same article. Since we can never retrieve more than the smallest cooccurrence value of the terms, we next look for the pair of terms with the smallest (non-absolute) ratio that also includes one of the terms in the current estimate. This gives us the term that is the least positively/most negatively correlated with a term in the estimate. In the example, CH has a smaller ratio than NC, and H has already been included in our calculation. The estimate P(NHC) is then P(NH) * P(C). However, this assumes that C is independent of N and H, so we must add a correction factor. The factor we have chosen is the ratio just described, which was used to determine the next term to include in the estimate. We therefore multiply our original estimate P(NHC) by 3.66 to correct for the error in assuming independence. If we had more than three concepts, we would remove C from the list of terms still unaccounted for, find the pair of terms with the next smallest ratio, and repeat the calculations above. Once we have accounted for all of the query's concepts, we multiply our new P(NHC) by T2 to get an estimate of the total number of articles this strategy will retrieve. As of February 19, 1992, the current version of MEDLINE (1989-92) had 1,031,534 articles. Substituting this value for T2, the retrieval estimate is (168 / T1) * (5766 / T1) * 3.66 * T2 = 6 articles. As this estimate is too small, Q&A will now proceed with the process of query refinement. DISCUSSION AND CONCLUSION We have described a procedure for performing retrieval estimation and query refinement before a search. While analysis of its effectiveness has not been completed, preliminary results are promising. Strategies which would retrieve little or nothing or, almost as frustrating, would retrieve too much, can be detected and refined before the initial search is even started. Further research is necessary to learn how much a query should be refined before the search. Because of inaccuracies inherent in any estimate, a compromise exists between acting on the estimate and doing the search. Refining a query based on an inaccurate estimate could reduce unnecessarily the number of relevant articles that are retrieved. On the other hand, performing the search without any prior refinement may produce excessive searches, and convergence on an efficient strategy may be slower. As an extreme example of this, a strategy with an invalid term will retrieve no articles, providing no useful information even for modification. Finding the proper balance between modifying the query and performing the search can reduce the cost of this compromise. Additionally, the estimation algorithm should be investigated to see how it may be improved.

501

Q&A currently does not incorporate relevance feedback for query modification; however, relevance feedback has been shown to significantly improve the performance of information retrieval systems, at least for small test collections [e.g., 1, 3, 5, 13]. Most of the relevance feedback methods that have been developed so far require an initial search to generate a set of documents known to be relevant to the user. While we have an interface for obtaining results from the resources, the tools for extracting information from the results have not yet been developed; it is therefore still unknown if information can be efficiently collected and successfully applied in query modification. However, the current estimation and refinement process provides a natural starting point for investigating the use of relevance feedback methods. Another area for future work is the incorporation of a natural language interface. Such an interface would be more natural for the user, and thus would be a better vehicle for query expression. It would also open up interesting possibilities for query disambiguation.

An Information Retrieval System for Experimentation and User Applications", SIGIR Forum, 1988; 22:2-10.

[5] Harper, D. J. and Van Rijsbergen, C. J., "An Evaluation of Feedback in Document Retrieval Using Co-occurrence Data", J Doc, September 1978; 34(3):189-216.

[6] Hersh, William, Hickam, David H., Haynes, R. Brian, McKibbon, K. Ann, "Evaluation of SAPHIRE: An Automated Approach to Indexing and Retrieving Medical Literature", Proceedings of the 15th Annual SCAMC, IEEE Computer Society Press, 1991; 808-812.

[7] Lindberg, Donald A. B. and Humphreys, Betsy L., "The UMLS Knowledge Sources: Tools for Building Better User Interfaces", Proceedings of the 14th Annual SCAMC, IEEE Computer Society Press, 1990; 121-125.

[8] Marcus, Richard and Reintjes, Francis, "Experi-

Acknowledgments The authors would like to express their thanks for the technical assistance of Ms. Cindy Schatz, Reference Librarian, Countway Medical Library, Harvard Medical School. This work was supported in part by [9] NLM contract [NO1-LM-8-3513] and in part by an educational grant from the Hewlett Packard Corporation. MEDLINE and the Metathesaurus are trademarks of the National Library of Medicine. Windows is a trademark of Microsoft0. Smalltalk/V is a trademark of Digitalk, Inc. [10] References

[1] Aalbersberg, IJsbrand Jan, "Incremental Rele-

ments and Analysis on a Computer Interface to an Information-Retrieval Network", MIT Laboratory for Information and Decision Systems, LIDSR-900, April 1979.

McCray, Alexa T. and Hole, William T., "The Scope and Structure of the First Version of the UMLS Semantic Network", Proceedings of the 14th Annual SCAMC, IEEE Computer Society Press, 1990; 126-130.

Powsner, Seth M. and Miller, Perry L., "From Patient Reports to Bibliographic Retrieval: A Meta1 Front-End", Proceedings of the 15th Annual SCAMC, IEEE Computer Society Press, 1990; 526-530.

vance Feedback", Proceedings of the 15th International Conference on Research and Develop- [11] Salton, Gerard, "Introduction to Modern Information Retrieval", 1983 McGraw-Hill, New York. ment in Information Retrieval (SIGIR 92), June 1992; 1 1-22. [12] "TOME.SEARCHER on Animal Welfare", TOME Associates Ltd., Report and User Guide, [2] Cimino, Christopher, Barnett, G. Octo, Hassan, June 1990. Laurie, Blewett, Dyan Ryan, and Piggins, Judith L., "Interactive Query Workstation: Stan- [13] Wu, Henry and Salton, Gerard, "The Estimadardizing access to computer-based medical retion of Term Relevance Weights Using Relevance sources", Computer Methods and Programs in Feedback", J Doc, December 198 1; 37(4):194-214. Biomedicine, 1991; 35:293-299. [14] Yip, Man-Kam, "An Expert System for Doc[3] Dillon, Martin and Desper, James, "The Use ument Retrieval", MIT Department of Electriof Automatic Relevance Feedback in Boolean cal Engineering and Computer Science, Master's Retrieval Systems", J Doc, September 1980; Thesis, 1981. 36(3) :197-208.

[4] Harman, Donna, Benson, Dennis, Fitzpatrick, Larry, Huntzinger, and Goldstein, Charles, "IRX:

502

Q & A: a query formulation assistant.

Inexperienced users of online medical databases often do not know how to formulate their queries for effective searches. Previous attempts to help the...
880KB Sizes 0 Downloads 0 Views