A history-taking system that uses continuous speech recognition.

A History-Taking System That Uses Continuous Speech Recognition Kevin Johnson, Alex Poon, Smadar Shiffman, Richard Linl, and Lawrence Fagan

Section on Medical Informatics

IDivision of Clinical Phamacology Stanford University School of Medicine Medical School Office Building (MSOB x215) Stanford, CA 94305-5479 ABSTRACT Q-MED is an automated history-taking system that uses speaker-independent continuous speech as its main interface modality. Q-MED is designed to allow a patient to enter her basic symptoms by engaging in a dialog with the program. Error-recovery mechanisms help to eliminate findings resulting from misrecognitions or incorrect parses. An evaluation of the natural language parser that Q-MED uses to map user utterances to findings showed an overall semantic accuracy of 87 percent; Q-MED asks more specific questions to capture findings that were not volunteered, or that were unable to be parsed in their initial, open-endedform.

HISTORY-TAKING SYSTEMS Automated history taking has been the subject of research in medical informatics since the late 1960s (e.g., [1-7]). History-taking systems are designed to facilitate the process of conducting the medical interview; patients waiting to be seen by a health-ce provider can spend time roding their symptoms. The affordability of powerful personal computers has made automated history taking an attractive alternative to the paper-based questionnaires that were occasionally used in the 1970s and 1980s[2-5].

DESIGN GOALS We wanted to design a history-taking system that could interpret spoken input from patients and use this interpretation to guide the interviewing process. One approach that has been used in computer systems dtai use CSRT, or spokenlanguage systems, has been to control the process of data entry through a dialog between the user and the computer[9-12]. The use of a dialog is familiar to patients, and is, in fact, the way many history-taking systems have been constructed[4, 5]. In systems that engage in a dialog, the context of the interview can be used to predict the types of responses that a user is likely to furnish[l 1, 12]. In addition, the structure of the question that is asked by the computer may make the patient aware of the range of responses that the program can understand. The latter two points are important, because in any spoken language system, there is a large, but constrained, vocabulary that the system is able to recognize. The use of dialog helps spoken-language systems convey the limitations of their language model, or grammar, without needing to rely upon visual represenitaons of that model.

History-taking systems are well-accepted by patients, who appreciate the pace of computerbased history taking and the unbiased viewpoint that machines take toward personal information. However, a study by Quaak and associates[3] showed that 21 percent of the patients that used his system found the allowable answers too restrictive to be accurate. To develop an interface that enables inexperienced users to record detailed historical information is difficult, even with graphical technology. Ideally, it would be preferable to let the patient speak in her own words, while the history-taking system extracts information from the patient's utterances and uses that information to determine which diagnoses to consider and which additional questions to ask. Continuous-speech recognition technology (CSRT) provides a mechanism to allow the patient to use more natural language during the history-taking process. 0195-4210/92/$5.00 0 1993 AMIA, Inc.

It is encumbant upon any spoken-language system to provide mechanisms for recovering from speech-recognition errors. In the context of a medical interview, where patients describe their symptoms, misrecognizing an utterance may cause a loss of important information.

757

Misrecognitions may also generate incorrect findings. Therefore, one of our main design goals was to develop mechanisms that minimize the number of incorrect symptoms in the summary report.

Speech Recognizer The speech-recognition system that we are using, the DS200 by Speech Systems, Inc. (SSI), requires a predefined grammar of words and ways in which they can be combined to form sentences. Because the SSI system has a vocabulary in excess of 30,000 words, it is possible to develop large, complex grammars that represent most of what a user might say. As the size of a grammar increases, however, the probability of misrecognizing words in the user's responses also increases. The solution to this problem-creating small grammars that model what the user might say in any given context, and changing grammars as the context changesmeshes nicely with the approach of asking questions that have varying levels of specificity. We can create different grammars for each of these contexts, and switch grammars whenever necessary. Furthermore, we can develop robust grammars for open-ended questions, and grammars with smaller vocabularies for multiplechoice or yes-no questions.

SYSTEM DESIGN of an interviewing engine, a consists Q-MED speech recognizer, a parser, and a text-generator. Error-recovery mechanisms are built into these modules to ensure that the system captures most of the findings that arise during the dialog with the patient.

Interviewing Engine We have built an interviewing engine that simulates the interviewing style of health-care providers, and that controls the set of processes, described later in this section, that are necessary for a spoken-language system using a dialog model. A full description of the interviewing engine can be found in an article by Poon and

associates[13].

Parser

The process of medical interviewing has been studied and described extensively[14, 15]. Experienced health-care providers ask open-ended questions based on concepts l-elated to the diagnoses the interviewer is considering. When answers to those questions fail to furnish insight into the patient's problems, the health-care provider uses more specific questions, both to clarify the intent of the original question, and, as a side effect, to suggest example responses that would be of use to the interviewer. Interviewers sometimes ask yes-no questions to screen for specific diagnoses, or to capture potentially useful information that the patient has not volunteered. Q-MED's interviewing engine simulates the interviewing style of health-care

Q-MED uses a parser, based on techniques outlined by Lin[16] and Shiffman[17] to map a user's utterance to one or more predefined findings. The parser uses a concept-matching algorithm to map sets and sequences of keywords to these findings. The parser defines a canonical forn for each keyword in a finding, and builds a canonical representation of the user's utterance by concatenating these canonical forms. The canonical representation of the user's utterance is compared to the set of canonical forms, and the matching fmdings, along with the words from the user's utterance that relate to them, are returned to the Q-MED interviewing engine. Finding Database As Q-MED creates the ATN, a finding database is developed, with nodes corresponding to the concepts of the interview. One concept is predefined as the primary key for the data, and frames the context in which other findings are elicited. Concepts essentially coorespond to properties of each finding. Throughout the interview, as the program parses utterances, the matched findings are stored according to the context in which they are obtained. In Q-MED for back pain (Q-MED BACK), the location of the pain serves as the primary key of a finding.

professionals. The process of asking questions with different levels of specificity is especially useful for recovering from misrecognized speech, incorrect parses, or ambiguous questions. In any of these cases, if the answer does not cause Q-MED to pursue other information, Q-MED asks more specific questions, in the same context as the more general questions, until useful information is obtained. As questions become more specific, Q-MED selects grammars that are more constrained, so that the probability of misrecognizing an utterance is decreased.

758

The interviewing engine can query the fmding database, to determine whether, for example, in the setting of unilateral lower-back pain, there was radiation of the pain to the patient's foot. If the answer to this question has been obtained, the interviewing engine can skip questions related to this concept.

The second error-prevention mechanism that we use is explicit confirmation. Whenever the focus of the interview is about to change based upon unconfirmed information, Q-MED asks the patient to confirm that the finding was stated. Again, a paraphrase of the finding is displayed as part of the question that is asked.

Text Generator Q-MED uses basic text-generation capabilities to paraphrase user responses, to construct questions that confirn previously obtained responses, and to create a note that summarizes the patient's complaints. Q-MED generates text by combining phrases and words in an order specified by text-generation templates.

EVALUATION Our first Q-MED application interviews patients who have back pain. We conducted a preliminary evaluation of the extent to which Q-MED is able to map an utterance to the correct finding. We had men and women in our laboratory read 200 sentences to Q-MED. These sentences were derived from predefimed grammars, so that the users' utterances were guaranteed to match one or more Q-MED findings if all of the words in the utterance were recognized correcdy. We calcula ed the utterance recognition accuracy and the word recognition accuracy for each sentence. When utterances were partially misrecognized, we examined the findings that our parser retuned, and a physician determined if the findings were semantically equivalent to the initial uttemace. We then determined the overall semantic accuracy for each type of grammar.

Recovery Mechanisms It was emperative that we create mechanisms that allow the user to specify when an error has occurred. A variety of such mechanisms have been proposed for spoken-language systems[9,10]; we have implemented two mechanisms that fit the interviewing paradigman "ignore that" feature, and explicit confirmation of specific findings. The "ignore that" feature refers to the ability of a user to retract the last utterance that she stated during the interview. In some cases, utterances may trigger incorrect findings. While a solution to this would be to allow the user to see the results of the speech-recognition process, the danger is that users might choose to have the system retract utterances, even if the misrecognized words do not affect the semantic interpretation of the utterance. For this reason, we display only a paraphrase of the user's utterance, without showing the actual recognized response. The user can then tell the system ignore any utterance that results in an incorrect

Results The results of the experiment are outlined in Table 1. Each type of grammar has a specific number of possible semantic categories to which utterances can be classified. For example, yesno responses can be interpreted as "yes" or "&no", but the grammar also includes the language models for utterances which translate to "exit", "testing", and "ignore that". The false positive finding rate (FPR) is defined as the number of incorrect fiudings triggered by 100 utterances.

fimding. Table 1 Recognition Rate and Semantic Accuracy of Q-MED Type of

Aven4ge Number Number of words in of semantic possible gramma categories sentences

Uterance Wad Uerance FPR recognition rcognition Semantic (percent) (percent) (pent) accuracy

yes-no

36 83 242

88 82.8

grammar

specific n-ended

5 18 59

98 2 billion > 20 billion

72.5

As expected, grammars that generate a small number of possible sentences, like the yes-no

86 87.7 89

99 89.4 85.2

.3 0.67 10

grammar, performed quite well. Although the utterance recognition rate was fairly low, the

759

parser was able to correctly classify the utterance 99 percent of the time. There was one false positive finding that occurred when the phrase "uh huh" was recognized as "I'm done," which matches the semantic category "exit." Grammars used with specific finding categories, had a less impressive difference between the recognition and semantic accuracies. This may be explained by the fact that the number of words used to discriminate between findings in these grammars is small (1.7 words, on average), which makes the system vulnerable to single word omissions or insertions. With the most complex grammars, the semantic accuracy averaged 85 percent, which was an improvement over the utterance recognition accuracy. However, the false positive rate of 10 percent is high. This implies that, on average, the user would have to tell the system to ignore 1 out of every 10 utterances when grammars of this size are used.

Many of these lists require a large amount of fine-tuning, in an effort to represent the user's language accurately and therebye improve speech recognition. We are currently investigating ways to decrease the knowledge-acquisition task for this system. Our parser uses only minimal syntactic parsing. We are able to parse natural language using a keyword parser and carefully designed semantic categories. However, it is not currently possible to have any single word in multiple categories; "turn" cannot mean "change" in some contexts, yet mean "rotate" in others. The inflexibility of these semantic categories could potentially cause problems when we attempt to use Q-MED for larger sets of complaints.

Another potential limitation with Q-MED is that in its current form, all information it obtains is predefined. Words that are not listed in the parser's lexicon cannot be recognized. Therefore, the summary does not have the robustness of a questionnaire that leaves space for the user to annotate answers or provide information about topics not covered in the questionnaire. One approach that we can take is to generate a finding using Q-MED, and then to allow the user to speak freely into the microphone, recording his speech. The speech can be saved as part of the database, and the health care provider can choose to listen to the recordings that she thinks are relevant to the patient's possible diagnoses.

The overall semantic accuracy was 87 percent; this implies that about 13 percent of the user's responses would cause Q-MED to ask specific questions. This will lead to some increase in the overall time of the interview, but should result in a more complete summary report for the

physician. DISCUSSION One of our main goals in developing Q-MED was to create a spoken-language system for history taking that could be field tested. To meet this goal, we had to ensure that mechanisms were in place to accurately and completely capture the information patients stated. Although our most general grammars have a fairly high false positive rate, this application of Q-MED relies upon less open-ended grammars for most of the interview. We believe that the overall false positive rate of this application is sufficiently low that we can test this system using patients who have back pain symptoms.

CONCLUSION Q-MED BACK is a functional spoken-language system that uses speaker-independent, continuous-speech-recognition technology to obtain information about a patient's symptoms. Our study demonstrates that the Q-MED architecture achieves very good semantic accuracy, with a low false positive rate when smaller grammars are used.

Limitations Although we are encouraged by preliminary reviews of Q-MED, we recognize that it has limitations. One such limitation is the time to develop the knowledge base used by Q-MED. QMED applications currently require:

.

ACKNOWLEDGMENTS We thank Les Lenert and Christopher Lane for assistance in the design and implementation of initial versions of this system. Lyn Dupr6 provided assistance with the style and content of the paper. We thank Edward Shortliffe for providing an environment that supports this type of research.

A list of findings A list of synonyms for words in findings A list of questions to ask A list of text-generation templates A set of speech-recognition grammars

Primary support for this research is provided under contract #213-89-0012 from the Agency for

760

Health Care Policy and Research. Additional support is provided by the Department of Defense and the National Library of Medicine under grants LM-04864 and LM-07033. Computer facilities were provided by the SUMEX-AIM Resource (LM-05208) and through an equipment grant from Speech Systems, Inc. Speech SystemsTM is a tademark of Speech Systems, Inc.

Young, S., Hauptmann, AG, Ward, 11. WH, Smith, ET, Werner, P. High level knowledge sources m usable speech recognition systems. Communications of the ACM, 1989. 32(2): p. 183-193. Young, S. Using semantics to correct 12. parser output for ATIS utterances. in Proceedings of the Speech and Natural Language Workshop. 1991. Pacific Grove, CA: Morgan Kaufmann Publishers, Inc.

Hilf, F.D., et al., Machine-Mediated 1. Interviewing, Stanford University, 1970.

Pynsent, P., Fairbank, JCT, Computer 2. interview system for patients with back pain. J. Biomed. Eng., 1989. 11: p. 25-29.

Poon, A., Johnson, K., and Fagan, L.. 13. Augmented transition networks as a representation for knowledge-based history-taking systems. Submitted to SCAMC. 1992. Baltimore, MD.

Qu , M., et al, Patient appreciations 3. of computerized medical interviews. Med. Inform., 1986. 11(4): p. 339-350.

14. Cohen-Cole, S., The medical interview: the three function approach. St Louis: Mosby Year Book, Inc. 1991.

4. Slack, W., A computer-based medicalhistory system. NEJM, 1966. 274(4): p. 194198.

Cassell, E.J., Talking With Patients. 15. 1st ed. MIT Press Series on the Humanistic and Social Dimensions of Medicine, ed. SJ. Reiser. Vol. 1 and 2. 1985, Cambridge, Mass: MIT Press.

Slack, W., A History of computerized 5. medical interviews. Clinical Computing, 1984. 1(5): p. 53-59. 6. Stead, W., Computer-assisted inteview of patients with functional he he. Arch Intern Med, 1972. 129: p. 950-952.

Lin, R., et al. A free-text processing 16. system to capture physical findings: Canonical Phrase Identification System (CAPIS). in Fifteenth Annual Symposium on Computer Applications in Medical Care. (SCAMC) 1991. Washington, DC: McGraw-Hill, Inc.

Waddell, G., An approach to backache. 7. British Journal of Hospital Medctine, 1982.

8. Collen, M., Multiphasic Health Testing Services. New York: John Wiley and Sons. 1977.

17. Shiffman, S., et al. The Integration of a continuous-speech-recognition system with the QMR diagnostic program. Submitted to SCAMC, 1992.

Rudnicky, A., Hauptmann, AG, 9. Conversational interaction with speech systems, Carnegie Mellon University, 1989. Rudnicky, A., The design of spoken 10. language interfaces, Carnegie Melion, 1990.

761

The integration of a continuous-speech-recognition system with the QMR diagnostic program.

A commercial large-vocabulary discrete speech recognition system: DragonDictate.

Contextual variability during speech-in-speech recognition.

Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System.

Evaluation of speech recognition of cochlear implant recipients using a personal digital adaptive radio frequency system.

An irrelevant speech effect with repeated and continuous background speech.

Automatic speech recognition using psychoacoustic models.

Discovering functional units in continuous speech.

Speech recognition in natural background noise.

The NTID speech recognition test: NSRT(®).

Better speech recognition with cochlear implants.

A bio-inspired feature extraction for robust speech recognition.

Automatic Speech Recognition from Neural Signals: A Focused Review.

A method for measuring the intelligibility of uninterrupted, continuous speech.

Educational uses of the PLATO computer system.

ERPs during continuous recognition memory for words.

Acoustical changes of loudly spoken speech and their effects on speech recognition in hearing-impaired listeners.

Visual speech segmentation: using facial cues to locate word boundaries in continuous speech.

Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners With Simulated Age-Related Hearing Loss.

Development of Open-Set Word Recognition in Children: Speech-Shaped Noise and Two-Talker Speech Maskers.

The Auditory-Brainstem Response to Continuous, Non-repetitive Speech Is Modulated by the Speech Envelope and Reflects Speech Processing.

Human Holliday junction resolvase GEN1 uses a chromodomain for efficient DNA recognition and cleavage.

Robust relationship between reading span and speech recognition in noise.

Melodic contour identification and sentence recognition using sung speech.