A method for measuring the intelligibility of uninterrupted, continuous speech (L) Alexandra MacPhersona) and Michael A. Akeroyd MRC/CSO Institute of Hearing Research - Scottish Section, Glasgow Royal Infirmary, Alexandra Parade, Glasgow G31 2ER, United Kingdom

(Received 2 October 2013; revised 17 December 2013; accepted 16 January 2014) Speech-in-noise tests commonly use short, discrete sentences as representative samples of everyday speech. These tests cannot, however, fully represent the added demands of understanding ongoing, linguistically complex speech. Using a new monitoring method to measure the intelligibility of continuous speech and a standard trial-by-trial, speech-in-noise test the effects of target duration and linguistic complexity were examined. For a group of older hearing-impaired listeners, significantly higher speech reception thresholds were found for continuous, complex speech targets than for syntactically simple sentences. The results highlight the need to sample speech intelligibility in C 2014 Acoustical Society of America. a variety of everyday speech-in-noise scenarios. V [http://dx.doi.org/10.1121/1.4863657] PACS number(s): 43.66.Yw, 43.71.Gv, 43.71.Lz [ICB] I. INTRODUCTION

The speech we encounter on a daily basis varies greatly both in terms of its duration and its linguistic and semantic complexity. Utterances can range from a few short words to a string of sentences each with a unique syntactic structure. Speech from the radio and the television is often continuous with an extensive and varied vocabulary. If information is missed there may not be the opportunity for the message to be paused or reiterated. Continuous speech is likely to create distinctive challenges to listening as the speech units must be retrieved, integrated, and understood while new information simultaneously continues to arrive (Pichora-Fuller et al., 1995). It is plausible that difficulties in understanding speech-in-noise may become particularly evident in these uninterrupted, continuous listening situations, as there are fewer opportunities for the system to “catch up” should extra processing or resources be momentarily required (Shinn-Cunningham and Best, 2008). Most standard speech-in-noise tests use short sentences, often with simple syntactic structures, presented one at a time, and with pauses after each trial during which the participant responds. It is quite likely that such speech tests are not complete measures of the complexity encountered in some everyday speech-in-noise scenarios. We have, therefore, designed a test of continuous speech. The intelligibility of continuous speech is inherently harder to experimentally measure than that of short utterances. Standard word or sentence tests usually express intelligibility as the percentage of test items repeated correctly. Long passages of connected speech cannot, however, be recalled in the same way. Those few experiments that have used continuous speech targets have used methods including shadowing, i.e., repeating the speech as it is heard (e.g., Treisman, 1964), asking comprehension questions either during (Hafter, Xia, and

a)

Author to whom correspondence should be addressed. Also at: School of Psychological Sciences & Health, University of Strathclyde, Graham Hills Building, 40 George Street, Glasgow, G1 1QE, UK. Electronic mail: [email protected]

J. Acoust. Soc. Am. 135 (3), March 2014

Pages: 1027–1030 Kalluri, 2012) or after listening to a passage of speech (Giolas and Epstein, 1963), or asking listeners to subjectively adjust the level of speech to a specific threshold (Hawkins and Stevens, 1950). The speech reception thresholds (SRTs) measured by such methods are likely to be biased by post-perceptual processes such as memory or comprehension skill, or by subjective measures of intelligibility (Speaks et al., 1972). We propose a new method for measuring the intelligibility of continuous speech. The method requires participants to listen to a 4-min segment of continuous speech, in noise, while simultaneously monitoring a written transcript of the same speech. The written transcript contains deliberate mistakes (e.g., the audio may be “he had pale blue eyes,” but the text would read “he had light blue eyes”). The listeners’ task is to mark these substitutions as they occur. The rationale behind this new method is that, if speech is intelligible listeners should be able to identify the word substitutions, but the detection of the changes will become harder as the signal-to-noise ratio (SNR) is decreased. The number of substitutions detected is taken as the measure of speech intelligibility. Similar audio/visual monitoring tasks have been used previously to measure speech quality (Huckvale et al., 2010) and reading skill (McMahon, 1983), but this is believed to be the first time such a method has been used to measure the intelligibility of speech. The method has the benefits that (i) speech intelligibility is based on listeners’ abilities to detect changes in the stimuli and as such does not require listeners to make subjective assessments of intelligibility, (ii) speech does not need to be paused in order to record listener responses, and (iii) as intelligibility is assessed in an online manner, the influences of post-perceptual processes are diminished. Furthermore, the task, while somewhat distinct from everyday listening, is parallel to watching television or a film with subtitles (Romero-Fresco, 2011). The aims of the current study were to (i) establish a proof of concept for the proposed measure of continuous speech and (ii) consider whether the duration and linguistic complexity of target speech can affect its intelligibility. Psychometric functions were measured for both the new monitoring task and a trial-by-trial, speech-in-noise test. The slope and the 50%

0001-4966/2014/135(3)/1027/4/$30.00

C 2014 Acoustical Society of America V

1027

SRT were both determined from the psychometric function. Variations in test material have been shown to affect both SRT and slope (for a review, see MacPherson, 2013). It was hypothesized that the distinctive challenges associated with understanding continuous, linguistically complex speech should result in different intelligibility measurements compared to relatively short and simple speech tokens. The current experiment was carried out with older listeners with mild to moderate hearing impairments. It was expected that, due to increased difficulties understanding speech in noise (PichoraFuller et al., 1995), any effects of target duration and linguistic complexity would be easier to detect in this listener group. II. EXPERIMENT 1

Seventeen listeners aged between 60 and 73 (mean age ¼ 68) took part in the experiment. Their better-ear average pure-tone thresholds (BEAs; based on 500, 1000, 2000, and 4000 Hz) ranged from 25 to 50 dB hearing level (HL) with a mean of 39 dB HL. Both linguistically complex targets, presented continuously (“complex/continuous”) and syntactically simple targets, presented on a trial-by-trial basis (“simple/trial-by-trial”) were used in the current experiment. The complex/continuous targets were extracts taken from a commercial audiobook. The audiobook was an unabridged version of “The Memoirs of Sherlock Holmes” by Arthur Conan Doyle. The extracts were spoken by a British-English male and were edited to durations of 4 min (or the nearest word). Eight such segments were used in total. Segments were on average 647 words long and had an average speaking rate of 161 words per minute. The speech was relatively linguistically complex; sentences were of different lengths and syntactic structure, semantic information spanned several sentences, narration and dialog were mixed, and both high and low frequency words were used. The simple/trial-by-trial targets were taken from the Audiovisual Sentence Lists (ASL) and were spoken by a British-English male (Macleod and Summerfield, 1990). These targets were relatively linguistically simple. They consisted of sentences of equal length that had simple syntactic structures and contained many high frequency words. The stimuli were presented in a speech-shaped unmodulated International Collegium for Rehabilitative Audiology noise (Dreschler et al., 2001). The target speech was presented at an average level of 70 dB sound pressure level (measured using B&K (Montebello, CA) artificial ear-type 4189, with a 1/2 inch microphone-type 4134). The level of the noise masker was adjusted relative to the target to give seven different SNRs spaced equally at 2 dB intervals. These SNRs were chosen individually for each listener on the basis of thresholds attained in a pre-test. The SNRs tested ranged from 12 to 12 dB. SNR was fixed over each 4-min segment. In the trial-by-trial task a random ASL sentence was presented in noise and listeners were asked to repeat back as much of the sentence as they could. Listeners were given as much time as they needed to respond and the next trial was not initiated until a response had been made. On each trial the number of correctly identified keywords (out of a possible three) was recorded. The SNR at which the target speech was presented was randomized across trials. Listeners 1028

J. Acoust. Soc. Am., Vol. 135, No. 3, March 2014

completed 30 trials at each of the 7 SNRs. Psychometric functions were constructed by calculating the percentage of keywords correctly identified at each of the seven SNRs. For the continuous task, typed transcripts of the each 4-min segment were provided. In each transcript, 50 words were changed so that they no longer matched those of the audio recording. The words to be changed were selected in a pseudorandom fashion while adhering to several rules; no changes were made in the opening line of each transcript, no function words were changed (e.g., “the,” “his”), changed words needed to be at least 4 words apart, and no word was changed if it would alter the context of the text. Wherever possible, substituted words retained the same length and approximate wordfrequency as the original word. The transcript for each segment was presented in size-14 font with double line spacing on four sheets of A4 paper. Listeners were simply instructed to mark any word on the transcript which did not exactly match those they had just heard; they were not informed of the precise substitution rules outlined above. Listeners were not able to pause the speech within a trial. They completed eight continuous trials in total, one at each of the seven SNRs and one in quiet. The first continuous trial was always completed in quiet in order to measure listeners’ baseline performance on the task. The seven SNRs were randomized in the subsequent trials. Psychometric functions were constructed by calculating the percentage of substitutions correctly identified (out of a possible 50) at each of the 7 SNRs. The data points for all measured psychometric functions were fitted with a standard logistic equation (see Wichmann and Hill, 2001) from which the 50% SRT (in dB) and the peak slope at 50% (in % per dB) were determined. The experiment was carried out in a sound-treated booth. The stimuli were digitally generated on a personal computer and presented using a RME (Haimhausen, Germany) DIG196/8 PAD soundcard and Arcam (Cambridge, UK) A80 amplifier. Stimuli were presented diotically using Sennheiser (Old Lyme, CT) HD580 precision headphones. Listeners who wore hearing aids were tested unaided. III. RESULTS

At 1.0 dB (standard deviation, SD ¼ 4.5), the acrosslistener mean SRT was higher for the complex/continuous condition than the 3.9 dB (SD ¼ 1.4) of the simple/trialby-trial condition. Figure 1(A) shows the by-listener scatterplot of the complex/continuous vs simple/trial-by-trial condition SRTs. The dotted line represents 1:1 where the SRT values for the two tasks are equivalent. The majority of data points lie below this line, indicating that for individual listeners the SRTs measured in the complex/continuous condition were consistently higher than they were in the simple/trial-by-trial condition. There is also a wider spread of SRT values for the complex/continuous task than for the simple/trial-by-trial task suggesting that the complex/continuous task may be more sensitive to individual differences between the listeners than the simple/trial-by-trial task. An effect of condition on SRT was confirmed by a one-way repeated measures analysis of variance (ANOVA) [F(1,16) ¼ 9.5, p < 0.01]. No significant correlation was found between listeners’ BEAs and SRTs for A. MacPherson and M. A. Akeroyd: Letters to the Editor

FIG. 1. Scatter plots of individual listener data for the complex/continuous vs the simple/trial-by-trial task. (A) compares SRTs for the two tasks while (B) compares slope values. The dotted line in each panel represents 1:1, i.e., where measures on the two tasks are equivalent. The vertical and horizontal lines indicate the location of the mean values.

either the complex/continuous condition [r ¼ 0.17, p ¼ 0.53] or the simple/trial-by-trial condition [r ¼ 0.02, p ¼ 0.94]. The across-listener mean value of slope for the complex/ continuous condition was shallower at 5.1% per dB (SD ¼ 1.9% per dB) than for the simple/trial-by-trial condition at 12.3% per dB (SD ¼ 3.1% per dB). Figure 1(B) shows that this pattern was also seen at the individual-listener level with nearly all data points lying above the dotted line, indicating that for individual listeners slopes measured for the complex/continuous condition were consistently shallower than they were for the simple/trial-by-trial condition. This was confirmed by a one-way repeated measures ANOVA [F(1,16) ¼ 63.8, p < 0.001]. Greater variation in the values for this parameter were seen for the simple/trial-by-trial task, suggesting that in terms of slope the simple/trial-by-trial task was better at differentiating between listeners than the complex/continuous task. There was no evidence of a link between slope and hearing loss; no correlation was found between listeners’ BEAs and the slopes given in either the complex/continuous [r ¼ 0.25, p ¼ 0.33] or simple/trial-by-trial conditions [r ¼ 0.22, p ¼ 0.39]. The results suggest that understanding continuous, linguistically complex speech in noise was more challenging for the listeners than understanding short, syntactically simple sentences. IV. EXPERIMENT 2

It was not possible to establish from Experiment 1 how much of the observed intelligibility difference could be attributed to the differences in the durations of the targets and how much to the differences in linguistic complexity. Experiment 2 was, therefore, carried out by exchanging the stimuli for the two speech-in-noise tasks, which aimed to separate out these two factors. The syntactically simple ASL sentences previously used in the trial-by-trial task were concatenated to form 4-minute segments of speech and presented using the continuous monitoring method (i.e., a simple/continuous condition). Conversely, the 4-minute segments of continuous speech taken from the audiobook were edited into shorter yet syntactically complex sentences,1 which were presented on a trial-by-trial basis (i.e., a complex/trial-by-trial condition). The procedure for the followup experiment was similar to that of the main experiment. A further group of 17 listeners were tested with equivalent J. Acoust. Soc. Am., Vol. 135, No. 3, March 2014

average age (mean ¼ 68 yr) and well-matched audiometric thresholds (mean hearing loss ¼ 36 dB HL) to the previous group of listeners. Table I reports the results along with the mean values of the first experiment. In the follow-up experiment, SRTs were found to be significantly higher for the complex/trial-by-trial condition than for the simple/continuous [F(1,16) ¼ 105.5, p < 0.001], but there was no significant difference in the slope values for the two conditions [F(1,16) ¼ 0.57, p ¼ 0.46]. The results suggest more complex speech utterances, even when they were presented on a trial-by-trial basis, were more challenging for the listeners than understanding short, syntactically simple sentences that were presented continuously. V. DISCUSSION

The aims of the current study were (i) to consider whether it was possible to measure objectively the intelligibility of continuous speech using an online method and (ii) to consider whether continuous, linguistically complex speech would result in differences in intelligibility values compared to a standard speech-in-noise test where syntactically simple sentences were presented in a trial-by-trial procedure. Experiment 1 noted intelligibility differences between the two tasks with higher SRTs and shallower psychometric functions given for the continuous, complex speech targets than for the isolated sentences. In an attempt to disentangle the effects of target duration and target complexity, the stimuli for the two speech-in-noise tasks were exchanged in Experiment 2. While syntactically simple, continuous speech and more complex trial-by-trial sentences gave the same shaped psychometric functions; higher SNRs were required to understand the complex trial-by-trial sentences. As different listeners were used in the two experiments, a direct comparison of the results from the two experiments cannot be made, but the general trends resulting from increasing either target duration (i.e., from a single sentence in the trialby-trial task to 4-min of ongoing speech in the continuous task) or linguistic complexity (i.e., from the syntactically standardized ASL sentences to the more syntactically complex audiobook stimuli) will be briefly discussed. In terms of the slope of the psychometric function, the effects of complexity and duration were essentially additive. The slopes were shallowest when speech was both linguistically complex and continuous, but the slope increased by 3.6%–4.2% per dB as either complexity or target duration was reduced. For SRTs, however, target duration and complexity appear to interact. While increasing target complexity resulted in an increase in SRT regardless of whether speech was presented continuously or on a trial-by-trial basis, increasing the duration of the target TABLE I. A summary of mean SRT and slope values (in brackets) for all conditions: simple/trial-by-trial, simple/continuous, complex/trial-by-trial, and complex/continuous. Linguistic complexity of target Target duration Trial-by-trial Continuous

Simple

Complex

3.9 dB (13.3%/dB) 8.4 dB (8.7%/dB)

0.7 dB (9.3%/dB) 1.0 dB (5.1%/dB)

A. MacPherson and M. A. Akeroyd: Letters to the Editor

1029

only had an effect on SRT when speech was linguistically simple. These trends in the data would seem to indicate that, at least in terms of SRT, the linguistic complexity of a target had a greater effect on intelligibility than its duration. Exchanging the stimuli used in the two speech tasks could not account for all differences between the two stimuli sets. SNR was varied differently for the two speech tests and different talkers were used in the linguistically complex stimuli and the more linguistically simple stimuli. While the use of different talkers could introduce a potential confound if they were not of equal intelligibility, it should be noted that differences in SRT and slope of the same order to those seen across stimuli were also seen within a particular stimuli (i.e., when the same talker was used) when the type of speech test used was manipulated (see Table I). While the current method makes some steps toward objectively measuring ongoing, linguistically complex speech, it has several drawbacks. The task requires listeners to not just identify speech, but also read the text and then compare information from both sources (McMahon, 1983). The reading difficulty of the stimuli and the participants’ reading ability is, therefore, an important concern in such a method. While all continuous stimuli in the current experiments were classed as being between standard difficulty and very easy to read on the Flesch Reading Ease scale (Flesch, 1948), participants’ reading ability may be a considerable factor in their performance on the task and deserves further consideration in future task validation. Another drawback to the task is that by providing a written transcript of the presented speech, the amount of topdown information available to the listener is greatly increased compared to that available in the trial-by-trial task. Additional top-down information, whether in the form of a prime or supporting context, can greatly aid speech understanding and substantially reduce the SNR needed to reach a certain level of intelligibility (e.g., Freyman et al., 2004; Kalikow et al., 1977). It is possible in the current study that some of the increased difficulty associated with understanding continuous speech was masked by the top-down advantage the task offers. For example, SRTs were found to be considerably lower in the simple/continuous condition than in the, in principle, less demanding simple/ trial-by-trial condition. The current study indicates that different styles of communication are likely to result in differences in intelligibility. Compared to short and linguistically simple utterances, linguistically complex and continuous speech is generally more challenging, requiring both a greater increase in SNR and a more favorable SNR overall to reach the same level of intelligibility. We deliberately tested a group of older, mildlyto-moderately hearing-impaired listeners, as it was expected that these listeners would experience the greatest difficulties understanding ongoing, complex speech (Pichora-Fuller et al., 1995). Some comparison with a young normal-hearing control group is still needed, but the results were consistent with the idea that intelligibility suffered when the task was more taxing and opportunities for speech understanding to “catch up” were reduced (Shinn-Cunningham and Best, 2008). In conclusion, the monitoring method described here requires further development, but the results indicate that it is 1030

J. Acoust. Soc. Am., Vol. 135, No. 3, March 2014

feasible to objectively measure the intelligibility of ongoing speech without the need for response pauses. Such tests, when used in conjunction with standard trial-by-trial tasks, will help to sample a wider variety of everyday speech-innoise scenarios and may, in turn, lead to further insights into the difficulties that some listeners experience there. ACKNOWLEDGMENTS

The first author was funded by a Ph.D. studentship from the Medical Research Council, which was hosted by The University of Strathclyde. The Scottish Section of IHR is supported by the Medical Research Council (Grant No. U135097131) and by the Chief Scientist Office of the Scottish Government. 1

These sentences were 4–12 words long (average of 7 words) and had 3–6 keywords (with an average of 4 keywords) which made them slightly longer than the sentences used in the simple/trial-by-trial condition.

Dreschler, W. A., Verschuure, H., Ludvigsen, C., and Westermann, S. (2001). “ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment,” Audiology 40(3), 148–157. Flesch, R. (1948). “A new readability yardstick,” J. Appl. Psychol. 32(3), 221–233. Freyman, R. L., Balakrishnan, U., and Helfer, K. S. (2004). “Effect of number of masking talkers and auditory priming on informational masking in speech recognition,” J. Acoust. Soc. Am. 115(5), 2246–2256. Giolas, T. G., and Epstein, A. (1963). “Comparative intelligibility of word lists and continuous discourse,” J. Speech Hear. Res. 6(4), 349–358. Hafter, E., Xia, J., and Kalluri, S. (2012). “A naturalistic approach to the cocktail party problem,” Paper presented at the International Symposium on Hearing, St John’s College, Cambridge. Hawkins, J. E., and Stevens, S. S. (1950). “The masking of pure tones and of speech by white noise,” J. Acous. Soc. Am. 22(1), 6–13. Huckvale, M., Hilkhuysen, G., and Frasi, D. (2010). “Performance-based measurement of speech quality with an audio proof-reading task,” Paper presented at the 3rd ISCA Workshop on Perceptual Quality Systems, p. 88. Available at http://www.academia.edu/3510726/Transmitted_ Speech_Quality_versus_Perceptual_Annoyance_and_Service_ Acceptability_ Thresholds (Last viewed 1/28/2014). Kalikow, D. N., Stevens, K. N., and Elliot, L. L. (1977). “Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337–1351. Macleod, A., and Summerfield, Q. (1990). “A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use,” Br. J. Audiol. 24, 29–43. MacPherson, A. (2013). “The factors affecting the psychometric function for speech intelligibility,” Ph.D. thesis, The University of Strathclyde, Glasgow. Available at http://www.ihr.mrc.ac.uk/publications (Last viewed 12/18/2013). McMahon, M. L. (1983). “Development of reading-while-listening skills in the primary grades,” Read. Res. Quart. 19(1), 38–52. Pichora-Fuller, M. K., Schneider, B. A., and Daneman, M. (1995). “How young and old adults listen to and remember speech in noise,” J. Acous. Soc. Am. 97(1), 593–608. Romero-Fresco, P. (2011). “Subtitling through speech recognition: Respeaking,” in Translation Practices Explained edited by D. Kelly and S. Laviosa (St. Jerome Publishing, Manchester, UK), pp. 162–176. Shinn-Cunningham, B. G., and Best, V. (2008). “Selective attention in normal and impaired hearing,” Trends Amplif. 12(4), 283–299. Speaks, C., Parker, B., Kuhl, P., and Harris, C. (1972). “Intelligibility of connected discourse,” J. Speech Hear. Res. 15(3), 590–602. Treisman, A. M. (1964). “The effect of irrelevant material on the efficiency of selective listening,” Am. J. Psychol. 77(4), 533–546. Wichmann, F. A., and Hill, N. J. (2001). “The psychometric function: I. Fitting, sampling, and goodness of fit,” Percept. Psychophys. 63(8), 1293–1313. A. MacPherson and M. A. Akeroyd: Letters to the Editor

Copyright of Journal of the Acoustical Society of America is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

A method for measuring the intelligibility of uninterrupted, continuous speech.

Speech-in-noise tests commonly use short, discrete sentences as representative samples of everyday speech. These tests cannot, however, fully represen...
139KB Sizes 0 Downloads 3 Views