Ardoint et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4895096]

Published Online 11 September 2014

The intelligibility of interrupted speech depends upon its uninterrupted intelligibility Marine Ardoint,a) Tim Green, and Stuart Rosen Speech Hearing and Phonetic Sciences, University College London, Chandler House, 2 Wakefield Street, London WC1N 1PF, United Kingdom [email protected], [email protected], [email protected]

Abstract: Recognition of sentences containing periodic, 5-Hz, silent interruptions of differing duty cycles was assessed for three types of processed speech. Processing conditions employed different combinations of spectral resolution and the availability of fundamental frequency (F0) information, chosen to yield similar, below-ceiling performance for uninterrupted speech. Performance declined with decreasing duty cycle similarly for each processing condition, suggesting that, at least for certain forms of speech processing and interruption rates, performance with interrupted speech may reflect that obtained with uninterrupted speech. This highlights the difficulty in interpreting differences in interrupted speech performance across conditions for which uninterrupted performance is at ceiling. C 2014 Acoustical Society of America V

PACS numbers: 43.71.Es, 43.71.An, 43.71.Ky [SGS] Date Received: May 2, 2014 Date Accepted: August 8, 2014

1. Introduction The perception of speech that is periodically interrupted by silence has similarities to the situation where speech is masked by a fluctuating background noise, and so has been used to investigate the top-down processes involved in speech recognition in adverse circumstances (e.g., Nelson and Jin, 2004; Gilbert et al., 2007; Jin and Nelson, 2010). In normal hearing (NH) listeners, quite high levels of speech perception are possible even with large proportions of the speech signal replaced by silence (Nelson and Jin, 2004). In contrast, both hearing impaired (HI) listeners (Gordon-Salant and Fitzgibbons, 1993; Baskent, 2010; Jin and Nelson, 2010) and cochlear implant (CI) users (Nelson and Jin, 2004; Chatterjee et al., 2010; Gnansia et al., 2010) appear much more sensitive to interruptions. For NH listeners presented with simulations of cochlear implant processing, performance with interrupted speech has been found to be affected both by the degree of spectral resolution and by the elimination of temporal fine structure (TFS). This has led to suggestions that fine spectral and temporal detail may be critical for the process of integrating fragments of speech into a coherent whole (Nelson and Jin, 2004; Gilbert and Lorenzi, 2010). Fundamental frequency (F0) information might also be expected to contribute to the reconstruction of interrupted speech, given its important role in perceptual grouping and segregation of sound sources (e.g., Bregman, 1990; Micheyl and Oxenham, 2010). However, Chatterjee et al. (2010) compared recognition of interrupted noise-vocoded sentences which either had natural F0 variation or an artificially flattened F0 contour and found that while performance was affected by the number of frequency channels, the F0 contour had no effect. It should be noted, however, that while temporal periodicity pitch cues would have been available in the noise-vocoded speech, such cues are substantially weaker than the pitch cues typically available via normal hearing (Green et al., 2002). Thus, it remains possible that F0 cues may a)

Author to whom correspondence should be addressed.

J. Acoust. Soc. Am. 136 (4), October 2014

C 2014 Acoustical Society of America EL275 V

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.113.111.210 On: Tue, 23 Dec 2014 00:34:27

Ardoint et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4895096]

Published Online 11 September 2014

contribute to recognition of interrupted speech in some conditions. Perhaps consistent with this, Baskent and Chatterjee (2010) found that recognition of noise-vocoded, interrupted sentences was improved by the addition of speech that was lowpass filtered at 500 Hz but otherwise unprocessed. When the number of channels in the vocoder was small the improvement in performance was greater than would have been predicted based on Articulation Index calculations of the extra speech information available, suggesting a possible contribution from the better representation of voice pitch information provided by the lowpass filtered speech. More generally, however, the interpretation of much of the previous evidence is open to considerable question. To identify factors that contribute specifically to processes involved in the reconstruction of interrupted speech, it is necessary to show that such factors affect the perception of interrupted speech differently than they affect uninterrupted speech. While factors such as spectral resolution and the availability of temporal fine structure have been shown to affect the perception of interrupted speech, typically, performance with uninterrupted speech has remained at ceiling for different levels of the manipulated factor. Thus, it is not clear whether such manipulations are specifically affecting the reconstruction of interrupted speech, or if their effects on uninterrupted speech would be similar in the absence of ceiling effects. The present study, in contrast, tests performance with interrupted speech in different processing conditions which yield similar, below-ceiling performance for uninterrupted speech. This was achieved by employing different combinations of spectral resolution and F0 information. 2. Method 2.1 Listeners Twelve native British English speakers, aged between 19 and 32, participated. All had normal audiometric thresholds [up to 20 dB hearing level (HL)] between 125 Hz and 8 kHz. The study was approved by the UCL Research Ethics Committee and each listener provided informed consent before participation. 2.2 Stimuli Speech material consisted of IEEE sentences (IEEE, 1969) recorded from a male speaker of Standard Southern British English with a median F0 of 115 Hz. All sentences included five keywords. Three different forms of vocoder processing that produced similar performance for uninterrupted speech were implemented. In each case, analysis filters (sixth-order Butterworth) spanned the range from 70 Hz to 10 kHz with cutoff frequencies based on Greenwood (1990). Envelopes were extracted by full-wave rectification and low pass filtering at 30 Hz (fourth-order Butterworth) and used to modulate the amplitude of a carrier. The resulting waveforms were restricted to the appropriate frequency band using the analysis filters and summed to obtain the processed signals, which, finally, were low-pass filtered at 10 kHz (6th-order elliptic). The number of analysis filters (i.e., frequency channels) and the nature of the carrier varied across conditions. Other studies in our laboratory allowed identification of the number of channels for each type of vocoding that would produce similar performance for uninterrupted speech. In condition NzVoc8 the carrier was white noise and there were eight channels. Since the cutoff frequency of the envelope filter was below the F0 range, no F0 cues were available in this condition. In condition FxNx8 there were also eight channels, but F0 information was well represented since the carrier during voiced speech segments was a pulse train following the F0 contour of the original speech signal (Dudley, 1939). The contour was estimated using the ProsodyPro Praat script (Xu, 2013), a semi-automatic technique which involves manual checking. During unvoiced segments a noise carrier was used. In condition Fx16, a pulse train carrier was used throughout the signal, with F0 values interpolated across unvoiced and silent speech segments using piecewise cubic Hermite

EL276 J. Acoust. Soc. Am. 136 (4), October 2014

Ardoint et al.: Interrupted speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.113.111.210 On: Tue, 23 Dec 2014 00:34:27

Ardoint et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4895096]

Published Online 11 September 2014

interpolation in logarithmic frequency. Perhaps the most unusual aspect of this processing is that, especially for sibilant fricatives, it results in periodic excitation at high frequencies only, resulting in a percept that is very “buzzy.” Also, because parts of the F0 contour are synthetic, they can sound somewhat unnatural in places. A similar technique was used by Faulkner et al. (2000) but with a fixed F0 value. This manipulation was found in pilot work to reduce performance compared to the use of a noise carrier for unvoiced speech, with the consequence that 16 channels were used to yield similar performance for uninterrupted speech. A summary of the three processing conditions is provided in Table 1. For each processing condition, and also for unprocessed speech (UnProc), sentence recognition was tested with stimuli interrupted at a rate of 5 Hz. To minimize spectral splatter, 5-ms cosine ramps were applied to the onset and offset of the speech portions of the interrupted stimuli. The duty cycle (DC, the proportion of each 200 ms period in which the speech signal was present) was either 0.7 or 0.85. Performance was also tested for uninterrupted speech (DC ¼ 1), resulting in a total of 12 conditions. 2.3 Procedure Stimuli were delivered to NH listeners via Sennheiser HD 600 headphones at a presentation level of approximately 75 dB sound pressure level (SPL). Each test block comprised 20 sentences which listeners were asked to repeat as accurately as possible. Prior to the 12 main test blocks, listeners completed two training runs. The first comprised 30 uninterrupted sentences in each of the three processed conditions. The second comprised five sentences for each of the NzVoc8, FxNx8, Fx16, and UnProc conditions interrupted with a DC of 0.775, midway between the values used in the experiment proper. For both training and the main tests, the order in which conditions were completed and the allocation of sentence lists to conditions, were random for each participant. 3. Results As shown in Fig. 1, intelligibility of unprocessed speech was at or near ceiling for all DCs. Importantly, performance with each type of speech processing was below ceiling and above floor for all DCs. At each DC, performance was similar across the processing types. With no interruptions, mean performance was highest for NzVoc8 (84.9%) and lowest for Fx16 (80.3%). The decrement in mean performance with a DC of 0.85, relative to no interruptions, was 31 percentage points for all three processing types. Further reducing DC to 0.7, resulted in further decrements in mean performance of 33 percentage points for FxNx8, 35 percentage points for NzVoc8, and 27 percentage points for Fx16. Data were subjected to a repeated measures analysis of variance (ANOVA) with factors of processing condition (excluding UnProc) and DC. As an indication of effect sizes, generalized g2, as recommended by Bakeman (2005) for repeated measures designs, was calculated in addition to the more commonly used partial g2. The main effect of DC was significant [F(2,22) ¼ 816.0,

Table 1. Summary of processing conditions. Condition

Channels

Carrier

Envelope filter cutoff (Hz)

F0 cues

NzVoc8 FxNx8

8 8

30 30

None Strong

Fx16

16

White noise Voiced: F0 pulse train Unvoiced: white noise Voiced: F0 pulse train Unvoiced: interpolated F0 pulse train

30

Strong

J. Acoust. Soc. Am. 136 (4), October 2014

Ardoint et al.: Interrupted speech EL277

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.113.111.210 On: Tue, 23 Dec 2014 00:34:27

Ardoint et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4895096]

Published Online 11 September 2014

FIG. 1. Percent of keywords correctly identified for the four processing conditions (UnProc, black stars, FxNx8, black boxes, NzVoc8, dark gray boxes, and Fx16, light gray boxes) and the three duty cycles. Limits of the boxplot represent the upper and lower quartiles, horizontal bars the median, and symbols the mean. For clarity, boxes are offset along the x axis.

p < 0.001, partial g2 ¼ 0.987, generalized g2 ¼ 0.908], but that of processing condition was not [F(2,22) ¼ 1.26, p ¼ 0.303, partial g2 ¼ 0.103, generalized g2 ¼ 0.016]. Despite the somewhat smaller decrement in performance for the change in DC between 0.85 and 0.7 for the Fx16 condition, there was no significant interaction [F(4,44) ¼ 2.27, p ¼ 0.077, partial g2 ¼ 0.171, generalized g2 ¼ 0.035]. 4. Discussion Consistent with the pattern of previous evidence across various studies, while sentence recognition with unprocessed speech remained at or close to ceiling when interruptions were introduced, for degraded speech there was a substantial drop in performance with decreasing duty cycle. Changes in performance with duty cycle were very similar for the three different types of processing, despite differences in spectral resolution and F0 cues. For example, the FxNx8 and NzVoc8 conditions provided the same number of frequency channels but very different degrees of F0 information. The difference in performance between these two conditions, however, was no more than 2 percentage points at any duty cycle. The only hint of a difference in the effect of 5-Hz silent interruptions across processing type was that, compared to FxNx8 and NzVoc8, there was a smaller decline in performance as duty cycle decreased from 0.85 to 0.7 for Fx16, the condition providing the greatest spectral detail. However, this difference was less than 10 percentage points and the interaction between processing type and duty cycle was not significant, so that this cannot be regarded as evidence of a special role for spectral resolution. A key aspect of the present experiment was that performance with uninterrupted speech was equated at below ceiling levels across the different processing types. Comparing the effect of interruptions on performance in a processed condition (or from actual CI users or HI listeners) with that on performance with unprocessed speech which is at ceiling when uninterrupted, runs the risk of giving a misleading impression that particular speech features affected by the processing play a special role in the reconstruction of interrupted speech. Here, where performance was sufficiently degraded so as to be below ceiling for uninterrupted speech, the effect of interruptions was very similar across the three processing conditions despite their differences with

EL278 J. Acoust. Soc. Am. 136 (4), October 2014

Ardoint et al.: Interrupted speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.113.111.210 On: Tue, 23 Dec 2014 00:34:27

Ardoint et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4895096]

Published Online 11 September 2014

respect to features such as spectral resolution and F0 information. Of course, the absence of differences between the specific conditions implemented here does not rule out the possibility that certain speech features may be particularly important in the reconstruction process. In addition, it should be noted the present results were obtained with only a single interruption rate (5 Hz). Recognition of interrupted speech varies with interruption rate (e.g., Miller and Licklider, 1950), perhaps reflecting both the impact of the duration of the remaining speech portions on the speech cues conveyed and differences in the extent to which interruptions introduce spurious speech cues. It is, therefore, quite possible that the reconstruction process may differ across interruption rates. Nonetheless, the possibility that performance with interrupted speech is determined primarily by the intelligibility of uninterrupted speech, rather than by the availability of any particular speech feature, would appear to warrant further investigation. In particular, the apparently severe effects of interruption for CI users compared to NH listeners may, at least in part, reflect the comparatively poorer representation of uninterrupted speech via a CI, rather than any difference in the reconstruction process itself.

Acknowledgments This work was supported by the MRC (UK) Grant No. G1001255. Pilot work was carried out by Kurt Steinmetzger as part of his Ph.D. thesis. References and links Bakeman, R. (2005). “Recommended effect size statistics for repeated measures designs,” Behavior Res. Methods 37, 379–384. Baskent, D. (2010). “Phonemic restoration in sensorineural hearing loss does not depend on baseline speech perception scores,” J. Acoust. Soc. Am. 128, EL169–EL174. Baskent, D., and Chatterjee, M. (2010). “Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech,” Hear. Res. 270, 127–133. Bregman, A. S. (1990). Auditory Scene Analysis (MIT Press, Cambridge, MA). Chatterjee, M., Peredo, F., Nelson, D., and Baskent, D. (2010). “Recognition of interrupted sentences under conditions of spectral degradation,” J. Acoust. Soc. Am. 127, EL37–EL41. Dudley, H. (1939). “Remaking speech,” J. Acoust. Soc. Am. 11, 169–177. Faulkner, A., Rosen, S., and Smith, C. (2000). “Effects of the salience of pitch and periodicity information on the intelligibility of four-channel vocoded speech: Implications for cochlear implants,” J. Acoust. Soc. Am. 108, 1877–1887. Gilbert, G., Bergeras, I., Voillery, D., and Lorenzi, C. (2007). “Effects of periodic interruptions on the intelligibility of speech based on temporal fine-structure or envelope cues,” J. Acoust. Soc. Am. 122, 1336–1339. Gilbert, G., and Lorenzi, C. (2010). “Role of spectral and temporal cues in restoring missing speech information,” J. Acoust. Soc. Am. 128, EL294–EL299. Gnansia, D., Pressnitzer, D., Pean, V., Meyer, B., and Lorenzi, C. (2010). “Intelligibility of interrupted and interleaved speech for normal-hearing listeners and cochlear implantees,” Hear Res. 265, 46–53. Gordon-Salant, S., and Fitzgibbons, P. J. (1993). “Temporal factors and speech recognition performance in young and elderly listeners,” J. Speech Hear Res. 36, 1276–1285. Green, T., Faulkner, A., and Rosen, S. (2002). “Spectral and temporal cues to pitch in noise-excited vocoder simulations of continuous-interleaved-sampling cochlear implants,” J. Acoust. Soc. Am. 112, 2155–2164. Greenwood, D. D. (1990). “A cochlear frequency-position function for several species—29 years later,” J. Acoust. Soc. Am. 87, 2592–2605. Institute of Electrical and Electronics Engineers (1969). “IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust. 17, 225–246. Jin, S. H., and Nelson, P. B. (2010). “Interrupted speech perception: The effects of hearing sensitivity and frequency resolution,” J. Acoust. Soc. Am. 128, 881–889. Micheyl, C., and Oxenham, A. J. (2010). “Pitch, harmonicity and concurrent sound segregation: Psychoacoustical and neurophysiological findings,” Hear Res. 266, 36–51. Miller, G. A., and Licklider, J. C. R. (1950). “The intelligibility of interrupted speech,” J. Acoust. Soc. Am. 22, 167–173.

J. Acoust. Soc. Am. 136 (4), October 2014

Ardoint et al.: Interrupted speech EL279

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.113.111.210 On: Tue, 23 Dec 2014 00:34:27

Ardoint et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4895096]

Published Online 11 September 2014

Nelson, P. B., and Jin, S. H. (2004). “Factors affecting speech understanding in gated interference: Cochlear implant users and normal-hearing listeners,” J. Acoust. Soc. Am. 115, 2286–2294. Xu, Y. (2013). “ProsodyPro—A Tool for Large-scale Systematic Prosody Analysis,” in Proceedings of Tools and Resources for the Analysis of Speech Prosody (TRASP 2013), Aix-en-Provence, France.

EL280 J. Acoust. Soc. Am. 136 (4), October 2014

Ardoint et al.: Interrupted speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.113.111.210 On: Tue, 23 Dec 2014 00:34:27

The intelligibility of interrupted speech depends upon its uninterrupted intelligibility.

Recognition of sentences containing periodic, 5-Hz, silent interruptions of differing duty cycles was assessed for three types of processed speech. Pr...
157KB Sizes 2 Downloads 15 Views