Intelligibility of Emotional Speech in Younger and Older Adults Kate Dupuis1,2,3 and M. Kathleen Pichora-Fuller1,3,4 racy is preserved in older listeners with good audiograms and both age groups benefit from blocking and the repetition of emotions.

Objectives: Little is known about the influence of vocal emotions on speech understanding. Word recognition accuracy for stimuli spoken to portray seven emotions (anger, disgust, fear, sadness, neutral, happiness, and pleasant surprise) was tested in younger and older listeners. Emotions were presented in either mixed (heterogeneous emotions mixed in a list) or blocked (homogeneous emotion blocked in a list) conditions. Three main hypotheses were tested. First, vocal emotion affects word recognition accuracy; specifically, portrayals of fear enhance word recognition accuracy because listeners orient to threatening information and/or distinctive acoustical cues such as high pitch mean and variation. Second, older listeners recognize words less accurately than younger listeners, but the effects of different emotions on intelligibility are similar across age groups. Third, blocking emotions in list results in better word recognition accuracy, especially for older listeners, and reduces the effect of emotion on intelligibility because as listeners develop expectations about vocal emotion, the allocation of processing resources can shift from emotional to lexical processing.

Key words: Adult aging, Pitch, Vocal emotion, Word recognition. (Ear & Hearing 2014;35;695–707)

INTRODUCTION In traditional tests of speech understanding, the listener hears and repeats words or phrases. Although widely used in labs and clinics for over half a century, these tests do not provide a highly accurate index of everyday communication performance, possibly because the stimuli do not have many of the variations that characterize speech produced in naturalistic situations (Killion et al. 2004; Taylor 2007; for a discussion see Bunton & Keintz 2008). Traditionally, professional talkers have recorded test stimuli by speaking in a neutral voice in artificial studio conditions. While it has long been recognized that speech carries linguistic as well as social and personal information (Ladefoged & Broadbent 1957), traditional test materials are typically designed to emphasize the linguistic information carried in the stimuli and to minimize emotional information. Such an emotionally neutral style of speech production is not representative of everyday speech in which emotional information is conveyed through prosodic variations in acoustical cues such as pitch, duration, and intensity (Cutler et al. 1997; Pell 2001). Furthermore, it is well known that, in emotionally neutral speech, linguistic prosodic cues such as pitch contour and variation can affect intelligibility (e.g., Wingfield et al. 1989; Cutler et al. 1997). Further evidence of the important role of vocal pitch in speech understanding comes from studies showing that word recognition is better when sentences are spoken naturally compared to when artificial alterations of speech acoustics such as flattened or exaggerated pitch contours are introduced (Laures & Weismer 1999; Laures & Bunton 2003; Miller et al. 2010). Thus, it is possible that emotional prosodic cues may also have an influence on how well speech is understood in noise. Although much is known about the effects of linguistic prosody on speech understanding and some research has been conducted to examine the effects of emotional prosody on speech perception in quiet, less research has examined how vocal emotion affects word recognition accuracy in adverse listening conditions. Mullennix et al. (2002) found that latencies for two tasks, matching judgment and phoneme identification in quiet, were negatively affected by variability in emotional prosody, suggesting that listeners benefit from repeated presentation of the same emotions. Nygaard and Queen (2008) demonstrated that the latency of word repetition in quiet was facilitated by congruency between semantic and prosodic emotional information and that word repetition was faster for stimuli spoken to portray sadness compared to happiness or neutral emotion.

Design: Emotion was the within-subjects variable: all participants heard speech stimuli consisting of a carrier phrase followed by a target word spoken by either a younger or an older talker, with an equal number of stimuli portraying each of seven vocal emotions. The speech was presented in multi-talker babble at signal to noise ratios adjusted for each talker and each listener age group. Listener age (younger, older), condition (mixed, blocked), and talker (younger, older) were the main between-subjects variables. Fifty-six students (Mage= 18.3 years) were recruited from an undergraduate psychology course; 56 older adults (Mage= 72.3 years) were recruited from a volunteer pool. All participants had clinically normal pure-tone audiometric thresholds at frequencies ≤3000 Hz. Results: There were significant main effects of emotion, listener age group, and condition on the accuracy of word recognition in noise. Stimuli spoken in a fearful voice were the most intelligible, while those spoken in a sad voice were the least intelligible. Overall, word recognition accuracy was poorer for older than younger adults, but there was no main effect of talker, and the pattern of the effects of different emotions on intelligibility did not differ significantly across age groups. Acoustical analyses helped elucidate the effect of emotion and some intertalker differences. Finally, all participants performed better when emotions were blocked. For both groups, performance improved over repeated presentations of each emotion in both blocked and mixed conditions. Conclusions: These results are the first to demonstrate a relationship between vocal emotion and word recognition accuracy in noise for younger and older listeners. In particular, the enhancement of intelligibility by emotion is greatest for words spoken to portray fear and presented heterogeneously with other emotions. Fear may have a specialized role in orienting attention to words heard in noise. This finding may be an auditory counterpart to the enhanced detection of threat information in visual displays. The effect of vocal emotion on word recognition accuDepartment of Psychology, University of Toronto, Toronto, Ontario, Canada; 2Baycrest Health Sciences, Toronto, Ontario, Canada; 3Toronto Rehabilitation Institute, Toronto, Ontario, Canada; and 4Rotman Research Institute, Toronto, Ontario, Canada. 1

0196/0202/2014/356-0695/0 • Ear & Hearing • Copyright © 2014 by Lippincott Williams & Wilkins • Printed in the U.S.A. 695

696

DUPUIS AND PICHORA-FULLER / EAR & HEARING, VOL. 35, NO. 6, 695–707

Gordon and Hibberts (2011) found that younger adults repeated sentences presented in noise more accurately when speech was spoken to portray happiness compared to sadness or neutral emotion. These authors suggest that the advantage conferred by positive affective prosody may represent listeners’ motivation to gather approach information from stimuli, or alternately, it may be related to the way that portrayals of happiness in speech are articulated. It is curious that happy vocal emotion was advantageous in the study by Gordon and Hibberts, but not in the study of Nygaard and Queen (2008). One possibility for the discrepancy between these two studies is that there are different effects of vocal emotion for speech heard in quiet compared to speech heard in noise. Furthermore, in both studies, only a small subset of emotions (happiness, sadness, and neutral) was tested and, based on research using visual emotional stimuli, this subset of emotions may have weaker effects on performance compared to other emotions that they did not test. Mounting evidence, mainly from the visual domain, suggests that the emotional valence of a stimulus can affect how it is detected, attended to, and/or remembered. For example, participants are faster at detecting fear-relevant stimuli (e.g., spiders or snakes) than fear-irrelevant stimuli (e.g., flowers or mushrooms) in a complex visual array (Öhman et al. 2001a). There is also a detection advantage for threatening faces in arrays of neutral faces or other emotional faces (Öhman et al. 2001b). In addition to the effects of certain emotions on visual detection, attention to a visual stimulus can be influenced by emotion; for example, positive mood states can lead to a broadening in the scope of visual attention, while negative mood states can lead to a narrowing (Rowe et al. 2007; Schmitz et al. 2009; see Huntsinger 2013 for a recent review). In auditory studies in which the task is to judge the sex of a speaker, emotional prosody, specifically portrayals of anger, can lead to changes in behavioral (e.g., reaction times) and physiological (e.g., skin conductance) responses (Aue et al. 2011), as well as activation of specific brain regions (e.g., the superior temporal sulcus) that respond selectively to emotional compared to neutral prosody (Grandjean et al. 2005). The emotional valence of a visual stimulus (positive or negative vs. neutral) and the physiological arousal elicited by a stimulus (strong vs. weak) can also increase memory for information (Dolcos et al. 2004; Kensinger & Corkin 2004; see Hamann 2001 and Kensinger 2009 for reviews). One study found that emotional nonlinguistic vocalizations (sad crying and fearful screams) were remembered better than neutral stimuli (Armony et al. 2007). However, it is not yet known whether there are similar effects of vocal emotion on how spoken words are recognized, attended to, or remembered when they are presented in noise. It seems reasonable to hypothesize that the ability to respond quickly to stimuli to maximize safety and survival in the presence of danger has evolutionary significance and should not be modality-specific. In particular, the vocal emotions important for sharing information about threats could render speech more intelligible than speech spoken with emotions other than fear, while emotions such as anger could be easier to recognize (Dupuis & Pichora-Fuller 2010) and could enhance talker identification (Aue et al. 2011). In addition to the effects of emotional valence on detection, attention, and memory for visual stimuli, researchers have also shown that the consistency with which emotions are presented can have an effect on memory. Grühn et al. (2005, 2007) conducted two studies on the effects of emotion on memory for

visual materials presented in lists that were homogeneous or heterogeneous in terms of semantic emotion. Using printed words and pictures with emotionally positive, negative, and neutral valences, these authors reported that recall was better for both younger and older participants in the homogeneous (emotion blocked) compared to the heterogeneous (emotion mixed) condition. Interestingly, compared to the heterogeneous condition, in the homogeneous condition, participants rated printed words as less negative or positive and recalled more neutral than positive or negative stimuli. The authors suggested that participants may allocate more processing resources to remembering the items and less to processing the emotional information when emotion is consistent and predictable within a block. Blocking may be helpful insofar as consistency in emotion can be thought of as a kind of predictable context. In general, a key finding in cognitive aging is that supportive context plays an important role in enabling older adults to compensate when performing cognitively challenging tasks (for a review see Craik & Bialystok 2008). More specifically, older adults have been shown to benefit from the use of supportive context to improve performance on tasks involving listening to speech (see Pichora-Fuller 2008). Since the use of a blocked (homogeneous) emotion presentation in a memory test can benefit the performance of both younger and older adults (e.g., Grühn et al. 2005, 2007), it may be that this type of presentation can also improve word recognition accuracy and reduce agerelated differences in performance. Older adults are known to have poorer ability than younger adults to recognize vocal emotions (e.g., Mitchell 2007; Dupuis & Pichora-Fuller 2010; Ryan et al. 2010). Nevertheless, it is unknown how the expression of vocal emotion in stimuli used in a word recognition test might alter age-related differences in speech understanding. On traditional tests of word recognition in noise, older adults typically underperform compared to their younger counterparts. Older adults also report greater everyday difficulties understanding speech in the presence of noise or distractions compared to younger adults (e.g., Banh et al. 2012). Such difficulties are reported even by older adults with hearing thresholds that are considered to be clinically normal for most of the speech range, with the most likely explanation for their difficulties being age-related declines in supra-threshold auditory temporal processing (for reviews see Pichora-Fuller & Souza 2003; Fitzgibbons & Gordon-Salant 2010; Humes & Dubno 2010). Age-related declines in at least two types of auditory temporal processing are potentially relevant to understanding emotional speech in noise: periodicity (synchrony) and envelope coding (Ezzatian et al. 2012; Schvartz & Chatterjee 2012). Reduced periodicity coding in older adults may account for age-related differences in both behavioral (e.g., Abel et al. 1990; Pichora-Fuller et al. 2007) and physiological studies (e.g., Anderson et al. 2011). Importantly, periodicity coding is believed to be important in discriminating voices based on pitch differences (e.g., Vongpaisal & Pichora-Fuller 2007). There is also behavioral and physiological evidence for age-related differences in how well fluctuations in the amplitude envelope are detected and discriminated. For example, there are age-related differences in ability to detect gaps in both nonspeech and speech stimuli (e.g., Gordon-Salant et al. 2006; Pichora-Fuller et al. 2006). Older adults also have more difficulty discriminating changes in amplitude modulation with associated difficulties in envelope following (Purcell et al. 2004). Furthermore,



697

DUPUIS AND PICHORA-FULLER / EAR & HEARING, VOL. 35, NO. 6, 695–707

when vocal fine structure is removed by vocoding speech, leaving only amplitude envelope cues in each of the frequency bands used in vocoding, older adults require a greater number of bands to achieve the same level of word recognition performance as younger adults (e.g., Souza & Boike 2006; Sheldon et al. 2008; Grose et al. 2009). Thus, age-related declines in supra-threshold auditory temporal processing could hamper the ability of older adults to extract periodicity, gap, and ongoing envelope fluctuation cues that are relevant for speech perception and that may vary with emotion. Three main hypotheses were tested. The first hypothesis is that vocal emotion affects word recognition accuracy in noise. Specifically, portrayals of fear may enhance word recognition accuracy because listeners orient to threatening information and/or because the acoustical cues (e.g., pitch mean and variation), which are important for producing and identifying fear (e.g., Pittam & Scherer 1993; Scherer 2003), are also important for speech intelligibility in noise (e.g., Miller et al. 2010). The second hypothesis is that older listeners recognize words less accurately compared to younger listeners and that the pattern of the effects of different emotions on intelligibility may be attenuated or differ across age groups because of age-related differences in auditory temporal processing. The third hypothesis is that presenting emotions in homogeneous blocks reduces the effect of emotion on word recognition accuracy because listeners develop expectations about vocal emotion such that processing resources can be allocated away from emotional to lexical processing of the information conveyed by the stimuli. As such, word recognition accuracy will be higher for stimuli presented in the homogeneous blocked condition compared to the heterogeneous mixed condition, and accuracy will increase with repeated exposure to a given emotion, especially for older listeners.

MATERIALS AND METHODS Participants Fifty-six university students (Mage = 18.31 years, SD = 1.33, 84% female) and 56 older adults (Mage = 72.34 years, SD = 5.28, 63% female) from an existing volunteer pool were tested. All participants had acquired English by the age of 5 years and were, on average, in good self-reported health, as indicated by a mean score of 3.56 on a four-point scale ranging from 1 (poor) to 4 (excellent). All participants had completed at least Grade 10, and the majority of the older adults (82%) had undertaken post-secondary education (see Table  1 for participant characteristics). All participants had clinically normal pure-tone air conduction thresholds of no greater than 25 dB HL in the frequency range most important for speech (250 to 3000 Hz) in the better ear and no significant interaural threshold asymmetry (no more than 15 dB interaural difference at more than two adjacent test frequencies up to and including 3000 Hz; see Table  2 for audiometric thresholds). The younger participants received course credit and the older participants received $10/ hr. There was one testing session that lasted approximately 1 hr. This work was conducted in accordance with the human ethics standards and received approval from the Social Sciences, Humanities, and Education research ethics board of the University of Toronto. Participants provided informed consent and were tested individually.

TABLE 1. Summary of participant characteristics (means and SEs) Mean (SE)

Participant Characteristics Years of education Vocabulary score (0–20) Health rating score (1–4)

Significant Younger Adults Older Adults Difference (N = 56) (N = 56) p 12.46 (0.13) 11.93 (0.22) 3.56 (0.08)

15.52 (0.42) 15.45 (0.25) 3.20 (0.09)

Intelligibility of emotional speech in younger and older adults.

Little is known about the influence of vocal emotions on speech understanding. Word recognition accuracy for stimuli spoken to portray seven emotions ...
537KB Sizes 0 Downloads 9 Views