International Journal of Audiology 2014; 53: 710–718

Original Article

Assessing variability in audiovisual speech integration skills using capacity and accuracy measures Nicholas Altieri & Daniel Hudock Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

Department of Communication Sciences and Disorders, Idaho State University, Pocatello, USA

Abstract Objective: While most normal-hearing listeners rely on the auditory modality to obtain speech information, research has demonstrated the importance that non-auditory modalities have on language recognition during face-to-face communication. The efficient utilization of the visual modality becomes increasingly important in difficult listening conditions, and especially for older and hearing-impaired listeners with sensory or cognitive decline. First, this report will quantify audiovisual integration skills using a recently developed capacity measure that incorporates speed and accuracy. Second, to investigate sensory factors contributing to integration ability, high and low-frequency hearing thresholds will be correlated with capacity, as well as gain measures from sentence recognition. Design: Integration scores were obtained from a within-subjects design using an open-set sentence speech recognition experiment and a closed set speeded-word classification experiment, designed to examine integration (i.e. capacity). Study sample: A sample of 44 adult listeners without a self-reported history of hearing-loss was recruited. Results: Results demonstrated a significant relationship between measures of audiovisual integration and hearing thresholds. Conclusions: Our data indicated that a listener’s ability to integrate auditory and visual speech information in the domains of speed and accuracy is associated with auditory sensory capabilities and possibly other sensory and cognitive factors.

Key Words: Speech perception; noise; behavioral measures; aging

The effects of visual speech, or being able to obtain cues from seeing a talker’s face during speech recognition, have been documented over the past several decades (e.g. Bergeson & Pisoni, 2004; Erber, 1969; McGurk & Macdonald, 1976; Sumby & Pollack, 1954; Summerfield, 1987). Visual speech provides benefits above auditory-only performance, and these “gain” effects have been observed repeatedly in terms of accuracy and also reaction times (RT) for both young and older listeners. This is especially true in noisy listening environments in which speech-reading can provide the equivalent of a 15-dB gain in the auditory signal (Sumby & Pollack, 1954). Speech recognition has thus been described as a multimodal phenomenon requiring the listener to extract and combine information from separate sources. Audiovisual integration not only requires the successful extraction of auditory and visual cues from a time-variable signal, but also the conjoining of cues in real-time that contain both redundant and complementary features (complementary  visual place of articulation and auditory manner plus voicing) (Grant, 2002; Grant et al, 1998). Integration necessitates the successful combination of separate auditory and visual bottom-up sensory encoding, the combination of complementary features using top-down processes,

and even the involvement of higher cognitive functions such as working memory (Feld & Sommers, 2009; Zekveld et al, 2013; for studies on unisensory perception). This paper will investigate the relationship between measures of integration ability (audiovisual gain and capacity), and lower level sensory abilities (pure-tone thresholds). While the benefits associated with audiovisual speech perception have been reported across age groups and listeners with different hearing acuity, there is considerable variability in ability to integrate multi-modal information sources between individuals (Grant & Seitz, 1998; Grant et al, 1998). Significantly, individual variability in integration persists even when auditory and visual-only recognition scores are taken into consideration. As an example, Grant and Seitz (1998) reported variability in audiovisual gain scores among hearing-impaired participants in nonsense syllable and sentence recognition tasks. The authors reported that hearing-impaired listeners not only differed from one another in their ability to integrate auditory and visual information, but they also showed higher gain compared to normal-hearing listeners. Second, integration measures using nonsense syllables were correlated with the measures obtained

Correspondence: Nicholas Altieri, Department of Communication Sciences and Disorders, Idaho State University, 921 S. 8th Ave. Stop 8116, Pocatello, ID, 83209 USA. E-mail: [email protected] (Received 19 September 2013; accepted 23 March 2014 ) ISSN 1499-2027 print/ISSN 1708-8186 online © 2014 British Society of Audiology, International Society of Audiology, and Nordic Audiological Society DOI: 10.3109/14992027.2014.909053

Variability in integration skills

Abbreviations

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

AV C(t) RT

Audiovisual Capacity Reaction time

from sentence recognition. Grant et al (1998) also observed that each individual in a group of 29 hearing-impaired listeners showed some audiovisual benefit compared to auditory-only syllable and sentence recognition accuracy. Variability in integration skills was examined by comparing obtained audiovisual accuracy, to scores predicted by the Pre-Labeling Integration Model (Braida, 1991), and the Fuzzy Logical Model of Perception (Massaro, 2004). The results showed that “better integrators” achieved greater audiovisual benefit compared to model predictions. These observations indicate that audiovisual accuracy depends on a listener’s ability to not only extract unisensory cues, but also their ability to integrate these cues. Erber (2003) for instance, argued that listeners who are capable of hearing low-frequency sounds (250–1000 Hz), which convey information about vowels, voicing, and manner of articulation, typically benefit more from complementary visual cues than listeners who hear little to no low-frequency information. Erber further pointed out that older listeners with “sloping audiograms”, indexing high rather than low-frequency hearing loss, generally benefit from visual speech. This appears to be bolstered by research showing that when normal-hearing listeners are presented with low-frequency bands of speech plus visual speech cues, they show evidence for higher audiovisual accuracy compared to when they are presented only with high-frequency bands (Grant & Walden, 1996). These results were ostensibly due to the fact that high-frequency components of the signal contain information about place cues, which provides redundant rather than complementary cues to the visual signal. Tye-Murray and colleagues (2007), and also Sommers and colleagues (2005), nonetheless reported that hearing-impaired listeners showed higher visual-only accuracy, while indicating similar integration abilities as normal-hearing older adults. Despite research showing that hearing-impaired listeners show different integration potential than normal-hearing listeners (Bergeson & Pisoni, 2004; Erber, 2003), the extent to which low and highfrequency hearing predict integration remains unclear. We propose that the “capacity” approach described in the following section provides a methodology to address these issues that have not been thoroughly examined using accuracy-only approaches (see Altieri et al, 2013). This study differs from Tye-Murray et al’s (2007) because they used two groups of listeners (normal-hearing vs. hearing impaired) while we used a single sample of listeners with variable hearing ability. We also correlated audiogram measures with integration skills, while implementing measures of integration ability that compared obtained AV scores, to model-based predictions. The benefits of being able to see a talker’s face in noisy listening environments has normally been quantified by comparing word or sentence recognition accuracy scores when both auditory and visual information are available, to auditory or visual-only experimental conditions. The following difference score provides one way to compute audiovisual gain: AVGain  p(AV)max{ p(A), p(V)}, where p(AV) denotes the probability correct on audiovisual trials, and p(A) and p(V) denote the probability correct on auditory and

711

visual-only trials respectively (Altieri & Wenger, 2013). This measure shows the multisensory benefit a listener achieves compared to their best unisensory modality. One potential disadvantage of audiovisual difference scores such as these is that it includes unimodal performance. This is problematic because AV gain will likely be negatively correlated with auditory-only identification, which in turn is negatively associated with hearing thresholds. Some remedies involve using a normalized measure, which assesses visual gain relative to a listener’s auditory capabilities (e.g. Tye-Murray et al, 2007). Alternatively, we suggest comparing obtained AV accuracy to independent predictions, which is given by: P(AV)  p(A)  p(V)p(A)*p(V) (see Altieri et al, 2013; Tye-Murray et al, 2007). This latter formula yields the probability of correctly recognizing auditory or visual information, assuming that the sources are processed independently. Regardless, these accuracy approaches are incomplete since it ignores a crucial component of perception—processing speed. More germane to our present study, research has recently begun investigating the contribution of RTs to assess audiovisual integration (Altieri & Townsend, 2011; Winneke & Phillips, 2011). Specifically, research has emphasized using a RT-based measure of efficiency, known as “capacity” or C(t), that measures processing speed as workload changes. Within the context of audiovisual speech “efficiency as a function of workload” has referred to the change in processing efficiency as a function of the number of modalities present in the signal (AV vs. A- or V-only; Altieri & Townsend, 2011; see Appendix A to be found online at http://informahealthcare.com/ doi/abs/10.3109/14992027.2014.909053). In other words, does having visual-information available contribute to more efficient use of cognitive-linguistic resources? We propose that capacity, C(t), can be advantageous over and above accuracy-only measures because it takes into account processing speed—a significant variable affecting face-to-face communication, and also because data from the audiovisual trials can be compared to the well-defined statistical benchmark of parallel independent race models (Altieri et al, 2013; Miller, 1982). Parallel Independent Race Model predictions provide a sensible null hypothesis in which auditory and visual information do not interact, and are processed independently and therefore not integrated. According to these models, audiovisual trials are processed faster, for statistical reasons, than either auditory or visual-only trials because more modalities are present in the signal. C(t) has also recently proven advantageous since it constitutes a more sensitive measure of neurocognitive function compared to accuracy or mean RT (Altieri & Wenger, 2013; Wenger et al, 2010). In short, capacity provides a non-parametric index of a listener’s ability to combine auditoryvisual cues in the processing time domain. Recent research has provided the theoretical basis and algorithms for a modified capacity assessment function, known as C_I(t) (Altieri et al, 2013; Townsend & Altieri, 2012), that takes into account speed and accuracy—although it has not yet been applied to a large sample of listeners. This modified capacity measure was designed to provide a complete summary of a listener’s integration skills, while the old measure, C(t), only measures multisensory benefit using processing speed. However, computing both functions can be advantageous. To illustrate the utility of utilizing both C_I(t) and C(t), imagine a normal-hearing listener with ceiling level audiovisual and auditory-only performance in accuracy, but with poor audiovisual benefit in terms of speed. The C_I(t) measure may still

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

712

N. Altieri & D. Hudock

show “good” integration skills because high accuracy scores were omitted from the RT-only C(t) measure. Suppose, on the other hand, that a listener shows poorer than predicted audiovisual accuracy. For this participant, C_I(t) should be lower than C(t). Thus, C_I(t) should prove useful for listeners with hearing loss, poor A-only or V-only skills, in difficult listening environments, or when speed-accuracy tradeoffs occur. Associations between accuracy-based gain scores and hearing ability (e.g. Grant et al, 1998; Tye-Murray et al, 2007) can be bolstered by the capacity approach because of its theoretical utility. Crucially, the concept of measuring performance relative to the benchmark of independent processing avoids many of the disadvantages of accuracy-based gain measures. Specifically, gain scores may prove problematic because they may be contaminated by ceiling effects, and because they are often correlated with unisensory performance. The following study investigated the relationship between low and high-frequency hearing thresholds, and audiovisual integration. This experiment required participants to identify the content of spoken sentences in audiovisual, and auditory and visual-only formats (CUNY sentences, Boothroyd et al, 1985). First, we predicted a positive correlation between AV gain and low and high-frequency hearing thresholds. Listeners with lower hearing ability should show higher gain due to an increased reliance on visual speech cues, and also because there is more substantial room for improvement (i.e. inverse effectiveness; Stein & Meredith, 1983). Second, we hypothesized that visual-only recognition scores would be correlated with hearing ability. This is based on previous research showing that hearing ability may also predict visual-only recognition (Bernstein et al, 2000; although see also: Owens & Blazek, 1985). The second experiment required participants to identify, as quickly and as accurately possible, monosyllabic words in audiovisual, and auditory and visual-only formats (speeded word recognition). We predicted a positive correlation between capacity and auditory puretone thresholds. We also predicted a similar relationship between hearing ability and visual-only speech recognition. This latter hypothesis was motivated by findings showing that listeners with progressive hearing loss (Bergeson et al, 2003) or hearing loss occurring at younger age (Bernstein et al, 2000) are better lip-readers than normal-hearing listeners. Finally, we predicted a positive correlation between capacity measures of integration, and gain scores from the open-set sentence recognition experiment.

Methods Participants Forty-four participants were recruited from the Idaho State University campus and the Pocatello ID metropolitan area (12 males and 32 females). Data from one male participant was excluded due to computer error. All participants were native speakers of American English. The mean participant age was 32 years old (SD  13.70) at the time of testing. Audiometric pure-tone thresholds were obtained for each listener immediately prior to the experiment in a sound attenuated room to establish mean high and low-frequency hearing thresholds. All participants reported normal or corrected vision at the time of testing. The same group of volunteers participated in both Experiments 1 and 2. This study was approved by the Idaho State University Human Subjects Committee, and participants were paid 10 dollars per hour.

Audiometric testing Pure-tone thresholds were obtained using an Ambco 1000 Audiometer in a sound attenuated chamber. Thresholds were obtained for 250, 500, 1000, 2000, 4000, and 8000 Hz tones separately in each ear using headphones. For each frequency, thresholds were obtained via the presentation of a continuous tone for approximately 1–2 seconds. The following staircase procedure based on the modified HughsonWestlake method was used: when the listener correctly identified the tone by button press, the sound was reduced by 10 dB HL. If they failed to correctly indicate the presence of the tone, the sound was raised by 5 dB HL on the subsequent presentation. For the purposes of this study, the low frequency range consisted of 250, 500, and 1000 Hz tones, while the high frequency range consisted of 2000, 4000, and 8000 Hz (Erber, 2003). The ceiling for audiometric thresholds was set to 0 dB HL.

Experiment 1

STIMULI The sentence stimuli used in Experiment 1 consisted of 75 sentences obtained from the CUNY sentence database (Boothroyd et al, 1985). Each of the sentences was spoken by a female talker. The stimuli included 25 audiovisual and 25 auditory- and visualonly sentences. The stimuli were obtained from a laser video disk and rendered into a 720  480 pixel video, digitized at a rate of 30 frames per second. Each stimulus was displayed on a flat screen Dell computer monitor with a refresh rate of 60 Hz. The audio files were sampled at 48 000 Hz (16 bits). The auditory track was removed from each of the sentences using Adobe Audition for the visual-only sentences, and the visual component was removed for the auditory-only block. The sentences were subdivided into the following word lengths: 3, 5, 7, 9, and 11, with five sentences for each length for each stimulus set (Altieri et al, 2011). This was done since sentence length naturally varies in conversational speech. Sentences were presented in random order for each participant, and we did not provide cues regarding sentence length or semantic content. The sentences are displayed in Appendix B to be found online at http://informahealthcare.com/doi/abs/10.3109/ 14992027.2014.909053. To avoid ceiling performance, the auditory signal was degraded using an 8-channel sinewave cochlear implant simulator (AngelSim: http://www.tigerspeech.com/). Simulating cochlear implants using this number of channels leads to accuracy levels of approximately 70% words correct in normal-hearing listeners in sentence recognition. It yields similar accuracy levels as multi-talker babble background noise (Bent et al, 2009). One motivation for using a cochlear implant simulator instead of babble is that the variation in phonemic cues in the babble can selectively mask cues in the auditory signal. This may selectively render some monosyllabic words significantly more intelligible than others.

PROCEDURE Accuracy data from the 75 audiovisual (25), auditory-only (25), and visual-only (25) sentences were obtained from each participant. Trials were presented in separate blocks consisting of 25 audiovisual, 25 auditory, and 25 visual-only trials played at a comfortable level over Beyer-Dynamic 100 Headphones. A blank gray screen was shown on all auditory-only trials. The order of audiovisual, auditory, and visual-only block presentation was randomized across

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

Variability in integration skills participants in an effort to avoid order effects. The stimuli in both experiments were presented to the participants using E-Prime 2.0 (http://www.pstnet.com/eprime.cfm) software. Participants were seated in a chair approximately 24 inches away from the computer monitor. Each trial began with the presentation of a black dot on a solid gray background, which cued the participant to press the space bar to begin the trial. Stimulus presentation began with the female talker speaking one of the sentences. After the talker finished speaking the sentence on each trial, a dialog box appeared in the center of the monitor instructing the participant to type in the words they thought the talker said. Each sentence was given to the participant only once, and feedback was not provided on any of the trials. Scoring was carried out in a manner identical to the protocol described by Altieri et al (2011). Whenever the participant correctly typed a word, then that word was scored “correct.” The proportion of words correct was scored in each sentence. Word order was not a criterion for a word to be scored as accurate, and typed responses were manually corrected for misspellings. As an example, for the sentence “Is your sister in school,” if the participant typed “Is the…” only the word “Is” would be scored as correct, making the total proportion correct equal to 1/5 or .20. Upon inspection of the data, participants rarely switched word order in their typed responses.

713

PROCEDURE The audiovisual, auditory-, and visual-only trials were presented randomly in one block. There were a total of 128 audiovisual trials (8 words  2 talkers  2 recordings  4 repetitions), 128 auditory-only trials, and 128 visual-only trials, for a total of 384 experimental trials. A blank gray screen was present on auditory-only trials. This experiment required 25–30 minutes to complete. Trials began with a white dot on a gray background appearing in the center of the monitor. Auditory stimuli were played at a comfortable level (~70 dB SPL) over Beyer Dynamic-100 Headphones. Responses were collected via button-press using the computer keyboard. Each of the buttons, 1–8, was arranged linearly on the keyboard and was labeled with a word from the stimulus set. The position of the labels was identical for all participants. Participants were instructed to press the button corresponding to the word that they judged the talker to have said “as quickly and accurately as possible”. Responses were timed from the onset of the stimulus on each trial. Inter-trial intervals randomly varied on a uniform distribution between 750–1000 ms. On auditory-only trials, participants were required to base their response only on auditory information, and on visual-only trials, participants were required to lip-read. Auditory-only trials were played with a blank computer screen. Likewise, visual-only trials were played without any sound coming from the headphones. Participants received 48 practice trials at the onset of the experiment to assist with learning the response mappings. Feedback was provided (“Correct” vs. “Incorrect”) during practice, but not in the experiment.

Experiment 2

STIMULI

Results

The stimulus materials consisted of audiovisual movie clips of two female talkers. The stimuli were obtained from the Hoosier MultiTalker Database (Sherffert et al, 1997). Two recordings of each of the following monosyllabic words were obtained from two female talkers: Mouse, Job, Tile, Gain, Shop, Boat, Page, and Date. These stimuli were drawn from a similar study carried out by Altieri and Townsend (2011), and Altieri and Wenger (2013). We selected the eight words to obtain a stimulus set that included words that were distinct in the auditory domain, others in the visual domain, and some in both modalities. The auditory, visual, and audiovisual movies were edited using Adobe After Effects C4. Each of the auditory files was sampled at a rate of 48 000 Hz (16 bits). Each movie was digitized and rendered into a 720  480 pixel clip at a rate of 30 frames per second. The duration of the auditory, visual, and audiovisual files ranged from 800–1000 ms. Similar to Experiment 1, the auditory signal was degraded using the eight-channel sinewave CI simulator.

Sentence recognition The purpose of Experiment 1 was to investigate the extent to which traditional measures of audiovisual gain using accuracy were correlated with hearing ability. The measure of gain was: AVGain  p(AV) – max{p(A), p(V)} (Altieri & Wenger, 2013). The results from this portion of the study indicate a systematic relationship between audiovisual gain in CUNY sentence recognition, and both high and lowfrequency hearing thresholds1. The ears used for low and high-frequency pure-tone averages were selected independently. Lowfrequency pure-tone averages were computed for each listener by averaging across the obtained thresholds for 250, 500, and 1000 Hz in their best ear. An analogous procedure was used to obtain the average high-frequency thresholds; we averaged across the obtained thresholds for 2000, 4000, and 8000 Hz in the ear with the lowest threshold. Table 1, A, shows the average auditory, visual, and audiovisual accuracy levels and standard deviations across the participants.

Table 1. A: CUNY sentence recognition scores (mean percent correct and standard deviation, SD), and average thresholds (dB HL); and B: Pearson correlations between hearing thresholds and sentence recognition (df  42). A: CUNY sentence recognition scores (mean percent correct and standard deviation, SD), and average thresholds (dB HL).

Mean SD

AV

A

V

Gain

Low Th. dB HL

High Th. dB HL

94.70 3.20

76.30 9.30

14.30 7.40

18.40 8.50

10.00 7.50

10.00 12.00

B: Pearson correlations between hearing thresholds and sentence recognition (df  42). Gain & LF Threshold r p

.32  .05

Gain & HF Threshold

V-only & A-only

A-only & LF Threshold

A-only & HF Threshold

.39  .01

.35 .02

.38 .01

.71  .0001

Th  threshold; LF  low frequency; HF  High-frequency

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

714

N. Altieri & D. Hudock

The average gain scores, and low and high frequency thresholds are also displayed. The mean audiovisual recognition accuracy scores approximated ceiling (~95%). Ceiling effects in accuracy are expected for normal-hearing listeners, although this will not necessarily be the case for hearing-impaired listeners in clinical settings. A substantially greater level of variability was observed for the auditory and visual-only recognition accuracy levels, and consequently, audiovisual gain scores. Variability was observed in the average low and high-frequency thresholds. Preliminary Pearson correlations, using an arcsine transformation on proportion correct scores, were carried out to examine the relationship between sentence recognition and gain and hearing thresholds. Negative correlations were observed between the recognition accuracy of auditory sentences and mean low-frequency pure-tone threshold (r(42)  .38, p  .01), and also between sentence recognition accuracy and average high-frequency pure-tone threshold (r(42)  .71, p .0001). Gain showed a trend toward a negative relationship with visual-only recognition (r(42)  .29, p  .08), and a significant negative relationship with auditory recognition (r(42) .93, p .0001). However, a relationship was not observed between gain and audiovisual recognition, which was generally at ceiling (r(42)  .05, p  .76). Visual-only speech recognition showed a positive correlation with auditory-only accuracy (r(42)  .35, p  .02), which is predicted by Watson et al (1996). Correlations between CUNY sentence visual-only accuracy were non-significant for high (r(42)  .17, p  .29) and low-frequency (r(42)  .12, p  .43) hearing thresholds. Next, a significant positive relationship was observed between audiovisual gain and pure-tone hearing thresholds as expected. Correlations were observed between audiovisual gain and low (r(42)  .32, p .05) as well as high-frequency hearing thresholds (r(42)  .39, p  .01)2. These results indicate that poorer hearing ability, as measured by traditional audiometric tests, is predictive of audiovisual integration. Figure 1 displays scatter plots of audiovisual gain scores and average low and high-frequency thresholds. The left panel shows the correlation involving low-frequency thresholds, and the one on the right, the high-frequency thresholds. Table 1, B, summarizes the main correlations discussed above.

SPEEDED-WORD RECOGNITION Table 2, A, shows mean accuracy, RTs, and also maximum capacity scores from the empirical function (C_I(t) and C(t)) from Experiment 2. The accuracy results show high audiovisual percent correct in closed-set word recognition across all participants at 98.30% (SD  1.83%). Similarly, auditory-only accuracy levels were high for most participants, with mean accuracy levels at 97.00% (SD  6.97%). Visual-only accuracy scores were substantially lower and more variable (mean  71.60%, SD  11.03%). Together, the differences in accuracy between the audiovisual and auditory-only conditions were consistently low across participants, meaning that the gain scores in accuracy were close to 0 percentage points; which was not surprising since this sample primarily consisted of normalhearing listeners. The mean RTs mirrored the accuracy scores; mean audiovisual and auditory-only accuracy levels were similar to each other, although noticeably faster than visual-only RTs. Next, capacity levels, including the RT-only (C(t)) and combined RT-accuracy measure (C_I(t)) evidenced maximum values slightly greater than 1. Maximum, or peak capacity values were recorded since they provide a concise summary of integration potential, and also because maximum values reduce a continuous measure to one point that can be utilized in statistical analyses (Altieri & Townsend, 2011; Altieri & Wenger, 2013). A maximum capacity value of “1”, for example, indicates that one is capable of achieving unlimited capacity under the given listening conditions. Unlimited capacity denotes the purely statistical benefit expected from having two signals present, as opposed to just one (Townsend & Nozawa, 1995). Figure 2 shows C(t) and C_I(t) averaged across the participants. The left panel shows RT-only C(t) (solid line) along with the standard error estimates (dotted lines). Likewise, the right panel shows the combined RT-accuracy capacity assessment measure results. The mean maximum capacity values shown in Table 2, A, were 1.27 (SD  .54) and 1.39 (SD  .73) for C_I(t) and C(t) respectively. These values demonstrate that listeners were capable of integrating audiovisual speech information efficiently, at least at some time points, since capacity was greater than 1. However, the standard deviations suggest that some participants were inefficient integrators since capacity was lower than 1.

Figure 1. Scatter plots illustrating the correlation between audiovisual gain scores in Experiment 1, and auditory low and high-frequency pure-tone thresholds (dB HL) obtained from audiometric testing. The best fitting line is displayed.

Variability in integration skills

715

Table 2. A: Speeded word recognition scores (these include percentage correct scores, averaged maximum C_I(t) and C(t) values, and mean RTs. SDs are also included); and (B): Spearman correlations between hearing thresholds and capacity measures (df  42). A: Speeded word recognition scores. These include percentage correct scores, averaged maximum C_I(t) and C(t) values, and mean RTs. SDs are also included.

Mean SD

AV

A

V

C_I(t)

C(t)

AV_RT

A_RT

V_RT

98.30 1.83

97.00 6.97

71.60 11.03

1.27 0.54

1.39 0.73

1847 309

1834 285

2448 525

B: Spearman correlations between hearing thresholds and capacity measures (df  42). C_I(t) & LF threshold

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

rho p

.42  .01

C_I(t) & HF threshold

C(t) & LF threshold

C(t) & HF threshold

.19 .24

.47  .01

.16 .33

LF  low frequency; HF  High-frequency

Capacity provided valuable qualitative information regarding integration skills. The results show that for faster RTs, capacity scores were at 1 or slightly higher. This trend was observed across participants regardless of age. For slower responses, C_I(t) and C(t) were lower compared to independent race-model predictions. However, the average never dropped below the common heuristic bound (  ½) for “low capacity” or poor integration, which may arise when the visual signal inhibits auditory processing (Altieri & Townsend, 2011; Townsend & Nozawa, 1995). Both the RT-only and combined RTaccuracy capacity measures provided evidence for unlimited capacity for faster RTs. While capacity was reduced for slower responses, the fact that it remained above the lower-bound suggests that participants were typically capable of benefiting from complementary visual speech cues under these listening conditions. To address the hypotheses regarding the relationship to capacity values and hearing thresholds, we carried out Spearman rank correlations between the capacity measures, and low and high-frequency thresholds. First C_I(t) showed a significant positive relationship with low-frequency (rho(42)  .42, p .01) but not high-frequency hearing (rho(42)  .19, p  .24). The RT-only measure, C(t), also showed a significant positive relationship with low-frequency

thresholds (rho(42)  .47, p .01), but a non-significant relationship with high-frequency thresholds (rho(42)  .16, p  .33). Interestingly, low-frequency ability proved to be a stronger predictor of capacity than traditional accuracy scores. Figure 3 plots the correlations between the capacity measures, and low and high-frequency thresholds. Table 2, B, summarizes the correlations between capacity and thresholds. Correlations between other measures revealed further relationships. A Pearson correlation indicated a negative relationship in Experiment 2 between visual-only word recognition accuracy and high (r(42)  .41, p .01), but not low-frequency thresholds (r(42)  .24, p  .13). Interestingly, these results may suggest that participants who are better at visual-only word recognition in a closed-set task may have better high-frequency hearing. The correlation at first appears counterintuitive since previous research reported that those with hearing loss may tend to be better at visual-only recognition (Bernstein et al, 2000; cf. Clouser, 1977). However, our results find support from research showing that older adults with presbycusis show evidence for poorer visual-only abilities compared to young normal-hearing listeners (Sommers et al, 2005; Spehar et al, 2004). To further examine the effects of age, we carried

Figure 2. Mean capacity measures (thick solid line  capacity/integration efficiency). The left panel shows C(t), and the right panel shows C_I(t). The dotted lines show the standard errors (SEs) for the averaged measures.

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

716

N. Altieri & D. Hudock

Figure 3. Scatter plots showing a positive correlation between C_I(t) and C(t), and average low (left panels) and high-frequency (right panels) thresholds. The best fitting line is shown in each panel. out a correlation between age and visual-only scores. A negative correlation was observed, with older adults scoring lower in the visual-only condition (r(42)  .54, p .001). A multiple regression analysis suggested that age was the stronger predictor (estimated β  .0052). However, high-frequency threshold (estimated β  .0012) together with age explained approximately 30 percent of the variance (R2  .30, F(2, 42)  8.45, p .001). To help establish the viability of capacity as an assessment of speech integration in closed-set recognition, the new measure should be associated with traditional measures from audiovisual open-set sentence recognition. We hypothesized a positive correlation between capacity assessment values, C_I(t), and measures of audiovisual enhancement obtained from open-set sentence recognition (Bergeson & Pisoni, 2004; Sumby & Pollack, 1954; Sommers et al, 2005; Tye-Murray et al, 2007). To examine this, we carried out a bivariate Spearman correlation between max C_I(t) and AV gain (AV – max{A,V}), the results of which revealed a statistically significant effect (r(42)  .34, p  .02). This demonstrates that nearly 12% of the variance in AV gain scores obtained from sentence recognition may be accounted for by C_I(t). The correlation between C(t) and AV gain showed a non-significant trend (r(42)  .27, p  .08), indicating the superiority of C_I(t) in summarizing a listener’s speech integration potential. However, C(t) and C_I(t) should be examined within an individual to assist in determining whether a one’s integration performance is the result of speed, accuracy, or both. Suppose C(t) is lower than 1, but C_I(t) equals 1 or is greater than 1. This pattern would indicate that the listener slows down on audiovisual trials relative to model predictions to effectively integrate the cues. Here, successful integration occurs because of increased accuracy rather than speed. Tradeoffs can also occur when participants are “fast” on audiovisual trials, but accuracy becomes poor.

General Discussion and Conclusion Together, the results from Experiments 1 and 2 provide a growing body of evidence demonstrating a correspondence between audiovisual gain scores, visual-only accuracy, and thresholds obtained

from audiometric testing. A critical component of this study was the exploration of the new measure C_I(t), which was designed to assess the overall integration capability because it uses both processing speed and accuracy. We suggest using capacity measures as a supplement to accuracy-only and sentence recognition assessments of audiovisual gain scores. One advantage of using open-set sentence recognition is that it is more ecologically valid, especially because sentences provide predictive contextual information as opposed to words in isolation. However, audiologists may find the use of speeded recognition of isolated words useful for other reasons since context is eliminated, and in light of many of the discussed benefits of assessing capacity. Novel features of the capacity approach include the ability to: (1) utilize both accuracy and speed in a unified assessment, (2) provide a diagnosis of integration ability within a normal-hearing or hearingimpaired listener, and, (3) determine whether the locus of a listener’s integration potential, or lack thereof, results primarily from speed, accuracy, or both. The latter two variables can be measured relative to the non-parametric predictions of the race model inequality. In clinical settings time may be an issue. We might suggest using either sentence recognition, or an abridged version of the speeded recognition task with fewer trials. This protocol should, even when combined with traditional audiometry, require less than an hour. The combined C_I(t) and C(t) used in this report should prove valuable for assessing integration in hearing-impaired listeners. Incidentally, our method of degrading the auditory signal using a cochlear implant simulator only lowers the amount of sensory cues available. Of course, there are cognitive differences between normal-hearing individuals and those with hearing loss, particularly if hearing-loss is associated with age. Research on aging and speech perception has found that cognitive factors predict auditory recognition (e.g. Zekveld et al, 2013), and visual-only and AV integration ability (Feld & Sommers, 2009; Sommers et al, 2005; Tye-Murray et al, 2007). Working memory appears to be a variable determining audiovisual language recognition ability—especially in difficult listening conditions (Altieri & Wenger, 2013; Baranyaiova-Frtusova et al, 2010; Buchan & Munhall, 2012; Zekveld et al, 2009). Elderly

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

Variability in integration skills or hearing-impaired listeners, for example, may increase their decision threshold on audiovisual trials in order to “slow down” and benefit from visual information. This may have the effect of slower RTs, and hence lower C(t) relative to independent model predictions. Incidentally, pure audiometric measures may be weak predictors of language processing, in part because they fail to consider visual processing, or cognitive abilities. While slowing down may afford higher accuracy as measured by C_I(t) in certain cases, slower responses on audiovisual trials could be symptomatic of stressed cognitive resources. In fact, Zekveld et al’s (2009) results showed evidence for this hypothesis via the observation of poorer working memory skills in middle-aged normal-hearing and hearing-impaired listeners in comparison to young listeners. The authors also observed that working memory resources were negatively correlated with a listener’s ability to use subtitles—a form of multisensory integration—to improve comprehension on an audiovisual sentence recognition task. In another study, Buchan and Munhall (2012) reported an audiovisual recognition experiment employing congruent (A /ba/  V “ba”) and incongruent audiovisual stimuli (A /ba/  V “ga”). The experiment required participants to identify the content of utterances containing auditory and visual information; integration ability was assessed by determining the proportion of perceptual fusions. For example, if auditory /ba/ and visual “ga” were presented, a fused response could be “da”. Not surprisingly, the results revealed a decrease in the proportion of fused “McGurk” responses when listeners were performing a concurrent distractor task that manipulated cognitive load, and ostensibly reduced the availability of visual processing resources. Certain hearing-impaired individuals with superior visual-only speech recognition skills, working memory resources, processing speed, or the superior ability to obtain and match visemes with auditory cues will influence hearing-aid fittings or perhaps delay hearing-aid use in some listeners with superior integration performance. On the other hand, suppose an older listener exhibits signs of mild cognitive impairment. Such a listener may often yield similar performance on an audiogram as an age-matched listener with normal cognitive function. Nonetheless, data obtained using capacity in a cued memory study revealed lower accuracy scores as well as slower RTs and consequently capacity scores in aging participants with mild cognitive impairment compared to normal aging participants (Wenger et al, 2010). This motivates the prediction that many listeners with reduced cognitive performance may elicit lower C_I(t) than their counterparts, and thus, may benefit from early intervention and a hearing aid to facilitate speech recognition and communication. In summary, the capacity measure should prove considerably useful for categorizing individual listeners according to their integration capabilities.

Acknowledgements The project described was supported by the INBRE Program, NIH Grant Nos. P20 RR016454 (National Center for Research Resources) and P20 GM103408 (National Institute of General Medical Sciences).

Notes 1. The α levels were set to the more conservative level of .025 to adjust for multiple comparisons. 2. We computed these same correlations with hearing thresholds, and the alternative gain measure comparing the difference

717

between obtained AV, and p(A)p(V)p(A)*p(V). The correlation between low-frequency threshold and gain was nonsignificant (r(42)  .20, p  .19). However, the correlation for high-frequency was significant (r(42)  .52, p .001). Declaration of interest: The authors report no conflicts of interest.

References Altieri N., Pisoni D.B. & Townsend J.T. 2011. Behavioral, clinical, and neurobiological constraints on theories of audiovisual speech integration: A review and suggestions for new directions. Seeing and Perceiving, 24, 513–539. Altieri N., Pisoni D.B. & Townsend J.T. 2011. Some normative data on lip-reading skills. J Acoust Soc Am, 130, 1–4. Altieri N. & Townsend J.T. 2011. An assessment of behavioral dynamic information processing measures in audiovisual speech perception. Frontiers in Psychology, 2, 1–15. Altieri N., Townsend J.T. & Wenger M.J. 2013. A dynamic assessment function for measuring age-related sensory decline in audiovisual speech recognition. Behav Res Methods. Altieri N. & Wenger M. 2013. Neural dynamics of audiovisual integration efficiency under variable listening conditions. Frontiers in Psychology, 4, 1–15. Baranyaiova-Frtusova J., Winneke A. & Phillips N. 2010. The effect of audio-visual speech information on working memory in younger and older adults. Cognitive Aging Conference (CAC), Atlanta, USA. Bent T., Buchwald & Pisoni D.B. 2009. Perceptual adaptation and intelligibility of multiple talkers for two types of degraded speech. J Acoust Soc Am, 126, 2660–2669. Bergeson T.R. & Pisoni D.B. 2004. Audiovisual speech perception in deaf adults and children following cochlear implantation. In: G. Calvert, C. Spence, B.E. Stein (eds.), The Handbook of Multisensory Processes. Cambridge, USA: MIT Press, pp. 749–771. Bergeson T.R., Pisoni D.B., Reese L. & Kirk K.I. 2003. Audiovisual speech perception in adult cochlear implant users: Effects of sudden vs. progressive hearing loss. Mid-Winter Meeting of the Association for Research in Otolaryngology; Daytona Beach, Florida. Bernstein L.E., Demorest M.E. & Tucker P.E. 2000. Speech perception without hearing. Perception and Psychophysics, 62, 233–252. Boothroyd A., Hanin L. & Hnath T. 1985. A Sentence Test of Speech Perception: Reliability, Set Equivalence, and Short-term Learning (internal report RCI 10). New York: City University of New York. Braida L.D. 1991. Crossmodal integration in the identification of consonant segments. Quarterly Journal of Experimental Psychology, 43, 647–77. Buchan J.N. & Munhall K.G. 2012. The effect of a concurrent cognitive load task and temporal offsets on the integration of auditory and visual speech information, Seeing and Perceiving, 25, 87–106. Clouser R.A. 1977. Relative phoneme visibility and lipreading performance. Volta Review, Jan. 27–34. Erber N.P. 1969. Interaction of audition and vision in the recognition of oral speech stimuli. J Sp Hear Res, 12, 423–425. Erber N.P. 1975. Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40, 481–492. Erber N.P. 2002. Hearing, Vision, Communication, and Older People. Clifton Hill, Australia: Clavis Publishing. Erber N.P. 2003. Use of hearing aids by older people: Influence of non-auditory factors (vision, manual dexterity). Int J Audiol, 42, 2S21–2S26. Feld J. & Sommers M.S. 2009. Lipreading, processing speed, and working memory in younger and older adults. J Sp Lang Hear Res, 52, 1555–1565. Grant K.W. 2002. Measures of auditory-visual integration for speech understanding: A theoretical perspective (L). J Acoust Soc Am, 112, 30–33. Grant K.W. & Seitz P.F. 1998. Measures of auditory-visual integration in nonsense syllables and sentences. J Acoust Soc Am, 104, 2438–2450.

Int J Audiol Downloaded from informahealthcare.com by Fontys Hogescholen on 12/09/14 For personal use only.

718

N. Altieri & D. Hudock

Grant K.W. & Walden B.E. 1996. Evaluating the articulation index for audio-visual consonant recognition. J Acoust Soc Am, 100, 2415–2424. Grant K.W., Walden B.E. & Seitz P.F. 1998. Auditory-visual speech recognition by hearing impaired subjects: Consonant recognition, sentence recognition, and auditory-visual integration. J Acoust Soc Am, 103, 2677–2690. Massaro D.W. 2004. From multisensory integration to talking heads and language learning. In: G. A. Calvert, C. Spence, B.E. Stein (eds.), The Handbook of Multisensory Processes, Cambridge, USA: The MIT Press, pp. 153–176. McGurk H. & Macdonald J.W. 1976. Hearing lips and seeing voices. Nature, 264, 746–748. Miller J. 1982. Divided attention: Evidence for coactivation with redundant signals. Cognit Psychol, 14, 247–279. Owens E. & Blazek B. 1985. Visemes observed by hearing impaired and normal hearing adults viewers. J Sp Hear Res, 28, 381–393. Ratcliff R., Thapar A. & McKoon G. 2004. A diffusion model analysis of the effects of aging on recognition memory. Journal of Memory and Language, 50, 408–424. Sherffert S., Lachs L. & Hernandez L.R. 1997. The Hoosier audiovisual multi-talker database. In: Research on Spoken Language Processing Progress Report No. 21, Bloomington, USA: Speech Research Laboratory, Psychology, Indiana University. Sommers M., Tye-Murray N. & Spehar B. 2005. Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults. Ear Hear, 26, 263–275. Sumby W.H. & Pollack I. 1954. Visual contribution to speech intelligibility in noise. J Acoust Soc Am, 26, 12–15. Summerfield Q. 1987. Some preliminaries to a comprehensive account of audio-visual speech perception. In: B. Dodd, R. Campbell (eds.), The Psychology of Lip-Reading. Hillsdale, USA: LEA, pp. 3–50.

Supplementary material available online Supplementary Appendix A & B.

Townsend J.T. & Altieri N. 2012. An accuracy-response time capacity assessment function that measures performance against standard parallel predictions. Psychol Rev, 119, 500–516. Townsend J.T. & Nozawa G. 1995. Spatio-temporal properties of elementary perception: An investigation of parallel, serial, and coactive theories. Journal of Mathematical Psychology, 39, 321–360. Townsend J.T. & Wenger M.J. 2004. A theory of interactive parallel processing: New capacity measures and predictions for a response time inequality series. Psychol Rev, 111, 1003–1035. Tye-Murray N., Sommers M. & Spehar B. 2007. Audiovisual integration and lip-reading abilities of older adults with normal and impaired hearing. Ear Hear, 28, 656–668. Watson C.S., Qiu W., Chamberlain M.M. & Li X. 1996. Auditory and visual speech perception: Sources of individual differences in speech recognition. J Acoust Soc Am, 100, 1153–1162. Wenger M.J., Negash S., Petersen R.C. & Petersen L. 2010. Modeling and estimating recall processing capacity: Sensitivity and diagnostic utility in application to mild cognitive impairment. Journal of Mathematical Psychology, 54, 73–89. Winneke A.H. & Phillips N.A. 2011. Does audiovisual speech offer a fountain of youth for old ears? An event-related brain potential study of age differences in audiovisual speech perception. Psychology and Aging, 26, 427–438. Zekveld A.A., George E.L.J., Houtgast T. & Kramer S.E. 2013. Cognitive abilities relate to-self-reported hearing disability. J Sp Lang Hear Res, 56, 1364–1372. Zekveld A.A., Kramer S.E., Kessens J.M., Vlaming M.S. & Houtgast T. 2009. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system. Ear Hear, 30, 262–272.

Assessing variability in audiovisual speech integration skills using capacity and accuracy measures.

While most normal-hearing listeners rely on the auditory modality to obtain speech information, research has demonstrated the importance that non-audi...
507KB Sizes 0 Downloads 3 Views