Brain & Language 137 (2014) 86–90

Contents lists available at ScienceDirect

Brain & Language journal homepage: www.elsevier.com/locate/b&l

Short Communication

How visual timing and form information affect speech and non-speech processing Jeesun Kim, Chris Davis ⇑ The MARCS Institute, University of Western Sydney, Australia

a r t i c l e

i n f o

Article history: Accepted 17 July 2014 Available online 3 September 2014 Keywords: Visual speech Auditory and visual speech processing Visual form and timing information

a b s t r a c t Auditory speech processing is facilitated when the talker’s face/head movements are seen. This effect is typically explained in terms of visual speech providing form and/or timing information. We determined the effect of both types of information on a speech/non-speech task (non-speech stimuli were spectrally rotated speech). All stimuli were presented paired with the talker’s static or moving face. Two types of moving face stimuli were used: full-face versions (both spoken form and timing information available) and modified face versions (only timing information provided by peri-oral motion available). The results showed that the peri-oral timing information facilitated response time for speech and non-speech stimuli compared to a static face. An additional facilitatory effect was found for full-face versions compared to the timing condition; this effect only occurred for speech stimuli. We propose the timing effect was due to cross-modal phase resetting; the form effect to cross-modal priming. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction It is well established that seeing the talker’s moving face (visual speech) influences the process of speech perception, e.g., speech is perceived more accurately in quiet (Davis & Kim, 2004) and in noise (Sumby & Pollack, 1954). Such visual influence has been attributed to the information available from the talker’s oral regions, e.g., from mouth shapes, mouth and lip motion and some tongue positions (Summerfield, 1979) and peri-oral regions such as jaw, eyebrows and head (Davis & Kim, 2006; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004). The current study focused on the effect that perceiving speech-related movements has on speech processing, and was motivated by the observation that such motion provides two broad types of information, speech form (segment) and timing information (Summerfield, 1987). That is, mouth and lip movements define shapes and spaces that can combine with tongue positions to provide form information about the identity of spoken segments. In addition, such motion provides timing information about segment onset, offset and duration (Summerfield, 1979) and information about syllabic rhythmic structure from the cycle of jaw open-closure (Greenberg, Carvey, Hitchcock, & Chang, 2003; MacNeilage, 1998). We examined the extent that these two sources of speech information influence speech processing. Understanding the influence of visual form and timing information on auditory-visual (AV) speech processing is important not ⇑ Corresponding author. E-mail addresses: [email protected] (J. Kim), [email protected] (C. Davis). http://dx.doi.org/10.1016/j.bandl.2014.07.012 0093-934X/Ó 2014 Elsevier Inc. All rights reserved.

only for an appreciation of each component effect, but also because explanations of AV effects have tended to emphasize the importance of either one type of information or the other. Some neurophysiological accounts see the form of visual speech as being of key importance. Take for example, the explanation that Jääskeläinen, Kauramäki, Tujunen, and Sams (2008) advanced to explain why visual speech reduced the size of the auditory N1 evoked potential to vowels. Here it was argued that seeing lip shapes from particular articulations altered the sensitivity of auditory cortical neurons responsive to frequencies in the region of the second formant. Other accounts have stressed the role that the timing plays. For example, Arnal, Morillion, Kell, and Giraud (2009) have proposed that earlier auditory evoked responses (M100) to syllables preceded by predictable visual speech was due to rhythmic information resetting the phase of oscillation of auditory cortical neurons and so by increasing their receptivity (see also Lakatos, Chen, O’Connell, Mills, & Schroeder, 2007). Explanations of AV speech effects in behavioral studies also show a split between those that emphasize the importance of visual form and those emphasizing the importance of visual timing information. For example, explanations of the McGurk effect are typically couched with respect to the form of visual speech (McGurk & MacDonald, 1976). Likewise, it has been proposed that visual form information can reinforce or disambiguate phonemic content, especially in difficult listening environments (Hazan, Kim, & Chen, 2010) and can provide information about the spectral composition of speech (Grant & Seitz, 2000; Kim & Davis, 2004). On the other hand, it has been proposed that visual timing information also

J. Kim, C. Davis / Brain & Language 137 (2014) 86–90

can influence auditory speech processing. For example, a number of authors have proposed that visual speech provides cues as to when to listen (Grant & Seitz, 2000; Kim & Davis, 2004; Schwartz, Berthommier, & Savariaux, 2004). Studies in which both form and timing information have been manipulated appear to indicate that speech form is the more important cue. For example, Paris, Kim, and Davis (2013) manipulated the form and timing information available from visual speech independently and determined how this affected the time to process a subsequently presented speech sound (to decide whether a /ba/ or /da/ was presented). Visual speech form information (showing articulation of the full face up to the point of vocalization) was presented in a random interval between 250 and 400 ms before the auditory stimulus (i.e., no reliable timing information from the visual stimulus). Visual speech timing information (showing articulation of the talker’s jaw up to the point of vocalization) contained no form information about the spoken syllable. Compared to an auditory alone control, it was found that the form information significantly facilitated response times whereas the timing information did not. This result is consistent with the finding that the McGurk effect is relatively tolerant to large asynchronies between the AV speech signals, particularly when auditory speech lags (e.g., Munhall, Gribble, Sacco, & Ward, 1996). Of course, form effects (e.g., the McGurk) are ultimately constrained by when the visual and auditory speech signals occur. Studies have shown that when presented outside a temporal window, information from visual and auditory speech does not combine to influence perception (e.g., Munhall et al., 1996). Moreover, as mentioned above, there are studies that make it clear that the timing of visual speech information is crucial to its influencing auditory speech processing. For example, Kim and Davis (2004) have reported that in a speech detection in noise task the boost in accuracy due to presenting visual speech was eliminated by misaligning the AV signals by even a relatively small margin (40 ms). One way of interpreting the divergent results regarding the relative importance of visual speech form and timing information is to assume that the nature of the task used to measure speech processing modulates the degree of precision required from these different information types. Identification and detection tasks differ in terms of the properties of stimulus they employ and also at the level of processing used to drive responses. Identification tasks (identifying words or speech segments) use largely intact speech signals and responses are determined at a relatively late stage of processing. These tasks appear to be more sensitive to visual form information. Detection tasks typically present a severely degraded speech signal where participants are required to identify when (or if) a stimulus has occurred. It has been suggested that this type of task taps early stages of stimulus processing (Grant & Seitz, 2000). Detection tasks tend to show that the synchrony (timing) of visual to auditory speech is important. Given the potential importance of task, the current experiment used a task that combined properties of detection (detecting some relatively basic properties) and identification (recognizing a class of object) to jointly examine the contribution of visual form and timing information to speech processing. In relation to a visual timing effect, our interest was to determine whether seeing peri-oral speech-related motion would facilate processing speech, non-speech or both. Previous studies have found that visual speech can facilitate target speech detection (Kim & Davis, 2003, 2004) or that the simultaneous presentation of a visual cue (a light) can improve the detectability of a target sound (Lovelace, Stein, & Wallace, 2003) when the target is heavily masked or presented at threshold. However, it is not clear that such a timing effect will occur in current setup where a clear target signal was not heavily masked and the task did not involve auditory detection (see below). Few experiments have been conducted on whether visual timing cues can assist auditory identification

87

and the results have been mixed. For instance, Schwartz et al. (2004) showed that the presentation of a visual stimulus that provided speech timing but no speech form information (a rectangle that increased and decreased in height according to measures of mouth articulation) did not improve speech intelligibility. On the other hand, Best, Ozmeral, and Shinn-Cunningham (2007) showed that a cue (switching on LEDs) that indicated the time that a target would occur in a complex acoustic mixture improved identification accuracy and that this occurred for both speech and non-speech signals. Best et al. suggest their results may have been due to phasic alerting where the visual stimulus facilitated auditory processing by directing attention to the appropriate point in time. The aim with respect to visual speech form was to determine if it primes the processing of corresponding auditory speech. This was examined by contrasting stimuli where the auditory and visual speech matched with those where they did not. This manipulation was based on demonstrations that there is a functional correspondence between lip and mouth movements and particular speech spectral properties (Berthommier, 2004; Girin, Schwartz, & Feng, 2001) and that seeing visual speech significantly up-regulates the activity of auditory cortex compared to auditory speech alone (Okada, Venezia, Matchin, Saberi, & Hickok, 2013). Combining these two observations leads to the prediction that visual speech form will facilitate decisions based on the processing of its auditory counterpart. In what follows we outline the factors that were considered in designing the task to be used and the method for presenting timing and form information. First, the task required that responses were based on detecting speech (thus potentially sensitive to the timing of visual speech at an early stage of speech processing). In this regard, a task was selected that required a simple binary speeded response based on whether the presented stimulus sounded like speech or not. Second, the speech and non-speech stimuli differed in their spectral distribution but not in their temporal structure, i.e., the non-speech stimuli were created from the speech ones by spectral rotation (see Method). Thus discriminating the speech from the non-speech stimuli required that participants identify the spectral signature of speech (it is this signature that should correspond to the information presented in the visual speech form). Third, the speech stimuli consisted of nonwords in order to minimize the influence of lexical processing (i.e., high level information processing). In addition, all stimuli were presented in a moderate level of noise (5 dB) so that there was uncertainty when the signal started but the signal itself was relatively intact. The video began with the noise playing (before the target sound was presented) so that participants had time to prepare their response (rather than potentially reacting to the sudden onset of the sound). In the experiment, visual speech timing information was presented by showing the talker’s peri-oral movements with the mouth area obscured by an overlaid opaque circular patch (see

Fig. 1. The long-term average spectrum (LTAS) of the nonword speech and nonspeech stimuli. The curves represent the mean LTAS for the 45 stimuli in each condition, the shaded grey ribbons indicate the range.

88

J. Kim, C. Davis / Brain & Language 137 (2014) 86–90

Fig. 2. A depiction of how the talker’s face movements were presented in the experiment. The rows show the different stimulus conditions (Form & Timing information; Timing information; Baseline); the columns represent the Speech and Non-speech (spectrally inverted versions). In the Baseline, half the stimuli in the Speech and Non-speech conditions were shown with the mouth obscured.

below, Fig. 2). Although the mouth area itself was obscured, face movements around mouth (including jaw movements) were visible, providing information about the temporal structure of the speech movements (detailed form information as delivered by mouth, lips, tongue and teeth movements was excluded). The effect of form information was examined using another visual speech condition that included both timing and form information, i.e., an unmasked full view of the talker’s articulatory movements. To estimate the effect of form information alone, the full face condition was compared to the mouth masked one. 2. Method 2.1. Participants Thirty-one undergraduate students of University of Western Sydney participated in the experiment for course credit. All participants were native speakers of English. None of the participants reported any hearing loss and all had normal or correctedto-normal vision. 2.2. Materials Target items consisted of 45 speech (nonwords) and 45 nonspeech stimuli (the latter were constructed from the speech stimuli, see below). Nonwords were selected from the ARC Nonword Database (Rastle, Harrington, & Coltheart, 2002). Auditory and visual speech stimuli of the speech items were obtained from video recording a male native Australian English talker using a Sony TRV 900E digital camera at 25 fps and 48,000 HZ audio. The male talker was positioned 1.5 m from the camera and recorded against a blank background. Illumination and the talker’s distance from camera were held constant across items. On average onset of any oral and peri-oral movement in the video began 7–8 (SD = 3–4) frames before the auditory onset (i.e., 175–200 ms, SD = 75–100 ms). Auditory stimuli: The auditory portion of each video was processed separately so that across-token amplitudes could be normalized and a white noise masker added at a specified SNR. The average duration of the auditory stimuli was 651 ms (SD = 140 ms) and the mean amplitude of each was normalized to an average of 56 dB SPL (SD = 4.3 dB). These stimuli occurred after an initial silent period and a following one (see below). White noise was added to the entire auditory track to achieve a level at which the target signals were relatively intact (an SNR of 5 dB with respect to the stimulus). To avoid attention drawing to the sudden onset of speech and thus maximize any effect of attention to an auditory event drawn by visual timing information, noise was added to the silent periods, i.e., before and after the speech signal.

The non-speech stimuli were constructed from the nonword speech tokens (before the white noise was added) by rotating the spectral energy of each around the midpoint of the sample’s frequency range (0–5 kHz, see Fig. 1). This transformation maintained many of the characteristics of speech (e.g., similar spectral and temporal complexity) and has phonetic features, e.g., voice and voiceless sounds can be distinguished and voiceless fricatives can be identified (see Blesser, 1972). Visual stimuli: Only the lower region of the face (from the bottom of the eyes down) was presented as in previous studies (Davis & Kim, 2001; Kim, Davis, & Krins, 2004). The video files subtended a height of 12.1° of visual arc and a width of 15.7°. The files were 74 frames in length and played on a black background at a screen resolution of 640  480 with 32-bit in grayscale at 25 frames/s. The videos began with silence and the mean onset of the auditory signal was 1764 ms (SD = 59 ms). Three types of visual condition were constructed for the Speech and Non-speech stimuli (see Fig. 2): Form & Timing information (the full video, i.e., face and mouth visible); Timing Information (the moving face with the oral region obscured by superimposing a gray circle (radius 2° of visual arc) to cover the mouth movements); Baseline (either a static picture baseline condition that showed the full face or one where the mouth region was obscured). The timing information stimuli were constructed with the mouth area obscured in order to present timing information with minimal visual speech form information. To provide a rough estimate of the information available from the obscured mouth videos, a pilot study was conducted where five people were shown the video only presentation of the speech stimuli and asked to identify what was said. To make this task as easy as possible (since earlier pilot testing made it clear that open-set identification of the visual stimuli was not possible), only the broad class of the initial viseme was to be identified and a response set of five viseme classes was provided from which to choose. Participant performance was slightly better than chance (p = 0.04 in a one sample T-test) but was highly variable (ranging from 57% to 78% errors) and it is clear from these error rates that very little speech form information was available. In addition, 30 (15 Speech, 15 Non-speech) items were prepared for which video stimuli included a written message ‘‘Do not respond on this trial’’ was briefly presented. These items were used as catch trials to ensure that participants would watch visual stimuli throughout the experiment. Three blocks of items (each consisting of 100 experimental items, i.e., 45 Speech and 45 Non-speech items plus 10 catch trials) were constructed so that each target could appear with each visual (experimental) condition without being repeated in any block. Each participant went through all the three blocks. The presentation order of the blocks and the items within the block was randomized.

2.3. Procedure Participants were tested individually and stimulus presentation and response collection was controlled by the DMDX software (Forster & Forster, 2003). Note that response times were measured from the onset of the auditory Speech or Non-speech stimuli. Participants were instructed that on each trial they would see a talker’s static or moving face and hear speech (nonwords) or nonspeech intermixed throughout the experiment. They were told to respond as soon as practical (speed and accuracy instructions) by pressing a button (with their right hand) if they detected a speech sound or another button (with their left hand) if they detected a non-speech sound. Participants were told to watch the talker’s face but it was made clear that the task was to make a response to the

89

J. Kim, C. Davis / Brain & Language 137 (2014) 86–90

heard stimulus. Participants were informed about filler items for which they should not respond. For practice, 36 extra auditory and visual pairs (Speech and Non-speech each in moving and baseline conditions, and 8 catch trials) were presented before the experimental items. Each experiment lasted for approximately 30 min. 3. Results The results of the catch-trials suggested that participants paid attention to the visual presentation. That is, the vast majority of participants correctly made no response to the catch-trials; two participants incorrectly responded to one trial and one participant to two trials. The data reported here were from all of 31 participants. Table 1 presents the mean response times for each of the experimental conditions. The response time data across the conditions were examined using two analyses of variance (ANOVAs), one for the participant data (collapsed over the different items) and one for the item data (collapsed over the different participants). Overall, the Non-speech stimuli were responded to (46 ms) faster than the Speech ones, F1(1, 30) = 41.53, p < 0.05, g2p 0:58; F2(1, 42) = 95.83, p < 0.05, g2p ¼ 0:70. There was also a main effect of presentation condition, F(2, 60) = 115.52, p < 0.05, g2p ¼ 0:79; F2(2, 84) = 213.5, p < 0.05, g2p ¼ 0:84. The interaction was not secure in the participant analysis, F1(2, 60) = 2.76, p = 0.07; but was in the item analysis, F2(2, 84) = 5.17, p < 0.05, g2p ¼ 0:10. Since there was a significant difference between the response times to the Speech and Non-speech stimuli these data were analyzed separately. For the speech data, two planned pair-wise comparisons (one examining the effect of form; the other the effect of timing) were conducted (using an adjusted a-level of 0.025). There was a significant effect of form (difference between the Form & Timing and the Timing conditions), F1(1, 30) = 5.46, p = 0.025, g2p ¼ 0:15; F2(1, 42) = 6.61, p < 0.025, g2p ¼ 0:14. The timing effect was also significant (difference between the Timing and the Baseline conditions), F1(1, 30) = 113.45, p < 0.025; g2p ¼ 0:79; F2(1, 42) = 206.15, p < 0.025, g2p ¼ 0:83. For Non-speech stimuli, the difference between the Form & Timing and the Timing conditions was not significant, both F1 and F2 < 1. The difference between the Timing and the Baseline conditions was significant, F1(1, 30) = 75.73, p < 0.025, g2p ¼ 0:72; F2(1, 42) = 235.78, p < 0.025, g2p ¼ 0:85. The percentage error rates for each of the experimental conditions are shown in Table 2. For the error data, there were no significant differences between Speech and Non-speech, both Fs < 1, between the presentation types, both Fs < 1, nor an significant interaction between the two variables, F1(2, 60) = 1.56, p > 0.05; F(2, 84) = 1.65, p > 0.05. 4. Discussion The current study examined the effect that presenting visual speech form and timing information has on auditory processing. The timing cue consisted of showing the peri-oral region of the talker as this provided a cue to the auditory signal onset. This cue would provide timing information for both the speech and Table 1 Mean latencies (ms) for Speech and Nonspeech as a function of visual speech type (Standard Error in parentheses). Visual speech type

Form & timing Timing Static face: baseline

Response time Speech

Non-speech

572 (16.5) 589 (15.9) 684 (14.8)

539 (12.8) 538 (13.6) 633 (13.5)

Table 2 Mean percent error rates for Speech and Non-speech as a function of visual speech type (Standard Error in parentheses). Visual speech type

Form & timing Timing Static face: baseline

Error percentage Speech

Non-speech

2.4 (0.39) 2.9 (0.51) 3.1 (0.38)

2.8 (0.61) 2.8 (0.48) 2.2 (0.36)

non-speech signals. The cue to speech form consisted of showing the oral region of the talker. This cue was specific to the speech stimuli, in that only speech would match the spectral properties broadly specified by mouth and lip shapes. The results showed that providing visual timing information that was clearly related to speech articulation (peri-oral motion) facilitated discrimination times for both speech and non-speech and did so to the same extent. This finding of faciliation for simultaneously presented AV stimuli is consistent with the proposal that the visual cue may produce a phasic alerting (e.g., Best et al., 2007). However, there are a number of features in the current paradigm and results that are different from those studies that have proposed phasic alerting as an explanation. First, alerting paradigms typically use so called ‘‘accessory stimuli’’ as cues, stimuli that are irrelevant to the target one. In the current study, the visual speech stimuli were not irrelevant to auditory events (even the non-speech ones), i.e., there is a well-established enduring connection between face motion and auditory events. Second, phasic alerting is typically assumed to occur due to intensity changes in an accessory stimulus producing an automatic arousal. In the current paradigm the visual change involved the onset of perioral motion rather than a change in luminance that is often used (Posner, Nissen, & Klein, 1976). Third, visual accessory stimuli tend to produce relatively small facilitation effects, much smaller than the current effect of around 100 ms; also such effects are often accompanied by increased errors (Posner et al., 1976), which did not happen in the current experiment. Given these differences, we suggest that the timing effect more likely involved a general readying of auditory processing by the visual speech. One possible mechanism for this general auditoryvisual interaction that can be proposed is that visual speech controls the excitability of the auditory cortex through cross-modal phase resetting. On this view, ongoing activity in the auditory system is reorganized (reset) so to be ready for expected auditory input. This preparation leads to an increase in the sensitivity of the auditory cortex. For example, in a paradigm in which the stimuli were constructed to be perceived as a single auditory-visual event, Thorne, De Vos, Viola, and Debener (2011) showed that responses in pure tone frequency discrimination task were facilitated by the presentation of a visual stimulus (a white rectangle) that was presented before the auditory stimulus (30, 50, 65, 75, 100 ms). In addition, they showed that the presentation of the visual stimulus reset the oscillatory activity of the auditory cortext (alpha and theta frequencies) and this modulated the processing of the subsequently presented auditory stimulus. The current results also showed that the presentation of the mouth region lead to a further reduction in response time for the speech but not the non-speech stimuli.1 Our explanation

1 Visual timing information proved to be a much more powerful prime than visual form information. Part of the reason for this may have been that the size of the form priming effect may have been curtailed by a limit on how fast a response could be made. This would predict that individuals whose response time was relatively fast should show less of a priming effect than slower participants. This was not the case; for both speech and non-speech, there was a non-significant correlation between response time in the full face (form) condition and the form priming effect (i.e., RT in the AV timing condition minus RT in the form condition).

90

J. Kim, C. Davis / Brain & Language 137 (2014) 86–90

for this speech specific result involved a consideration of three factors: (1) the information used to perform the task; (2) why a match between visual speech form and the auditory signal led to a faster response; (3) the mechanism responsible for this priming effect. With regards to what information was required by the task, participants likely discriminated the speech/non-speech stimuli by spectral profile (Fig. 1 makes it clear how the speech and non-speech stimuli differed in this regard). We propose that visual speech primed the auditory speech signal because only this signal matched the spectral profile associated with shaping of the mouth and lips. Evidence that mouth/lip shape pick out spectral properties comes from several different areas. For example, models of speech production (e.g., from Stevens & House, 1961 onwards) support the idea that the shaping of the front part of the vocal tract (mouth and lips) is associated with the spectral properties of the produced utterance. Further, acoustic properties of speech can be used to reliably synthesise mouth shapes (Berthommier, 2004) and mouth shapes can be used to estimate speech properties (Girin et al., 2001). So why did a match between visual speech form information and the auditory signal facilitate response time? We suggest that a type of Hebbian cross-modal learning give rise to priming between visual speech and auditory processing and lead a facilitation effect. On this account, the history of the pairing of visual and acoustic features and the neural activation associated with each, should result in ensembles of visual and auditory neurons that are responsive to the co-occurrence of these events. The operation of such a mechanism is consistent with finding of Okada et al. (2013) that activity in the auditory cortex is up-regulated (primed) by the presentation of visual speech. Given this, the presentation of visual speech form would facilitate perceptual processing of the matched speech stimuli. That is, the information provided by mouth shape and motion would only up-regulate the region of the auditory cortex associated with the uttered auditory speech and thereby enable faster speech perception. The processing of the non-speech auditory stimuli would not be facilitated because the frequency region of the auditory cortex associated with these tokens was not pre-activated by the visual speech. We have proposed that the locus of priming is at the perceptual stage (at the auditory cortex) and not at the decision stage. We did this because if priming were at the decision stage it would be difficult to explain why the effect only occurred for the speech signal. That is, if a decision to respond ‘speech’ was primed by a visual and auditory signal match, then it seems arbitrary to rule out the converse, that a ‘nonspeech’ response would be primed by a mismatched one. Of course, a perceptual locus for priming would mean that the visual speech information was processed early in processing. Recent neurophysiological findings are consistent with this requirement as it has been shown that the presentation of visual speech affects both the timing and morphology of an early auditory evoked response, the N100 (e.g., Jääskeläinen et al., 2008; Van Wassenhove, Grant, & Poeppel, 2005). In sum, we found both visual speech form and timing information assist in classifying whether an auditory stimulus is speech or not. That both types of information are useful makes sense since changes in the shape of the mouth during speech articulation provide information as to the onset and rhythmic structure of auditory target and what some of the spectral properties of speech may be. Of course, whether this information will be useful will depend on the required information needed to perform the task itself. Acknowledgments The authors thank Leo Chong for assisting in data collection and acknowledge the support of ARC Discovery Grant (DP130104447).

References Arnal, L. H., Morillion, B., Kell, C. A., & Giraud, A. L. (2009). Dual neural routing of visual facilitation in speech processing. Journal of Neuroscience, 29, 13445–13453. Berthommier, F. (2004). A phonetically neutral model of the low-level audio-visual interaction. Speech Communication, 44, 31–41. Best, V., Ozmeral, E. J., & Shinn-Cunningham, B. G. (2007). Visually-guided attention enhances target identification in a complex auditory scene. Journal for the Association for Research in Otolaryngology, 8, 294–304. Blesser, B. (1972). Speech perception under conditions of spectral transformation: I. Phonetic characteristics. Journal of Speech and Hearing Research, 15, 5–41. Davis, C., & Kim, J. (2001). Repeating and remembering foreign language words: Implications for language teaching system. Artificial Intelligence Review, 16, 37–47. Davis, C., & Kim, J. (2004). Audio-visual interactions with intact clearly audible speech. Quarterly Journal of Experimental Psychology, 57A, 1103–1121. Davis, C., & Kim, J. (2006). Audio-visual speech perception off the top of the head. Cognition, 100, B21–B31. Forster, K. I., & Forster, J. C. (2003). DMDX: A windows display program with millisecond accuracy. Behavior Research Methods, Instruments and Computers, 35, 116–124. Girin, L., Schwartz, J. L., & Feng, G. (2001). Audio-visual enhancement of speech in noise. Journal of the Acoustical Society of America, 109, 3007–3020. Grant, K. W., & Seitz, P. F. (2000). The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America, 108, 1197–1208. Greenberg, S., Carvey, H., Hitchcock, L., & Chang, S. (2003). Temporal properties of spontaneous speech - A syllable-centric perspective. Journal of Phonetics, 31, 465–485. Hazan, V., Kim, J., & Chen, Y. (2010). Audiovisual perception in adverse conditions: Language, speaker and listener effects. Speech Communication, 52, 996–1009. Jääskeläinen, I. P., Kauramäki, J., Tujunen, J., & Sams, M. (2008). Formant transitionspecific adaptation by lipreading of left auditory cortex N1m. Neuroreport, 19, 93–97. Kim, J., & Davis, C. (2003). Hearing foreign voices: Does knowing what is said affect visual-masked-speech detection? Perception, 32, 111–120. Kim, J., & Davis, C. (2004). Investigating the audio-visual speech detection advantage. Speech Communication, 44, 19–30. Kim, J., Davis, C., & Krins, P. (2004). Amodal processing of visual speech as revealed by priming. Cognition, 93, B39–B47. Lakatos, P., Chen, C. M., O’Connell, M. N., Mills, A., & Schroeder, C. E. (2007). Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron, 53, 279–292. Lovelace, C. T., Stein, B. E., & Wallace, M. T. (2003). An irrelevant light enhances auditory detection in humans: A psychophysical analysis of multisensory integration in stimulus detection. Cognitive Brain Research, 17, 447–453. MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–511. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk effect. Perception & Psychophysics, 58, 351–362. Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility head movement improves auditory speech perception. Psychological Science, 15, 133–137. Okada, K., Venezia, J. H., Matchin, W., Saberi, K., & Hickok, G. (2013). An fMRI study of audiovisual speech perception reveals multisensory interactions in auditory cortex. PloS One, 8(6), e68959. Paris, T., Kim, J., & Davis, C. (2013). The role of visual speech in the speed of auditory speech processing. Brain and Language, 126, 350–356. Posner, M. I., Nissen, M. J., & Klein, R. M. (1976). Visual dominance: An informationprocessing account of its origins and significance. Psychological Review, 83, 157. Rastle, K., Harrington, J., & Coltheart, M. (2002). 358,534 Nonwords: The ARC nonword database. Quarterly Journal of Experimental Psychology, 55A, 1339–1362. Schwartz, J. L., Berthommier, F., & Savariaux, C. (2004). Seeing to hear better: Evidence for early audio-visual interactions in speech identification. Cognition, 93, B69–B78. Stevens, K. N., & House, A. S. (1961). An acoustical theory of vowel production and some of its implications. Journal of Speech and Hearing Research, 4, 303. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility innoise. Journal of the Acoustical Society of America, 26, 212–215. Summerfield, A. Q. (1979). The use of visual information in phonetic perception. Phonetica, 36, 314–331. Summerfield, A. Q. (1987). Some preliminaries to a theory of audiovisual speech processing. In B. Dodd & R. Campbell (Eds.), Hearing by Eye II: The psychology of speech reading and auditory-visual speech (pp. 58–82). Hove, UK: Erlbaum Associates. Thorne, J. D., De Vos, M., Viola, F. C., & Debener, S. (2011). Cross-modal phase reset predicts auditory task performance in humans. The Journal of Neuroscience, 31, 3853–3861. Van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102, 1181–1186.

How visual timing and form information affect speech and non-speech processing.

Auditory speech processing is facilitated when the talker's face/head movements are seen. This effect is typically explained in terms of visual speech...
421KB Sizes 2 Downloads 4 Views