JSLHR

Supplement

Speech Compensation for Time-ScaleModified Auditory Feedback Rintaro Oganea and Masaaki Hondaa

Purpose: The purpose of this study was to examine speech compensation in response to time-scale-modified auditory feedback during the transition of the semivowel for a target utterance of /ija /. Method: Each utterance session consisted of 10 control trials in the normal feedback condition followed by 20 perturbed trials in the modified auditory feedback condition and 10 return trials in the normal feedback condition. The authors examined speech compensation and the aftereffect in terms of 3 acoustic features: the maximum velocities on the (a) F1 and (b) F2 trajectories (VF1 and VF2) and (c) the F1–F2 onset time difference (TD) during the transition. They also conducted a syllable perception test on the feedback speech.

Results: Speech compensation was observed in VF1, VF2, and TD. The magnitudes of speech compensation in VF1 and TD monotonically increased as the amount of the time-scale perturbation increased. The amount of speech compensation increased as the phonemic perception change increased. Conclusions: Speech compensation for time-scale-modified auditory feedback is carried out primarily by changing VF1 and secondarily by adjusting VF2 and TD. Furthermore, it is activated primarily by detecting the speed change in altered feedback speech and secondarily by detecting the phonemic categorical change.

S

evidence of adaptation of the articulation to cancel the formant shift with relatively slow adaptation speed. Most previous studies have focused on the compensatory response to the stationary auditory perturbation. Few studies have examined speech compensation when nonstationary auditory perturbation was applied during the utterance. In speech motor control, articulatory dynamics is a significant feature for characterizing the consonants as well as semivowels. Dynamic characteristics of the formant trajectories also are important perceptual features in identifying these phonemes. Dynamic characteristics are defined in several ways in speech acoustics. Formant transition, or spectral temporal change, is one of the most important dynamic features to characterize stop consonants and semivowels. Voice onset time is another dynamic feature that distinguishes voiced and unvoiced stop consonants. An altered auditory feedback paradigm that manipulates these dynamic acoustic features in real time is an efficient approach to investigate the link between speech motor control and speech perception in terms of speech dynamics. Mitsuya, MacDonald, and Munhall (2011) investigated speech compensation to altered voice onset time auditory feedback and showed that talkers compensate for the altered auditory feedback, lengthening the voice onset time for unvoiced stop /t/ when they heard feedback sound with /d/.

peech production and sensory perception are linked in ongoing speech motor control as well as speech acquisition. Speech motor control uses various sensory information, including that of auditory feedback, tactile feedback, and somatosensory feedback, to achieve the own utterance target. In particular, auditory feedback is significantly linked to speech production as well as speech acquisition. To investigate the link between speech perception and production, altered auditory feedback in which the fundamental frequency or the formant frequencies of the uttered speech are altered has been widely used. Pitch-shifted auditory feedback (Burnett, Freedland, Larson, & Hain, 1998; Burnett, Senner, & Larson, 1997; Kawahara, 1994; Larson, Burnett, Kiran, & Hain, 2000) has provided evidence of rapid compensatory response in the laryngeal control for maintaining the target pitch. Formant-shifted auditory feedback (Cai, Ghosh, Guenther, & Perkell, 2010; Houde & Jordan, 1998; MacDonald, Goldberg, & Munhall, 2010; MacDonald, Percell, & Munhall, 2011; Purcell & Munhall, 2006; Villacorta, Perkell, & Guenther, 2007) also has provided

a

Waseda University, Saitama, Japan

Correspondence to Masaaki Honda: [email protected] Editor: Jody Kreiman Associate Editor: Ewa Jacewicz Received July 5, 2012 Revision received February 8, 2013 Accepted September 26, 2013 DOI: 10.1044/2014_JSLHR-S-12-0214

S616

Key Words: modified auditory feedback, time-scale modification, speech compensation, aftereffect

Disclosure: The authors have declared that no competing interests existed at the time of publication.

Journal of Speech, Language, and Hearing Research • Vol. 57 • S616–S625 • April 2014 • A American Speech-Language-Hearing Association Supplement: Select Papers From the 9th International Seminar on Speech Production, Part 2

There are several ways to modify spectral dynamics. One is to alter the formant frequencies of the uttered speech. Speech acoustics for vowels, semivowels, and stop consonants can be mostly represented as the formant frequencies and their time trajectories. Thus, if the formant frequencies for the uttered speech can be correctly manipulated in real time, dynamic characteristics in auditory feedback sound can be altered very flexibly. However, there are some technical problems in accurately detecting the first three formant trajectories in the transition of stop consonants in real time with a short time lag. Another way to alter the spectral dynamics is to change temporally the time scale of the speech signal, namely utterance speed. Time-scale modification is limited in that it can alter only the spectral dynamics in the utterance speed, but it is easier to implement in real time with a short time lag. In this study, we used time-scale modification to alter the spectral dynamics. The auditory feedback speech was altered using the time domain pitch-synchronous overlap-add (TD-PSOLA) method (Huang, Acero, & Hon, 2001; Moulines & Charpentier, 1990) so that the time scales on the formant trajectories were modified locally during the transition. In this research, we focused on speech compensation in response to nonstationary altered auditory feedback that was implemented by the local time-scale modification. We used a target utterance of the vowel–semivowel–vowel syllable /ija/ to examine the speech compensation in response to altered auditory feedback. The time-scale modification was applied during the transition of the semivowel /j/ whereby the transition interval was lengthened according to the timescale factor. We examined speech compensation in terms of the three acoustic parameters obtained from the uttered speech signal: the maximum velocities of the first two formant trajectories (VF1 and VF2, respectively) during the transition of /j/ and the onset time of the transition of the F2 trajectory relative to that of the F1 trajectory. We examined the time course of the speech compensation during perturbed trials with altered auditory feedback and the aftereffect during the control (return) trials after cessation of the altered auditory feedback. We compared the magnitude of speech compensation and the adaptation speed in these acoustic features for various time-scale modifications. We also conducted a listening test to determine whether the compensated feedback sound was correctly perceived as the target utterance during perturbed trials and return trials. Later in this article, we discuss the speech compensation from the perspective of articulatory motions and how the speech compensation is activated by simply detecting the speed change in the feedback sound or by detecting the phonemic change caused by modification of the time scale.

Method Experimental Condition Ten male subjects (M age = 26.10 years, SD = 12.51 years) participated in the experiment. The subjects were native Japanese speakers and had no reported impairment of hearing or speech. All subjects signed consent forms approved by the Ethics Committee on Human Research of

Waseda University. The subjects were informed that they would hear their own voice mixed with guide tones. They were instructed to speak the target syllable, /ija/, in synchronization with guide tones mixed with the feedback speech. They were not told that the auditory feedback condition was altered during the utterance session. The experimental procedure is depicted in Figure 1. The experimental session was conducted for time-scalemodified auditory feedback conditions with four time-scale factors of 1.2, 1.4, 1.6, and 1.8. Time-scale modification was applied during the transition of the semivowel /j/ of the target syllable /ija/ whereby the transition interval was lengthened according to the time-scale factor. In a series of experimental sessions, the time-scale factor was set in increasing order from 1.2 to 1.8. In each session, the subject repeated five utterance parts. Each utterance part comprised 40 repeated utterances of the target syllable, /ija/. These utterance trials were implemented under three conditions: (a) control (10 utterance trials in the normal auditory feedback condition), (b) perturbed (20 utterance trials in the modified auditory feedback condition), and (c) return (10 utterance trials in the normal auditory feedback condition). In perturbed trials, auditory feedback was changed in a stepwise manner, as shown in Figure 1. In the return trials, auditory feedback was changed in a stepwise manner to the normal auditory feedback condition. In each utterance part, the subject uttered the target syllable during a 2-s recording interval and had a 3-s pause between the successive utterances. The start time of each recording was displayed on a screen placed in front of the subject. Each subject was instructed to repeat the syllable utterance of /ija/ 40 times in each part. To maintain a good utterance condition, we instructed each subject to take a few minutes of rest between successive parts and a longer rest between each successive session, as shown in Figure 1. Utterance data for a total of 800 trials were collected for each subject. Before the experiment, the subjects were trained to speak the utterance of the target syllable in synchrony with the guide tones mixed with the feedback speech. In another training session, the subjects uttered the target syllable several times in conditions of both normal auditory feedback and modified auditory feedback with a time-scale factor of 1.6. Using these preliminary utterance data, we adjusted several control parameters in the experimental setup.

Auditory Feedback System The altered auditory feedback system is illustrated in Figure 2. Uttered speech was recorded using a microphone (Model ECM-672, Sony), and the speech signal was digitized at a sampling rate of 10 kHz after being low-pass-filtered at 5 kHz (Model MS-521, NF Corporation). The signal was processed in a digital signal processor (Model TMS320C6713B, Texas Instruments) to alter the time scale of the uttered speech using the TD-PSOLA method in real time. The altered speech signal was then fed back into both ears of the subject through earphones (Model ER-4P, Etymotic Research) after low-pass filtering. An electroglottograph (EGG) signal was

Ogane & Honda: Compensation for Temporal Feature Perturbation

S617

Figure 1. Experimental procedure. The top part of the figure shows the sequence of the modified auditory feedback experiment. One experiment consisted of an adjustment and practice phase followed by four sessions. The adjustment and practice phase contained the preparation for the modified auditory feedback experiment. Each session was divided into five parts (P). The shaded boxes indicate a few minutes of rest between sessions and parts. The bottom portion of the figure shows the sequence of one part, which consisted of three auditory feedback conditions: (a) control, (b) perturbed, and (c) return.

simultaneously acquired for detecting the pitch marks in the speech waveform, which was used in the TD-PSOLA processing. The pitch marks were determined by detecting the instant when the time-derivative EGG signal exceeded a chosen threshold. The threshold was manually adjusted by referring to the time-derivative EGG waveform in the training utterance data. The onset time of the transition of the semivowel, when altered auditory feedback was on, was automatically determined by detecting the instant when the squared sum of the delta linear predictive coding (LPC) cepstrum coefficients obtained from the uttered speech signal exceeded a threshold, as shown in Figure 3. Delta LPC cepstrum is calculated by the following equation (Furui, 1986): 0 P

Dcn ðtÞ ¼

The masking noise, which is most commonly used in altered auditory feedback experiments for masking bone-conducted speech, was not mixed with the feedback speech sound because the noise masks small changes in the altered feedback speech. The sound level of the feedback signal was adjusted for each subject to mask the boneconducted sound during the utterance. The sound level at the earphones was measured during the procedure to adjust the sound level by using a sound meter (NL-20, RION Co. Ltd., Japan) that was connected to other earphones through an acoustic coupling box. The sound level at the earphones was dependent on the sound loudness of the subject utterance, and it was in the range of 85 to 98 dB SPL.

khk cn ðt þ kÞ

k¼K 0 P

: k2 h

k

Figure 2. Block diagram of experimental setup for modified auditory feedback experiment. DSP = digital signal processor; EGG = electroglottography.

k¼K

Here, Dcn(t) is the nth delta LPC cepstrum, hk is an asymmetrical triangular window function, cn(t) is the nth mel-frequency LPC cepstrum coefficient at the frame t, and K is the window length. The threshold value was set at the 5% level of the maximum of the squared sum of delta LPC cepstrum coefficients during the transition, and it was determined from the training utterance data. The perturbation interval was set at approximately 190 ms, which was adjusted to the utterance speed by the subject. The total time lag in the processing was 40 ms for an analysis window length of 10 ms, because the TD-PSOLA processing needs the pitch mark in the next window block. The time lag was commonly set for control and return trials as well as perturbed trials. The time lag is tolerated for the additional delayed auditory feedback effect (Stuart, Kalinowski, Rastatter, & Lynch, 2002).

S618 Journal of Speech, Language, and Hearing Research • Vol. 57 • S616–S625 • April 2014

Figure 3. Speech signal waveform (top) and the time sequence of the squared sum of delta cepstrum coefficients (bottom) for syllable utterance /ija /. Perturbation onset is indicated by a line on the delta cepstrum.

Figure 4. Time trajectories of F1 and F2 for the time-scale modification with the time-scale factor of 1.0 (solid lines) and 1.8 (dotted lines) during the transition of semivowel /j/ in the syllable utterance /ija /.

A guide tone signal, a series of four tones, was mixed with the feedback speech in order to synchronize the utterance timing with the guide tone. The first three pretones were used for instructing the utterance start timing, after which the subject prepared the utterance after listening to the first two guide tones and started the utterance in synchrony with the onset of the third guide tone. The fourth guide tone was used for synchronizing the onset time of the semivowel utterance. The interval between the third and fourth guide tone was set at 0.3 s. The guide tones are efficient in eliminating errors in detecting the onset time of the altered feedback perturbation and in preventing the utterance of the proceeding vowel, /i/, with extremely short duration in the altered auditory feedback condition.

time-scale factor are shown in Figure 5. As the time-scale factor increased to 2.0, the perception score for /ija/ monotonically decreased and the score for /ia/ increased. The phonemic boundary with a perception score of 50% was at the time-scale factor of 1.5. At time-scale factors larger than 1.8, the modified speech was almost perceived as /ia/. For the modified speech with time-scale factors larger than 2, we found that it was difficult to compensate the modified speech by increasing the articulation speed during the transition of the semivowel because of the limitation of the possible articulation speed. Therefore, in the following analysis we selected the time-scale factors from 1.0 to 1.8. Figure 5. Syllable perception score in percentage for time-scalemodified speech for syllable utterance /ija / as a function of the change in time-scale factor from 0.6 to 2.5. Black and white circles indicate the scores at which the stimulus was perceived as /ija / and /ia /, respectively.

Acoustical and Perceptual Change by Time-Scale Modification We examined the acoustic characteristics of the timescale-modified speech during the transition that was implemented in the real-time-altered auditory feedback system. The trajectories of the first two formant frequencies for the time-scale factors of 1.0 and 1.8, respectively, are shown in Figure 4. The interval of the transition is lengthened to 1.8 times that of the original interval, and the time trajectories are uniformly lengthened along the time axis. We conducted a perception test to examine the perceptual phonemic change in the time-scale-modified speech as the time-scale factor varied. Stimuli were the time-scalemodified speech for the time-scale factors from 0.6 (shortened) to 2.5 (lengthened). The original speech was taken from an utterance of /ija/ spoken by a male subject. Ten listeners (nine males and one female, M age = 26.60 years, SD = 12.00 years) participated in the listening test. The listeners judged each stimulus as either /ija/ or /ia/. The perception scores at which the stimulus was perceived as /ija/ and /ia/ as a function of the

Ogane & Honda: Compensation for Temporal Feature Perturbation

S619

Acoustic Analysis We examined speech compensation in response to the altered auditory feedback in terms of three acoustic features that were obtained from the uttered speech: the maximum velocities on the (a) F1 and (b) F2 time trajectories during the transition of the semivowel (i.e., VF1 and VF2) and (c) the onset time of the transition of the F2 trajectory relative to that of the F1 trajectory (F1–F2 onset time difference [TD]). We obtained the formant frequencies for a 50-ms analysis window at every 1-ms frame period using Praat (Boersma & Weenink, 2012). The error in the formant analysis was manually corrected and the formant trajectory was smoothed using a curve fitting by the digital filter. We then calculated the velocity on each formant trajectory (DF [t]) with the following equation: K P

khk F ðt þ kÞ

DF ðtÞ ¼ k¼K

K P

k2 hk

k¼K

Here, DF(t) is the delta formant, hk is a symmetrical triangular window function, F(t) is the formant frequency at the frame t, and K is the window length. The maximum velocity of each formant trajectory is defined as a peak of D F(t) during the transition of the semivowel, as shown in Figure 6. The F1–F2 onset TD is defined as the interval from the F2 onset time to the F1 onset time, and F1 onset time minus F2 onset time, in the transitions of their trajectories, as shown in Figure 6. The F1 or F2 onset time was determined

Figure 6. Definition of three acoustic features: F1–F2 onset time difference (TD; top panel) and maximum velocities of F1 (VF1) and F2 (VF2; bottom panel). TD was calculated as the time difference between onset time of the F1 and F2 trajectories (TF1 and TF2, respectively). VF1 and VF2 were calculated as the maximum of each delta formant.

at the instant at which each trajectory exceeds 20% of the extent of the formant change during the transition.

Analysis of the Acoustic Data We normalized each acoustic feature by subtracting the average over 10 control trials from the original value. Thus, each acoustic feature is a relative value to the average over control trials. We did this to reduce the variance in the utterance speed across repeated utterances and subjects. Then we averaged each acoustic feature over 10 subjects and five repeated utterance trials to obtain the group data. We obtained the magnitude of speech compensation in each acoustic feature for every utterance trial by averaging the acoustic feature over the last five perturbed trials. Similarly, we obtained the magnitude of the aftereffect by averaging the acoustic feature over the last five return trials. The magnitude in the control condition is always identical to 0 because of the normalization procedure. We examined the difference in the magnitudes of each acoustic feature between the control condition and perturbed condition or return condition using a t test. The number of samples for each condition was 50 (five repetitions multiplied by 10 subjects). The magnitude data for three acoustic features in the control condition were normalized which variances to be 0. Thus, we performed the t test regardless of whether the average magnitude in the perturbed or return condition was equal to 0. We conducted a one-way analysis of variance (ANOVA) on the magnitude of speech compensation and aftereffect in each acoustic feature to examine the effect of the time-scale factor. Moreover, we performed Tukey’s honestly significant difference test on the magnitudes between every combination of time-scale factors.

Listening Test To examine the speech error caused by the speech compensation, we conducted a listening test on the feedback sound. Twenty-four stimuli were selected from 40 trials in one utterance session for every time-scale factor and every subject. These stimuli consisted of five control trials (2nd, 4th, 6th, 8th, and 10th), 12 perturbed trials (11th, 12th, 13th, 14th, 16th, 18th, 20th, 22nd, 24th, 26th, 29th, and 30th), and seven return trials (31st, 32nd, 33rd, 34th, 36th, 38th, and 40th). These stimuli for every time-scale factor and subject were presented in the order of the utterance trial to 10 listeners (nine males and one female, M age = 21.80 years, SD = 2.02 years). Five of these listeners had participated in the previous perception test and five were new. The listener judged each stimulus as /ija/ or other. Each listener adjusted the listening sound to the desirable level.

Results The time courses of the maximum F1 and F2 velocities (VF1 and VF2, respectively) and TD obtained from the uttered speech are shown in Figure 7. The time course

S620 Journal of Speech, Language, and Hearing Research • Vol. 57 • S616–S625 • April 2014

Figure 7. Time course of speech compensation in VF1 (top panel), VF2 (middle panel), and the F1–F2 onset TD (bottom panel). Control trials were the 1st to 10th trials, perturbed trials were the 11th to 30th trials, and return trials were the 31st to 40th trials. The time-scale factor was set at 1.2, 1.4, 1.6, and 1.8.

Figure 8. Magnitude of speech compensation in VF1 (top panel), VF2 (middle panel), and the F1–F2 onset TD (bottom panel) for the time-scale factors of 1.2, 1.4, 1.6, and 1.8. Error bars show standard deviations.

consisted of 10 control trials from the 1st to 10th trials followed by 20 perturbed trials from the 11th to 30th trials and 10 return trials from the 31st to 40th trials. In Figure 7, speech compensation for four time-scale factors from 1.2 to 1.8 is compared. The magnitudes of VF1, VF2, and TD for four timescale factors from 1.2 to 1.8 in the perturbed conditions are shown in Figure 8. The means of the acoustic features for four time-scale factors from 1.2 to 1.8 were, respectively, 0.1699, 0.3866, 0.6436, and 0.7764 for VF1; 0.3388, 0.4944, 0.4097, and 0.4431 for VF2; and 0.9487, 2.7829, 3.9906, and 5.5862 for TD. The magnitudes of VF1, VF2, and TD in the return condition are shown in Figure 9. The means of the acoustic features for four time-scale factors from 1.2 to 1.8 were, respectively, 0.0909, 0.1569, 0.2027, and 0.2933 for VF1; 0.1873, 0.4126, 0.0681, and 0.0796 for VF2; and 0.3847, 2.7059, 3.7076, and 3.9092 for TD.

Time Course of Speech Compensation As shown in Figure 7, speech compensation occurred in VF1, VF2, and TD. Every acoustic feature rapidly increased just after the beginning of perturbed trials and saturated in two to five trials. The variance of each acoustic feature in the time course was small for VF1 and TD and relatively large for VF2. Compared to the time courses of VF1 and VF2, the magnitude of speech compensation in VF1 monotonically increased in order of the time-scale factors from 1.2 to 1.8, but the difference in the magnitude of VF2 among the time-scale factors was small. This indicates that speech compensation by the velocity change was more dominant in VF1 than in VF2. In addition to speech compensation by VF1 and VF2, TD (i.e., F1–F2 TD) was also used in speech compensation. Here, the positive value of TD indicates that the onset timing of F2 during the transition of the semivowel occurred earlier than that of F1. The time course in TD was more variable for the small time-scale factor, in particular for 1.2. The magnitude of

Ogane & Honda: Compensation for Temporal Feature Perturbation

S621

Figure 9. Magnitude of aftereffect in VF1 (top panel), VF2 (middle panel), and the F1–F2 onset TD (bottom panel) for the time-scale factors of 1.2, 1.4, 1.6, and 1.8. Error bars show standard deviations.

aftereffects in VF2 and TD were larger than the amount in the control condition.

Magnitude and Speed of Speech Compensation

TD showed a tendency to increase as the time-scale factor changed from 1.2 to 1.8. After the time-scale-modified auditory feedback went back to the normal auditory feedback condition, every acoustic feature showed a tendency to gradually decrease and approach the value in the control condition (almost within five trials). The time course of VF1 during return trials showed a clear tendency to approach the value in the control condition for every time-scale factor. The slope of the time course of VF1 in return trials was large because the time-scale factor was large. At the final trial in the return condition, there remained a difference in VF1 between the control condition and the return condition. The amount of aftereffect in VF1 increased as the time-scale factor increased. Although the time course of VF2 and TD during return trials showed a tendency similar to VF1, it was more variable than that of VF1, and the decreasing slope was gentle. Furthermore, the amount of

The magnitude of VF1 in the perturbed condition monotonically increased as the time-scale factor increased from 1.2 to 1.8, as shown in Figure 8. This indicates that more effort was made for speech compensation by increasing the F1 velocity as amount of time-scale perturbation increased. The t test showed that the difference in magnitudes between speech compensation and the control was statistically significant for the time-scale factors of 1.2, t(49) = 2.93, p = .005; 1.4, t(49) = 5.96, p = .001; 1.6, t(49) = 6.89, p = .001; and 1.8, t(49) = 6.88, p = .001. The difference in magnitude between every combination of time-scale factors was significant: 1.2 and 1.4, t(98) = –2.49, p = .014; 1.2 and 1.6, t(98) = –4.31, p = .001; 1.2 and 1.8, t(98) = –4.78, p = .001; 1.4 and 1.6, t(98) = –2.26, p = .026; and 1.4 and 1.8, t(98) = –3.00, p = .003; except for the combination of 1.6 and 1.8, t(98) = –0.91, p = .367. For the magnitude in the F2 velocity, the difference in magnitudes between speech compensation and the control was statistically significant for the time-scale factors of 1.2, t(49) = 2.77, p = .008; 1.4, t(49) = 3.67, p = .001; 1.6, t(49) = 3.95, p = .001; and 1.8, t(49) = 2.71, p = .009. However, as shown in Figures 7 and 8, the magnitude in VF2 slightly increased as the time-scale factor changed from 1.2 to 1.4, and the magnitude saturated above 1.4. The difference in the magnitude between every combination from four time-scale factors was not statistically significant: 1.2 and 1.4, t(98) = –0.85, p = .395; 1.2 and 1.6, t(98) = –0.44, p = .659; 1.2 and 1.8, t(98) = –0.51, p = .611; 1.4 and 1.6, t(98) = 0.50, p = .620; 1.4 and 1.8, t(98) = 0.24, p = .810; and 1.6 and 1.8, t(98) = –0.17, p = .863. A comparison of the magnitudes of VF1 and VF2 in Figure 8 reveals that the magnitude of VF1 was larger than that of VF2 for the time-scale factors of 1.6 and 1.8, and the magnitude of VF2 was larger than that of VF1 for the time-scale factors of 1.2 and 1.4. For the magnitude of the F1–F2 onset TD there was an increasing tendency as the time-scale factor increased from 1.2 to 1.8. The difference in magnitude of TD between the control and perturbed conditions was statistically significant for time-scale factors of 1.6, t(49) = 4.56, p = .001; 1.8, t(49) = 5.17, p = .001; and 1.4, t(49) = 2.23, p = .031. The difference in magnitude between time-scale factors was significant between 1.2 and 1.6, t(98) = –2.56, p = .012; and 1.2 and 1.8, t(98) = –3.44, p = .001; but not significant for other combinations of time-scale factors. We conducted a one-way ANOVA to assess the magnitudes of speech compensation in perturbed trials and aftereffect in return trials for the three acoustic features. The results are shown in Table 1. For the perturbed condition, the time-scale factor effect was significant in VF1 ( p = .001) and TD ( p = .013) but not significant in VF2 ( p = .868). Tukey’s honestly significant difference test showed that the time-scale factor of 1.2 was significantly different compared with 1.6 (p = .001) and 1.8 (p = .001), and time-scale factor

S622 Journal of Speech, Language, and Hearing Research • Vol. 57 • S616–S625 • April 2014

Table 1. Results of a one-way analysis of variance for magnitudes of three acoustic features in the perturbed and the return conditions. Perturbed Parameter VF1 VF2 TD

Return

F

p

F

p

10.05 0.24 3.69

.001 .868 .013

1.37 1.56 2.18

.254 .200 .092

Figure 10. Syllable perception score along the time course of the utterance trials for auditory feedback sounds. The control trials were the 1st to 10th trials, perturbed trials were the 11th to 30th trials, and return trials were the 31st to 40th trials.

Note. VF1 = maximum velocity of first formant; VF2 = maximum velocity of second formant; TD = time difference.

of 1.4 was significantly different compared with 1.8 (p = .008) for VF1. Moreover, the time-scale factor of 1.2 was significantly different compared with 1.8 ( p = .008) for TD. The adaptation speed is characterized by the number of trials in the perturbed condition in which the time course saturates. As shown in Figure 7, the time courses of VF1 and TD during the perturbed trials were less variable, and they saturated in approximately three trials for every time-scale factor. On the other hand, the time course of VF2 during perturbed trials was more variable; it is difficult to explicitly determine the number of trials in which it saturated. In approximately five trials, each time course reached the level of the magnitude averaged over the last five trials in the perturbed trials.

Aftereffects As shown in the time course of the acoustic features in Figure 7, the aftereffect caused by the time-scale-altered feedback occurred in the first several trials in return trials and remained in VF1 and TD at the final 10th return trial for the large time-scale factors. Compared with the control condition, the magnitude of the aftereffect shown in Figure 9 was statistically significant in VF1 for the time-scale factors of 1.8, t(49) = 3.80, p = .001; in VF1 for 1.4, t(49) = 2.41, p = .020, and 1.6, t(49) = 2.44, p = .018; in VF2 for 1.4, t(49) = 2.88, p = .006; in TD for 1.4, t(49) = 2.30, p = .026; and in TD for 1.6, t(49) = 3.76, p = .001, and 1.8, t(49) = 3.91, p = .001. The aftereffect in three acoustic features had a positive value and showed a carryover effect to the perturbed condition. The results of the one-way ANOVA for the return condition are shown in Table 1. There were no significant differences.

Speech Error in Speech Compensation The results of the listening test on the time-scalemodified feedback sound are shown in Figure 10. The horizontal axis indicates the utterance trial number, and the vertical axis indicates the average perception score in percentage at which the stimulus is perceived as /ija/. The mean scores are shown for four time-scale factors. Of note, the perception score drops at the first perturbed trial (11th) for every time-scale factor because the subjects could not predict the altered feedback in the first perturbed trial. For

the perturbed conditions with the time-scale factors of 1.2 to 1.6, the score gradually increased during perturbed trials and approached the score in the control trials in approximately five perturbed trials. This indicates that the speech compensation to the altered feedback was almost complete. However, for the perturbed condition with the time-scale factor of 1.8, the score increased in perturbed trials but was still lower than that in the control trials at the final perturbed trial. This indicates that the speech compensation was incomplete for the highest time-scale factor and that some speech errors remained until the final perturbed trial. In return trials, from the 31st to 40th trial the perception score showed a small drop at the first return trial (31st) for every time-scale factor and then rapidly improved and approached the score compatible to that in the control trials. This indicates that the aftereffect did not continue perceptually.

Discussion The results showed that the speech compensation in response to the time-scale-modified auditory feedback occurred in the change of VF1, VF2, and the F1–F2 onset TD. Magnitudes of speech compensation in VF1 and TD monotonically increased as the time-scale factor (amount of perturbation magnitude) increased, whereas there was less of a relationship between the compensation and perturbation magnitudes for VF2. Speech compensation by increasing VF1 means that F1 changes with the speed proportional to the modified time scale in the auditory feedback signal. On the other hand, speech compensation by increasing VF2 and TD means that F2 changes earlier to F1 with an increased speed in the transition of the semivowel in /ija/. This suggests that the speech compensation to time-scale-modified auditory feedback is mostly carried out by changing the speed of F1 and partially by adjusting the speed of F2 and the onset time of F2 in the transition. The acoustic change of F1 and F2 in the utterance of the syllable /ija/ is related to the jaw opening and tongue motion from the front to back positions during the speech

Ogane & Honda: Compensation for Temporal Feature Perturbation

S623

articulation, respectively. Increasing VF1 and VF2 corresponds to an increase in the motion speed of the jaw and the tongue, respectively. Furthermore, an increase in TD indicates that the tongue motion occurs earlier than the jaw opening in compensatory response to the altered feedback. Thus, the changes in VF1, VF2, and TD suggest that the speech compensation is implemented not by increasing both jaw and tongue motion speeds simultaneously in proportion to the perturbation magnitude but mostly by increasing the jaw-opening speed and partially by moving the tongue with an increased constant speed and moving the tongue earlier in the transition. The amount of speech compensation by VF2 was nearly the same among the perturbed conditions with four time-scale factors. The amount of speech compensation by TD slightly increased as the amount of the perturbation increased. This suggests that there is a limit in increasing the speed of the tongue motion, and articulatory compensation by the tongue is implemented by adjusting the onset timing to the jaw motion. To investigate underlying compensatory articulatory motions, one must use articulatory measurements. The results showed a significant change in the magnitude of speech compensation in VF1, VF2, and TD as compared to that in the control condition. Furthermore, the magnitude in VF1 monotonically increased as the timescale factor increased, and there was a significant difference between almost every combination of time-scale factors. In TD, a significant difference was observed between timescale factors of 1.2 and 1.6 and between 1.2 and 1.8. This suggests that the amount of speech compensation increases when the phonemic perception change between the target and altered auditory feedback speech increases. There are two possible interpretations for speech compensation due to the time-scale-modified auditory feedback. One is that speech compensation is activated by detecting the speed change. Another is that it is activated by detecting the phonemic change. The former interpretation predicts a change in the magnitude of speech compensation when the time-scale perturbation is applied. Moreover, if the motor control is adjusted according to the amount of the time-scale modification, the magnitude of the compensation monotonically increases according to the amount of modification. On the other hand, the latter interpretation predicts that the speech compensation occurs only when the time-scale factor exceeds the boundary where the phonemic change occurs. As shown in Figure 5, phonemic perception changes from /ija/ to /ia/ when the time-scale factor changes from 1.2 to 1.8, and the phonemic categorical boundary is located around 1.5. The experimental result for VF1 supports both interpretations. The magnitude of compensation in VF1 changed monotonically as the time-scale factor increased. This fact supports the former interpretation. On the other hand, the changes in magnitudes of VF1 between adjacent time-scale factors were 0.17, 0.22, 0.26, and 0.13 as the time-scale factor changed from 1.0 to 1.8. The change in magnitude was not constant and became large around the phonetic boundary. This fact supports the second interpretation. On the other hand, the experimental results for VF2 and TD mostly

support the first interpretation, although the motor adjustment in proportion to the amount of the time-scale perturbation was not clearly observed. These facts suggest that speech compensation is activated primarily by detecting the speed of change in the altered auditory feedback as well as by detecting the phonemic change. We examined the adaptation speed in the speech compensation in terms of three acoustic features and in the syllable perception test on the feedback sound. The results showed that the time courses of VF1, VF2, and TD reached the level of the magnitude of speech compensation in approximately three or five perturbed trials. A listening test on the feedback sound showed that the speech error decreased in the five perturbed trials and approached the score in the control condition, except for the highest time-scale modification of 1.8. We used isolated syllable utterance, and there was a pause of approximately 3 s between successive syllable utterances. Therefore, it was difficult to estimate the motor adaptation time to time-scale-modified auditory perturbation. In formant-shifted auditory feedback experiments (Houde & Jordan, 1998; Rochet-Capellan, Richer, & Ostry, 2012), the amount of the formant shift has been adjusted in a ramp manner. In this experiment, the time scale was modified in a stepwise manner in perturbed trials. Therefore, it was difficult to compare the adaptation speed for both types of auditory feedback. At the least, the longer adaptation speed compared with the immediate compensations in the pitch-shifted auditory feedback suggests that speech compensation in time-scale-modified auditory feedback is implemented by reorganization in the speech motor planning rather than the reflex response. In the return condition, VF1 rapidly dropped at the beginning of the return condition, in particular for time-scale factors of 1.6 and 1.8. However, such a rapid change was not observed for VF1 with small time-scale modification or with VF2 and TD. This suggests that VF1 is mostly used in speech compensation and could be easily returned to the control condition in the return condition. On the other hand, TD was also used in speech compensation, but there was no tendency to return to the control condition in the return condition, which is a result of the aftereffect in TD. This suggests that the acoustic feature used primarily in speech compensation also is mostly used when the auditory feedback returns to the control condition.

Conclusions We examined speech compensation in response to time-scale-modified auditory feedback, in which the speech dynamics were altered in the auditory feedback. The timescale modification was applied during the transition of the semivowel for the target utterance of /ija/. Speech compensation was examined in terms of VF1, VF2, and the F1–F2 onset TD. The results showed that speech compensation to time-scale-modified auditory feedback occurred in VF1, VF2, and TD. We found that the magnitudes of the speech compensation in VF1 and TD monotonically increased as the amount of perturbation increases, and the amount of

S624 Journal of Speech, Language, and Hearing Research • Vol. 57 • S616–S625 • April 2014

speech compensation increased when the phonemic perception change increased. The magnitude of speech compensation in VF2 did not show a clear tendency according to the amount of perturbation. These facts suggest that the speech compensation is carried out mostly by changing the motion speed in the jaw opening and in part by changing the tongue motion speed and adjusting the onset time during the transition of the tongue motion to the jaw opening. Furthermore, they suggest that the speech compensation is activated mostly by detecting the temporal speed change in altered feedback speech and secondarily by detecting the phonemic categorical change.

References Boersma, P., & Weenink, D. (2012). Praat: Doing phonetics by computer [Computer software]. Retrieved from www.fon.hum. uva.nl/praat/ Burnett, T. A., Freedland, M. B., Larson, C. R., & Hain, T. C. (1998). Voice F0 responses to manipulations in pitch feedback. The Journal of the Acoustical Society of America, 103, 3153–3161. Burnett, T. A., Senner, J. E., & Larson, C. R. (1997). Voice F0 responses to pitch-shifted auditory feedback: A preliminary study. Journal of Voice, 11, 202–211. Cai, S., Ghosh, S. S., Guenther, F. H., & Perkell, J. S. (2010). Adaptive auditory feedback control of the production of formant trajectories in the Mandarin triphthong /iau/ and its pattern of generalization. The Journal of the Acoustical Society of America, 128, 2033–2048. doi:10.1121/1.3479539 Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34, 52–59. Houde, J., & Jordan, M. (1998, February 20). Sensorimotor adaptation in speech production. Science, 279, 1213–1216. doi:10.1126/science.279.5354.1213 Huang, X., Acero, A., & Hon, H. W. (2001). Speech synthesis. In Jane Bonnell (Ed.), Spoken language processing: A guide to

theory, algorithm, and system development (pp. 820–831). Upper Saddle River, NJ: Prentice Hall. Kawahara, H. (1994). Interactions between speech production and perception under auditory feedback perturbations on fundamental frequencies. The Journal of the Acoustical Society of Japan (E), 15, 201–202. Larson, C. R., Burnett, T. A., Kiran, S., & Hain, T. C. (2000). Effects of pitch-shift velocity on voice F0 responses. The Journal of the Acoustical Society of America, 107, 559–564. MacDonald, E. N., Goldberg, R., & Munhall, K. G. (2010). Compensations in response to real-time formant perturbations of different magnitudes. The Journal of the Acoustical Society of America, 127, 1059–1068. doi:10.1121/1.3278606 MacDonald, E. N., Percell, D. W., & Munhall, K. G. (2011). Probing the independence of formant control using altered auditory feedback. The Journal of the Acoustical Society of America, 129, 955–965. doi:10.1121/1.3531932 Mitsuya, T., MacDonald, E., & Munhall, K. (2011, June). Cross categorical temporal feedback in English voicing contrasts. Paper presented at the 9th International Seminar on Speech Production, Montreal, Québec, Canada. Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9, 453–467. Purcell, D. W., & Munhall, K. G. (2006). Adaptive control of vowel formant frequency: Evidence from real-time formant manipulation. The Journal of the Acoustical Society of America, 120, 966–977. doi:10.1121/1.2217714 Rochet-Capellan, A., Richer, L., & Ostry, D. J. (2012). Nonhomogeneous transfer reveals specificity in speech motor learning. Journal of Neurophysiology, 107, 1711–1717. doi:10.1152/ jn.00773.2011 Stuart, A., Kalinowski, J., Rastatter, M. P., & Lynch, K. (2002). Effect of delayed auditory feedback on normal speakers at two speech rates. The Journal of the Acoustical Society of America, 111, 2237–2241. doi:10.1121/1.1466868 Villacorta, V. M., Perkell, J. S., & Guenther, F. H. (2007). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. The Journal of the Acoustical Society of America, 122, 2306–2319. doi:10.1121/1.2773966

Ogane & Honda: Compensation for Temporal Feature Perturbation

S625

Copyright of Journal of Speech, Language & Hearing Research is the property of American Speech-Language-Hearing Association and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Speech compensation for time-scale-modified auditory feedback.

PURPOSE The purpose of this study was to examine speech compensation in response to time-scale-modified auditory feedback during the transition of the...
280KB Sizes 0 Downloads 3 Views