Choi et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4829059]

Published Online 8 November 2013

Perception and production of English stops by tonal and non-tonal Korean dialect speakers Tae-Hwan Choi, Gyung-Ho Kim, and Jeong-Im Hana) Department of English, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul, 143-701, Korea [email protected], [email protected], [email protected]

Abstract: This study examines whether relative weightings of voice onset time and onset F0 in Korean tonal vs non-tonal dialects affect the production and perception of English voiced and voiceless stops. Following Shultz et al. [(2012). J. Acoust. Soc. Am. 132, EL95–EL101], discriminant function analysis and logistic regression were conducted to calculate each speaker’s relative weightings of these two cues in the production of target words and the labeling of the synthesized tokens according to these cues, respectively. The results demonstrated that the acquisition of second language (L2) contrasts is influenced by native language dialects, and production and perception are not developed in parallel in L2 acquisition. C 2013 Acoustical Society of America V

PACS numbers: 43.70.Mn, 43.71.Ft [AL] Date Received: September 4, 2013 Date Accepted: October 23, 2013

1. Introduction The present study examines whether differences in the weighting of specific acoustic properties in native language (L1) dialects affect the extent to which acoustic properties are used in a second language (L2). Specifically, this study focuses on the production and perception of English voiced and voiceless stops by two groups of Korean dialect speakers—those speaking Kyungsang and those speaking Seoul Korean (standard Korean). Korean is widely known to have a three-way laryngeal contrast among voiceless stops, i.e., lenis, fortis, and aspirated stops, and voice onset time (VOT) and onset F0 have been identified as key acoustic correlates (Cho et al., 2002). However, the phonetic manifestation of the Korean stop contrast has been varying in recent years between these two acoustic cues. In the mid-1900s, VOT was identified as a key acoustic correlate for each stop category: fortis stops manifested as short VOTs, aspirated stops manifested as long VOTs, and lenis stops manifested as intermediate VOTs. Although older speakers still maintain clear VOT distinctions between lenis and aspirated stops, younger speakers tend to neutralize those differences (Silva, 2006). This change in VOT patterns indicates that younger speakers likely use differences in the onset F0 as the primary acoustic cue to differentiate between Korean stops; the mean F0 of the vowel onset following a lenis stop is significantly lower than that after an aspirated or fortis stop. However, these descriptions generally fit the Seoul dialect (standard Korean). The Kyungsang dialect of Korean, spoken in the southeastern part of the Korean peninsula, shows somewhat different patterns in the use of these cues because it preserves the pitch contrast. In this dialect, the pitch is primarily used to cue lexical tone contrast, and the use of pitch for the laryngeal contrast may be more constrained than that in non-tonal dialects. Consequently, Kyungsang speakers primarily use VOT, whereas Seoul dialect speakers use both VOT and F0 to distinguish between the three stops (Lee and Jongman, 2012). a)

Author to whom correspondence should be addressed.

J. Acoust. Soc. Am. 134 (6), December 2013

C 2013 Acoustical Society of America V

EL541

Choi et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4829059]

Published Online 8 November 2013

Given such dialectal differences, this study explores whether Kyungsang and Seoul Korean speakers make distinct use of these two cues to signal word-initial voiced vs voiceless stops in English. It is known that to cue English stops in the initial position, VOT is typically considered the most salient cue and onset F0 is a less-employed, secondary, redundant cue (Lisker and Abramson, 1964; Haggard et al., 1981). 2. Method 2.1 Participants Forty native Korean speakers (20 Kyungsang, 20 Seoul Korean) participated in the study (mean age ¼ 22.5 years, range ¼ 18–28 years for Kyungsang speakers; mean age ¼ 23.6 years, range ¼ 19–27 years for Seoul speakers). Only male speakers were recruited to manifest the dialectal effects on L2 acquisition more clearly. The experiment was conducted in Seoul, the area for standard Korean, when the Kyungsang dialect speakers moved to Seoul to enter college. Previous sociolinguistic studies showed that men are known to be less influenced by the social stigma directed against the nonstandard forms, whereas women are likely to respond to the overt prestige associated with the standard forms. All speakers in either dialect group had lived and been educated in the target dialect area, and most of their parents spoke the same target dialect (97% for the Kyungsang dialect speakers and 90% for the Seoul dialect speakers). All speakers in each dialect group began to study English after the critical period and lacked experience living and studying in Englishspeaking countries for more than eight months. English proficiency was comparable between these two groups of speakers; the mean scores of the paper-based TOEFL practice test (listening section) were 166.2 (sd ¼ 15.9) out of 226 for the Kyungsang speaker group and 171.8 (sd ¼ 14.9) out of 226 for the Seoul speaker group, which were not significantly different [t(38) ¼ 1.16, p > 0.05]. 2.2 Speech production 2.2.1 Speech materials The stimuli consisted of a set of 64 English words (16 target words and 48 fillers). The target words were all monosyllabic real words with bilabial voiceless and voiced initial stops (pea/bee, peak/beak, Pete/beat, peach/beach, pig/big, pit/bit, pin/bin, pill/bill). All words had a CV(C) structure with high front vowels such as /i/ or /I/ to create similar stimuli between production and perception tasks. Only real words were used to avoid the difficulty of eliciting nonsense syllables, and most of them showed relatively high lexical frequency. 2.2.2 Recordings and acoustic measurements Each participant was recorded in a sound-proof booth using a Tascam HD-P2 solid-state recorder and a Shure KSM 44 microphone. During the recording, individual PowerPoint files with frame sentences (“The word is ____”) were shown to the participants through a window in the sound-proof booth. Each sentence was displayed at a regular rate and the subject was instructed to read each sentence shown. Each participant read the randomized sentences twice, and the second recording was used for analysis. The recorded material was sampled at 44 100 Hz with 16-bit quantization. The VOT and F0 values of the target stops were measured using Praat (Boersma and Weenink, 2012). For subsequent statistical analyses, frequency values for each participant were converted from Hz to semitones relative to that participant’s mean onset F0 across all measured utterances (Shultz et al., 2012). This conversion was performed to express the relative distance above/below the speaker’s mean onset F0 on a logarithmic scale. 2.3 Speech perception 2.3.1 Stimuli Stimuli were derived from a single CV syllable created by naturally-produced, but resynthesized, burst and aspiration noise onto a synthetic vowel. Many “pea” and

EL542 J. Acoust. Soc. Am. 134 (6), December 2013

Choi et al.: Differential cue weighting in second language

Choi et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4829059]

Published Online 8 November 2013

“bee” examples were produced by a male native speaker of American English (Columbus, OH) in the frame sentence “I say ____.” The mean VOT value for the /p/ tokens was 62.5 ms (40–82 ms) and the onset F0 was 143.2 Hz (126.3–159.8 Hz), whereas the mean VOT for the /b/ tokens was 5.1 ms (4–6 ms) and the onset F0 was 120.4 Hz (109.6–130.7 Hz). To select the natural token for synthesis, a voiceless token (/pi/) was selected as a base. Then, using the PSOLA (Pitch Synchronous Overlap and Add) algorithm, the aspiration noise and F0 values from this token were subsequently manipulated along a two-dimensional continuum of VOT and onset F0 to give rise to approximately equal numbers of “pea” and “bee” percepts. To make the VOT-related cues of the base word ambiguous, VOT was compressed in eight steps (60 ms, 50 ms, 40 ms, 30 ms, 20 ms, 10 ms, 5 ms, 0 ms). Sixty-four different syllables were generated from eight base tokens with the specific VOT values by fully crossing eight levels of onset F0 (110 Hz, 120 Hz, 130 Hz, 140 Hz, 150 Hz, 160 Hz, 170 Hz, and 180 Hz). The minimum and maximum F0 values were based on the rage of F0 of production of /pi/ and /bi/ by the above speaker. Each token began at specific F0 onset values, fell or rose over the course of the first 80 ms following the vowel onset, and ended at a point corresponding to 119 Hz (the mean F0 value of the original token at this point). All other properties remained the same across all tokens. 2.3.2 Task The stimuli were presented binaurally in randomized order via Sennheiser HD-590 headphones. Stimulus presentation and response collection were controlled by SuperLab Pro (Cedrus). Each listener was asked to hear a single syllable and identify it as either “pea” or “bee” by clicking on one of two buttons designated with the appropriate letters. Participants completed a total of 4 repetitions of 256 trials (8 levels of VOT  8 levels of F0 perturbations  4 repetitions) in a randomized stimulus order. 3. Results Two speakers (one Kyungsang, one Seoul) showed high error rates due to mispronunciations of the words (9% and 14% of the entire dataset, respectively), and thus their data were not included for analysis. 3.1 Production Following Shultz et al. (2012), the relative weightings of VOT and onset F0 in production were computed using discriminant function analysis for each of the participants, which provided standardized canonical coefficients, with a larger coefficient denoting a stronger weighting for a variable. The choice of discriminant analysis over logistic regression was based on the comparatively low number of tokens per subject in the production and difficulty of accurately estimating the coefficients in the logistic regression. First, the mean canonical coefficients of the VOT and onset F0 were examined for Kyungsang and Seoul dialect speakers to determine which out of the two acoustic cues was more important to production. For Kyungsang dialect speakers, the mean coefficients of the VOT and the onset F0 were 0.60 and 0.73, respectively, which were not significantly different [t(18) ¼ 1.21, p > 0.05]. For Seoul dialect speakers, the mean coefficient of the VOT variable was 0.62 and that of the onset F0 was 0.88, which were significantly different [t(18) ¼ 0.293, p < 0.05]. Second, to compare individuals’ weightings of VOT and onset F0 in production, the total-sample canonical coefficients of production for both VOT and onset F0 were calculated for each speaker, as plotted in Fig. 1. The two panels in Fig. 1 showed a discrepancy in the data between Kyungsang and Seoul dialect speakers. The Kyungsang speakers showed a significant negative Pearson’s product moment correlation between VOT and onset F0, r(19) ¼ 0.592, p < 0.05, whereas the Seoul speakers did not show any significant correlation, r(19) ¼ 0.357, p > 0.05.

J. Acoust. Soc. Am. 134 (6), December 2013

Choi et al.: Differential cue weighting in second language EL543

Choi et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4829059]

Published Online 8 November 2013

Fig. 1. Total-sample canonical coefficients across VOT and onset F0 for each participant: (a) for Kyungsang dialect speakers and (b) for Seoul dialect speakers. Dark lines are linear regression lines.

3.2 Perception Logistic regression analysis was used to calculate the b weights for the perception data. The mean b weights of the VOT and onset F0 were 0.73 and 0.31 for Kyungsang listeners and 0.71 and 0.31 for Seoul listeners, both of which were significantly different [t(18) ¼ 0.67, p < 0.001 for Kyungsang listeners; t(18) ¼ 10.87, p < 0.001 for Seoul listeners]. Figure 2 is a plot of the b weights of the VOT and onset F0 cues for each participant in both the Kyungsang and Seoul groups. Unlike the production results, there was no significant correlation between these two cues for both the Kyungsang and Seoul listeners: r(19) ¼ 0.414, p > 0.05 for Kyungsang listeners; r(19) ¼ 0.147, p > 0.05 for Seoul listeners. 4. Discussion The aim of the present study was to test the hypothesis that relative weighting of specific acoustic properties in L1 dialects affects the extent to which these acoustic properties are used in L2. As expected, in production, Kyungsang and Seoul Korean speakers made distinct use of the two cues, VOT and onset F0, to signal stops in English such that Kyungsang dialect speakers made less use of onset F0 than Seoul speakers. These results demonstrated that members of both groups were primary users of onset F0 presumably due to their native language influence, but Kyungsang dialect speakers made less use of the onset F0 than Seoul speakers. This finding may be attributed to the fact that the pitch is primarily used for cuing lexical tone contrast, and consequently, the use of pitch for laryngeal contrast appears to be more constrained than in the nontonal Seoul dialect (Francis et al., 2006). The observation of a significantly negative correlation between VOT and onset F0 for Kyungsang, but not for Seoul speakers, suggests that when the Kyungsang speakers used both acoustic cues in producing English voiceless-voiced stops, as

EL544 J. Acoust. Soc. Am. 134 (6), December 2013

Choi et al.: Differential cue weighting in second language

Choi et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4829059]

Published Online 8 November 2013

Fig. 2. Total-sample b weights across VOT and onset F0 for each participant: (a) for Kyungsang dialect speakers and (b) for Seoul dialect speakers. Dark lines are linear regression lines.

indicated by the mean values of coefficients, Kyungsang speakers who emphasized the VOT cues deemphasized the onset F0 cues, and those who emphasized the onset F0 cues deemphasized the VOT. This result is in good agreement with the hypothesis that VOT and onset F0 are in a trade-off relationship such that speakers can increase the weighting of VOT and decrease the weighting of onset F0 or vice versa and still maintain the same relative “b” or “p” quality for a sound. However, Seoul speakers showed very high weightings for the onset F0, as suggested by the magnitude of the onset F0 coefficients, compared with those of the VOT and, thus, there were no correlations between these two cues. However, in the perception of the English voicing contrast, the two groups of listeners showed similar pattern in the use of the two acoustic cues. These results demonstrated that both dialect listeners were overwhelmingly VOT users in perception; both dialect listeners gave greater weights to VOT than the onset F0 in the perception of English voiced-voiceless stop distinctions. Namely, some Korean listeners were better listeners, sensitive to the acoustic cues provided, whereas other Korean listeners were less so, regardless of the L1 dialect. The present study reveals nonparallel results between perception and production. Differences between Kyungsang and Seoul dialect speakers in their use of the two acoustic cues to English voiced and voiceless stops were observed in production, but not in perception. The present results are contrary to recent findings that show positive and even strong correlation between production and perception in second language acquisition (e.g., Bradlow and Pisoni, 1997). There may be several reasons for the discrepancy between the results of previous studies and those of the present study. First, it has already been noted that in contrast to the necessary use of onset F0 as well as VOT in making a three-way contrast among Korean stop consonants, VOT is sufficient in making a voiced-voiceless contrast in English stops. The onset F0 is shown to

J. Acoust. Soc. Am. 134 (6), December 2013

Choi et al.: Differential cue weighting in second language EL545

Choi et al.: JASA Express Letters

[http://dx.doi.org/10.1121/1.4829059]

Published Online 8 November 2013

be a secondary, redundant acoustic cue to the two-way contrast between English stops. Notably, the F0 difference between English stops is relatively subtle compared with that between Korean stops. The F0 difference for English stops has been shown to be approximately 15–20 Hz, whereas Korean lenis and aspirated stops show a difference of more than 50 Hz, which extends to the steady state of vowels immediately following target consonants (Chang, 2012). This cross-linguistic difference in the amount of F0 used for stop contrast may lead to the conclusion that Korean learners of English were somewhat insensitive to the F0 differences that the English stops showed in the perception task. Alternatively, any researchers in the field of L2 acquisition admit that children’s abilities to produce and perceive L1 segmental contrasts develops in parallel through early childhood, but the same kind of parallelism between perception and production does not exist in L2 acquisition (Flege, 1999). Generally, it is accepted that the ability to perceive L2 contrasts may develop more rapidly, or to a greater extent, than the ability to pronounce L2 contrasts. The present results are in good agreement with this view in that in perception both dialect listeners relied heavily on VOT as native English listeners do, whereas the native language dialects greatly influenced the manner in which Korean speakers produced the English voiced and voiceless stops. This discrepancy may be directly related to the notion that the correlation between perception and production is limited to proficient L2 learners of a language (Flege, 1999). In conclusion, the present study reveals that (i) the acquisition of L2 contrasts is influenced by native language dialects and (ii) production and perception are not developed in parallel in L2 acquisition. Acknowledgments This paper was supported by Konkuk University in 2011. We are grateful to Anders Lofquist for comments and suggestions and Kun-Young Shin for his help with statistics. Boersma, P., and Weenink, D. (2012). “Praat: Doing phonetics by computer [computer program], 5.3.23,” http://www.praat.org/ (August 8, 2012). Bradlow, A., and Pisoni, D. (1997). “Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production,” J. Acoust. Soc. Am. 101(4), 2299–2310. Chang, C. B. (2012). “Rapid and multifaceted effects of second-language learning on first-language speech production,” J. Phonetics 40, 249–268. Cho, T., Jun, S.-A., and Ladefoged, P. (2002). “Acoustic and aerodynamic correlates of Korean stops and fricatives,” J. Phonetics 30, 198–228. Flege, J. E. (1999). “Second language acquisition and the critical period hypothesis,” in Second Language Acquisition and the Critical Period Hypothesis, edited by D. Birdsong (Lawrence Erlbaum, Hillsdale, NJ), pp. 101–132. Francis, A. L., Ciocca, V., Wong, V. K. M., and Chan, J. K. L. (2006). “Is fundamental frequency a cue to aspiration in initial stops?,” J. Acoust. Soc. Am. 120(5), 2884–2895. Haggard, M. P., Summerfield, Q., and Roberts, M. (1981). “Psychoacoustical and cultural determinants of phoneme boundaries: Evidence from trading F0 cues in the voiced-voiceless distinction,” J. Phonetics 9, 49–62. Lee, H., and Jongman, A. (2012). “Effects of tone on the three-way laryngeal distinction in Korean: An acoustic and aerodynamic comparison of the Seoul and South Kyungsang dialects,” J. Int. Phonetic Assoc. 42(2), 145–169. Lisker, L., and Abramson, A. S. (1964). “A cross-language study of voicing in initial stops: Acoustical measurements,” Word 20, 384–422. Shultz, A. A., Francis, A. L., and Llanos, F. (2012). “Differential cue weighting in perception and production of consonant voicing,” J. Acoust. Soc. Am. 132, EL95–EL101. Silva, D. J. (2006). “Acoustic evidence for the emergence of tonal contrast in contemporary Korean,” Phonology 23, 287–308.

EL546 J. Acoust. Soc. Am. 134 (6), December 2013

Choi et al.: Differential cue weighting in second language

Copyright of Journal of the Acoustical Society of America is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Perception and production of English stops by tonal and non-tonal Korean dialect speakers.

This study examines whether relative weightings of voice onset time and onset F0 in Korean tonal vs non-tonal dialects affect the production and perce...
160KB Sizes 1 Downloads 6 Views