JSLHR

Research Article

Cues for Lexical Tone Perception in Children: Acoustic Correlates and Phonetic Context Effects Xiuli Tong,a Catherine McBride,b and Denis Burnhamc

Purpose: The authors investigated the effects of acoustic cues (i.e., pitch height, pitch contour, and pitch onset and offset) and phonetic context cues (i.e., syllable onsets and rimes) on lexical tone perception in Cantonese-speaking children. Method: Eight minimum pairs of tonal contrasts were presented in either an identical phonetic context or in different phonetic contexts (different syllable onsets and rimes). Children were instructed to engage in tone identification and tone discrimination. Results: Cantonese children attended to pitch onset in perceiving similarly contoured tones and attended to pitch contour in perceiving different-contoured tones. There was a decreasing level of tone discrimination accuracy, with tone perception being easiest for same rime–different syllable onset, more difficult for different rime–same syllable

onset, and most difficult for different rime–different syllable onset phonetic contexts. This pattern was observed in tonal contrasts in which the member tones had the same contour but not in ones in which the member tones had different contours. Conclusion: These findings suggest that in addition to pitch contour, the pitch onset is another important acoustic cue for tone perception. The relative importance of acoustic cues for tone perception is phonetically context dependent. These findings are discussed with reference to a newly modified TRACE model for tone languages (TTRACE).

C

world’s languages (the remaining 70% of the world’s languages are tonal language, such as Cantonese; Yip, 2002). Although there has been some work examining the acoustic correlates of lexical tone perception (e.g., Ciocca & Lui, 2003; Gandour, 1981), little has been done to examine the effect of context on weighting acoustic cues for perception of lexical tones, especially for children’s tone perception. Thus, in the present study, we aimed at examining the specific acoustic cues necessary for lexical tone perception and the phonetic context effects of tone perception in children.

urrent models of speech perception have described how acoustic and phonetic information can be used to map onto phonemic categories (such as Merge: Norris, McQueen, & Cutler, 2000; Shortlist: Norris, 1994; Episodic: Goldinger, 1996), and how certain linguistic knowledge may (in the form of expectation) influence the weighting of specific acoustic cues in the perception of segmental sounds (such as TRACE: McClelland, Mirman, & Holt, 2006; C-Cure: McMurray & Jongman, 2011). However, these models lack emphasis on the suprasegmental features of speech, and, in particular, lexical tones, because they have been developed on the basis of phonological features of nontonal languages, which account for only 30% of the

Key Words: TTRACE, tone perception, acoustic features, phonetic context, tonal constituency, tonal integrity, Cantonese-speaking children

Lexical Tones: Where Do Tones Fit in Our Understanding of Speech Perception?

a

University of Hong Kong, Hong Kong SAR, China Chinese University of Hong Kong c MARCS Institute, University of Western Sydney, New South Wales, Australia Correspondence to Xiuli Tong: [email protected] b

Editor: Jody Kreiman Associate Editor: Alex Francis Received June 5, 2013 Revision received December 3, 2013 Accepted March 24, 2014 DOI: 10.1044/2014_JSLHR-S-13-0145

Suprasegmental features refer to acoustic properties of speech (i.e., fundamental frequency [F0], duration, and amplitude) that extend beyond one segment unit, such as stress patterns in English and lexical tones in Chinese (e.g., Cutler & Chen, 1997; L. Lee & Nusbaum, 1993). Lexical tones are pitch patterns that are used to convey differences in the meanings of words with identical phonetic segments. Disclosure: The authors have declared that no competing interests existed at the time of publication.

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014 • © American Speech-Language-Hearing Association

1589

There are six contrastive tones in Cantonese1: high level (55, T1), high rising (25, T2), midlevel (33, T3), low falling (21, T4), low rising (23, T5), and low level (22, T6). The tone inventory in Cantonese can be classified on the basis of phonological features that are primarily linked to F0 (e.g., Bauer & Benedict, 1997). The two diagrams in the Appendix show the F0 patterns of six Cantonese tones on the syllable /ji/ and /fu / produced by the same female Cantonese speaker. Acoustically, these six tones can be quantified in terms of different types of acoustic parameters or features such as pitch height (i.e., the average F0), pitch contour (i.e., the change of overall slope of the F0 over time), and the onset and offset values of F0 (e.g., Gandour, 1981, 1983). Functionally, these six tones are lexically contrastive. In Cantonese, one monosyllable can have six separate meanings, depending on which of the six tones it carries, such as the syllables /ji/ and /fu/. The six tones of the syllable /ji/ means /ji55/ (clothing), /ji25/ (chair), /ji33/ (the first character of spaghetti), /ji21/ (son), /ji23/ (ear), and /ji22/ (two), whereas the six tones of the syllable /fu/ represent /fu55/ (skin), /fu25/ (tiger), /fu33/ (trousers), /fu21/ (symbol), /fu23/ (woman), and /fu22/ (father). Of central interest here is whether some of these acoustic features (F0 height, F0 contour, the onset and offset of F0) must be extracted from speech for children’s perceptual distinction of Cantonese tones. In addition, given that consonants and vowels are tone-bearing units, where one tone can occur across different syllables, such as /ji/ and /fu/, another important question naturally arises as to whether the segmental context (the group of phonemes, or syllable, onto which tone is affixed) influences the weighting of these acoustic cues in tone perception. We focus on these two aspects in the present study. The TRACE model of speech perception provides one source of inspiration and understanding of perceptual cues and context effects for tone perception. In the TRACE model (McClelland & Elman, 1986), there are three processing levels: features, phonemes, and words. Features are defined as encoding dimensions of sound (e.g., consonantal, vocalic, diffuseness, acuteness, voicing, power, burst). Phonemes and words are the linguistic units representing speech elements. The identity of a phoneme is determined by both acoustic features and lexical context. Moreover, this process is interactive in nature because it involves the bidirectional information flow, “both bottom-up (features to 1

Cantonese is spoken in Hong Kong, Macau, and the province of Guangdong in south China. Traditionally, Cantonese has nine tones in six distinct pitch contours. The other three tones are the short tones having the same F0 values as high level, midlevel, and low level. Only six tones are labeled in the transcription scheme from the Linguistic Society of Hong Kong (2002). Chao (1930) first transcribed lexical tone in a numerical notational system by using five levels (from lowest, 1, to highest, 5) to describe relative height, shape, and duration of pitch contour. High, middle, and low are the descriptions of pitch ranges. Level, rising, and falling are the specifications of the slopes of the pitch contours.

1590

phonemes to words) and top-down (words to phonemes to features)” (McClelland et al., 2006, p. 365). Although the interactive TRACE model provides a dynamic structure accounting for the interaction between prelexical (acoustic) and lexical (context) information in the perception of phonemes, the TRACE model, like other models of speech perception, was originally developed from those 30% of the world’s languages that are nontonal. Lexical tones are relatively ignored in the TRACE model. Therefore, there is a need to examine exactly where lexical tone fits in our understanding of the interactive processes of speech perception described in the TRACE model. Indeed, a modified version of the TRACE model has been proposed for accounting for the access of lexical tone information in processing spoken Mandarin (Ye & Connine, 1999). In this modification, the concept of the toneme is added as a separate level, parallel to the phoneme level in the TRACE model. Moreover, as with phoneme identification, there are interactive processes involved in tone processing because “the degree of toneme (and phoneme) activation is based on the goodness of the input signal and lexical feedback connections where the level of lexical feedback is a function of its activation” (Ye & Connine, 1999, p. 619). In addition, it is assumed that the amount of top-down feedback from words to tonemes may be larger, compared with that from words to phonemes. Despite the assumption that interactive processes occur in lexical tone processing, these interactive processes have been conceptually defined and operationally examined in the interaction between tonemes and lexical context only. We do not know whether any interaction occurred between acoustic features and tonemes, or between tonemes and phonemes. In addition, the acoustic features necessary for lexical tone perception have not been specified. Thus, in the present study, we move one step further by examining the acoustic features that are essential for tone perception and the context effect of tone perception (the interaction between toneme and phoneme) in Cantonese-speaking children.

Lexical Tone Perception: Which Acoustic Cues Are Necessary for Tone Perception? To date, there is a mixed picture as to which acoustic cues are necessary or sufficient for Cantonese children to distinguish tones, as there is no study to date that evaluates the relative importance of such cues in tone perception. Some early studies of Cantonese children’s tone identification indicated that pitch height and pitch contour are two important acoustic cues used by Cantonese children in making tonal distinctions (e.g., Ching, 1984, 1988). Another study suggests that a “generic” description of the difference in pitch height and contour cannot adequately explain 3-year-olds’ tone perception performance (K. Y. S. Lee, Chiu, & van Hasselt, 2002). Specifically, K. Y. S. Lee et al. (2002) found that children’s performance on the high level (55)–high rising (25) contrast was comparable to performance on the high level (55)–low falling

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

(21) contrast. Among these three tonal contrasts, the worst performance was on the high rising (25)–low falling (21) contrast. The results further demonstrated that the two member tones for the contrasts of high level (55)–high rising (25) and high level (55)–low falling (21) differed greatly in F0 onset (ranging from 70 to 90 Hz), but that there was a smaller difference in F0 onset (within 4Hz) between high rising (25)–low falling (21). There was no such pattern of dissimilarity in F0 offset among these three tone contrast pairs. Thus, K. Y. S. Lee et al. postulated that the relative perceptual salience of tonal contrasts may be determined by the discrepancy among F0 onset values; that is, Cantonese children attend more to F0 onset than offset in tone identification tasks. Recent studies of the development of tone perception suggest that the relative perceptual salience of Cantonese tones is determined by both contour and frequency range (indexed by the mean frequency difference of two members for a tonal contrast; e.g., Ciocca & Lui, 2003; A. M.-Y. Wong, Ciocca, & Yung, 2009). For example, Ciocca and Lui (2003) found that Cantonese adults and 4-, 6-, and 10-year-old children all performed less well with the midlevel (33)–low level (22) and high rising (25)–low rising (23) contrasts relative to other tonal contrasts. Ciocca and Lui argued that the difficulty in perceiving these two tonal contrasts was due to the similarity in contour of the members of each contrast as well as the smaller frequency range between the pairs. A. M.-Y. Wong et al. (2009) obtained similar results in a study that investigated lexical tone perception in 5-year-old Cantonese children with and without specific language impairment. They found that children without specific language impairment showed lower scores for the midlevel (33)–low level (22) and high rising (25)–low rising (23) contrasts, whereas children with specific language impairment demonstrated specific weaknesses in identifying tones in the low rising (23)–low level (22) and low falling (21)–low rising (23) contrasts. Although these recent studies by Ciocca and colleagues (Ciocca & Lui, 2003; A. M.-Y. Wong et al., 2009) provide robust developmental evidence of the effect of acoustic features on tone perception, their acoustic analysis focused on the quantification of the degree of differentiation among lexical tones based on broader acoustic parameters, such as contour and frequency range. These may not be sensitive enough to capture the difference between tones because frequency range only provides “crude estimates of the amount of pitch movement across a syllable” (Cutler & Otake, 1998, p. 1880). Moreover, the frequency range is used to explain difficulties in perceiving similarly contoured tones in Ciocca and Lui’s (2003) study, and this may not be appropriate for tone pairs which differ in contours. K. Y. S. Lee et al. (2002) once hypothesized that an F0 onset difference may better account for the perceptual pattern observed in young Cantonese children’s tone identification task than other broader acoustic parameters such as pitch height and pitch contours. However, that particular study was limited in scope, with only three tone contrasts, i.e., high level (55)–high rising (25), high level

(55)–low falling (21), high rising (25)–low falling (21) included. This made it impossible to compare for the relative importance of different acoustic parameters, including pitch height, pitch contour, pitch onset, and pitch offset. For example, it did not include any level tone contrast such as midlevel (33)–low level (22), despite the fact that particular contrast is the one that has been consistently considered difficult to distinguish by Cantonese children (e.g., Ciocca & Lui, 2003). Moreover, K. Y. S. Lee et al. used different segments to carry a given tone contrast, thus the children did not complete the same items. As interitem variability might have affected the children’s sensitivity to different tonal contrasts, it is not possible to determine from K. Y. S. Lee and colleagues’ data as to whether the F0 onset difference is a better acoustic cue than pitch height, pitch contour, or offset when determining the perceptual saliency of tone contrast. If F0 onset is relatively more important than the other acoustic parameters, including pitch height, pitch contour, and offset, then we expect children to attend more to F0 onset when perceiving different tonal contrasts. We tested this hypothesis in Experiment 1.

Tonemes and Phonemes: Are There Context Effects in Tone Perception? The Cantonese lexical syllable (i.e., lexical morpheme) comprises three components: syllable onset (consonant), rime (vowel), and tone. Empirical studies of Cantonese children’s tone perception to date have largely focused on using identical phonetic segments (consonant and vowel phonemes) to carry tonal contrasts. For example, they have either used the same single syllable to carry different tones and form different tone contrasts (e.g., Ching, 1984; Ciocca & Lui, 2003), or they have used multiple syllables to carry different tone contrasts, each tone contrast having identical phonetic segments (K. Y. S. Lee et al., 2002). However, due to the small inventory of Cantonese tone, each tone generally occurs in different phonetic contexts, which may not be carried by the exact phonemes, such as the case of /ji55/ and /fu55/. Questions naturally arise as to whether the phonetic context (i.e., the onsets and rimes of tonal syllable) bears any influence on children’s perception of tone, and if so, in what ways. Cross-language studies with adult listeners suggested that consonant, vowel, and tone interact with one another to a certain degree in processing spoken Mandarin syllables (e.g., L. Lee & Nusbaum, 1993; Repp & Lin, 1990; Tong, Francis, & Gandour, 2008), and that there were significant differences in the amount of interaction effects among these three phonological units (Tong et al., 2008). For example, using a speeded classification paradigm, L. Lee and Nusbaum (1993) asked both Mandarin Chinese and English listeners to selectively attend to and classify /ba/ and /da/ syllables according to one target dimension (either consonant, or Mandarin tone or non-Mandarin constant pitch), while ignoring irrelevant variation along nontarget dimension in a speeded classification task (Garner, 1974).

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1591

They found that both Chinese and English listeners show mutual integration of consonant and Mandarin tones (i.e., Mandarin tone judgments are interfered with by an irrelevant change in consonant, and vice versa). In contrast, Repp and Lin (1990) found that there is an asymmetric interaction between vowel and Mandarin tones (i.e., the tone is more interfered with by the vowel than vice versa) for Chinese listeners, but not English listeners. These two studies provide evidence that segmentals (consonants and vowels) and suprasegmentals (lexical tones) are integrally processed by Chinese listeners. However, in comparison, the difference between the dimensional interactions involving lexical tones (consonant × tone vs. vowel × tone) in these two studies may lead to the question of whether vowel and consonant exert different effects on tone processing. In fact, Tong et al. (2008) expanded these previous studies by directly comparing the number of interaction effects in three dimensions: tone versus consonants, tone versus vowel, and consonant versus vowel of three phonological units in Mandarin adult listeners. They found that the degree of integration between two paired dimensions (tones vs. vowels and consonants vs. vowels) was higher than the one between tone and consonant. Moreover, the interaction effects were asymmetrical: tone perception was more affected by vowel and consonant changes than vice versa; vowel perception was less affected by consonant and tone changes than vice versa. According to Tong et al., the information values of the three phonological units (i.e., rimes, syllable onsets, and tones) are different: rimes are more informative than consonants, and consonants are more informative than tones, i.e., rime > syllable onset > tone. Given that listeners attend more to phonological units that are more informative as compared with those that are less informative, the relative attention weight assigned to rimes, syllable onsets, and tones likely decreases respectively. This may partly account for the asymmetrical integration. Also, acoustically, the steady-state of the vowel formant persists longer than the formant transition cues of consonant and F0 contours of tones. Tong et al. further postulated that asymmetrical dimension dependencies “likely depend on multiple factors including acoustic cues, information value, and context” (2008, p. 706). Findings from the representative research on the interactions between the three phonological units of Chinese syllables (i.e., syllable onset, rime, and tone) with Chinese adult listeners led us to hypothesize that variations in the degree of similarity of segmental phonetic contexts (syllable onset and rimes) may exert an impact on Cantonese children’s perception of lexical tone. We also expect that there would be an interaction between acoustic cues and phonetic context in children’s Cantonese tone perception. These two hypotheses were tested in Experiment 2.

The Present Study To recapitulate, we examine two questions related to the issue of Cantonese lexical tone perception in children: (a) To what extent do different acoustic factors, including

1592

pitch height, contour, and pitch onset and offset, influence young Cantonese children’s tone perception? (b) Does the phonetic context (i.e., the onsets and rimes of tonal syllable) influence children’s perception of tone, and if so, in what ways? Experiment 1 focused on the first aspect by investigating Cantonese children’s tone perception and by conducting a fine-grained analysis of the F0s (height, contour, onset and offset of F0) of multiple tonal contrasts, carried across two different monosyllables /ji/ and /fu/. Experiment 2 addressed the second aspect by examining Cantonese children’s perception of tonal contrasts that vary in acoustic cues across different phonetic contexts.

Experiment 1 Method Participants. Participants were 180 Cantonese children ages 5 to 6 years (Mage 5;10 [years;months]). Empirical data demonstrated that children in this age range are able to identify six contrastive tones (So & Dodd, 1995; To, Cheung, & McLeod, 2013), but their tone perception skills are still developing and have not yet reached the ceiling performance similar to adults (Ching, 1984; Ciocca & Lui, 2003). Also, there is a relative paucity of data concerning 5-year-old children’s perception of Cantonese lexical tones. The present study provides reliable and valid data on the acoustic correlates of lexical tone among lexically developing Cantonese children. The participants were recruited from six local kindergartens in the New Territories, Kowloon, and Hong Kong Island, the three different major parts of Hong Kong. All were typically developing children without any symptoms of hearing deficits or delay in language development, according to school and parental reports. Parents or caregivers of participating children all gave informed consent via a consent form with a protocol approved by the Survey and Behavioral Research Ethics Committee of The Chinese University of Hong Kong, and they were given coupons equivalent of US$6.00 for participation. Materials. The core stimuli consisted of two sets of eight minimal pairs of tonal contrasts, made up of six Cantonese tones produced by the monosyllables /ji/ and /fu/. The syllables /ji/ and /fu/ were selected as target syllables as they appear across all six tones and represent concrete objects or concepts. The six tones of the syllable /ji/ represented /ji55/ (clothing), /ji25/ (chair), /ji33/ (the first character of spaghetti), /ji21/ (son), /ji23/ (ear), and /ji22/ (two), whereas the six tones of the syllable /fu/ included /fu55/ (skin), /fu25/ (tiger), /fu33/ (trousers), /fu21/ (symbol), /fu23/ (woman), and /fu22/ (father). These two sets of six tones were used to form eight minimum tonal contrasts. The eight minimum pairs of tonal contrasts are midlevel (33)–low level (22), high rising (25)– low rising (23), high level (55)–midlevel (33), high level (55)–low level (22), low rising (23)–low level (22), low falling (21)–low level (22), low falling (21)–low rising (23), and high level (55)–high rising (25).

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

We limited the number of tone contrasts to eight rather than 15 (the total possible comparisons of all six tones) because we wanted to obtain the most information from the smallest set, given that the contrasts would be used with children (Ciocca & Lui, 2003). Also, these eight particular contrasts were used because they had been successfully used with the same age range of children in a study by Ciocca and Lui (2003), which found that the comparison of the differences in the initial F0 (i.e., 55 vs. 25) or final F0 (23 vs. 25) or both (22 vs. 33) vary systematically. Moreover, findings from an identification task, with twoalternative forced choice tasks that tested children ages 5 to 6 years, showed that different tone contrasts, such as tone 55 versus tone 21, should not be included because they could be easily distinguished even by children at age 2 (K. Y. S. Lee et al., 2002). The stimuli were produced by a female native Cantonese speaker in a sound-treated room. The 12 target words (six each for the syllable /ji/ and /fu/) were randomly presented to the speakers for 10 times within a carrier phrase: /ŋ23 wui23 tk22 ___ pɛi25 lɛi23 thɛŋ55/(“I will read ____ for you to hear”).Target words occurred in the medial position to avoid a tone lowering effect near the end of the sentence (Ciocca & Lui, 2003; Vance, 1976), and they were spoken naturally in the recording. Cool Edit Pro (Version 2.0; Adobe Systems, San Jose, CA) acoustic analysis software was used to inspect the spectrogram and waveform of stimuli, and recordings with disfluencies or loudness instability, and recordings that were unclear or were either too long or too short (i.e., longer than 800 ms or shorter than 200 ms) were excluded. Four recordings of each target word were selected as experimental stimuli. The mean durations of six tokens were 621.19ms (s = 84.16) and 521.53ms (s = 110.19) and their mean intensities were 70.21dB (s = 1.72) and 74.14dB (s = 1.28), for /ji/ and /fu/, respectively. Two sets of stimuli with /ji/ and /fu/ stimuli with approximately equal duration and intensity were selected as final stimuli. These two sets of target words were used to form eight contrasts, and each contrast was repeated three times, resulting in a total of 48 stimuli (8 contrasts × 3 repetitions × 2 tonal syllables). These 48 stimuli were copied onto a minidisc for testing. Acoustic analysis. The Appendix shows average F0 values at onset and offset for the six tones produced with syllables /ji/ and /fu/. To obtain the F0 onset and offset values, the audio-recordings of the 48 stimuli containing either /fu/ or /ji/ in six tones were analyzed with Praat (Boersma & Weenink, 2004). Of these 48 stimuli, the same tonal syllable (e.g., /fu55/) was repeated four times, and the duration of the four utterances was different. In order to minimize the potentially confounding effects of duration, the duration of each syllable was normalized. The onset of the tone was chosen between the second and the fourth pulse cycle of the vowel for the syllable /fu/ and the beginning of the syllable /ji/. The first few cycles were ignored because the pulsation was too weak to be audible (Lisker & Abramson, 1964; Zhu, 2010). Thus, the pitch value

of /fu/ was measured from the start of the vowel of /fu/, whereas that of /ji/ was measured from the start of the syllable. The offset was chosen at the point where the regular waveform disappeared or became irregular. For rising tones, the peak of the pitch contour was chosen because the fall occurring after the peak usually involves a glottal sound, which is nonphonemic (Zhu, 2010). Whenever the pitch value was undefined, the value was obtained by manual measurement. The duration of the pulse cycle was measured and the frequency was calculated. The average F0 values of the onset and offset were also used to measure their differences in different pairs of tonal contrasts. Procedure. Prior to the experimental session, twelve pictures representing the 12 target words, produced with the six tones of the syllables /ji/ and /fu/, were introduced to the participants auditorially. To ensure that the participants knew the target words and their pictorial representations, they were randomly asked to name the pictures. The experiment did not start until the participants could correctly map the 12 target words onto the 12 corresponding pictures. In the testing, participants were presented with two pictures for a given contrastive tone pair and asked to identify the one that corresponded to the word they had heard. There were three practice trials. All instructions and stimuli were recorded in advance and presented over the headphones. All experimental items were presented to all participants in a single session, and the order of stimuli presentation was randomized across participants. All participants were individually tested in a quiet room in the children’s school by trained research assistants, and the entire testing session lasted 15 to 20 min.

Results Effects of pitch height and contour on tone perception across syllables /ji/ and /fu/. Table 1 shows the mean accuracy rates and standard deviations for all children combined on eight tone contrasts for the syllables /ji/ and /fu/. To examine the effects of pitch height and contour on tone perception in the syllables /ji/ and /fu/, a two-way factorial repeated measure analysis of variance (ANOVA) was conducted, with Pitch Contour (same contour different pitch height vs. different contour same pitch height) and Tonal Syllable (/ji/ vs. /fu/) as within-participant factors. There was a significant main effect of Pitch Contour, F(1, 179) = 7.98, p < .01, h2p = .04, with children making fewer correct responses in the same contour different pitch height condition than in the different contour same pitch height condition, and a main effect of Tonal Syllable, F(1, 179) = 9.28, p < .01, h2p = .05, with children having a higher accuracy rate on syllable /ji/ than on syllable /fu/. The interaction between these two factors was also significant, F(1, 179) = 47.43, p < .001, h2p = .21. To unpack the interaction, we conducted tests for the simple main effect of Pitch Contour, which further revealed that the children made fewer correct responses for contrastive tones having the same contour and different heights than for contrastive tones having different contours and the same height for the syllable /ji/,

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1593

1594 Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

Table 1. Mean accuracy rates (%) and standard deviations (SD) for all eight tonal contrasts for tonal syllables /ji/ and /fu/. Same contour different pitch height ML-LL Syllable /ji/ /fu/

HR-LR

HL-ML

Different contour same pitch height HL-LL

LR-LL

LF-LL

LF-LR

HL-HR

accuracy

SD

accuracy

SD

accuracy

SD

accuracy

SD

accuracy

SD

accuracy

SD

accuracy

SD

accuracy

SD

54.63 61.30

24.61 22.61

62.78 76.30

27.13 17.44

87.22 92.04

21.51 15.51

92.96 79.63

15.75 24.26

84.81 51.30

21.81 26.92

68.33 74.07

27.81 25.54

81.48 77.41

23.43 21.91

94.81 94.07

13.10 13.26

Note. ML–LL = midlevel (33, T3)–low level (22, T6); HR–LR = high rising (25, T2)–low rising (23, T5); HL–ML= high level (55, T1)–midlevel (33, T3); HL–LL= high level (55, T1)–low level (22, T6); LR–LL= low rising (23, T5)–low level (22, T6); LF–LL = low falling (21, T4)–low level (22, T6); LF–LR = low falling (21, T4)–low rising (23, T5); HL–HR = high level (55, T1)–high rising (25, T2).

F(1, 179) = 39.67, p < .001, h2p = .18. However, the opposite was obtained for the syllable /fu/, such that children made fewer correct responses for contrastive tones having different contours and the same height than for those having the same contour and different heights, F(1, 179) = 8.21, p < .01, h2p = .04. To further examine perceptual variation, we conducted a much more detailed comparison of the children’s tone perception performance on eight tone contrasts for syllable /ji/ and syllable /fu/. For the syllable /ji/, the children performed differently on eight tone contrasts, F(7, 1253) = 89.17, p < .001, h2p = .33. The pairwise comparisons with Bonferroni correction revealed that the mean accuracy for high level (55)–high rising (25) was substantially higher than for any of the other contrasts (ps < .001), except for high level (55)–midlevel (33), p = 1.00. In contrast, the children’s performance on midlevel (33)–low level (22) was lower than for any other contrasts (all ps < .001), except for high rising (25)–low rising (23), p = .06. All the other contrast comparisons were significant (all ps < .001), except for the following contrast comparisons: high rising (25)–low rising (23) versus low falling (21)–low level (22), p = 1.00; high level (55)–midlevel (33) versus high level (55)–low level (22), p = 1.00; high level (55)–midlevel (33) versus low rising (23)–low level (22), p = 1.00; high level (55)–midlevel (33) versus low falling (21)–low rising (23), p = .50; and low rising (23)–low level (22) versus low falling (21)–low rising (23), p = 1.00. For the syllable /fu/, the mean accuracy for the eight tone contrasts differed significantly, F(7, 1253) = 86.43, p < .001, h2p = .33. The pairwise comparisons with Bonferroni correction revealed that the mean accuracy for high level (55)–high rising (25) was substantially higher than any other contrasts (all ps < .001), except for high level (55)–midlevel (33), p = 1.00. In contrast, the children’s performance on low rising (23)–low level (22) was lower than on any other contrasts (all p < .001). The mean accuracy of midlevel (33)–low level (22) was lower than that of any other contrasts (all p < .001), except for that of low rising (23)–low level (22), p < .01. No significant difference was found in the comparison of the following contrasts: high level (55)–low level (22) versus high rising (25)–low rising (22), p = 1.00; high rising (25)–low rising (23) versus low falling (21)–low level (22), p = 1.00; high rising (25)– low rising (23) versus low falling (21)–low rising (23), p = 1.00; high level (55)–low level (22) versus low falling (21)–low level (22), p =. 59; high level (55)–low level (22) versus low falling (21)–low rising (23), p = 1.00; and low falling (21)–low level (22) versus low falling (21)–low rising (23), p = 1.00. All other comparisons were significant (all ps < .001). Effects of pitch onset and offset on tone perception across the syllables /ji/ and /fu/. Figure 1a and 1b depict the differences in F0 values at onset and offset for the eight tone contrasts for syllables /ji/ and /fu/, respectively. As described above, these differences were calculated on the basis of the mean onset and offset of the F0 of each tone, as shown in the Appendix.

The left sides of Figure 1a and 1b display the onset and offset differences of four tonal contrasts, in which two member tones have the same contour. For the syllable /ji/, the onset differences of the four tone contrasts were T3–T6: 6.17Hz, T2–T5: 12.97Hz, T1–T3: 46.43Hz, T1–T6: 52.61Hz, and their offset differences were T3–T6: 21.10Hz, T2–T5: 57Hz, T1–T3: 48.75Hz, T1–T6: 69.84Hz. The corresponding response accuracies of the children for these four tone contrasts varied markedly (54.63%, 62.78%, 87.22%, and 92.96%, respectively). For the syllable /fu/, the onset differences of the four tone contrasts of the syllable /fu/ were T3–T6: 18.13Hz, T2–T5: 12.60Hz, T1–T3: 48.38Hz, T1–T6: 66.51Hz, and their offset differences were T3–T6: 26.95Hz, T2–T5: 69.06Hz, T1–T3: 62.75Hz, and T1–T6: 89.71Hz. The children’s response accuracies for the four tone contrasts (61.3%, 76.3%, 92.04%, and 79.63%, respectively) indicate that children’s tone perception performance was much more consistent with the pattern of differences in F0 values at onset for these four tone contrasts across syllables. The correlation analyses were conducted between onset and offset differences of tone contrasts and children’s tone identification performance; the analyses further revealed that pitch onset difference was significantly correlated with children’s tone identification performance (r = .80, p < .05), and no significant correlation was found between pitch offset difference and tone identification. The right sides of Figure 1a and 1b display the onset and offset differences of four tonal contrasts, in which two member tones have different contours for the syllables /ji/ and /fu/, respectively. For the syllable /ji/, the onset differences of the four tone contrasts of the syllable /ji/ were T4–T6: 21.24Hz, T4–T5: 7.54Hz, T5–T6: 13.70Hz, T1–T2: 79.28Hz, and their offset differences were T4–T6: 70.16Hz, T4–T5: 109.78Hz, T5–T6: 39.62Hz, and T1–T2: 26.78Hz. As for the children’s response accuracy regarding these four tone contrasts (68.33%, 81.48%, 84.81%, and 94.31%), neither the onset difference nor offset difference was related to the tone response accuracy. For the syllable /fu/, the onset differences of the four tone contrasts of the syllable /fu/ were T4–T6: 12.27Hz, T4–T5: 11.60Hz, T5–T6: 23.87Hz, T1–T2: 102.99Hz, and their offset differences were T4–T6: 52.91Hz, T4–T5: 90.37Hz, T5–T6: 37.47Hz, T1–T2: 16.82Hz. As for the children’s response accuracy regarding these four tone contrasts (74.07%, 77.41%, 51.30%, and 94.07%), neither the onset difference nor offset difference matched the tone response accuracy. The correlational analyses between pitch onset and offset differences and children’s tone identification performance further confirmed that neither the onset nor the offset differences were significantly correlated with children’s tone identification performance (both ps > .05).

Discussion Cantonese children achieved an equally high or low level of accuracy in perceiving some contrastive tones with the same contour and different heights, as well as in perceiving some contrastive tones with different contours

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1595

Figure 1. Average F0 values at onset and offset for the eight contrasts of (a) the syllable /ji/ and (b) the syllable /fu/. The left panels contain contrasts with same contour, whereas the right panels have contrasts with different contours.

and the same height, for example, T1–T3 versus T1–T2 or T2–T5 versus T4–T6. Moreover, the degree of F0 onset difference correlated with children’s perceptual performance on similarly contoured tones. These results suggested that Cantonese children attend to the onset F0 for the perception of similarly contoured tones but use pitch contour to distinguish different-contoured tones.

1596

One important finding is that Cantonese children were able to use F0 onset as a cue to distinguish similarly contoured tones, which supports an observation made in previous studies that Cantonese talkers were able to minimize the influence of consonant aspiration on the F0 of the following vowel within 10 ms (Francis, Ciocca, Wong, & Chan, 2006). Francis et al. (2006) explained that “talkers

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

Figure 1 (Continued).

are apparently able to restrict the production of consonantrelated perturbations of onset F0 to the first few tens of milliseconds following the onset of voicing, presumably in order to maintain the integrity of the F0 contour as a cue to lexical tone identity” (p. 2894). However, the present result further clarifies that it may be done specifically in order

to preserve subtle differences in onset F0 itself for tonal distinction. Another finding from this experiment is that children’s perceptual performance for some tone contrasts varied significantly across the syllables /ji/ and /fu/, indicating the possible context effect of lexical tone perception. For

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1597

example, the mean accuracy rate of low rising (23)–low level (22) contrast was 84.81% for the syllable /ji/, whereas the perceptual accuracy was only 54.3% for the syllable /fu/. There are at least two plausible explanations. One is that the difference in the frequency of occurrence of the representative morphemes of the tone contrast between /ji/ and /fu/ may be a factor contributing to a greater perception difference. However, data from the Cantonese corpus by Weizman, Fletcher, and Ma (2000) including 70 children ages 2 to 5.5 years show that no occurrence was found for either the syllable /ji/ or /fu/ with this tone contrast. This indicates that children are equally unfamiliar with the four words represented by /ji23/–/ji22/ and /fu23/–/fu22/. Therefore, it seems less likely that the difference in the occurrence frequency of these representative morphemes is the source of the perceptual difference of this tone contrast observed between /ji/ and /fu/. Another plausible explanation relates to the interaction among the three phonological units of the Cantonese syllable (syllable onset, rime, and tone), where the phonemes or tone-bearing units modify the acoustic features of lexical tones, and consequently influence the perceptual salience of certain tone contrasts. The syllable /ji/ consists of an approximant, voiced, and palatal sound /j/ and a front and unrounded vowel. By contrast, the syllable /fu/ is made up of a fricative, voiceless, and alveolar sound /f/ and the back and rounded vowel /u/. In terms of articulation features, producing /ji/ involves less mouth movement relative to /fu/. Additionally, as the syllable /ji/ is a combination of a semivowel and vowel, it is pronounced almost as if there is no onset, whereas /fu/ has a clear onset. A recent largescale study of 2- to 12-year-old Cantonese children’s acquisition of Cantonese sounds suggested that Cantonese children acquire /i/ and /u/ by the age of 2;6, /j/ by the age of 3, and /f/ by the age of 4, and that all Cantonese children acquire the six tones by the age of 2;6 (To et al., 2013). This excludes the possibility that the perceptual difference observed in tone perception may be due to differences in the development of phonetic segment perception. It is likely that the tone perceptual difference observed between syllables /ji/ and /fu/ arises from the interaction of phonetic segments and lexical tones. This hypothesis has not yet been tested in previous research. Thus, we tested this hypothesis in Experiment 2. In doing so, we systematically varied the phonetic contexts of eight tone contrasts used in Experiment 1 and compared children’s performances on the same tone contrasts across different phonetic contexts.

Experiment 2 Method Participants. Eighty Cantonese children (37 girls, Mage = 11;0, SD = 4 months) participated in this experiment. All children were recruited from Hong Kong Cantonesemedium primary schools. They were native Cantonese speakers with no symptoms of hearing deficits or language

1598

disorders, according to parents’ and teachers’ reports. Written consent was obtained from the parents of participating children, and parents were given coupons equivalent of US$6.00 for participation. This age group of children was chosen as target participants because they were expected to have attained adult levels of performance in identifying the six tones (Ciocca & Lui, 2003). This allowed us to control for the potential confounding effect of age acquisition differences related to the six tones. Moreover, none of the previous studies tested 11-year-olds, and we have very little knowledge of the development of 11-year-old Cantonese children’s tone perception skills. In addition, the oddity task was used to avoid a ceiling effect of tone identification, as shown by 10-year-old Cantonese children in a previous study (Ciocca & Lui, 2003).This oddity task is challenging for young children because it not only requires children to have tacit knowledge of tones, but also demands that the aurally presented tonal syllable be stored and processed (e.g., Burnham et al., 2011). Thus, testing 11-year-olds who have developed a robust representation of the six tones enables us to obtain reliable and valid responses. Materials. We selected 24 sets of four tonal syllables carrying six Cantonese tones, with a total of 96 monosyllable words (24 × 4 = 96 words). The four words for each set of stimuli represented common concrete objects or concepts, and they were approximately matched on rated familiarity, frequency of occurrence in spoken Cantonese, syllabic structure, and consonant and vowel combinations. Three of the four words in each set of stimuli had the same tone, whereas the other had a different tone. These 24 sets of stimuli consisted of three phonetic contexts: eight sets of four words consisting of the same rimes, different syllable onsets (SRDO, hereafter), e.g., /gcu25/ (dog), /tscu25/ (wine), /lcu23/ (willow), /hcu25/ (mouth); eight sets of four words sharing the same syllable onsets, different rimes (SODR, hereafter), e.g., /san55/ (mountain), /sœŋ55/ (box), /sy22/ (tree), /s55/ (comb); and eight sets of four words that had different syllable onsets, different rimes (DRDO, hereafter), e.g., /sɵn33/ (letter), /hk22/ (crane), /tsin33/ (arrow), /kœk33/ (feet). The eight sets of four tonal syllables for each phonetic context represented eight minimum possible tone contrasts that were used in Experiment 1: T1–T3, T2–T5, T1–T3, T1–T6, T5–T6, T4–T6, T4–T5, and T1–T2. All the words were illustrated with pictures. Design. A 3 (phonetic contexts) × 8 (tone contrasts) was employed with phonetic context (SRDO, SODR, DRDO) and tone contrasts (T1–T3, T1–T6, T3–T6, T2–T5, T1–T2, T5–T6, T4–T6, T4–T5) as factors. Each phonetic context consisted of eight tone contrasts that were characterized as having the same contour contrasts (T1–T3, T1–T6, T3–T6, and T2–T5) and different contour contrasts (T1–T2, T5–T6, T4–T6, and T4–T5). The sequences of the phonetic contexts were counterbalanced across participants. One third of the participants were administered the phonetic context condition of SRDO first, one third of the participants were tested under

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

the phonetic context condition of SODR first, and the other one third of participants were administered the phonetic context condition of DRDO first. Each of the eight tone contrasts were presented in random order for a given phonetic context. Both phonetic context and tone contrasts were within-subject factors, and all participants completed all 24 trials. Procedure. There were two phases: training and testing. First, participants were auditorially presented with the six Cantonese tones illustrated with pictures. They were then asked to reproduce the words that were mapped onto the pictures, which were presented in random order. The participants were instructed to speak the words aloud, which allows the experimenter to correct them if necessary. If the participants’ production of a given tonal syllable was incorrect, they were introduced to the six tones by the experimenters with examples. The participants then received instructions on the testing procedure until they were able to produce all six Cantonese tones. The two sets of six Cantonese tones were illustrated by the monosyllables /ji/ and /fu/, and the training phase took approximately 2 min for each participant. Next, participants were given a practice trial for the testing procedure. In the practice trial, the four tonal syllables, i.e., /jcn21/ (person), /jcu21/ (oil), /jy21/ (fish), /tsy55/ (pig) were presented auditorially to the participants with four corresponding pictures. The participants were told that the three tonal syllables /jcn21/ (person), /jcu21/ (oil), /jy21/ (fish) had the same low-falling (21) tone, but that the tonal syllable /tsy55/ (pig) had a high-level (55) tone that was different from the other three. Participants had to choose the picture /tsy55/ (pig) that represented the word with a different tone. The testing started when participants were ready to proceed. During the testing phase, participants were asked to choose the picture representing the monosyllable word that had a different tone compared with the other three. It took approximately 20 min to complete the whole test.

Results Phonetic context effect. Mean accuracy rates for eight tone contrasts in three phonetic contexts are shown in Figure 2. To examine the effect of phonetic context and the interaction between phonetic context and acoustic features of tone contrasts, a 3 (Phonetic Context: SRDO, SODR, DRDO) × 8 (Tone Contrast: T1–T3, T1–T6, T3–T6, T2–T5, T1–T2, T5–T6, T4–T6, T4–T5) two-way repeatedmeasures ANOVA was performed on mean accuracy rates, with Phonetic Context and Tone Contrast as within-subject factors. There was a main effect of Phonetic Context, F(2, 158) = 27.35, p < .001, h2p = .26, with a significant decreasing order of tone discrimination performance, that is, SRDO > SODR > DODR (ps < .01). There was also a main effect of Tone Contrast, F(7, 553) = 5.23, p < .001, h2p = .06. The interaction between Phonetic Context and Tone Contrast was also significant, F(14, 1106) = 3.89, p < .001, h2p = .05.

Figure 2. Mean accurate rates and standard deviations for 11-year-old children for the eight tone contrasts under three phonetic contexts. SRDO = same rimes, different syllable onsets; SODR = same syllable onsets, different rimes; DRDO = different syllable onsets, different rimes.

To unpack the interaction effect, a simple main effect analysis was conducted on tone contrasts for each phonetic context. The main effect of Tone Contrast was not significant in the SODR context, F(7, 553) = 1.44, p = .19 or the DRDO context, F(7, 553) = 1.37, p = .21. However, the effect of Tone Contrast was significant in the SRDO context, F(7, 553) = 13.34, p < .001, h2p = .14. Pairwise comparisons with Bonferroni correction of the full set of comparisons between the eight tone contrasts in the SRDO context is shown in Table 2. Interaction between phonetic context and acoustic features. Based on Table 2, the significant differences in tone contrast appear to be mostly localized in the comparison between tonal contrasts for which two member tones shared the same contour. This led us to conjecture that the effect of phonetic contexts on tone processing may be constrained by the contour features of contrastive tones. To test this hypothesis, we first calculated the mean percentage correct scores for the same contour tone contrasts by averaging the correct responses to the items for T3–T6, T2–T5, T1–T6, T5–T6 in each phonetic context. Similarly, correct responses to items of contrastive tones with different contours were averaged, including T5–T6, T4–T6, T4–T5, and T1–T2 in each phonetic context. Next, we submitted the mean accuracy rates for the Same Contour Contrasts and Different Contour Contrasts to a two-way repeated measures ANOVA with the withinsubject factors Contour (Same vs. Different) and Phonetic Contexts (SRDO, SODR, DRDO). The main effect of Contour was significant, F(1, 79) = 19.25, p < .001, h2p = .20, with higher accuracy rates observed in contrastive tones with the same contour, compared with those with different contours. The main effect of Phonetic Context was significant, F(2, 158) = 27.35, p < .001, h2p = .26, with a significant decreasing order of tone performance, that is, SRDO > SODR > DODR (all ps < .01). There was also a significant interaction between Contours and Phonetic Context, F(2, 158) = 15.21, p < .001, h2p = .16. A simple main effect analysis of Phonetic Context on tone contrasts further revealed that the interaction was due to having no

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1599

Table 2. Comparison matrix of eight tone contrasts in the same rime, different onset context. Same Contour Same ML–LL HR–LR HL–ML HL–LL Different LR–LL LF–LL LF–LR HL–HR

Different

ML–LL

HR–LR

HL–ML

ns ns ns

ns ns

ns

*** ** ** ns

*** ns ns ns

*** ** ** ns

HL–LL

LR–LL

LF–LL

LF–LR

*** ** ** ns

ns ns *

ns ns

ns

HL–HR

*p < .05. **p < .01. ***p < .001. ns = nonsignificant.

significant effect of phonetic context in the perception of tone contrasts with different contours, F(2, 158) = 2.16, p = .12, whereas there was a significant effect of Phonetic Context on tone contrasts with the same contour, F(2, 158) = 38.59, p < .001, h2p = .33, where the children made more correct responses to the same contour contrasts in the SRDO context than in the contexts of SODR (p < .001) and DODR (p < .001). They also made fewer correct responses in the SODR context than in the DODR context (p < .05; see Figure 3).

Discussion This experiment demonstrated that changing the phonetic context by varying the segmental units of tones (e.g., syllable onsets and rimes) affects the perception of lexical tones, which results in a clear decreasing pattern of tone Figure 3. Mean accurate rates for same contour tone contrasts and different contour tone contrasts under three phonetic contexts.

1600

perception performance in the same rime–different syllable onset, same syllable onset–different rime, and different rime– different syllable onset. Also, the phonetic context had a strong influence on contrastive tones with the same contour, but little influence on contrastive tones with different contours. These results suggest that the perceptual salience of lexical tones was a consequence of the interaction between phonetic segments and acoustic features of lexical tones. The graded effect of phonetic context on tone perception provides direct evidence of the dependencies between segmental and suprasegmental information in lexical tone perception among children, which have been manifested in a speeded classification paradigm with skilled adult listeners (L. Lee & Nusbaum, 1993; Repp & Lin, 1990). Moreover, the present pattern of results (SRDO > SODR > DODR) corroborates previous findings that vowels and consonants exert different effects on tone processing (Tong et al., 2008). The interaction effect suggests that phonetic context operated as if the fact that having same or different rimes and syllable onsets had little to do with the way children perceived and distinguished contrastive tones with different contours. However, it has a strong bearing on the perception of contrastive tones with the same contour. These findings are in line with studies of the categorical perception of Cantonese tones, which show that although the change of F0 height has a significant influence on the categorical perception of level tones (P. C. M. Wong & Diehl, 2003), it has little to do with contour tones (Huang & Holt, 2009). These findings are also partly consistent with a recent study showing that intertalker variations affect the acousticphonetic mappings of two contrastive level tones with the same contour but not contrastive tones with different contours (Peng, Zhang, Zheng, Minett, & Wang, 2012). Although it is clear that the variation in either syllable onsets or rimes impacts the perceptual saliency of similarly contoured tone contrast, the underlying mechanism that supports this interaction remains unclear. In particular, we used different consonants and vowels in our stimuli. Because different vowels and consonants are marked by various spectral and temporal features, including rising and falling formant transitions, it is likely that the coexistence

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

of the backdrop spectral patterns in the syllable onset and rime interferes with tone perception, depending on which consonant or vowel is involved. The acoustic energy level also varies tremendously, depending on the phonetic sound involved. Thus, the phonetic effect observed in the present study can also be explained by general auditory mechanisms such as differential masking effects. Future research will be necessary to systematically manipulate the types of vowels and consonants and investigate whether the phonetic context effect observed is constrained by the acoustic and phonetic characteristics of individual phonetic sounds.

General Discussion Using a two-alternative forced choice tone identification task, we found that pitch onset difference matters more in discriminating contrastive tones with the same contour than other acoustic correlates, but contour saliency is sufficient to distinguish contrastive tones with different

contours in identical phonetic contexts. Using a tone oddity task, we demonstrated a graded effect of phonetic context on tone perception, as children’s tone performance decreased with increasing phonetic dissimilarity in the order of SRDO > SODR > DRDO. Moreover, we also found an interaction between phonemes (syllable onset and rimes) and tones in which the graded effect was evident for the contrastive tones with the same contours, but not for the contrastive tones with different contours. Taking these findings one step further, we tentatively propose a newly modified version of the TRACE model that accounts for the interactive processes of speech perception in tonal languages (referred to as TTRACE model hereafter), described in Figure 4. Similar to the original TRACE model, there are three levels of dynamic structure: features, tonemes-phonemes, and words. However, unlike the previously modified version of TRACE model, where tonemes were assumed to be a separate level of representation during speech perception (Ye & Connine, 1999), we assume that

Figure 4. The TRACE model for speech perception of tonal languages (TTRACE). Unlike the traditional TRACE model, the TTRACE integrates segmental and suprasegmental dimensions in a distributed network. There are three processing levels: features, tonemes-phonemes, and words. Each level consists of a set of processing units that can be used to distinguish speech sounds from one another. The phonetic properties of a spoken input, for example, the tonal syllable /fu1/ in both segmental and suprasegmental dimensions, are encoded at the feature levels. The degree of activation of each phonological feature can vary from very low (bottommost circle) to very high (topmost circle). Similarly, the contour features are quantified in terms of approximate values of change in slope, varying from very little change (i.e., the bottommost circle for T1 of /fu/) to very much change in slope (i.e., topmost circle). At the phonemes-tonemes level, all syllable onsets, rimes, and tonemes (level tones of T1, T3, T6 and contour tones of T2, T4, T5) are spatially distributed and locally connected with each other. For the syllable /fu/, the phonemes /f/ and /u/ and its toneme T1 are the most strongly activated among all phonemes and tonemes. At the word level, the target word (e.g., /fu1/ ) and several types of competitive nontarget words with different degrees of phonological similarities are activated in a graded manner. In the TTRACE model, the relative activation strength at the word level is symbolized by the thickness of curved dotted lines. The arrows between levels are bidirectional, indicating the interaction between top-down and bottom-up processing. The TTRACE is developed and modified from the TRACE model in “Are There Interactive Processes in Speech Perception?” by J. L. McClelland, D. Mirman, and L. L. Holt, 2006, Trends in Cognitive Sciences, 10, p. 364. Copyright (2006), reprinted with permission from Elsevier.

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1601

phonemes and tonemes are processed integrally during speech perception; thus, there is an integral representation of tonemes and phonemes in the TTRACE model (see Figure 4). Moreover, the TTRACE model incorporates phonological features that encode both segmental dimensions (power, vocalic, acute, consonantal, voiced, burst) and suprasegmental dimensions (contour, height, onset, and offset) of speech sounds that can be used to distinguish meanings of words from one another. None of the previous models have included suprasegmental features relevant to speech perception. In addition, unlike previous TRACE models, the presumption of the interword interaction at the word level is fine-grained in the TTRACE model. That is, the magnitude of activation of words in a speech perception task reflects the degree of segmental and suprasegmental overlap among words. In other words, the strength of the activation of words is determined by the similarity between target and nontarget words in terms of different segmental and suprasegmental dimensions. For example, as depicted in Figure 4, the target word is /fu1/ (skin). There are several types of competitive nontarget words with different degrees of phonological similarities: (a) sharing the same vowel and tone: /wu1/ (black), (b) sharing the same consonant and vowel: /fu6/ (father), (c) sharing the same consonant and tone: /fa1/ (flower), (d) sharing the same vowel only: /ku2/ (ancient), (e) sharing the same consonant only: /f2/ (fire), (f) sharing the same tone only: /t1/ (many), (g) no similarity (in terms of phonemes and tonemes): /si3/ (try), /sɛ5/ (society), /la3/ (crack). We assume that the strength of activation among these competitive nontarget words to be displayed on at least three distinct levels, (a), (b), and (c), which are most strongly activated because these levels share two of the three aspects (combination of consonant, vowel, and/or tone) with the target word. The (d), (e), and (f) levels may form the second, most strongly activated batch of words because they also share one aspect with the target. The (g) level should display the least activation because it has no similarity with the target. The activation strength is symbolized by the thickness of curved dotted lines. The fine-grained distinction of the within-level activation strength, such as the differentiation of activation strength of (a), (b), and (c), is constrained by the nature or characteristic of the type of individual phonemes and tones involved. One of the most important aspects of the TTRACE model is the constituency of lexical tones. As noted in the introduction, exactly where lexical tones fit in our understanding of speech perception is unclear. Although Ye and Connine (1999) attempted to incorporate tones into processes of the speech perception in the TRACE model, their modification may be too simple to address some of the detailed nature of lexical tone perception. As previously discussed, there is no specification for the acoustic features necessary for lexical tone perception or for the interaction between tonemes and phonemes. Given these limitations of the modified version of the TRACE model (Ye & Connine, 1999), we examined the acoustic features that

1602

are important to tonal distinction and sought to clarify the interaction between toneme and phoneme in the present study. On the basis of our findings, we argue that a complete speech perception model should specify the key phonological features that encode both segmental and suprasegmental dimensions, and tone should be considered as a key structural element of spoken language in the speech perception model. In doing so, the TTRACE postulates that tonemes are integral to the perception of phonemes. Moreover, TTRACE incorporates the acoustic correlates of lexical tones (contour, height, onset, and offset) into the feature information relative to segmental speech perception (power, vocalic, diffuse, acute, consonantal, voiced, burst). Such tonal constituency depicted in the TTRACE model is supported by the results of our two experiments. As shown in Experiment 1, acoustic features of tone contrasts, in particular, pitch contours and the differences of F0 at onset, are integral in determining the relative ease of tone perception when the phonetic segments of tone are constant. The constituency assumption of lexical tones also provides a potential explanation for the findings of diverse studies on the relative importance of different acoustic correlates in tonal distinctions, showing that acoustic information of tone (primarily F0) is a continuum, and all acoustic features (pitch height, contour, onset, and offset) are available and activated with distributed representations. Although it is possible to use different acoustic measures to characterize tone, there is a great variation in the magnitude of variance that different acoustic measures capture for different tonal contrasts. For example, the difference of F0 at onset captures much of the variance of contrastive tones with the same contour, but not for contrastive tones with different contours. Thus, the decision of using rough characterization (pitch height vs. pitch contour) or fine-grained characterization (F0 onset, offset, and contour) in previous studies ends up with different findings. In addition, the graded phonetic context effect of tonal distinction (SRDO > SODR > DRDO) displayed in Experiment 2 lends some support to the idea of the constituency of lexical tones in speech perception that the process of distinguishing tones is involved in the recognition and computation of phonetic segments. Another important aspect of the TTRACE model is the integrality of toneme and phoneme. As outlined in the introduction, findings from representative studies of adult listeners’ speech perception suggest the processing dependencies between segmentals and suprasegmentals (e.g., L. Lee & Nusbaum, 1993; Repp & Lin, 1990; Tong et al., 2008). The observation that Cantonese children show the graded phonetic context effect of tonal distinction (SRDO > SODR > DRDO) on similarly contoured tone contrasts but not on differently contoured tone contrasts supports the idea that tonemes and phonemes are integrated and the relative degree of integration is determined by multiple factors, such as acoustic cues, information value, and context (Tong et al., 2008). In other words, the three phonological units (syllable onset, rime, tone) are

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

mutually activated in the access to the lexical representation of tones. Thus, the different perceptual patterns of the eight minimal tonal contrasts observed within phonetic contexts (either segmentally identical contexts or segmentally different contexts) in our study basically reflect the perceptual integrations among tonemes and phonemes. Beyond the tonal constituency and tonal integrality, the TTRACE model suggests several theoretical questions that merit empirical attention in future research. First, how does top-down lexical knowledge impact tone perception? There is a longstanding debate about whether the process of spoken word recognition is a strictly bottom-up or top-down process (Cutler, Butterfield, & Williams, 1987; Marslen-Wilson, 1984; Norris et al., 2000; Samuel, 1996). Convergent findings from recent studies of spokenword recognition with different experimental paradigms suggest that the perceptual process is neither strictly bottomup nor top-down; rather, it is a combination of these two processes (McClelland et al., 2006; McMurray & Jongman, 2011; Samuel, 2001). Given that Chinese tones are lexically contrastive, and each tonal syllable is a morpheme, it seems quite plausible that lexical knowledge may exert some influence on tonal distinction. However, there has been little work on how lexical knowledge exerts impact on lexical tone distinction. This is an important question that merits further investigation. Second, to what extent does language experience shape the perceptual systems developed for the perception of lexical tones? Although there is growing evidence of the effect of language experience on tone perception (Gandour, 1983, Peng et al., 2010; Xu, Gandour, & Francis, 2006), there is no conceptual model that could be used to compare and contrast with the findings of within-language and cross-linguistic investigations of linguistic pitch perception. The TTRACE model represents an important preliminary step in resolving this issue by characterizing the interactive processes of speech perception in tonal languages. However, the TTRACE model is developed on the basis of findings on native Cantonese children’s tone perception. An extension of this research will be necessary to investigate the perception of Cantonese lexical tones by other Cantonese language learners who are nonnative Cantonese speakers, such as Mandarin learners of Cantonese and English learners of Cantonese. This will further shed light on the detailed nature of how the linguistic knowledge influences the perceptual categorization. In addition, there are different types of consonants and vowels involved in the stimuli of same rime–different syllable onset, different rime–same syllable onset, and different rime–different syllable onset conditions. Given that different types of vowels and consonants have different acoustic properties, this may influence the interaction effect between phonemes and tone,2 which might in part

2

We wish to acknowledge that an anonymous reviewer pinpointed this possibility that opens a new direction for future extension of this line of research.

explain the relatively small effect size in the significant interaction between phonetic context and tone contrast observed in Experiment 2. Future research may consider systematically manipulating the types of shared vowels and consonants and further examine the acoustic factors that contribute to the processing interaction between tones and phonemes in tonal distinction. Finally, we must keep in mind that the TTRACE model is a preliminary model that characterizes the interactive processes of speech perception in tonal languages by incorporating lexical tones as an integral part of phonemes in speech perception. Like other speech perception models (such as Merge: Norris et al., 2000; Episodic: Goldinger, 1996; TRACE: McClelland et al., 2006; C-Cure: McMurray & Jongman, 2011), the TTRACE model has some limitations. For example, there is no clear specification of the effect of age on speech perception. The current TTRACE model may not adequately address questions, such as what kind of developmental trajectory do the Cantonese-speaking children follow in weighting the various acoustic cues for tone perception? Another issue is how speakers of different language backgrounds, as compared to native speakers, might utilize the redundant cues differently for tone perception. Furthermore, the current TTRACE model provides only a general framework for modeling the architecture of the speech perception system for tone languages, and there are no detailed predictions about how the representations of tonemes and phonemes change over time and how tonemes and phonemes compete for limited processing resources during the course of phonological encoding. There is also no detailed quantification of the interactions among phonological features of tonemes and phonemes. In addition, the TTRACE model is proposed on the basis of empirical findings instead of the simulation approach. We are very cautious when explaining the prediction of this model. Future research may consider setting parameters and implementing the model with a simulation approach when evaluating the prediction of the current TTRACE model. In conclusion, we have shown that Cantonese children attend to onset F0 for perception of similarly contoured tones, and F0 contour for the perception of differently contoured tones. There is a context effect in tone perception: Phonemes and tonemes are interactively processed in the perception of lexical tones. Previous research has mostly focused on examining listeners’ attendance to broader acoustic parameters such as pitch height or contour. In contrast, our study provides initial evidence that shows that the F0 onset is another important acoustic cue to consider in the model of tone perception, especially with regard to cue-weighting in tone perception. Furthermore, tonemes and phonemes are interactively processed in the perception of lexical tones. Thus, our newly proposed TTRACE model is a preliminary but promising one that accounts for speech perception in tonal languages and opens up a new direction for further investigation of the detailed nature of speech perception in tonal languages.

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1603

Acknowledgments The authors would like to thank Stephen Man Kit Lee for his drawing of the TTRACE model and his comments and discussion. We would also like to thank former research assistant Keith Leung for his help and Dr. Jack Gandour for his comments on an earlier version of this article. This research was supported, in part, by Seed Grant 201202159006 from the University of Hong Kong to Dr. Xiuli Tong.

References Bauer, R. S., & Benedict, P. K. (1997). Modern Cantonese phonology. Berlin, Germany: Mouton de Gruyter. Boersma, P., & Weenink, D. (2004). Praat: Doing phonetics by computer [Computer program]. Retrieved from http://www. fon.hum.uva.nl/praat/ Burnham, D., Kim, J., Davis, C., Ciocca, V., Schoknecht, C., Kasisopa, B., & Luksaneeyanawin, S. (2011). Are tones phones? Journal of Experimental Child Psychology, 108, 693–712. Chao, Y. R. (1930). A system of tone-letters. Le Maître Phonétique, 45, 24–27. Ching, T. Y. C. (1984). Lexical tone pattern learning in Cantonese children. Language Learning and Communication, 3, 243–414. Ching, T. Y. C. (1988). Lexical tone perception by Cantonese deaf children. In I. M. Liu, M. J. Chen, & H. C. Chen (Eds.), Cognitive aspects of the Chinese language (pp. 93–102). Hong Kong: Asian Research Service. Ciocca, V., & Lui, J. Y. K. (2003). The development of the perception of Cantonese lexical tones. Journal of Multilingual Communication Disorders, 1, 141–147. Cutler, A., Butterfield, S., & Williams, J. N. (1987). The perceptual integrity of syllabic onsets. Journal of Memory and Language, 26, 406–418. Cutler, A., & Chen, H. C. (1997). Lexical tone in Cantonese spoken-word processing. Perception & Psychophysics, 59, 165–179. Cutler, A., & Otake, T. (1998). Pitch accent in spoken-word recognition in Japanese. The Journal of the Acoustical Society of America, 105, 1877–1888. Francis, A. L., Ciocca, V., Wong, V. K. M., & Chan, J. K. L. (2006). Is fundamental frequency a cue to aspiration in initial stops? The Journal of the Acoustical Society of America, 120, 2884–2895. Gandour, J. (1981). Perceptual dimensions of tones: Evidence from Cantonese. Journal of Chinese Linguistics, 9, 20–36. Gandour, J. (1983). Tone perception in Far Eastern languages. Journal of Phonetics, 11, 149–175. Garner, W. R. (1974). The processing of information and structure. Potomac, MD: Erlbaum. Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1166–1183. Huang, J., & Holt, L. L. (2009). General perceptual contributions to lexical tone normalization. The Journal of the Acoustical Society of America, 125, 3983–3994. Lee, K. Y. S., Chiu, S. N., & van Hasselt, C. A. (2002). Tone perception ability of Cantonese-speaking children. Language and Speech, 45, 387–406. Lee, L., & Nusbaum, H. C. (1993). Processing interactions between segmental and suprasegmental information in native

1604

speakers of English and Mandarin Chinese. Perception & Psychophysics, 53, 157–165. Lisker, L., & Abramson, A. S. (1964). A cross-language study of voicing in initial tops: Acoustical measurements. Word, 20, 384–422. Marslen-Wilson, W. (1984). Function and processing in spoken word recognition: A tutorial review. In H. Bouma & D. G. Bouwhuis (Eds.), Attention and performance X: Control of language processes. Hillsdale, NJ: Erlbaum. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86. McClelland, J. L., Mirman, D., & Holt, L. L. (2006). Are there interactive processes in speech perception? Trends in Cognitive Sciences, 10, 363–369. McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review, 118, 219–246. Norris, D. (1994). Shortlist: A connectionist model of continuous speech recognition. Cognition, 52, 189–234. Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences, 23, 299–325. Peng, G., Zhang, C., Zheng, H.-Y., Minett, J. W., & Wang, W. S.-Y. (2012). The effect of intertalker variations on acousticperceptual mapping in Cantonese and Mandarin tone systems. Journal of Speech, Language, and Hearing Research, 55, 579–595. Peng, G., Zheng, H.-Y., Gong, T., Yang, R.-X., Kong, J.-P., & Wang, W. S.-Y. (2010). The influence of language experience on categorical perception of pitch contours. Journal of Phonetics, 38, 616–624. Repp, B. H., & Lin, H.-B. (1990). Integration of segmental and tonal information in speech perception: A cross-linguistic study. Journal of Phonetics, 18, 481–495. Samuel, A. G. (1996). Does lexical information influence the perceptual restoration of phonemes? Journal of Experimental Psychology: General, 125, 28–51. Samuel, A. G. (2001). Knowing a word affects the fundamental perception of the sounds within it. Psychological Science, 12, 348–351. So, L. K. H., & Dodd, B. J. (1995). The acquisition of phonology by Cantonese-speaking children. Journal of Child Language, 22, 473–495. To, C. K. S., Cheung, P. S. P., & McLeod, S. (2013). A population study of children’s acquisition of Hong Kong Cantonese consonants, vowels, and tones. Journal of Speech, Language, and Hearing Research, 56, 103–122. Tong, Y., Francis, A. L., & Gandour, J. T. (2008). Processing dependencies between segmental and suprasegmental features in Mandarin Chinese. Language and Cognitive Processes, 23, 689–708. Vance, T. J. (1976). An experimental investigation of tone and intonation. Phonetica 33, 368–392. Weizman, Z., Fletcher, P., & Ma, E. (2000). Cantonese corpus. Retrieved from http://childes.psy.cmu.edu/ Wong, A. M.-Y., Ciocca, V., & Yung, S. (2009). The perception of lexical tone contrasts in Cantonese children with and without specific language impairment (SLI). Journal of Speech, Language, and Hearing Research, 52, 1493–1509. Wong, P. C. M., & Diehl, R. L. (2003). Perceptual normalization for inter- and intratalker variation in Cantonese level tones.

Journal of Speech, Language, and Hearing Research • Vol. 57 • 1589–1605 • October 2014

Journal of Speech, Language, and Hearing Research, 46, 413–421. Xu, Y., Gandour, J. T., & Francis, A. L. (2006). Effects of language experience and stimulus complexity on the categorical perception of pitch direction. The Journal of the Acoustical Society of America, 120, 1063–1074.

Ye, Y., & Connine, C. M. (1999). Processing spoken Chinese: The role of tone information. Language and Cognitive Processes, 14, 609–630. Yip, M. (2002). Tone. Cambridge, England: Cambridge University Press. Zhu, X. (2010). Phonetics. Beijing, China: Commercial Press.

Appendix Average F0 Values at Onset and Offset for the Six Tones Produced With the Syllables /ji/ and /fu/ /ji/ Tone

Onset

Offset

HL(T1)

254.02

HR(T2)

/fu/ F0 trace

F0 trace

Onset

Offset

255.60

292.01

266.68

174.75

282.38

189.03

283.50

ML(T3)

207.59

206.86

243.63

203.93

LF(T4)

180.18

115.60

213.23

124.07

LR(T5)

187.71

225.38

201.63

214.45

L(T6)

201.42

185.76

225.50

176.98

Note. HL= high level (55, T1), HR = high rising (25, T2), ML= midlevel (33, T3), LF = low falling (21, T4), LR = low rising (23, T5), LL = low level (22, T6). The six tones of the syllable /ji/ represented /ji55/ (clothing), /ji25/ (chair), /ji33/ (the first character of spaghetti), /ji21/ (son), /ji23/ (ear), and /ji22/ (two), whereas the six lexical tones of the syllable /fu/ included /fu55/ (skin), /fu25/ (tiger), /fu33/ (trousers), /fu21/ (symbol), /fu23/ (woman), and /fu22/ (father).

Tong et al.: Acoustic Correlates and Phonetic Context Effects

1605

Copyright of Journal of Speech, Language & Hearing Research is the property of American Speech-Language-Hearing Association and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Cues for lexical tone perception in children: acoustic correlates and phonetic context effects.

The authors investigated the effects of acoustic cues (i.e., pitch height, pitch contour, and pitch onset and offset) and phonetic context cues (i.e.,...
863KB Sizes 0 Downloads 4 Views