Perceptual & Motor Skills: Perception 2014, 118, 1, 210-224. © Perceptual & Motor Skills 2014

EFFECTS OF SENTENCE CONTEXT ON PHONEMIC CATEGORIZATION OF NATURAL AND SYNTHETIC CONSONANT-VOWEL TOKENS1, 2 MARK S. HEDRICK, JI YOUNG LEE, ASHLEY HARKRIDER, AND DEBORAH VON HAPSBURG Department of Audiology & Speech Pathology The University of Tennessee Health Science Center Summary.—The purpose was to assess if phonemic categorization in sentential context is best explained by autonomous feedforward processing or by top-down feedback processing that affects phonemic representation. 11 listeners with normal hearing, ages 20–50 years, were asked to label consonants in /pi/ - /ti/ consonantvowel (CV) stimuli in 9-step continua. One continuum was derived from natural tokens and the other was synthetically generated. The CV stimuli were presented in isolation and in three sentential contexts: a neutral context, a context favoring /p/, and a context favoring /t/. For both natural and synthetic stimuli, the isolated and neutral context sentences yielded significantly more /t/ responses than sentence contexts primed for either /p/ or /t/. No other conditions were significantly different. Results did not show easily explainable semantic context effects. Instead, data clustering was more readily explained by top-down feedback processing affecting phonemic representation.

Speech is a complex, rapidly changing temporal signal. Different theories of speech perception posit a different sequence of events that occur before a speech signal accesses semantic representation, thereby enabling understanding of spoken language. That sequence of events is resistant to degradation, even in extreme listening conditions of noise and reverberation. A major debate in the past few decades pertains to the flow of information in this sequence of events, whether in fact there is only feedforward from acoustics to word recognition, or whether there is feedback from the lexical/semantic representation to the phonemic level. To state it more precisely – debate has centered on whether there is feedback from lexical/semantic representation that would alter the structure of phonemic representation (Norris, McQueen, & Cutler, 2000). A seminal paper in this debate is that of Ganong (1980), who demonstrated that the category boundary of a phoneme will shift depending on the lexical context of the phoneme. As an example, the /g/ - /k/ boundary may shift to favor /g/ responses if the lexical context forms the word Address correspondence to Mark Hedrick, Department of Audiology & Speech Pathology, The University of Tennessee, 578 South Stadium Hall, Knoxville, TN 37996-0740, or e-mail ([email protected]). 2 Portions of this research were presented at the 2009 fall meeting of the Acoustical Society of America in San Antonio, Texas. 1

DOI 10.2466/24.22.PMS.118k14w6

15-PMS_Hedrick_130116.indd 210

ISSN 0031-5125

11/02/14 4:39 PM

CONSONANT-VOWEL TOKENS

211

“gift”; likewise, the /g/ - /k/ boundary will shift to favor /k/ responses if the lexical context forms the word “kiss.” This “Ganong effect” has been frequently studied to determine whether lexical representation does in fact “feed back” and affect phonemic representation. Ganong assumed that a strictly top-down processing mechanism would show alteration of responses to endpoints of the phonemic categories, that strictly bottom-up processing would show identical responses regardless of lexical context, but that an interaction of top-down and bottom-up mechanisms would show alteration by lexical context only to ambiguous stimuli along the /g/ - /k/ continuum. Ganong observed the latter result, which was thought to provide evidence for an interaction of top-down and bottomup processing. Ganong stated that “…lexical status has an effect before acoustic information is replaced by a phonetic categorization” (p. 119). There are two general models to explain the sequence of events from acoustic speech signal to semantic representation, depending on whether the interaction observed by Ganong is an interaction of “information” or an interaction of “processes” (Norris, et al., 2000). From Ganong’s above statement, it appears he surmised an interaction of processes. Norris and colleagues, however, have suggested that the “Ganong effect” can be explained using modular phonemic perception (e.g., Norris, et al., 2000) without recourse to actual interaction of lexical and phonemic processing. The two general models may be referred to as autonomous feedforward models and top-down feedback models. In the autonomous feedforward models, the incoming speech signal is initially processed independent of context, i.e., perceptual processing is autonomous, and lexical or semantic knowledge influences occur at later decision stages (Cutler, Mehler, Norris, & Segui, 1987; Massaro, 1989; Norris, 1994). For example, the Shortlist (Norris, 1994) and Merge (Norris, et al., 2000) models are feedforward, with information flowing only from phoneme to lexical representation. Influence of lexical knowledge occurs at higher level decision stages—there is no feedback from lexical representation to alter phonemic representation, nor is such feedback necessary to explain lexical influences (Norris, et al., 2000; Desroches, Newman, & Joanisse, 2009). In these models, there is an interaction or combining of information, but not an interaction of processing. For instance, there is no interaction of lexical processing/mechanisms with phonemic processing/mechanisms, but lexical and phonemic knowledge may be combined in making phonemic or word recognition decisions. In top-down feedback models, on the other hand, phonemic perceptions are directly affected by top-down feedback from higher levels of processing. Thus, what a person hears is strongly affected by what one expects to hear and the knowledge amassed over the course of previous experiences (McClelland & Elman, 1986; Connine & Clifton, 1987; Samuel, 1996;

15-PMS_Hedrick_130116.indd 211

11/02/14 4:39 PM

212

M. S. HEDRICK, ET AL.

Samuel, 2001). These models maintain that there is bi-directional flow of information between phonemic and lexical representations. Thus, lexical information may re-shape phonemic representation (McClelland, Mirman, & Holt, 2006). Ganong’s (1980) data would fall in line with these models, e.g., the TRACE model (McClelland & Elman, 1986) updated to include Hebbian learning (Mirman, McClelland, & Holt, 2006) is designed to show that reciprocal connections exist between lexical and sub-lexical layers, permitting both bottom-up and top-down effects during word recognition. Thus, even higher-level language components such as syntax or semantics may actually re-shape phonemic representations. Such top-down influence may help listeners when trying to understand speech in degraded listening conditions. In these top-down models, there is an interaction of processes that is not present in the autonomous feedforward models. Previous studies investigating evidence for these two types of models typically have involved studying lexical effects on phonemic representation while altering portions of words (Ganong, 1980; e.g., Ellman & McClelland, 1988; Pitt & McQueen, 1998; Magnuson, McMurray, Tanenhaus, & Aslin, 2003). Fewer studies have examined sentential context effects on phonemic representation (Samuel, 1981; Miller, Green, & Schermer, 1984; e.g., Connine, 1987; Connine, Blasko, & Hall, 1991; Borsky, Tuller, & Shapiro, 1998; van Alphen & McQueen, 2001). These studies have suggested that sentential context effects do not influence phonemic representation, and thus provide evidence for autonomous feedforward models. Most of these studies have suggested that decision bias affects listener responses – and decision nodes are not pre-lexical (Norris, et al., 2000), meaning that sentence effects on phoneme perception involve interaction of information, not interaction of processes. A number of different methods and measures have been used to assess how sentential context may show either autonomous feedforward or top-down feedback processing effects upon phonemic perception. These included varying the difficulty of the task (e.g., Miller, et al., 1984), measuring reaction time (Connine, 1987; van Alphen & McQueen, 2001), and using signal detection theory to examine phonemic restoration (Samuel, 1981). There may be other ways, however, to show how sentential context effects influence phoneme perception. One such method would be to use nouns that are ambiguous, i.e., nouns that could be perceived either as mere consonant-vowel (CV) syllables, or alternatively perceived as nouns with different meanings. Most of the above studies investigating sentential context effects (excepting van Alphen & McQueen, 2001) have involved target items that were relatively unambiguous in lexical or semantic meaning (e.g., “dent” – “tent” in Connine, 1987; e.g., “goat” – “coat” in Borsky, et al., 1998). Thus, many of the studies were using content words as targets and were varying

15-PMS_Hedrick_130116.indd 212

11/02/14 4:39 PM

CONSONANT-VOWEL TOKENS

213

semantic “fit”; such stimuli, owing to the semantic manipulation, may have unfairly favored decision bias and, hence, explanations consistent with autonomous feedforward models. To counter this, van Alphen and McQueen (2001) used function words as targets to get maximal sentential context effect and still showed evidence of autonomous feedforward processing. However, simple CV syllables with a range of response options (simple CV with no meaning, different nouns with different meanings) provide even more variation for examining sentential context effects than stimuli in any of the above studies. In addition, few studies have presented stimuli with a range of response options in isolation and then in neutral or in primed sentential contexts. Further, few studies have presented the target word as either a naturally produced token and then as a synthetic token. All these variations—range of response options, degree of sentential context, natural vs synthetic—provide several ways in which sentential context effects may be studied. The purpose of the current study was to assess whether sentential context effects in these variations of range of response options, extent of sentential context, or type of stimulus (natural versus synthetic) would show patterns that were more likely explained by autonomous feedforward models or alternatively by top-down feedback models. Two sets of consonant-vowel syllables (/pi/ - /ti/) were used as stimuli: a set derived from naturally produced /pi/ and /ti/, and a synthetic set. The syllables were presented in four contexts: in isolation, in a neutral sentence context, in a sentence context semantically biased for /pi/, and in a sentence context semantically biased for /ti/. Hypothesis. Sentence context effects resulting from an autonomous feedforward model, based on decision bias, would result in different psychometric functions for all four of the context conditions. Such a result likely would occur if sentential effects only influence postperceptual decision processes, but also could be explained by a feedback model. It may be that sentence context effects could cause listeners to notice but not fully attend to the semantic context of sentences. Such a modulation of attention is congruent with interactive feedback (Mirman, McClelland, Holt, & Magnuson, 2008), and could yield results based on the extent of attention shift that occurred; i.e., sentences containing more context would presumably cause more attention shift, and have results different from a sentence having less context or a word spoken in isolation. Alternative hypothesis. Sentence context effects resulting from topdown influence would result in clustering of psychometric functions. (If listeners truly attended to the semantic context of the sentences, their results would not show such clustering.)

15-PMS_Hedrick_130116.indd 213

11/02/14 4:39 PM

214

M. S. HEDRICK, ET AL.

METHOD Participants Eleven native English speakers (2 men, 9 women) participated in this study. The participants ranged in age from 21–50 years (M = 29.3, SD = 11.3) and all were recruited from The University of Tennessee campus. They were right-handed and had no history of audiological, neurological, or psychological disorders, nor speech, language, hearing, or learning disorders. Their audiometric thresholds were at or below 15 dB HL for the octave frequencies between 250 and 8000 Hz (re: ANSI, 2010) and tympanograms were normal for both ears. Participants were paid $10 upon completion of the project; their total time of participation was approximately one hour. Stimuli Stimuli were made for four listening conditions: an isolated CV condition and three sentential conditions. A male native speaker of English produced /ti/, /pi/, and three sentences while in a sound attenuating chamber. A nine-step series of a /ti/-/pi/ continuum was made by manipulating the naturally spoken /ti/ and /pi/ tokens. After nine isolated CV syllables were completed, these CV syllables from a /ti/-/pi/ continuum were inserted digitally in three sentences; thus, 27 sentences were constructed (9 CV syllables × 3 sentences). (See Fig. 1 for exemplars.) Recording was done in a quiet room using a high quality microphone (Spher-O-Dyne) held approximately 1 cm from the speaker's mouth. The microphone output was fed to a preamplifier (Tucker-Davis, Model MA2), then routed to a 16-bit A/D converter (Tucker-Davis, Model DD1), and saved as a file sampled at 12.5 KHz via a commercially available software package (CSRE, Version 4.5). For the isolated CV condition, nine CV syllables from a /ti/-/pi/ continuum were made from the naturally spoken /ti/ and /pi/. The naturally spoken /ti/ and /pi/ were used as the 1st and 9th endpoint stimuli, respectively, and intermediate steps in the continuum were created by altering the spectral tilt of the consonant. The spectral tilt of the /ti/ sound naturally produced by the speaker was flat, whereas the spectral tilt of the /pi/ sound was falling or decreasing in amplitude with increasing frequency. The spectral tilt of the consonantal portion of each syllable was altered systematically to produce the continuum which varies from /pi/ to /ti/, using a commercially available software program (Adobe Audition 1.5). There were three sentential contexts: (1) ”Pick the letter _CV_,” (2) “The farmer picked the _CV_,” and (3) “The golfer picked up the _CV_.” These sentences varied in contextual likelihood when /ti/ or /pi/ was included at the end of the sentence. Thus, the sentences consisted of three

15-PMS_Hedrick_130116.indd 214

11/02/14 4:39 PM

215

CONSONANT-VOWEL TOKENS 5

5

5

Frequency (kHz)

5

0

0

Time (sec.)

Time (sec.)

5

0 1.471 0

Time (sec.)

2.034

5

0

0

Time (sec.)

2.165

5

Frequency (kHz)

5

0 0.4951 0

0 0

Time (sec.)

0 0.4951 0

Time (sec.)

0 1.417 0

0

Time (sec.)

1.88

0

Time (sec.)

2.164

FIG. 1. Spectrograms of stimuli using or derived from naturally produced CV syllables. All sentences were naturally produced. Please note the differences on the time scale between panels. Top left panel: naturally produced isolated /ti/. Bottom left panel: naturally produced isolated /pi/. Second panel from left, top: the sentence “Pick the letter /ti/” with a natural /ti/. Second panel from left, bottom: the sentence “Pick the letter /pi/” with a natural /pi/. Third panel from left, top: The sentence “The farmer picked the /ti/” with a natural /ti/. Third panel from left, bottom: the sentence “The farmer picked the /pi/” with a natural /pi/. Top right panel: the sentence “The golfer picked up the /ti/” with a natural /ti/. Bottom right panel: the sentence “The golfer picked up the /pi/” with a natural /pi/.

sub-conditions with different contextual priming: (1) neutral context (NC) (e.g., “Pick the letter /t/” vs “Pick the letter /p/”) where both sentences are equally likely, (2) a /p/-primed context (PC) (e.g., “The farmer picked the tea” vs “The farmer picked the pea”) where the latter is more likely for a North American listener than the former, and (3) a /t/-primed context (TC) (e.g., “The golfer picked up the tee” vs “The golfer picked up the pea”) where the former is a much more likely sentence than the latter). Nine CV syllables from a /ti/-/pi/ continuum were concatenated digitally to the end of each sentence (Adobe Audition, 1.5) and used as words in context (e.g., the letter “t” or “p,” “pea,” “tea,” “tee”). Recall that there were 3 different sentences (NC, PC, TC). Each stimulus in the 9-step continuum was added to each of the 3 different sentences, making a total of 27 sentences. (See Fig. 2 for exemplars.) Synthetic stimuli were constructed in a similar way to the natural speech stimuli. Nine synthetic speech stimuli were constructed to form the synthetic continuum. The synthetic stimuli endpoints were made having a burst and formant transition onset values appropriate for a labial or for an alveolar, and the burst and transition onset values were then inter-

15-PMS_Hedrick_130116.indd 215

11/02/14 4:39 PM

216

M. S. HEDRICK, ET AL. 5

5

5

Frequency (kHz)

5

0 0

Time (sec.)

0.5

5

0 0

Time (sec.)

0 1.27 0

Time (sec.)

1.74

5

Time (sec.)

1.945

5

Frequency (kHz)

5

0 0

0 0

Time (sec.)

0.5

0 0

Time (sec.)

0 1.277 0

Time (sec.)

1.74

0 0

Time (sec.)

1.945

FIG. 2. Spectrograms of stimuli using synthetic CV syllables. Except for the last CV syllable, all sentences were naturally produced. Please note the differences on the time scale between the panels. Top left panel: synthetic isolated /ti/. Bottom left panel: synthetic isolated /pi/. Second panel from left, top: the sentence “Pick the letter /ti/” with a synthetic /ti/. Second panel from left, bottom: the sentence “Pick the letter /pi/” with a synthetic /pi/. Third panel from left, top: the sentence “The farmer picked the /ti/” with a synthetic /ti/. Third panel from left, bottom: the sentence “The farmer picked the /pi/” with a synthetic /pi/. Top right panel: the sentence “The golfer picked up the /ti/” with a synthetic /ti/. Bottom right panel: the sentence “The golfer picked up the /pi/” with a synthetic /pi/.

polated from the labial to the alveolar to create intermediate stimuli. The burst was shaped by manipulating the amplitude of the formants F3, F4, and F5 during the frication portion of the syllable. The burst was given either a falling shape (greater energy at the lower-frequency F3 rather than at F4 or F5, typical for the labial /p/) or rising (greater energy at F4 and F5 compared to F3, typical for the alveolar /t/). Frequency values for F3 during the burst varied from 2200 Hz for the labial to 2800 Hz for the alveolar; these frequency values also represented the formant transition onset values when voicing was initiated for the vocalic portion of the syllable. Frequency values for F4 and F5 were constant at 3500 Hz and 3750 Hz, respectively. Formant transition values varied from labial to alveolar values for F2 and F3; F2 onset values varied from 1000 Hz (labial) to 1800 Hz (alveolar) across the continuum, and F3 onset values varied from 2200 Hz (labial) to 2800 Hz (alveolar), as listed above. The total duration of the burst was 25 msec., with formant transitions beginning at the termination of the burst and continuing for 40 msec., to voicing initiation. Fundamental frequency was initiated at 130 Hz at vocalic onset and gradually de-

15-PMS_Hedrick_130116.indd 216

11/02/14 4:39 PM

CONSONANT-VOWEL TOKENS

217

clined to 100 Hz at voicing offset 250 msec., later. Steady-state formant frequency values were as follows: F1 = 310 Hz, F2 = 2000 Hz, and F3 = 2900 Hz. Synthetic stimuli were created using a formant synthesizer (Klatt, 1980) with a sampling rate of 10 kHz followed by low-pass filtering at 4.9 kHz. Please note above that all the information needed in the speech waveform to correctly identify /p/ from /t/ occurs at frequencies below 4500 Hz, which means that a 10 kHz sampling rate is sufficient for stimulus generation. These synthesis sampling rates are similar to those used in previous studies with the Klatt synthesizer (e.g., Hedrick & Younger, 2003), and achieved natural-sounding stimuli. As the results will show, the similar results found for both the naturally produced and the synthetic stimuli in the current study show that the synthetic stimuli were natural-sounding. Procedure Prior to the experimental test, all participants received a case history interview and screening tests. They were then given a preliminary test using continuum endpoints of synthetic speech to familiarize them with the task and make sure their consonant identification ability was normal. Participants engaged in the study once they identified the consonants in the synthetic endpoint stimuli with 90% accuracy. Participants were administered the identification task individually in a sound-treated booth. They were seated facing a computer screen which displayed the letters “p” and “t,” and asked to select which consonant sound they heard by using a mouse to select the letter of that consonant. The stimuli were D/A converted (Tucker-Davis, Model DD1), and routed through a headphone buffer (Tucker-Davis HB) before sent to headphones within a sound-attenuated booth (IAC). The stimuli were presented to subjects binaurally (diotically) at a comfortable listening level (approximately 74 dB SPL) using Sennheiser headphones. The stimuli were blocked according to natural CVs or synthetic CVs. Six participants listened to synthetic speech first and then natural speech, whereas five participants listened to natural speech first and then synthetic speech. Each stimulus was presented five times in random order within each block – the isolated CVs and the sentence stimuli were all randomized together. Randomized presentation lists and online data collection was performed by a commercially available software program (CSRE 4.5). Thus, participants gave 180 responses (9 series × 5 times for the isolated CV condition and 9 series × 5 times × 3 contexts for the sentence condition) for natural and synthetic speech, respectively. Again, the only blocking of stimuli was according to natural or synthetic CVs—for the naturally produced CV stimuli, all stimuli (all sentences and isolated CVs) were randomly presented together with each randomized order; likewise, for the synthetically-pro-

15-PMS_Hedrick_130116.indd 217

11/02/14 4:40 PM

218

M. S. HEDRICK, ET AL.

duced CV stimuli, all stimuli (all sentences and isolated CVs) were randomly presented together with each randomized order. Analysis Responses from the listeners were tabulated as psychometric functions showing response label (/p/ or /t/) as a function of acoustic cue manipulation (hence stimulus number). The points on the psychometric function were then used as the dependent variable in analyses of variance with cue manipulation (or stimulus number) and sentence context as the factors. Similar analyses performed previously (Hedrick & Younger, 2007) listed an effect size of 0.8 for formant transition manipulation for the /p/-/t/ contrast; assuming a slightly more conservative estimate of effect size (0.7) along with an alpha of .05 and power of .84, approximately 10 participants would be needed to show significant differences between the sentence contexts if such differences were present. RESULTS Natural Speech Figure 3 presents the results from the natural speech stimuli, showing percent /t/ responses plotted as a function of manipulation of spectral tilt from /t/ to /p/. The legend to the right identifies the different sentential context conditions. The data show differences between some of the contextual conditions, an observation borne out by the statistical analysis. Raw data responses along the psychometric function were used as the dependent variable in a two-way analysis of variance (ANOVA), with manipulation of tilt and sentential context as the within-subjects factors. This analysis procedure involves the entire psychometric function, similar to those used by Hedrick and Younger (2003). Results indicated a main effect for sentential context (F1.85, 18.48 = 7.06, p = .006, ηp2 = 0.41) and spectral tilt (F2.27, = 162.34, p < .001, ηp2 = 0.94), with no significant interaction. To explore the 22.67 main effect of sentential context, post hoc tests were performed and results can be found in Table 1. In general, responses for the isolated CV and neutral sentence context were grouped together, the responses from the PC and TC sentences were grouped together, and there were differences between but not among these groupings. Synthetic Speech Figure 4 presents the results from the synthetic CV stimuli. As with the natural stimuli, these results show the same pattern for sentential context and spectral tilt. The two-way ANOVA demonstrates a main effect of context (F2.38, 23.76 = 6.97, p = .003, ηp2 = 0.41) and spectral tilt (F2.68, 26.77 = 84.37, p < .001, ηp2 = 0.89), with no interaction effect. Table 2 shows the post hoc test on sentential context. A comparison within Table 1 (second and third col-

15-PMS_Hedrick_130116.indd 218

11/02/14 4:40 PM

219

CONSONANT-VOWEL TOKENS

Percent /ti/ Responses

100 80 60

Isolated CV

40

NC PC

20

TC 0 1

2

3

4

5

6

7

8

9

Spectral Tilt From /tl/ to /pi/ FIG. 3. Percent /t/ responses plotted as a function of spectral tilt for the four sentential context conditions: isolated CV, NC (neutral context), PC (context primed for /p/ response), and TC (context primed for /t/ response). Stimuli were derived from a naturally produced token with slight changes made in spectral tilt.

umns) indicates the same patterns of results were obtained from stimuli containing the natural tokens as from stimuli containing the synthetic tokens. Figures 3 and 4 portray a lack of change at continuum endpoints; thus, by using data from the entire psychometric function in the statistical analyses, context effects at phonemic category boundaries may be buried. Therefore, 50% boundary values of the psychometric functions were calculated using linear regression and subsequently used in one-way ANO-

Percent /t/ Responses

100 80 60

Isolated CV

40

NC PC

20

TC 0 1

2

3

4

5

6

7

8

9

Spectral Tilt From /t/ to /pi/ FIG. 4. Percent /t/ responses plotted as a function of spectral tilt for the four sentential context conditions: isolated CV, NC (neutral context), PC (context primed for /p/ response), and TC (context primed for /t/ response). Stimuli were synthetically generated.

15-PMS_Hedrick_130116.indd 219

11/02/14 4:40 PM

220

M. S. HEDRICK, ET AL. TABLE 1 POST HOC TESTS OF CONTEXT FOR NATURAL AND SYNTHETIC SPEECH STIMULI (P VALUES) Contrast Pair

Natural Speech Stimuli

Synthetic Speech Stimuli

Isolated CV vs NC

.24

.49

Isolated CV vs PC

.04

.003

Isolated CV vs TC

.02

.02

NC vs PC

.005

.01

NC vs TC

.008

.006

PC vs TC .34 Note.—Boldface emphasizes p < .05.

.38

VAs, one for each stimulus set. There was a main effect of context for the natural (F1.92, 17.29 = 8.63, p = .003, ηp2 = 0.49) and synthetic (F3, 27 = 7.31, p = .001, ηp2 = 0.55) tokens. Tukey Least Significant Difference (LSD) post hoc tests for the contextual conditions are shown in Table 2 for the natural and synthetic stimuli. In essence, the same pattern of results is seen when using either the entire psychometric function as dependent variable values or only the boundaries from the psychometric function. The only difference was a p = .05 for the LSD test between the isolated word and PC context for the natural tokens. DISCUSSION The purpose of the current study was to assess whether the sentential context effects would show patterns of results that were likely explained by either autonomous feedforward models or by top-down feedback models when tested with stimuli having a range of response options, varying TABLE 2 POST HOC TESTS OF CONTEXT FOR NATURAL AND SYNTHETIC SPEECH STIMULI USING BOUNDARY VALUES AS THE DEPENDENT VARIABLE (P VALUES) Contrast Pair

Natural Speech Stimuli

Synthetic Speech Stimuli

Isolated CV vs NC

.15

.75

Isolated CV vs PC

.05

.01

Isolated CV vs TC

.01

.03

NC vs PC

.002

.003

NC vs TC

.004

.006

PC vs TC

.87

.50

Note.—Boldface emphasizes p < .05.

15-PMS_Hedrick_130116.indd 220

11/02/14 4:40 PM

CONSONANT-VOWEL TOKENS

221

extents of sentential context, and either natural or synthetic targets. There were four different sentential contexts: targets in isolation, neutral context, and /p/ or /t/-primed contexts. If decision bias was the main influence upon listener response, then responses should have been different for all four contexts. Such a result would likely occur if sentential effects only influence post-perceptual decision-making. Because autonomous models predict that information is combined only at decision nodes, these models could readily predict these results. However, if top-down feedback was the main influence upon listeners’ responses, then responses might instead show a clustering of results in which the phonemic responses from the listeners may be different for a rich context (e.g., in the primed sentence contexts) versus a nonexistent or weak context. The ambiguity of the CV stimuli (having a range of response options as mere syllables, similarity via names of letters, or as names of different objects, e.g., “pea” or “tea” or “tee”), the different context, and the types of stimuli (natural versus synthetic) helped create situations where the same acoustic target might be assigned different meanings depending on the context, and made it more likely that phonemic representation might be altered by top-down feedback. Different responses for all four contexts would be simulated by either feedback or feedforward models, but more readily explained by feedforward models that better account for decision bias, as argued previously (Samuel, 1981; Miller, et al., 1984; e.g., Connine, 1987; Connine, et al., 1991; Borsky, et al., 1998; van Alphen & McQueen, 2001). However, the results do not support a behavioral pattern of contextually biased labeling when the phoneme is physically ambiguous. Instead, the data of the current study show a clustering of isolated CV and neutral sentential context versus primed sentential contexts. These patterns are consistent across stimuli having either natural or synthetic CV tokens. This clustering of results suggests that the semantic sentential context may affect phonemic representation, and is more readily explained by top-down feedback processing. Few studies examining sentential context effects have used similar stimuli or conditions like those used in the current study (Samuel, 1981; Miller, et al., 1984; Connine, 1987; Connine, et al., 1991; Borsky, et al., 1998; van Alphen & McQueen, 2001). Most of these previous studies used as target words specific nouns (e.g., “dent” versus “tent”), which are rather specific in meaning. This would seem to push explanations for sentential context effects toward decision bias or lexical semantic strategies rather than to possible changes in phonemic representation. Using the more ambiguous CV stimuli in the current study allowed for a better chance of observing possible top-down feedback mechanisms affecting phonemic representation.

15-PMS_Hedrick_130116.indd 221

11/02/14 4:40 PM

222

M. S. HEDRICK, ET AL.

This brings up one of the striking results in the current study: responses to the PC and TC-primed sentences were not different from one another for either natural or synthetic stimuli. It would seem that, if decision bias were the only factor, then there should be clear differences between the PC and TC sentences. The lack of difference between these conditions might be because the TC sentence (“The golfer picked up the ___”) was not sufficiently strong in context to separate it from the PC sentence (“The farmer picked the ___”). Yet this explanation could still be used even if a small but significant difference would have been found between the sentences, but no such difference was found. Nor does it help to understand why a golfer’s tee would possibly be confused with a pea. An alternative explanation is that the PC and TC sentences presented more semantic material than the imperative sentence (“Pick the letter ___”). This additional semantic material could have imperfectly yet consistently influenced encoding – and this imperfect yet consistent influence upon encoding was a result of all the stimuli (isolated CVs, NC, TC, and PC sentences) presented altogether within randomized lists. It may also be the case that the randomized lists resulted in listeners noting but not fully attending to the semantic context of the TC and PC sentences (e.g., Mirman, et al., 2008). We reason that this would still show an interactive effect of feedback on phonemic processing, since there were still significant differences in results from the primed sentences in comparison with the isolated/neutral sentence context conditions. An autonomous model that would yield sentential effects based on post-perceptual decision bias cannot easily explain why the primed sentence contexts in the current study yielded results so similar. It is obvious that the current study cannot provide an absolute answer as to whether or not sentential semantic feedback influences sublexical phonemic representation. The design of the current study was such that sentential context effects upon phoneme perception were surmised by asking the listeners to identify what they “heard,” not what they “thought.” An indirect behavioral measurement may not be sensitive enough to prove whether listeners responded based on decision bias or alteration of sublexical phonemic representation because speech perception occurs so quickly and unconsciously. A more direct approach involving more elaborate behavioral measures, in conjunction with evoked potential and brain imaging techniques, will likely be necessary before there is an answer to whether or not there is lexical/semantic sentential feedback at the pre-lexical level. Electrophysiological results from intra-cranial electrodes in humans have shown the importance of attention on auditory cortical neural selectivity to a given speaker in a multi-speaker background (e.g., Mesgarani & Chang, 2012). These findings are congruous with the role listeners' attention likely played in the current study; they do not completely determine the role or

15-PMS_Hedrick_130116.indd 222

11/02/14 4:40 PM

CONSONANT-VOWEL TOKENS

223

extent of cortical effects on subcortical processing. The data of the current study, however, are explained most easily by sentential semantic feedback influencing sublexical phonemic representation rather than by a strictly feedforward system. REFERENCES

AMERICAN NATIONAL STANDARDS INSTITUTE. (2010) Specifications for audiometers (ANSI S3.6). New York: Author. BORSKY, S., TULLER, B., & SHAPIRO, L. P. (1998) “How to milk a coat”: the effects of semantic and acoustic information on phoneme categorization. Journal of the Acoustical Society of America, 103, 2670-2676. CONNINE, C. M. (1987) Constraints on interactive processes in auditory word recognition: the role of sentence context. Journal of Memory and Language, 26, 527-538. CONNINE, C. M., BLASKO, D. G., & HALL, M. (1991) Effects of subsequent sentence context in auditory word recognition: temporal and linguistic constraints. Journal of Memory and Language, 30, 234-250. CONNINE, C. M., & CLIFTON, C. (1987) Interactive use of lexical information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 13, 291-299. CUTLER, A., MEHLER, J., NORRIS, D., & SEGUI, J. (1987) Phoneme identification and the lexicon. Cognitive Psychology, 19, 141-177. DESROCHES, A. S., NEWMAN, R. L., & JOANISSE, M. F. (2009) Investigating the time course of spoken word recognition: electrophysiological evidence for the influences of phonological similarity. Journal of Cognitive Neuroscience, 21, 1893-1906. ELLMAN, J. L., & MCCLELLAND, J. L. (1988) Cognitive penetration of the mechanisms of perception: compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27, 143-165. GANONG, W. F. (1980) Phonetic categorization in auditory word perception. Journal of Experimental Psychology, Human Perception and Performance, 6, 110-125. HEDRICK, M., & YOUNGER, M. S. (2003) Labeling of /s/ and /S/ by listeners with normal and impaired hearing, revisited. Journal of Speech, Language, and Hearing Research, 46, 636-648. HEDRICK, M., & YOUNGER, M. S. (2007) Perceptual weighting of stop consonant cues by normal and impaired listeners in reverberation versus noise. Journal of Speech, Language, and Hearing Research, 50, 254-269. KLATT, D. (1980) Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America, 67, 971-995. MAGNUSON, J. S., MCMURRAY, B., TANENHAUS, M. K., & ASLIN, R. N. (2003) Lexical effects on compensation for coarticulation: the ghost of Christmash past. Cognitive Science, 27, 285-298. MASSARO, D. W. (1989) Testing between the TRACE model and the fuzzy logical model of speech perception. Cognitive Psychology, 21, 398-421. MCCLELLAND, J. L., & ELMAN, J. L. (1986) The TRACE model of speech perception. Cognitive Psychology, 18, 1-86. MCCLELLAND, J. L., MIRMAN, D., & HOLT, L. (2006) Are there interactive processes in speech perception? Trends in Cognitive Sciences, 10, 363-369.

15-PMS_Hedrick_130116.indd 223

11/02/14 4:40 PM

224

M. S. HEDRICK, ET AL.

MESGARANI, N., & CHANG, E. (2012) Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485, 233-237. MILLER, J. L., GREEN, K., & SCHERMER, T. M. (1984) A distinction between the effects of sentential speaking rate and semantic congruity on word identification. Perception & Psychophysics, 36, 329-337. MIRMAN, D., MCCLELLAND, J. L., & HOLT, L. L. (2006) An interactive Hebbian account of lexically guided tuning of speech perception. Psychonomic Bulletin Review, 13, 958-965. MIRMAN, D., MCCLELLAND, J. L., HOLT, L. L., & MAGNUSON, J. S. (2008) Effects of attention on the strength of lexical influences on speech perception: behavioral experiments and computational mechanisms. Cognitive Science, 32, 398-417. NORRIS, D. G. (1994) A connectionist model of continuous speech recognition. Cognition, 52, 189-234. NORRIS, D. G., MCQUEEN, J. M., & CUTLER, A. (2000) Merging information in speech recognition: feedback is never necessary. Behavioral and Brain Sciences, 23, 299-370. PITT, M. A., & MCQUEEN, J. M. (1998) Is compensation for coarticulation mediated by the lexicon? Journal of Memory and Language, 39, 347-370. SAMUEL, A. G. (1981) Phonemic restoration: insights from a new methodology. Journal of Experimental Psychology, General, 110, 474-494. SAMUEL, A. G. (1996) Does lexical information influence the perceptual restoration of phonemes? Journal of Experimental Psychology, General, 125, 28-51. SAMUEL, A. G. (2001) Knowing a word affects the fundamental perception of the sounds within it. Psychological Science, 12, 348-351. VAN ALPHEN, P., & MCQUEEN, J. M. (2001) The time-limited influence of sentential context on function word identification. Journal of Experimental Psychology: Human Perception and Performance, 27, 1057-1071. Accepted January 10, 2014.

15-PMS_Hedrick_130116.indd 224

11/02/14 4:40 PM

224E ERRATUM HEDRICK, M. S., LEE, J. Y., HARKRIDER, A., & VON HAPSBURG, D. (2014) Effects of sentence context on phonemic categorization of natural and synthetic consonant-vowel tokens. Perceptual & Motor Skills: Perception, 118, 1, 210-224. DOI: 0.2466/24.22. PMS.118k14w6 The second author wishes to add a third footnote to the bottom of page 210: 3 Ji Young Lee is currently at the Department of Audiology and Speech-Language Pathology, Catholic University of Daegu, 13-13 Hayang-ro, Hayang-eup, Gyeongsan-si, Gyeongsangbuk-do 712-702 REPUBLIC OF KOREA.

Erratum published in: Perceptual and Motor Skills, 2014, 119, 3, 985. © Perceptual & Motor Skills 2014 DOI: 10.2466/99.PMS.119c33z3

Effects of sentence context on phonemic categorization of natural and synthetic consonant-vowel tokens.

The purpose was to assess if phonemic categorization in sentential context is best explained by autonomous feedforward processing or by top-down feedb...
1MB Sizes 2 Downloads 4 Views