Processing of phonemes in speech: A speed-accuracy study Roger Remington Universityof Oregon,Departmentof Psychology, Eugene,Oregon97403 (Received13 January 1977;revised24 June 1977)

The presentstudyinvestigatedthe serial or parallel processing of phonemeswithin a syllable.Subjects wererequiredto pressa key, on cue,to indicatewhethertwo successive, binauralCVC syllableswerethe sameor different. The cue was a tone which occurred at varying times after the onset of the secondCVC and often interrupted its presentation.Subjects'accuracyas a function of time was comparedfor discriminationsinvolvingthe first, second,and third phoneroes.The resultsshow that informationabout

the phonemeswithin a CVC syllableis accruingsimultaneously. The importanceof these resultsfor modelsof speechperceptionis discussed. PACS

numbers:

43.70.Dn

INTRODUCTION

A syllable can be represented formally as an ordered sequence of phonemes, and our intuitions as speakers and hearers tend to support this. A priori, then, it seems reasonable to hypothesize that the phonemes within a syllable are extracted from the acoustic wave in a correspondingly sequential fashion. Such considerations might lead to the additional assumption that this recovery is strictly serial. That is, recovery

(extraction) of the initial consonant of a CVC syllable, for exampie, is completed before extraction of the vowel or final consonant is begun. Both assumptions, hence the strict serial model, are almost certainly incorrect.

Acoustically, the notion that a syllable can be divided into an ordered set of discrete phonemes has received no empirical support. Spectrographic analysis has consistently failed to reveal discrete events in the acoustic wave that correspond to a given phoneme in all contexts. (See Studdert-Kennedy, 1975 for a review; Liberman,

Cooper, Shankweiler, and Studdert-Kennedy, 1967; Liberman, 1970a, 1970b; Massaro, 1975a. ) In most cases it is not possible to segment the acoustic signal into discrete temporal intervals which contain acoustic information about only one phoneme sufficient to iden-

tify that phoneme (steady-state vowels being likely exceptions). A temporal interval with sufficient information to identify a given phoneme might encompass neighboring phonemes as well (Liberman et al., 1967; Liberman, 1970a; Fant, 1967; Haggard, 1974). Consequently throughout much of the acoustic syllable, events that signal one phoneme are interacting with those from neighboring phonemes. At any point within this stream, information about several phonemes is simultaneously present. Liberman (1970a, 1970b) has argued that this parallel presentation of information is mirrored by an ability to recover neighboring phonemes in parallel. That efficient

use of information

in the waveform

may

entail simultaneous processing of adjacent consonant and vowel phonemes is suggested by the importance of context on the perception of both consonants and vowels. It has been

known

for

some

time

that

consonants

be produced in the absence of a vowel and still

cannot

maintain

a speechlike qualitY. Liberman, Delattre, and Cooper (1952) showed that correct identification of a given stop

consonant requires that at least initial portions of the following vowel be presented. This is understandable since the important acoustic cues for place of articulation, the second formant transitions, are unambiguous only after some resolution of the vowel has taken place

(Liberman ef al.,

1967; Liberman, 1970a; Massaro,

1975a). Strange, Verbrugge, and Shankweiler (1974) and Healy and Cutting (1976) have shown that vowel identification and discrimination, respectively, change as a function of consonant context.

Studdert-Kennedy

(1975) cites a study by Fujimura and Ochiai (1963) showing that when portions of vowels are excised from running speech without the surrounding context of formant transitions, vowel identifications change.

Perceptually then, adjacent consonants and vowels appear to be highly interdependent. In two separate

studies, Wood and Day (1975)and Pisoni and Tash (1974) sought to demonstrate this. Both studies showed that when subjects were asked to make judgments on one element of a CV syllable this judgment was affected by changes in the irrelevant element. For example, in the Pisoni and Tash (1974) experiment when subjects were instructed to respond "same" or "different" on

the basis of one element of the CV syllable, "same" responses were fastest when both the consonant and

vowels of the two syllables were the same; while "different" responses were fastest when both elements differed.

Such failures

of selective

attention

have

been

found for other auditory dimensions (Wood, 1974a, 1974b) and the pattern of results (notably the redundancy gain) suggests that the consonant and vowel in CV syllables are processed as integral units (Garner, 1974). Wood and Day (1975) argue, in agreement with Wood (1975), that this integrality implies parallel processing. While integrality is certainly not support for a serial

model, the Pisoni and Tash (1974) and Wood and Day (1975) results support the parallel model only in a weak sense. Subjects in these experiments may not have suffered from an inability to separate the elements of

the syllables at all levels of processing; but, instead, capitalized on an holistic matching strategy. Changing the irrelevant element of a CV syllable that is to be compared to another CV affects the overall similarity

of the two syllable wholes, and subjects in the two studies may have used this similarity-dis.similarity dimen,

1279

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Copyright(D1977 by the AcousticalSocietyof America

1279

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1280

R. Remington: Processingof phonemes

1280

sionas a basisfor their responses. x In keepingwith random walk models of the comparison and decision

processes (Link and Heath, 1975) extreme values of similarity

or dissimilarity

produce faster RT's than

moderate values. Wood (1974a)and others (Garner, 1974; Lockhead, 1970, 1972) have noted this alternative explanation of redundancy gain in RT data. Apparently, subjects are able to treat syllables and, in some cases, sentences as the functional units. (See Foss and Swinney •3. )

The Wood and Day (1975) and Pisoni and Tash (1974) studies could, quite possibly, have tapped mechanisms that underlie the comparison process.

This would still

implysomeparallelism,sinceinformatior{ aboutthe consonant and vowel (in addition to other linguistic and nonlinguistic information) must exist simultaneously at this stage in processing. However, the acoustic data cited earlier

would lead one to expect that parallel

pro-

cessing extends to lower levels of analysis, in particular, to the mapping of the acoustic signal onto phoneme units in memory. It is not clear that this level of processing can be tapped by a redundancy gain analysis of whole syllable matching.

for example, would not improve until accuracy for the initial consonant had reached its asymptotic value. Thus, this model predicts uo improvement in accuracy when additional subsequent phoneroes differ. This serial model may seem unlikely in view of the highly encoded nature of the acoustic signal. It may be worthwhile to attempt a cursory reconciliation of this model with the acoustic

data cited earlier.

Assume

that phonemes are represented in memory by a unique set of abstract

acoustic

features.

When a set of fea-

tures extracted from the incoming speech signal

matches a given set in memory, the correspondingphoneme is accessed and stored in a buffer which, in turn, is capable of being accessed by some conscious decision making mechanism. Since many consonants depend on

vowel information for their disambiguation, the set of acoustic features defining them would include the relevant vowel

information.

The strict

serial

model

as-

sumes that this vowel irfformation is used only for the

recognition of the consonant; hence, only one phoneme at a time is placed in the baffer. The parallel model, on the other hand, assumes that this vowel information is used in the simultaneous accessing of both the consonant and vowel, and that both elements can be simul-

Another way to approach the question of serial-versusparallel extraction of neighboring phoneroes would be to examine discrimination accuracy as a function of time.

is to examinetheserial andparallelhypothese• by look-

Simultaneous

with

extraction

of consonant

and vowel

informa-

tion would be more firmly established if it could be shown that at all times after the onset of the syllable discrimination accuracy improves simultaneously for both these elements. This would be a more stringent

test of the parallel processing hypothesis; indeed, it is a stronger parallel processing hypothesis. Extensions of the previous findings to CVC syllables would be desirable, since it is not known to what degree the

elements of VC groups are simultaneously processed, or whether parallel processing extends to more than two elements of a syllable.

The present study attempts to examine the question of parallel, simultaneous, extraction of neighboring phoneroes by examining how a listener's ability to distinguish between two,CVC (three-element) syllables improves with the amount of the syllable the person is allowed to hear, as a function of the position and number of differing elements. If information about neighboring phoneroes within a syllable is accumulating simultaneously, then at a time prior to the complete

identification of one phoneme, there should be information about a neighboring later phoneme. For example, at some point before a/pIz-gIz/discrimination reaches its maximum value, a/pIz-poz/discrimination should also be above chance. The parallel processing hypothesis makes the additional prediction that discrimination on the basis on one element would be improved by dif-

ferences in subsequentphonemes; and the scope of this

taneouslyincorporatedinto the buffer. 2 The intent here ing at how the phonetic contents of this baffer change time.

Results from the present study will have a bearing on a related issue. Findings from backward recognition

masking studies (Pisoni, 1975; Pisoni and McNabb, 1974; Pisoni, 1972; Massaro, 1975a)indicate that vowels require less processing time than do consonants, and that information about the identity of vowels is available

sooner.

This

is consistent

with

the Pisoni

and Tash (1974) study where vowel differences were responded to more quickly than consonantdifferences; and the Wood and Day (1975) experiment where the vowel targets were affected less by varying the irrelevant consonant element than were consonant targets by varying the irrelevant vowel element. In short, vowels were apparently more influential in affecting performance in these tasks than were consonants. Perhaps vowel recognition occurs before initial consonant recognition, even though, temporally, the consonant precedes the vowel. If this is the case, predictions from the parallel

or serial

extraction

models will

not be sub-

stantially altered, but results from the present experiment are expected to give a clearer picture of how phonetic processing relates to the temporal properties of the acoustic signal. I. METHOD

A. General description On each trial a subject heard a target CVC nonsense

redundancy gain is of some interest.

syllable (CVC 1), followed by a second CVC syllable (CVC 2). The task was to indicate by pressing one of

In contrast, the strict serial model predicts that phonetic judgments of later phoneroes must await a completed identification of preceding phoneroes. Dis-

two keys whether CVC 2 was the same as or different from CVC 1. At varying intervals after the onset of

crimination accuracy for the vowel in a CVC syllable,

terminated CVC 2 and cued the subject to respond.

CVC 2 (on time, or lag) a tone was presentedwhich It

J. Acoust.Soc.Am., Vol. 62, No. 5, November1977 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1281

R. Remington: Processingof phonemes

1281

FIG. 1.

(a)

(b)

POS

(a) This showsapossible prediction

from a serial model. Since recognition of the first phoneme is completed before processing of later phonemes is begun, the curve for POS i will have reached asymp-

,

tote before

discrimination

accuracy

for

POS 2 or POS 3 begins to rise above chance,

and the curves will not overlap. The parallel model (b), however, would predict sub-

TIME

is important

tone.

TIME

to note that no acoustic information

the second ,CVC was available

after

the onset

The tone did not mask the remainder

about

of the

of the syl-

lable, but replaced it. When the tone occurred prior to the completion of CVC 2 judgments of "same" are properly speaking judgments of no difference in the corresponding portions of the two syllables. The purpose of this was to force the subjects to use the information about the individual phonemes in their judgments by restricting the possibility of holistic matching. Any seriality in phoneme extraction should be detectable by this

stantial overlap in the curves for all three positions since information accrual is proceeding simultaneously. Intercept differences would result from the sequential nature of the input.

method.

While interrupting an ongoing stimulus by a second stimulus resembles backward masking in many re-

spects (see Massaro, 1975a), forcing the subject to respond on cue at predetermined intervals after the trial onset is the essential characteristic of the SpeedAccuracy Tradeoff paradigm, as described by Wickel-

gren (1977). This paradigm has been used successfully to study memory retrieval dynamics (Reed, 1973; Wickelgren and Corbett, 1977) and sentence comprehension (Dosher, 1976). Central to this method is the notion that forcing the subject to respond on cue yields a measure of the amount of information the person has at a given point in time after the onset of the trial (on

time, or lag) regarding the required decision.

So long

as RT to the cue is constant at all lags, it is reasonable to assume that we are indeed obtaining a measure of this information. By placing cues at different points in

time, a plot of discrimination accuracy (amount of available information) as a function of time ( increasing lag)

is obtained. To simplify, the result is a chart of th.e buildup of information with time. The best-fitting curves to these plots are typically exponentially accelerated functions with three parameters to be estimated: intercept, or the point after which accuracy begins its monotonic rise above chance; asymptote, or the point of maximum accuracy (approached but not actually

reached); and slope, or rate of approach to the asymptote.

If the simultaneous processing hypothesis is correct, plots of discrimination accuracy as a function of on time will show (1) that the three elements of the CVC increase simultaneously--i. e., the curves will overlap;

and (2) that CVC pairs differing in two elements have higher asymptotes and steeper slopes than pairs dif-

fering in only one element. There may be intercept differences between the three elements due to the partial successiveness of the phonemes, but accuracy should nonetheless increase simultaneously. Alternatively, the strict serial model would predict no simultaneous increase in discrimination since complete recognition of one phoneme precedes recognition of later phonemes. Hence, there should be no overlap in the curves for the three elements. Figure 1 is a graphic example of a possible set of curves from each hypothesis. It should be emphasized that the serial model which

gives rise to the predictions in Fig. 1, referred to here as the strict serial model, is a very specific serial model. It assumes that each phoneme is processed sequentially in the order it occurs in the waveform with no overlap in processing of different phonemes, or partial parallelism. Even with these constraints, a strict serial model with large trial-to-trial variations in the finishing times for each phoneme processor could produce highly overlapping curves, and by the method used here could be indistinguishable from parallel processing. In fact there may well be many variations of serial process ing that would account for overlap in the speed-accuracy curves, but the data will at least place constraints on the types of serial models possible. B. Stimulus

materials

Eight CVC nonsense syllables were spoken in a sound

'treated room and recorded on a Sony TC650, two-channel tape recorder. The syllables were /pIp, pop, pIz,

poz, glp, gop, glz, goz/ (/I/ as in "pit", /o/as in "good"). The stimuli were digitized at a sample rate of 10 kHz and stored in a PDP-12 computer. All were truncated as near 300 msec as possible without distorting perception, the final durations ranging from 280 to

312 msec. Eight different on times were created by adding a 200-rnsec, 1000-Hzsine wave tone to variable initial portions of each syllable. On time values for each position are given in Table I. Subjects FP and MB were

tested

with

the second

set of on times

shown

in Ta-

ble I in orderto geta betterestimatebf theintercept. The

tones

at the two

shortest

on times

for

these

two

subjects were 1430 Hz, as different equipment was used. Since natural speech was desired, no attempt was

made

to control

other

acoustic

variables.

Figure 2 contains spectrograms of the eight syllables

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1282

R. Remington: Processing of phonemes

TABLE

I.

On times

On-time

I TABLE

durations in msec for each position.

are identical

for POS 1 (initial

consonant)

Estimated

onset times

rel-

ative to the plosive release for the vowel (POS 2) and final consonant (POS 3) for each syllable, and the mean across syllables.

and

POS 2 (vowel) but differ for POS 3 (final consonant) due

to its temporally later occurrence. On times for POS 1 and 2 for subjects 4 and 5 were adjusted to give a better estimate of the intercept.

II.

Tones that occurred at on

times less than 300 msec (those above the dotted line) in-

Syllable

POS 2

POS 3

terrupted presentation of CVC 2.

/puz/ /gup/ /pup/ /giz/ /piz/ /gip/ /pip/ /guz/

36 45 81 45 45 63 72 63

90 162 144 90 123 145 145 162

Mean

56

133

On time POS

durations

1 and 2

(K_N,LK, SS)

Condition

POS

3

(All S's)

(F P, MB)

i

75

25

125

2

100

50

150

3

150

75

175

4

225

100

225

300

5

300

150

6

375

225

375

7

450

300

450

8

600

450

600

ments of the vowel and final consonant onsets, but in most cases one or more of the different sets. of spectrograms contained sufficient cues to make a reasonable guess as to the onset of a given phoneme. The stimuli

used. The ordinate is frequency, the abscissa is time ed the syllables. For each syllable the onset times for each phoneme were estimated relative to the plosive release. The estimates were made separately from three different sets of spectrograms--two broadband and one narrow band. The plosive release marking the start of the syllable also signaled the onset of the initial consonant. Since the formant transitions carry information about both the consonant and the vowel, the first appearance of such transitions was taken as the onset of the acoustic information signaling the vowel. The first appearance of a systematic change in the vowel

C. Design Stimuli on the experimental tapes were blocked by the position of the differing element. Table III gives an example of each trial type for each position. Each block

formants after reaching steady-state values (or following a rise toward a "target" steady state) was chosen time

for

the final

consonant.

Table

had a "target" element (position) and all different trials

II

within a block differed in the target element. Two different (2-dill) trials differed in the target element plus one other element. Subjects were informed of the target

summarizes the results for each phoneme and gives mean values for each phoneme across syllables. In

general, there is some error associated with measure-

•'

7.C 6.5 6.0 -

70 6.5 6.0 5.5

-!-



"•

4.5

r'r'

-

2.0

1.5

•'

2.0 -

'

,,

1.5

...

-

,.o_

0.5

I.,5 1.0

-•



6.5 6.0

3.`5

2.5

3.0

or'

• •.

-

:

- ..,,

-

'

_

:..

Illl •--

"•.'%? •.

N---- 4 50 MS

i!!1'11

ß---I]

/GIZ/

i

4,50

!

.

1

MS

/P-U-Z/

'

5.5 5.0 4.5

4.0 3.5 •

':•.

nate, the abscissa is time measured

5.5 5.0

from plosive burst. On times up through 450 msec are shown by the vertical bars.

4.0

3.5 3.0

.

2.5

2.5 ß

I.,5

:

=i 1.0 • 0.5

I.,5

i•.. lill

II

k'---- 4 50

I

i

t 1.0

...

I

MS

/G TP/

FIG. 2. Spectrograms of the eight syllables used. Frequency is on the ordi-

4.5

3.0

2.5: 2.0 1.5 -

.

7.0 6.0

.•

I --•

N---

6.5 -

.:=:•

II I I 4 50 MS

ß

..

i

ii':,

/P'U-P/

,5.0, -

-

I.$ 1.0 0.5

.:

i•i'

-

5.0

LD 2.0 IJ_

--

_

4.0 3o,5!

4.5

--

3.o :t:i" .... 1.5

.

--

i i --•

5.5 -

,5.,5-

"-'

0.5



:.-.

2.0

2.0

/G'U'P/

I',J `5.0

O'

.-

7.0

7.0 -!

6.,5

'-'

40

2.,5

. •

/G'U'Z/

6.0-

5.5 5.0 4.5

315

3.5 3.0

-•:,.-'

, iiii II i [ I+-- 45o as

14 II50 iMSI i--•

'

4.5 4.0

..... '•-•','• ' 2.5 3.0•,. 3.0

C7 2.5 I,I

4.5

3.5

4.0 -

4.0

6.0

6.0 5.5 5.0

5.0 -

3.5

7.0 6.5

7.0 6.5

5.5

r-,J 5.0

thus created were output at the same

sample rate to a Sony TC650 recorder to generate a master tape. This master tape consisted of all stimuli at all on times. The experimental tapes were created by using Uher 10 000 and Teac 3340S recorders to record the appropriate pair for each trial from the master tape. The Uher was used to present the stimuli during the experiment. A PDP-15 computer was used to control the starting and stopping of the tape recorder in the experimental sessions and to record responses and display RT feedback to the subjects.

marked off in terms of the on-time values that interrupt-

as the onset

1282



45o

MS

--•

1'1Ii J ! i -1'•: 1 •-4 50 MS --•

/PIZ/

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1283

R. Remington: Processing of phonemes

TABLE III.

Trial

the syllable/pIz/

1283

types for each target position using

as an example.

stimulus pair, to which the subject responded"same" or "different."

Target POS

Same

1-diff.

2-diff.

I II III

/pip/ /pip/ /pip/

/glp/ /pup/ /plz/

/glz/ and/gup/ /puz/ and/gup/ /puz/ and/glz/

element prior to each block, and the 2-diff trials served the additional purpose of directing attention away from the target.

For each position there were 8 1-diff, and 16 2-diff pairs. To achieve an equal number of trial types at all on times, identical and 1-diff pairs occurred twice, and 2-diff pairs once at each on time. Thus, (16 pairs per trial type)x (3 trial types)x (8 on time conditions) yielded 384 trials per position. Order of presentation of specific pairs was randomized across 384 trials separately for each position.

Four experimental tapes were constructed by recording one block of 06 trials from the rand9mized list for each of the three positions onto a given tape. To safeguard against possible order effects the ordering of the positions was different on each tape, and each subject Each tape conone block for

each target position, for a total of 384 trials.

between the two CVC's

varied randomly from 500-1000 msec.

Trial types: example/pip/

received a different ordering of tapes. sisted then of three blocks of 06 trials,

The interval

Subjects

heard each tape at least four times.

D. Subjects Five paid volunteers from the subject pool of the Cognitive Laboratory at the University of Gregon served as subleers. The three women and two men were all nonstudents and were paid $ 9..00 per hour. All had normal hearing and no training in phonetics. They were told

nothin. g of the purpose of the experiment until they completed all sessions. E. Procedure

Following the

response, the word "confidence"appeared on the scope. The subject then rated her/his confidencein the last response on a scale of 1-4. 1 = certainly wrong, 2 = probably wrong, 3 =probably right, 4 = certainly right. If RT was greater than 225 reset, the word "faster" appearedprior to the next trial; if not, "ready" appeared and another trial commenced. Following each block of 96 trials there was a short break while the subject was informed of the target element for the next block. A 10-rain break was given

after three blocks, and breaks could be lengthened if desired. Each session lasted approximately 90 rain, including breaks. Each subject was tested individual-

ly, six blocks (596 trials) per day for 8-10 days over a three-week II.

period.

RESULTS

AND

DISCUSSION

Figure 3 shows mean RT to tone onset, as a function of on time, for each subject at each position. There was a tendency for tit to decrease with increasing on time. Ideally, these curves should be flat, representing only the time required to initiate a response. Since there

is no stimulus

information

available

after

the on-

set of the tone, accuracy scores could only be affected by increased processing time, not by additional stimulus information. The decrease in tit at the longer on time is, in part, a result of anticipation responses, while some of the remaining effect is almost certainly due to an interruption of processing at the short on times. These differences

in tit

are well within the range nor-

mally found with this paradigm (Reed, 1073; Dosher, 1076; Wickelgren and Corbett, 1077). Within subjects, [IT's at a given on time did not vary as'a function of trial type.

Figure 4 is a plot of theproportion of "same" responses as a function of on time. A third of the trials were "same" trials, and inspection of Fig. 4 reveals a slight bias to say "different" at the earliest on times for positions one and two. The bias is confined to the

Subjects were given initial training sessions to familiarize them with the stimuli and to train them to respond as quickly as possible to the tone onset. Subjects were encouraged to finish responding before the offset of the tone and told that they were expected to make mistakes, especially at the short on times. Though subjects were instructed to be as accurate as possible, the primary emphasis was on speed. If a given response exceeded

vowel and initial

consonant and then only at on times

of 150-225

or less.

225 msec the word "faster" was displayedon a scopein

some vowel information

front of the subject for two seconds prior to the beginning of the following trial. Responses exceeding 500 msec and responses completed before the onset of the tone were not recorded. All subjects required from two

come "speechlike." Too, some of the bias must be due to the subjects having to respond "same" to stimuli which

to four training sessions (1152-2304 trials) before they were responding consistently to all tones.

The procedure for the training sessions was identical to that for the experimental sessions. Gn each trial the word "ready" appeared on the scope in front of the

subject for 500 msec, followed 1-1.5 sec later by the

msec

The

reason

for the bias

is

not entirely clear, but it is likely that interrupting the

syllable at these early on times produces sounds that are not precisely speech sounds. Note also that the bias disappears soon after the vowel is physically present, and, though many factors could have contributed to this time course, it is consistent with the observation that is necessary before sounds be-

at these early times were physically quite different.

Figure 5 p10tsthe proportion correct on "same"trials as a function of on time, for each subject at all three positions. The curves for the three positions are virtually identical, suggesting that the particular target element

did not influence performance on "same" trials types. Differences might have resulted if subjects were successful in selectively attending to only the target element for

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1284

R. Remington: Processingof phonemes

25O

l'os ]

1284

250

200 RT

150

i

I00 0

,500

200 -

i

i

500

400

-

200

-

i

I

i

100

200

300

i ioo • 600

0

I

I

400

I

500

600

On-L J=e

FIG.

POS

250

i 500

3.

Mean

reaction

time

to the tone

on-

set is plotted as a function of on time for each subject and each position. o .... o FP;

3

m

toMB; A....

ß

ß LK.

AKN;•--

'--•SS;

150

l,

I

I

100

200

300

ioo o

I 400

I

I

500

600

½hl-t

100

200

•00

400

500

600 0

100

200

$00

400

500

600

I

,

MB

o•/

,

I

I

'

i

FIG. 4. Proportion of "same" responses plotted as a function of on time for each subject. The arrows indicate the true proportion of "same"

trials.

Points

above

the ar-

rows show a bias toward responding "same" while points below show a bias to respond "different." ©---.---© POS 1; o o POS

.5

$$

o•'

.4

2; A----A

• ß.5

•k• '"O'" 'O•"O•

,3

POS 3.

....

o/

,3 c

.2

'o"J

.2

0 0

'

'

*



'

100

200

300

400

500

•0 6•

On-time

0

I

I

I

I

100

200

300

400

I •0

6•

On-time

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1285

R. Remington:Processing of phonemes

0

I00

200

300

400

I'

I

I

i

,.o.e-

ß ;

ß

O•

500

600

1285

0

I00

200

300

400

500

I





!

I

i

• "-'"--T

/ / .9 p /

/

/

/

•/

600

•'

.

',; o

ß

,

i



I



I .

/.•_•o• o

/



FIG. 5. The proportion correct on "same" trials is plotted as a function of on time for each subject at each position. ©------©POS 1; o o POS 2; A----A POS 3.

-

ß



0• 0

1.0 •

I

I

I

I

I00 200 On-• •me

a given block. attention

I

!

[

I

I

I

300

400

500

•-•

I .9

o•

I 0 600

0

l

I

I00

200

I

from

the 2-dill

trial

types will further substantiate. This is in agreement with the Wood and Day (1974) results, and it seems

doubtful that people can selectively attend to one element

As the parallel processing hypothesis predicts, Fig. 6 shows discrimination accuracy improving simulta-

neously for the three positions though the effect is far more pronounced for the vowel and initial consonant than for the final consonant. Also, for all subjects, a differ-

nant.

Figure 6 plots proportion correct (Pc) as a function of + mean

RT to tone onset.

To reduce

the effects

of response bias, Pc was calculated by averaging the proportion correct on the "different" trial type of interest with the "same" trials

from the same block. 3 On time

+ mean RT gives a theoretically more accurate indication of the shape of the curves than does on time alone since differences in I•T to the tone at a given on time could

produce different degrees of processing (Wickelgren, 1977). Even though there is no additional acoustic information after tone onset, differences in processing between

_I 600

all on times

A. Serial/Parallel

time

I 500

ence in the vowel led to equal or better performance

Of CV groups.

on time

I

500 400 On-L[me

There is no evidence for this selective

in the data as the results



two different

on times

could

distort

the rel-

ative accuracy values of each. The on-time values corresponding to each of the eight points on a given curve can be found by referring to Table I.

than did a difference

A Wilcoxen Matched-Pairs

in the initial

at

conso-

Signed Ranks test

(Hays, 1973, p. 781) based on the number of points on the vowel curve exceeding their corresponding points on the consonant curves shows this effect to be signifi-

cant (N=35, z=3.9,

p,½"•"•*= [

MB

500

600

' --

.9 •-

/ •-

/

.•

.6

., ,'•

a vowel difference (POS 2 and POS 3) to those which differed in initial and final consonant (POS i and POS 3). 1-diff trial

types are plottedfor reference. o

' ["

400

•0

600

700

,.o '"/-.-.

••

ec



o 1-DIFF; •--'---'©

800

0 200

300

400

500

. ,.o ,,

.9

600

,

.9

.e•



in

i.•..a_.•% •S2and 3Dmr; a---a •S1and 3Dmr.

. '/ &/

RT

Compares 2-dill trial types for POS 3 which contained

300

"ør,,

+

I 800

'

,

,

,

, .eP

•'•

I

.--.- .....

-

m//

'-"'/ //7o•o•O•O o

• /q .6t I

ß ' ';"'"

r:'

.' L"•'

I

,.o[

I

.8 Fl.:, 7 0•00 400 500 •0 800 On-trine + RT700 .4•2½0 •

t

0 aO0 •00 4•

,

J

500 600

'7 •

.6 400

500 600 0 200

FIG. 8. Compares2-diff trials for •S vowel dfffer•ce

•itial

(9OS I •d

i whichconta•ed a

and f•al constats (9OS i •d i and 2 D•F;

•---•

400

500

600

On-:ime ; RT

9OS 2) to those which differed in

9OS 3). 1-diff trials

for POSi are plot•d for reference. o •S

•00

•S

1 •d

o 1-D•F;

;

FIG. 10. Compares performance on 1-diff and 2-dfff trial types

-•

for •S

2. c -

o 1-D•F;

•.•'e

2-D•F.

3 D•F.

J. Acoust.Soc. Am., Vol. 62, No. 5, November1977 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1288

R. Remington: Processingof phonemes

TABLE IV.

1288

Estimated speed-accuracy intercepts for each sub-

ject and each position, as well as mean values for each position across subjects. The onset time for each position is subtracted from the mean intercept to yield the onset-time-adjusted (OTA) intercept. POS

1

POS

2

POS

3

by the time acoustic information from the final consonant is available. The redundancy gain for the two positions, coupled with the lack of selective attention and the existing overlap, suggests that once final-consonant information is available in the waveform, phonetic processing proceeds in parallel for these elements.

S' s

Intercept

RT

Intercept

RT

Intercept

FP

250

160

200

160

325

159

MB

225

170

212

168

325

181

lapping curves for the initial

KN

300

209

300

215

375

219

LK

275

182

300

183

350

188

SS

250

191

275

192

350

185

Table IV shows, the mean (on time + RT) intercept for the initial consonant is 260 meet, and 257 msec for the

260

182

257

183

345

188

vowel. Not only are the intercepts nearly identical, but

Mean Onset

time

0

OTA intercept

260

56

132

201

213

RT

crease in discrimination accuracy for all three positions but substantial redundancy gain for at least POS 1 and POS

3.

Even

the curves

for the initial-

and final-con-

0.7

and 0.85

with a mean of 0.73

across

sub-

jects. This is about 80% of the total accuracy for the initial consonant, suggesting that some processing of the initial consonant is still taking place as accuracy for the final consonantbegins its rise. Further, it was mentioned earlier in connection with the Wood and Day (1974) and Pisoni and Tash (1974) studies ttlat redundancy gain is good evidence that information exists simultaneously at the level where the response decision is made. Were this the only source of parallelism the speed-accuracy curves might have looked quite different; some small but constant amount of information about a later phoneme might have been present over the interval where a pre-

ceding element was steadily approaching asymptote. But, appreciable increases in the later element would occur only after asymptote had been reached on the earlier element.

The curves

for later

elements

then would look

increasingly s shaped. In fact, though, accuracy ior the three positions rises not only simultaneously, but in a similar fashion after intercept, implying that the parallelism

is not confined to the decision

level but extends

to earlier levels of analysis--phonetic, acoustic, or both. In contrast, the data contain no direct evidence for the strict serial model. The sequentialtry that exists be-

tween the initial and final consonantsalmost certainly results from the temporal availability of information in the waveform. Acoustically, final-consonant information begins to appear between 90-162 msec after plosive release with a mean of 133 msec across syllables. (See Table II. ) At this point in time-discrimination accuracy for the initial consonant has reached about 80•o of its

final value. Though the two curves overlap, most of the information

about the initial

consonant has been extracted

consonant and vowel.

As

for subjects MB and FP, where the estimates were the most accurate since chance performance was actually achieved, the vowel intercept is substantially earlier than the initial consonant intercept (see Table IV). The intercept of the speed-accuracy curve gives the point where discrimination accuracy first begins to rise above chance, or, put slightly differently, the point at which acoustic

sonant overlap to an extent. Estimates of the on-time + RT intercept for each subject at each position were calculated from the curves in Fig. 6, and these data are displayed in Table IV. The intercepts were estimated by drawing a smooth curve through the points in Fig. 6 extrapolating to Pc = 0.5 when necessary. Table IV also gives the mean on time +RT for each phoneme averaged across subjects with all values rounded to the nearest millisecond. For all subjects the intercept for the final consonant corresponds to a Pc for the initial consonant of between

The real problem for serial models is the highly over-

information

first

becomes

usable.

These

re-

sults suggest then that vowel information first appears in usable form simultaneous with, or before, consonant information despite its later physical occurrence. No degree of trial-to-trial variation in finishing times for a serial, sequential phoneme extraction model is likely to produce this pattern of results--more sophisticated serial models could, of course. If, for example, there were some early acoustic analysis of an interval of the

waveform containing both the consonantand vowel (parallel acoustic analysis) and this served as input to a serial phoneme analyzer in such a way that some of the time the vowel was processed first, some of the time the initial consonant, the intercepts for the two could come out identical. In short, serial models can be made to handle the results, but the simplest account favors parallel, simultaneous extraction. B. Vowels and components . The data permit some interesting observations about the interaction

of acoustic

information

and discrimina-

tion accuracy. The intercept values in Table IV are in terms of on time + RT relative to syllable onset. Not only do they contain an on-time component reflecting the minimum amount of stimulus information necessary to produce above chance responding (stimulus information and processing time are confounded in this study, how-

ever, it doesn't bear critically on the present discussion), but also an RT component, and a large effect of the difference in temporal availability of the acoustic information for each phoneme. If these last two components could be subtracted out, the remainder of the intercept value would reflect the lag between when the acoustic cues for a particular phoneme first impinged on peripheral

auditory mechanisms and when it first

became usable--in short, processing time.

This pro-

cessing time would be a measure of how efficiently the acoustic cues for the different elements are utilized, and it would be of some interest

to see how this effi-

ciency varied as a function of both acoustic structure

of a phoneme and its position in the syllable. The procedure for estimating the onset time of acoustic information for a particular phoneme has been given above and the findings summarized in Table III. There

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1289

R. Remington: Processing of phonemes

1289

auditory and phonetic levels can proceed in parallel

is certainly some error associated with these estimates but Gn the whole, the estimates from the three sets cor-

there is general agreement that auditory processing

respondedvery closely and the values given are probably accurate to within + 20 msec. Subtracting the

begins earlier. All this suggests that vowel information may be available sooner than consonant information,

acoustic onset times from their respective

a conclusion reached by Pisoni in 1972 on the basis of data from a backward-masking study. The findings from the present study are certainly consistent with this

intercepts

for each subject yields the onset-time-adjusted (OTA) intercepts shown in Table IV. For the RT component, however, no value could be found that could justifiably be said to reflect only response specific factors. For this reason the response component was not subtracted. Fortunately, as Table IV illustrates, mean RT across on times did not differ within subjects as a function of the position of the differing element, nor did the mean RT across on times and subjects--POS 1= 182, POS 2 =

183, POS 3= 188 msec. This suggests that the OTA intercepts are in error by some constant amount, so that while the absolute values of each are not important,

the

OTA intercept differences between phonemes are still meaningful. 4

The OTA intercepts in Table IV show a pronounced

position. That is, the acoustic structure of the vowel permits its analysis at an earlier auditory level and this gives it a substantial time advantage over the highly encoded stop consonant even though in the waveform that consonant may physically precede the vowel. The superior performance of the vowel at even the very early on times in the present study could have arisen from such a processing

advantage.

It must be kept in mind though that these results have been obtained from a matching-discrimination task as

have many of the findings cited above. Not only might recognition involve different mechanisms, but the particular discriminations

involved may well influence the

position effect, the initial consonantrequiring on the average 59 msec more time than the vowel and 47 msec

pattern of results achieved, thoughhow this wouldhave

more than the final consonant.

study is not clear.

able estimate of the error

Since there

is no reli-

in estimating either the onset

times or the intercepts no statistical analysis is really meaningful, but these differences look large compared to the 12-msec difference consonan[. What factors

between the vowel and final could underlie this effect?

Perhaps the'first element always suffers since the' tem-

poral uncertainty associatedwith syllable onset would affect initial more than later phonemes.

effects.

nants are spread out in time whereas vowels are characterized by steady-state periodic structures where most of the relevant information is present simultaneously. For half of the syllables, those ending with /z/, the final consonantsalso had some steady-state,

periodic structure which couldhave contributed to their being nearly as efficiently processed as vowels. the somewhat

makes its information

available

before that of a preced-

ing stop consonant, it becomes something of a puzzle to determine how the system keeps track of the temporal order of events within a syllable since the processing characteristics of the system have obscured the sequential

information

in the waveform.

1This is clear in the Pisoni and Task (1974) experiment where

explicitcomparisons werecalledfor. If it is assumed that

In the present study, position in the syllable is confounded with acoustic structure, and it is quite likely that acoustic structure plays an important role in processing efficiency. The acoustic cues for stop conso-

Aside from

However, if the analysis presented

above is correct and the speedier processing of the vowel

It is also pos-

sible that information about later phonemes is present before the acoustic cues appear, owing to anticipatory coarticulation

affected performance at the early on times in the present

restricted

evidence

from

this study, there are other reasons for believing that differences in acoustic structure may be related to

processing time. It has been suggestedby Liberman et al. (1967) that a gradient of encodednessfor speech sounds exists with vowels requiring the least encoding,

stop consonantsthe most, and fricatives, laterals and the rest falling in between. Additional evidence for this comes from studies that show that there are important differences in how vowels and consonants are processed

monitoring for specified targets involves an implicit comparison, then results from the Wood and Day (1975) study can be similarly

accounted for.

2This buffer would also contain additional linguistic as well as nonlinguistic information. It is assumed that all these sources of information are processed simultaneously and are fed into this buffer in parallel. This buffer is more than a storage area, though, since it would probably be endowed with an algorithm for integrating the information from these sources into a recognizable percept.

3Originally, d• wasto beused/as theaccuracy measure but performance in some conditions was too high to make its consistent use meaningful.

Other measures used were H t

(Swenson, 1972),-logl0n (Luce, 1963), andlog2P(c). Though the shapes differ with different measures, the relationships between

the curves

in different

conditions

remain

unaffected.

4If mean RT across on times for each subject at each position is subtracted from OTA intercept for that subject the resultant values are POS 1=76 msec, POS 2 =18 msec, POS 3 =23 msec. These values probably underestimate by some constant amount the actual required processing time, though under

some conditions

a 20-msec

vowel

can be discriminated

quite accurately. See Massaro, D. (1975). Experimental t•sychologyand Information Processing (Rand McNally, New York),

Chap. 230.

(Shankweilerand Studdeft-Kennedy,1970). In general,

5This research was supportedby a National Institute of General

vowel processing is considered to be heavily dependent

Medical Sciences Training Grant 5 TO1 GM 02165 BHS and by research grant BNS75-03145 from the National Science

on the acoustic-auditory analysis whereas consonant recognition depends critically on subsequent phonetic

analysis (Studdeft-Kennedy, 1976; Pisoni, 1976). Though Wo.od (1974) has evidence that processing at

Foundation.

Dosher, B. A. (1976).

"The Retrieval of Sentences from

Memory," Cogn. Psychol. 8, 291-310.

J. Acoust. Soc. Am., Vol. 62, No. 5, November1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

1290

R. Remington: Processing of phonemes

Fant, G. (1967). "Auditory Patterns of Speech," in Models for the Perception of Speech and Visual Form, edited by W. Wathen-Dunn (MIT, Cambridge, MA). Foss, D. J., and Swinney, D. A. (1973). "On the Psychological Reality of the Phoneme: Perception, Identification, and Consciousness," J. Verb. Learn. Verb. Behar. 12, 246257.

Fujimura,

O., and Ochiai, K. (1963).

"Vowel Identification

and Phonetic Contexts," J. Acoust. Soc. Am. •]5, 1889 (A). Garner, W. R. (1974). The Processing of Information and Str•cture (Lawrence Erlbaum, Potomac, MD). Goldstein, L. M., and Lackher, J. R. (1973). "Alterations of the Phonetic Coding of Speech Sounds During Repetition, ', Cognition 2, 279-297.

Haggard, M. 1>. (1967). "Models and Data in Speech Perception," in Models for the Perception of Speech and Visual Form, edited by W. Wathen-Dunn (MIT, Cambridge, MA). Hays, W. L. (1973). Statistics for the Behavioral Sciences (Holt, Rinehart,

and Winston, New York).

Healy, A. F., and Cutting, J. E. (1976). "Units in Speech Perception: Phoneme and Syllable," J. Verb. Learn. Verb. Behar.

15, 73-84.

Ladefoged, P. (1967). Three Areas of Experimental Phonetics (Oxford U.P., London). Liberman, A.M. (1970). "The Grammars of Speech and Language," Cogn. Psychol. 1, 301-323. Liberman, A.M. (1970). "Some Characteristics of Perception in the Speech Mode," in Perception and Its Disorders, edited by P. Hamburg, K. Pribram, and A. J. Standard

(Williams and Wilkins, Baltimore, MD). Liberman, A.M., Cooper, F. S., Shankweiler, D. P., and Studdert-Kennedy, M. (1967). "Perception of the Speech Code," Psychol. Rev. 74, 431-461.

Liberman, A.M., Delattre, P. C., and Cooper, F. S. (1952). "The Role of Selected Stimulus Variables in the Perception of the Unvoiced Stop Consonants, "Am. J. Psychol. 65, 497 -516.

Link, S. W., and Heath, R. A. (1975). "A Sequential Theory of Psychological Discrimination," Psychometrika 40, 77-105. Lockhead, G. R. (1972). "Processing Dimensional Stimuli: A Note," Psychol. Rev. 79, 410-419. Lockhead, G. R. (1970). "Identification and the Form of Multidimensional Discrimination Space, "J. Exp. Psychol. 85, 1-10.

Luce, R. D. (1963). 'q•)etection and Recognition," in Handbook of Mathematical Psychology, edited by R. D. Luce, R.

1290 R. Bush, and E. Galanter (Wiley, New York), Vol. 1. Massaro, D. (1975a). "Acoustic Features in Speech Perception," in Understanding Lan•age, edited by D. W. Massaro (Academic, New York).

Massaro, D. (1975b). t'•xperimental Psychology and Information Processin• (k•andMcNally, New York). Pisoni, D. B. (1975). "Dichotic Listening and Processing Phonetic Features," in Cognitive Theory, edited by F. R Restie, R. M. Shfffrin, N.J. Castellan, H. R. Lindman, and D. B. Pisoni (Lawrence Erlbaum, Hillsdale, N J).

Pisoni, D. B. (1972). "Perceptual Processing Time for Consonants and Vowels,"

J. Acoust.

Soc. Am.

SA.

Pisoni, D. B., and McNabb, S. D. (1974). "Dichotic Interactions of Speech Sounds and Phonetic Feature Processing," Brain Lang. 1, 351-362.

l>isoni, D. B., and Tash, J. (1973). '"Same-Different • Reaction Times to Consonants, Vowels and Syllables," J. Acoust. Soc. Am. 55, 436(A).

Reed, A. V. (1973). "Speed-AccuracyTradeoff in Recognition Memory," Science 181, 574-576. Strange, W., Verbrugge, R., and Shankweiler, D. 1>. (1974). "Consonantal Environment Specified Vowel Identity," J. Acoust. Soc. Am. 55, S54(A). Studdert-Kennedy, M. (1975). "Speech Perception," in Contemporary Issues in Experimental Phonetics, edited by N.J. Lass (Academic,

New York).

Swennson, R. G. (1972). "The Elusive Trade-Off: Speed rs. Accuracy in Visual Discrimination Tasks," Percept. Psychophys. 12, 16-32.

Wickelgren, W. A. (1977). "Speed-AccuracyTradeoff and Information Processing Dynamics," Acta Psychol. 41, 6785.

Wickelgren, W. A., and Corbett, A. (1977). "Associative Interference and Retrieval Dynamics in Yes-No Recall and Recognition," J. Exp. Psychol.: Human Learning and Memory 3, 189-202.

Wood, C. C. (1975). "A Normatire Model for Redundance Gain in Speech Discrimination, "in Cognitive Theory, edited

by F. Restie, R. M. Shfffrin, N.J. Castellan, H. R. Lindman, and D. B. Pisoni (Lawrence Erlbaum, Hillsdale, NJ). Wood, C. C. (1974a). "Parallel Processing of Auditory and Phonetic /nformation in Speech Perception," Percept. Psychophys. 15, 501-508. Wood, D. D., and Day, R. S. (1975). "Failure of Selective Attention to Phonetic Segments in Consonant-Vowel Syllables," Percept. Psychophys. 17, 346-350.

J. Acoust. Soc. Am., Vol. 62, No. 5, November 1977

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 131.91.169.193 On: Sun, 23 Nov 2014 10:43:17

Processing of phonemes in speech: a speed-accuracy study.

Processing of phonemes in speech: A speed-accuracy study Roger Remington Universityof Oregon,Departmentof Psychology, Eugene,Oregon97403 (Received13 J...
2MB Sizes 0 Downloads 0 Views