Developmental study of vowel formant frequencies in an imitation task.

Developmental study of vowel formant frequencies in an imitation

task

R. D. Kent and L. L. Forner Departmentof CommunicativeDisorders,Universityof Wisconsin,Madison, Wisconsin53706 (Received 14 March 1978; revised28 June 1978)

Imitations of ten synthesizedvowelswere recordedfrom 33 speakersincludingmen, women,and children. The first three formant frequenciesof the imitations were estimated from spectrogramsand considered

with respectto developmental patternsin vowelformantstructure,uniformscalefactorsfor vowel normalization, and formant variability. Strong linear effects were observedin the group data for imitationsof most of the English vowelsstudied,and straightlines passingthrough the origin provideda

•satisfactory fit to linearF•- F2 plotsof the Englishvoweldata.Logarithmic transformations of the formant frequencieshelpedsubstantiallyto equalizethe dispersionof the group data for differentvowels, but formant scale factors were observedto vary somewhatwith both formant number and vowel identity. Variability of formant frequencywas least for F• (s.d. of 60 Hz or lessfor English vowelsof adult males) and about equal for F2 and F3 (s.d. of 100'Hz or lessfor English vowelsof adult males).

PACSnumbers: 43.70.Gr,43.70.Bk, 43.70.Jt, 43.70.Ve INTRODUCTION

A problem of long standingin acoustic phonetics is that of reconciling formant-frequency measurements of vowels with their phonetic equivalence. This problem was given classic portrayal by l•eterson and Barney (1952) when they plotted the first and secondformant-frequency measurements of ten vowels produced by 76 speakers,

includingmen, women, and children (see Figs. 8 and 9 in their article). The vowel regions for the 76 talkers were characterized both by considerable spread of formant frequencies within a vowel category and frequent overlap of formant frequencies across vowel categories. Some 25 years later, the phoneticidentification of vowels from the formant frequencies of a heterogeneous sample of speakers continqes to be a research challenge

(see, for example,the recentpapersby Lennigand Hindle, 1977; Sroka, 1977; Broad and Wakita, 1977). The normalization of formant frequencies for the purpose of demonstrating vowel equivalence has several

possible complications. Fant (1966) noted that the vocal tracts

of adult males

differ

from

those of women and

Furthermore, the available acoustic data on children's speech are not adequate to describe the details of developmental changes in the formant frequencies of different vowels. Although vowel formant frequencies for various

ages of children have been reported (e.g., Eguchi and Hirsh, 1969), generalization and interpretation of the results is limited by the fact that the vowels were produced in varying phonetic contexts. Therefore, acoustic documentation of developmental changes in vowel production requires additional formant-frequency data for several age groups of children, preferably for isolated vowels or vowels embedded in a fixed phonetic context. It also is desirable

to control

for dialectal

differences

in vowel production, for these differences may confound the problem of vowel normalization.

This. paper presents data on the imitation of 15 synthesized formant patterns representing five English vowels and ten other vowels chosen to sample the F2-F 3 space. Synthesized vowels were used as stimuli because their acoustic characteristics could be specified exactly to satisfy the intended formant patterns and

children in having a proportionately longer pharyngeal tube, and this anatomical disparity makes questionable

because

the application of constant scale or correction factors.

diphthongization. The 33 speakers who served as subjects represented both sexes and ranged in age from 4 years to young adulthood. The chief purposes of this

Another problem in the assignment of scale factors is that these factors may vary with vowel identity or with formant-frequency number (Fant, 1966). Even the data that have been used in the study of vowel normalization have been questioned because of possible dialectal dif-

ferencesamongthe speakers(Nordstromand Lindblom, 1975; Lennig and Hindle, 1977). A number of solutions

to the problemof vowelnormalizationhavebeenproposed(Mol, 1963;Fujisaki andKawashima,1968;Gerstman, 1968; Nordstrom and Lindblom, 1975; Broad, 1976; Lennig and Hindle, 1977; Sroka, 1977), but for the most part these procedures have not been applied to a data base representing a specified wide range of

speaker ages. For .example,l•eterson ahd Bar.hey (1952) did not specify the ages of the children in their

control

could

be exercised

over

such factors

as

vowel duration, fundamental frequency contour, and

report are to (1) describe developmentalvariations in vowel formant frequencies, (2) evaluate possibilities for vowel normalization by uniform scaling factors, and (3) determine intersubject and intrasubject variability in the formant frequencies of vowel imitations. Another purpose of the experiment was to investigate vocal imitation as a developing sensorimotor skill, but this issue

hasbeendiscussedin previousreports (Kent, 1978and in press). I. EXPERIMENTAL

PROCEDURES

A. Subjects,

,

sample of speakers; nonetheless, their study remains

one of the few sources of formant-frequency d•atafor a large and heterogeneous set of talkers.

208

J. Acoust. Soc.Am.65(1), Jan.1979

The 33 speakers were grouped by age and sex as follows: five young adult men, four young adult women, five 12-year-old girls, five 6-year-old boys, five 6-

0001-4966/79/010208-10500.80 (D1979Acoustical Society of America

208

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08

ulus. The stimuli were recorded in aquasi-random order, separated by s silent interval of approximately 12 s.

TABLE I. F1, F2, and F3 frequencies (in Hz) of the synthetic vowel targets used in the limitation

experiment.

C. InstructiOnsto subjects

English vowels [i]

[•e]

270 2290 3010

660 1720 2410

[a]

730 1090 2440

Arbitrary I

F1 F2 •3

695 1400 2425

[u]

300 870 2240

The subjectswereaskedto s•y thesamesounds that

[3•]

they would hear from a loudspeaker but they were informed that they should use their own natural voice pitch. With some coaching, even the youngest children understood this instruction. A short sample of the tape was played before the imitations were recorded to acquaint the subjects with the nature of the stimuli. One tape recorder was used for playback of the stimuli and another was used to record the subjects' imitations. The children were told they would be given candy or a toy at the end of the experiment.

490 1350 1690

vowels

2

3

460 1915 2515

285 1580 2625

4

435 855 2325

5

400 1450 2300

D. Acoustic analysis, year-old girls, three 4-year-old boys, and six 4-yearold girls. Actually, four 4-year-old boys participated in the study, but one boy was excluded from the analyses because his unusually high fundamental frequency resuited in strong formant-harmonic interactions in the •spectrograms of his vowels. Because the primary purpose of this study was to obtain formant-frequency data for a variety of speakers, it was not considered essential to have an equal number of subjects in each group.

Formant frequencies of the vowel imitations were estimated from spectrograms made with a Kay Elemetrics 7029A Sona-Graph equipped with 45-, 300-, and 500-Hz analyzing filters and a frequency counter. For any given speaker, the spectrograms were made with the filter that yielded the optimum formant resolution. Narrowband section displays also were obtained. The formant frequencies were identified by locating the ap-

proximate center of formant bars on wideband(300 or 500 Hz) spectrograms or by fitting smooth curves to the narrowband (45 Hz) amplitude sections. Comparison of

B. Stimuli

The ten synthesized vowels had formant-frequency ranges for F1, F2, and F 3 bounded roughly by the mean

valuesthat PetersonandBarney (1952) recordedfor adult males' productionsof the vowels [i u a •e 3•]. In

the two analyses was accomplished to arrive at a final estimate of the formant frequencies. All acoustic measurements were made by the first author. The reliability of measurement, determined for two groups of

fact, five of the synthesized vowels were modeled after

Petersonand•arney's meandatafor theseEnglish vowels.

The other five vowels did not necessarily have

a phonemicidentity in English (most were variably labeled in a preliminary

task of identification

by pho-

netically trained listeners) and were selected because they gave a fairly uniform sampling of F1-F2-F 3 space. Because the stimuli presented for imitation were not necessarily English vowels, the subjects had to attend to the actual

character

of each

vowel

to make

a satis-

N

2.0

,L•

2•

.

factory imitation. None of the subjects, even in the youngest age group, appeared to have any difficulty with this task. The first three formant frequencies of the stimuli are listed in Table I and depicted in Fig. 1. The stimuli had a duration of 250 ms and were shaped with an amplitude rise-fall time of 30 ms. The fundamental frequency pattern was a linear ramp falling from 130 to 105 Hz. The vowels were synthesized with five formants but only the first three were variable in fre-

•L 1.2

0.8

quency(F 4 andF 5 were fixed at 3500 and 4500 Hz, respectively). The bandwidthsof the five formants were, in ascending order of formant number, 50, 70, 110,

170, and 250 Hz. Becausea series model (Klatt, 1972) was used in synthesizing the vowels, the relative formant amplitudes were determined by the formant frequencies.

A total of 50 synthetic vowel stimuli were presented for imitation, including five tokens of each vowel stim209

J. Acoust.Soc.Am., Vol. 65, No. 1, January1979

O.

12 I

I

0.4

FREQUENCY

FIG. 1. Ft-F•

plot showi•

I

I

0.6

I

I

0.8

OF F1 (kHz)

ten imitation stimuli (filled cir-

cles) and the me• imitation responses of five men (crosses). The center of each cross indicates the mean Ft and F• values for the vowel and the leith of each arm represents 1 standard

deviation(i. e., the horizontal arms are ß 1 s.d. for Ft and the vertical arms are ß 1 s.d. for F•). The stim•i represented are the E•lish vowels [i u • • •] and the arbitrary vowels 1-5.

R.D. Kentand L. L. Forner:Studyof vowelformants

209


speakers--adult males and 4-year-old children--was calculated for repeated formant-frequency measures for selected spectrograms. The standard deviation Sr of the measurement differences, the mean difference D, and the standard error of the mean difference s• were determined for the first three formant frequencies of 30 spectrograms for each group. The following formu-

las (Snedecor and Cochran, 1967) were used to compute SD, S•, and a Student's t, the latter to determine if the mean

differences

were

different

from

0:

for both analyses generally were made for the last haft of a vowel segment, the measures did not apply to exactly the same instant within the vowel. Differences in sampling point can contribute to large differences in formant frequency whenever the productions are diph-

thongized,as sometimeshappened with [m], for example. II. RESULTS

AND

DISCUSSION

A. Isovowellinesin linear F•-F 2 and F:z-F:• plots ,

1/2

D• -

SD•

i

The correspondence between ten imitation targets and the group means for imitations of these targets by adult

D

n-1

'

SD SD/;ql /2 t=D/sD,

with(n-I)

males can be judged from the Ft-F 2 plot in Fig. 1. The Fi-F 2 values of the stimuli are represented by the filled circles and the F1-F•. means of the imitations are given by the intersections

d.f.

,

where n is the number of remeasured spectrograms and

D e is the difference between any original and replicate measurement.

The sample mean difference, standard deviation of the differences, and standard error of the mean differences

were as follows for the adult males' Ft; 13, 54, 11 Hz; F•.; 19, 63, 13 Hz; Fs; -8, 62, 13 Hz. For the 4-yearold children, the same statistics were' Ft; 15, 59, 11 Hz; F•.;-13, 68, 13 Hz; Fs;-6, 52, 10 Hz. None of the t values reached significance at the 0.10 level, indicating that the differences between replicate measurements were essentially randomly distributed around zero. Larger measurement errors are expected for the children than for the adults, given the age-dependent

tal

and vertical

of the crossed lines.

lines

that

form

each

cross

The horizonare

ñ one

standard deviation for F1 and F2, respectively. Generally speaking, the F1-F 2 means for the men subjects closely approximate the F1-F 2 values of the stimuli. The major discrepancy between the targets and the imitation means is an overall shift in the vowel quadrilateral of the imitation responses to the right, in the direction of increased F! frequency. The first two formant frequencies for imitations

of

the Englishvowels[i u a ae]by all 33 subjectsare shown in the scattergram of Fig. 2. Each point represents the mean of five imitations by one of the 33 speakers. Data

for [3'] are not includedbecausethey overlappedextensively with the data for the other vowels.

With the ex-

ceptionof the results for [ae],the pointsare arranged

decline in fundamental{requencyand the fact that the

in elongated clusters oriented diagonally in the Fi-F 2 plane. Because it appeared that a straight line would

hypothetical error in the estimation of formant frequencies from spectrograms is equal to or greater than

was attempted using the method of least squares.

fo/4 (Lindblom, 1972). The accuracy of formant-fre-

fit the clusteringsfor [i u a], straight-line regression The

quency measurement also was assessed by comparing the spectrographic measurements with measurements

derived from a linear prediction analysis (Markel, 1971). The linear prediction analysis, which became available late in the study, was used to determine the formant frequencies for three of the 12-year-old girls. The statistics D, st, and sD, computed with the formulas given above, were as follows for the comparison of spectrographic and linear prediction analyses for the

first three formant frequencies;subject 12-a- F t (-4, 70, 11 Hz), F•. (-2, 88, 13 Hz), F s (3, 81, 14 Hz); subject 12-b' Ft (-17, 45, 6 Hz), F•. (17, 96, 14 Hz), F s (-10, 81, 13 Hz); subject 12-c' F t (-5, 73, 9 Hz), F•. (-34, 115, 16 Hz), F s (-58, 128, 19 Hz). These values are based on at least 90% of the imitations recorded for each subject; the remainder were eliminated from consideration, usually because the linear prediction analysis missed a formant or identified one spuriously.

Of the nine t values, three were significant at the 0.05 level, with the LPC measures being lower in frequency

thanthe spectrographicmeasures. These error estimates probably are liberal in that they include several sources of variability, including differences in the sampling points for the analysis as well as differences in the techniques of analysis. That is, although measures

210

J. Acoust.Soc.Am., Vol. 65, No. 1, January1979

3.6

3.2

0

0

2.8

%%0% o

øøooo 0 $ø%

2.0

0(9

1.6

1.2

0.8

0.4

o

0.2

0.4

0.6

0.8

1.0

1.2

FREQUENCY OF F1 (kHz)

FIG. 2. Scattergram of the F 1 and F•. frequencies for imitations of the English vowels [i u a •e] by 33 men, women, and children. Each data point represents the mean for five imitations of the vowel by a given speaker.

R.D. Kentand L. L. Forner:Studyof vowelformants

2t0


+½, where /t is the slope of the regression line and ½ is a normally distributed random variable. None of the

3.6

],.77

t values reached significance (p >0.30), so the null hy-

3.2

pothesis that the lines pass through the origin cannot be rejected. The estimate of the slope /• for each vowel is:

2.8

•

[i], 7.11; [u], 2.48; [a], 1.545; [m], 2.65; and [•], 2.79.

2.4

/ 1.6

IM

o

Ld 1.2

• [o],. 70

[u],.59

0.8

0.4

I

I

0.2

[

I

0.4

I

I

I

0.6

I

0.8

FREQUENCY

FIG. 3.

This result

confirms Mol's (1963) suggestionthat straight lines

0 2.0 z

These lines are showntogether with the F1-F•.

means for each age-sex group in Fig. 4.

I

I

1.0

I

I

I

1.2

OF F1 (kHz)

Linear regression determined by the method of least

squares for the F 1 and F•. frequencies of imitations of the En-

glish vowels[i u o ae3•] by the 33 speakers. The r •' coefficient of determination is shownfor each regression line.

results are depicted in Fig. 3, which shows the regres-

sion line determinedfor each vowel ([3•] included)as well as the r 2 coefficient of determination. With the ex-

passing through the origin are a good approximation to the F1-F • values for different vowels produced by a variety of speakers. Mol interpreted the lines passing through the origin as evidence for a principle of "axial growth," in which development of the vocal tract has a uniformity that is reflected by regular changes in the formant patterns of different vowels. Perhaps the most useful aspect of Mol's scheme is that it allows the vowel spaces of different speakers to be compared along fixed vectors in F1-F • space, with an easy extrapolation for

subjects who fall outside (longer or shorter vocal tracts) of an existing subject set. One area of application is the acoustic examination of vowels produced by persons with speech disorders, as the rays shown in Fig. 4 offer a quick, although approximate, test of the adequacy

of a speaker's vowel F1-F • patterns. A scattergram of the F 2 and F 3 frequencies for imitations of the five English vowels is given in Fig. 5. As in Fig. 2, each point is the mean for five imitations of

each vowel by one of the 33 speakers (except for vowel

ceptionof the result for [m], the r 2 valuesare larger

[3•] for whichtwo of the 4-year-oldscouldnot makea

than 0.5, which means that the least-squares lines account for more than half of the variability for four of

satisfactory phonemie production as judged by the in-

the five vowels. Thus, linear effects are substantial in the Fi-F2 variability for four of the five vowels.

It was concludedin earlier reports based on partial analyses of the current data that straight lines through the origin provide a satisfactory fit to group means for

eachof the Englishvowels[i u a ae3•] (Kent, 1978). To

vestigators). Regression lines again were determined by the method of least squares, and the results are il-

lustrated in Fig. 6. The r • coefficientsof determina-

3.6

i

test this possibility statistically, a t test described by Snedecor and Cochran (1967) was used to evaluate the null hypothesis that a straight line determined for each English vowel passes through the origin. The formula for the test of significance is

t = ( • - b2)/s,., 1/n +2

x2

, with(n- 2)d.f.

whereb is the least squaresestimateof the slope½in the linear

y=a

2.4

model

1.6

1.2

+/•x +½:

xr// x ,

0.8

and S•o,is the residual mean square,

0.4

0

In these equationsX is taken to be a F• value and Y is taken to be a F 2 value.

If a straight line throughthe origin is a satisfactory fit, thenthe linear modelgivenabovereducesto y =/Ix 211

J. Acoust.Soc. Am., Vol. 65, No. 1, January1979 ,

0.2,

0.4

i 0.6

i 0.8

i 1.0

i 1.2


FIG. 4. F1--F 2 regression lines for the English vowels

[i u a ae3•] accordingto the linear equationY=•x +½. The straight lines, all of which pass throughthe origin, are deter-

mined withb=•XY/• X2astheestimate oftheslope/•,where X is an F 1 value and Y is an F 2 value., R. D. Kent and L. L. Forner: Study of vowel formants

211


85

-

80-

75-

70-

65-

55

I 60

I 65

I 70

10 In F1 FIG. 7.

Logarithmic transformations of the group Ft-F•.

means for imitations of the five English vowels and the two arbitrary vowels numbered 1, 2. The subject groups are men,

women, 12-year-old girls, 6 year-old boys, 6-year-old girls, 4-year-old boys, and 4-year-old girls. Lines have been drawn to connect the mean data points for each vowel.

dependentscale factors that are used to transform the

F 1 andF 2 frequencies(Fant, 1966). The successof a multiplicative scale factor rests upon a formant-frequencytransformation that yields approximately equal

ti-onare largerthan0.70 for eachvowelexcept[u](for

dispersions for differentvowels(LennigandHindle,1977).

which Fig. 5 shows a rather unsystematic variation in

Uniform scaling by multiplication may be regarded as

a narrow F 2 range). Thus, for four of the five vowels, the regression lines accountfor 70% or more of the

small increases in the r 2 value (increments of 0.02 or

a process of addinga speaker-dependentconstantto logarithmically transformed F 1 and F 2 frequencies (Nearey, 1977). Thus, the potential success of uniform scaling procedures can be gaugedby determining the extent to which vowel dispersion is equalized in a log-

less) exceptfor vowels[u] and [•e], whichhadincre-

transformed F1-F 2 space.

variance in the group data. Calculation of the linear regression of F 3 on combined F t and F2 resulted in

ments of 0.23 and 0.11, respectively. Thus, for the F2-F 3 values, as with the Ft-F 2 values, linear effects are strong in the group formant data.

B. Formant scalingand logarithmictransformations Attempts at vowel normalization by uniform scaling typically haveinvolvedthe derivationof constantspeaker-

Figure 7 showsthe log-transformed mean values of the Ft and F 2 frequencies for the seven age-sex groups

producing the Englishvowels[i u a m3•] andarbitrary vowels 1 and 2. Comparison of this log-transformed

Fi-F 2 plot with the linear Fi-F2 plot (Fig. 4) showsthat the transformation

does indeed make the vowel disper-

sions more nearly the same. [u]

4.4

[a]

,45

.72

[ae]

.74

[i]

.85

For example, whereas

the dispersionfor [u] is markedlycompressedrelative to that for [i] in the linear Ft-F2 plot, the dispersions for these two vowels are nearly equal in Fig. 7. The effect of the logarithmic transformation on the F2 and

F 3 frequenciesis illustrated in Fig. 8. This log plot offers a somewhat equivalent dispersion for the five

3.6

vowels, but the remaining differences probably are not

negligible(for example,noticethecompression for [3•] relative to [•e]). 2.8

These results with a logarithmic transformation of

formantfrequenciessupportLennigandHindle's(1977) conclusionthat a logarithmic transformation can greatly reduce the differences in dispersion among vowels, al-

2.

'•/110 i i i I I I ß

1.4

1.8

2.2

2.6

3.0

3.4


FIG. 6. Linear regression determined by the method of least squaresfor the F 2 andF 3 frequenciesof imitations of the En-

thoughthis transformationmay notyield an exact equivalence of dispersion. The outcomeperhaps is sufficiently satisfying to hold promise for uniform scaling factors, which will be considered again in the next section of this paper. Further work is neededto determine

if the logarithmic transformation also serves to segregate the F1-F 2 values by vowel category. This issue glishvowels[i u a •e3•] bythe33 speakers.Ther 2coefficient was consideredby Lennig and Hindle (1977). of determination is shown for each regression line.

212

J. Acoust. Soc.Am.,Vol.65, No.1,January 1979

R.D.KentandL. L. Forner: Study ofvowel formants

212


85-

i

40 t,

83--

81

-

79

•.

8o

•

40

o

[u]> [ • ] > [•e](theydidnotreportdatafor [ a]).

tionsyieldthe series[•e]> [u]>[a]>[i], whereasfor the

214

J.Acoust. Soc.Am.,Vol.65,No.1,January 1979

R.D.KentandL.L.Forner: Study ofvowel formants

214


D. Formant variability• 3.2

The variability of the Ft, F2, and F 3 frequencies in

i

vowel imitation

is influenced by several factors,

includ-

ing the error of measurement (which varies with a speaker's fundamental frequency), the precision of a subject's vowel articulation, andthe subject's familiarity

z

1.6

with the imitation target. Because these factors are difficult to tease apart, the current report focuses on variability measures for the adult male subjects. The men are expected to have the smallest variability of

Q

::D

formant frequencies for two major reasons: (1) they have the lowest fundamental frequencies of all the agesex groups tested, so measurement error in the estimation of the formant frequencies should be minimal,

0.8

O 12

"•'

YEARS

ß6 YEARS

Fusion of spatially separated vowel formant cues.

Contributions of fundamental frequency and formant frequencies to speaker identification.

Formant Frequencies and Bandwidths in Relation to Clinical Variables in an Obstructive Sleep Apnea Population.

Audio-vocal responses of vocal fundamental frequency and formant during sustained vowel vocalizations in different noises.

Midbrain Synchrony to Envelope Structure Supports Behavioral Sensitivity to Single-Formant Vowel-Like Sounds in Noise.

Optimizing Vowel Formant Measurements in Four Acoustic Analysis Systems for Diverse Speaker Groups.

Neural Resolution of Formant Frequencies in the Primary Auditory Cortex of Rats.

Sensorimotor control of vocal pitch and formant frequencies in Parkinson's disease.

On Short-Time Estimation of Vocal Tract Length from Formant Frequencies.

A Bayesian Developmental Approach to Robotic Goal-Based Imitation Learning.

[An experimental study of discrimination of central frequency of a single formant (author's transl)].

Developmental changes in ERP responses to spatial frequencies.

An automated procedure for evaluating song imitation.

Developmental changes in using verbal self-cueing in task-switching situations: the impact of task practice and task-sequencing demands.

Task-dependent decoding of speaker and vowel identity from auditory cortical response patterns.

Is selective attention the basis for selective imitation in infants? An eye-tracking study of deferred imitation with 12-month-olds.

Vowel development in an emergent Mandarin-English bilingual child: a longitudinal study.

Imitation, Sign Language Skill and the Developmental Ease of Language Understanding (D-ELU) Model.

Imitation.

Developmental Trajectories in Primary Schoolchildren Using n-Back Task.

Benefits of Stimulus Exposure: Developmental Learning Independent of Task Performance.

Colour television, an imitation of the human visual system.

Neonatal imitation and an epigenetic account of mirror neuron development.

Applying machine learning to identify autistic adults using imitation: An exploratory study.