Developmental study of vowel formant frequencies in an imitation
task
R. D. Kent and L. L. Forner Departmentof CommunicativeDisorders,Universityof Wisconsin,Madison, Wisconsin53706 (Received 14 March 1978; revised28 June 1978)
Imitations of ten synthesizedvowelswere recordedfrom 33 speakersincludingmen, women,and children. The first three formant frequenciesof the imitations were estimated from spectrogramsand considered
with respectto developmental patternsin vowelformantstructure,uniformscalefactorsfor vowel normalization, and formant variability. Strong linear effects were observedin the group data for imitationsof most of the English vowelsstudied,and straightlines passingthrough the origin provideda
•satisfactory fit to linearF•- F2 plotsof the Englishvoweldata.Logarithmic transformations of the formant frequencieshelpedsubstantiallyto equalizethe dispersionof the group data for differentvowels, but formant scale factors were observedto vary somewhatwith both formant number and vowel identity. Variability of formant frequencywas least for F• (s.d. of 60 Hz or lessfor English vowelsof adult males) and about equal for F2 and F3 (s.d. of 100'Hz or lessfor English vowelsof adult males).
PACSnumbers: 43.70.Gr,43.70.Bk, 43.70.Jt, 43.70.Ve INTRODUCTION
A problem of long standingin acoustic phonetics is that of reconciling formant-frequency measurements of vowels with their phonetic equivalence. This problem was given classic portrayal by l•eterson and Barney (1952) when they plotted the first and secondformant-frequency measurements of ten vowels produced by 76 speakers,
includingmen, women, and children (see Figs. 8 and 9 in their article). The vowel regions for the 76 talkers were characterized both by considerable spread of formant frequencies within a vowel category and frequent overlap of formant frequencies across vowel categories. Some 25 years later, the phoneticidentification of vowels from the formant frequencies of a heterogeneous sample of speakers continqes to be a research challenge
(see, for example,the recentpapersby Lennigand Hindle, 1977; Sroka, 1977; Broad and Wakita, 1977). The normalization of formant frequencies for the purpose of demonstrating vowel equivalence has several
possible complications. Fant (1966) noted that the vocal tracts
of adult males
differ
from
those of women and
Furthermore, the available acoustic data on children's speech are not adequate to describe the details of developmental changes in the formant frequencies of different vowels. Although vowel formant frequencies for various
ages of children have been reported (e.g., Eguchi and Hirsh, 1969), generalization and interpretation of the results is limited by the fact that the vowels were produced in varying phonetic contexts. Therefore, acoustic documentation of developmental changes in vowel production requires additional formant-frequency data for several age groups of children, preferably for isolated vowels or vowels embedded in a fixed phonetic context. It also is desirable
to control
for dialectal
differences
in vowel production, for these differences may confound the problem of vowel normalization.
This. paper presents data on the imitation of 15 synthesized formant patterns representing five English vowels and ten other vowels chosen to sample the F2-F 3 space. Synthesized vowels were used as stimuli because their acoustic characteristics could be specified exactly to satisfy the intended formant patterns and
children in having a proportionately longer pharyngeal tube, and this anatomical disparity makes questionable
because
the application of constant scale or correction factors.
diphthongization. The 33 speakers who served as subjects represented both sexes and ranged in age from 4 years to young adulthood. The chief purposes of this
Another problem in the assignment of scale factors is that these factors may vary with vowel identity or with formant-frequency number (Fant, 1966). Even the data that have been used in the study of vowel normalization have been questioned because of possible dialectal dif-
ferencesamongthe speakers(Nordstromand Lindblom, 1975; Lennig and Hindle, 1977). A number of solutions
to the problemof vowelnormalizationhavebeenproposed(Mol, 1963;Fujisaki andKawashima,1968;Gerstman, 1968; Nordstrom and Lindblom, 1975; Broad, 1976; Lennig and Hindle, 1977; Sroka, 1977), but for the most part these procedures have not been applied to a data base representing a specified wide range of
speaker ages. For .example,l•eterson ahd Bar.hey (1952) did not specify the ages of the children in their
control
could
be exercised
over
such factors
as
vowel duration, fundamental frequency contour, and
report are to (1) describe developmentalvariations in vowel formant frequencies, (2) evaluate possibilities for vowel normalization by uniform scaling factors, and (3) determine intersubject and intrasubject variability in the formant frequencies of vowel imitations. Another purpose of the experiment was to investigate vocal imitation as a developing sensorimotor skill, but this issue
hasbeendiscussedin previousreports (Kent, 1978and in press). I. EXPERIMENTAL
PROCEDURES
A. Subjects,
,
sample of speakers; nonetheless, their study remains
one of the few sources of formant-frequency d•atafor a large and heterogeneous set of talkers.
208
J. Acoust. Soc.Am.65(1), Jan.1979
The 33 speakers were grouped by age and sex as follows: five young adult men, four young adult women, five 12-year-old girls, five 6-year-old boys, five 6-
0001-4966/79/010208-10500.80 (D1979Acoustical Society of America
208
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08
ulus. The stimuli were recorded in aquasi-random order, separated by s silent interval of approximately 12 s.
TABLE I. F1, F2, and F3 frequencies (in Hz) of the synthetic vowel targets used in the limitation
experiment.
C. InstructiOnsto subjects
English vowels [i]
[•e]
270 2290 3010
660 1720 2410
[a]
730 1090 2440
Arbitrary I
F1 F2 •3
695 1400 2425
[u]
300 870 2240
The subjectswereaskedto s•y thesamesounds that
[3•]
they would hear from a loudspeaker but they were informed that they should use their own natural voice pitch. With some coaching, even the youngest children understood this instruction. A short sample of the tape was played before the imitations were recorded to acquaint the subjects with the nature of the stimuli. One tape recorder was used for playback of the stimuli and another was used to record the subjects' imitations. The children were told they would be given candy or a toy at the end of the experiment.
490 1350 1690
vowels
2
3
460 1915 2515
285 1580 2625
4
435 855 2325
5
400 1450 2300
D. Acoustic analysis, year-old girls, three 4-year-old boys, and six 4-yearold girls. Actually, four 4-year-old boys participated in the study, but one boy was excluded from the analyses because his unusually high fundamental frequency resuited in strong formant-harmonic interactions in the •spectrograms of his vowels. Because the primary purpose of this study was to obtain formant-frequency data for a variety of speakers, it was not considered essential to have an equal number of subjects in each group.
Formant frequencies of the vowel imitations were estimated from spectrograms made with a Kay Elemetrics 7029A Sona-Graph equipped with 45-, 300-, and 500-Hz analyzing filters and a frequency counter. For any given speaker, the spectrograms were made with the filter that yielded the optimum formant resolution. Narrowband section displays also were obtained. The formant frequencies were identified by locating the ap-
proximate center of formant bars on wideband(300 or 500 Hz) spectrograms or by fitting smooth curves to the narrowband (45 Hz) amplitude sections. Comparison of
B. Stimuli
The ten synthesized vowels had formant-frequency ranges for F1, F2, and F 3 bounded roughly by the mean
valuesthat PetersonandBarney (1952) recordedfor adult males' productionsof the vowels [i u a •e 3•]. In
the two analyses was accomplished to arrive at a final estimate of the formant frequencies. All acoustic measurements were made by the first author. The reliability of measurement, determined for two groups of
fact, five of the synthesized vowels were modeled after
Petersonand•arney's meandatafor theseEnglish vowels.
The other five vowels did not necessarily have
a phonemicidentity in English (most were variably labeled in a preliminary
task of identification
by pho-
netically trained listeners) and were selected because they gave a fairly uniform sampling of F1-F2-F 3 space. Because the stimuli presented for imitation were not necessarily English vowels, the subjects had to attend to the actual
character
of each
vowel
to make
a satis-
N
2.0
,L•
2•
.
factory imitation. None of the subjects, even in the youngest age group, appeared to have any difficulty with this task. The first three formant frequencies of the stimuli are listed in Table I and depicted in Fig. 1. The stimuli had a duration of 250 ms and were shaped with an amplitude rise-fall time of 30 ms. The fundamental frequency pattern was a linear ramp falling from 130 to 105 Hz. The vowels were synthesized with five formants but only the first three were variable in fre-
•L 1.2
0.8
quency(F 4 andF 5 were fixed at 3500 and 4500 Hz, respectively). The bandwidthsof the five formants were, in ascending order of formant number, 50, 70, 110,
170, and 250 Hz. Becausea series model (Klatt, 1972) was used in synthesizing the vowels, the relative formant amplitudes were determined by the formant frequencies.
A total of 50 synthetic vowel stimuli were presented for imitation, including five tokens of each vowel stim209
J. Acoust.Soc.Am., Vol. 65, No. 1, January1979
O.
12 I
I
0.4
FREQUENCY
FIG. 1. Ft-F•
plot showi•
I
I
0.6
I
I
0.8
OF F1 (kHz)
ten imitation stimuli (filled cir-
cles) and the me• imitation responses of five men (crosses). The center of each cross indicates the mean Ft and F• values for the vowel and the leith of each arm represents 1 standard
deviation(i. e., the horizontal arms are ß 1 s.d. for Ft and the vertical arms are ß 1 s.d. for F•). The stim•i represented are the E•lish vowels [i u • • •] and the arbitrary vowels 1-5.
R.D. Kentand L. L. Forner:Studyof vowelformants
209
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08
speakers--adult males and 4-year-old children--was calculated for repeated formant-frequency measures for selected spectrograms. The standard deviation Sr of the measurement differences, the mean difference D, and the standard error of the mean difference s• were determined for the first three formant frequencies of 30 spectrograms for each group. The following formu-
las (Snedecor and Cochran, 1967) were used to compute SD, S•, and a Student's t, the latter to determine if the mean
differences
were
different
from
0:
for both analyses generally were made for the last haft of a vowel segment, the measures did not apply to exactly the same instant within the vowel. Differences in sampling point can contribute to large differences in formant frequency whenever the productions are diph-
thongized,as sometimeshappened with [m], for example. II. RESULTS
AND
DISCUSSION
A. Isovowellinesin linear F•-F 2 and F:z-F:• plots ,
1/2
D• -
SD•
i
The correspondence between ten imitation targets and the group means for imitations of these targets by adult
D
n-1
'
SD SD/;ql /2 t=D/sD,
with(n-I)
males can be judged from the Ft-F 2 plot in Fig. 1. The Fi-F 2 values of the stimuli are represented by the filled circles and the F1-F•. means of the imitations are given by the intersections
d.f.
,
where n is the number of remeasured spectrograms and
D e is the difference between any original and replicate measurement.
The sample mean difference, standard deviation of the differences, and standard error of the mean differences
were as follows for the adult males' Ft; 13, 54, 11 Hz; F•.; 19, 63, 13 Hz; Fs; -8, 62, 13 Hz. For the 4-yearold children, the same statistics were' Ft; 15, 59, 11 Hz; F•.;-13, 68, 13 Hz; Fs;-6, 52, 10 Hz. None of the t values reached significance at the 0.10 level, indicating that the differences between replicate measurements were essentially randomly distributed around zero. Larger measurement errors are expected for the children than for the adults, given the age-dependent
tal
and vertical
of the crossed lines.
lines
that
form
each
cross
The horizonare
ñ one
standard deviation for F1 and F2, respectively. Generally speaking, the F1-F 2 means for the men subjects closely approximate the F1-F 2 values of the stimuli. The major discrepancy between the targets and the imitation means is an overall shift in the vowel quadrilateral of the imitation responses to the right, in the direction of increased F! frequency. The first two formant frequencies for imitations
of
the Englishvowels[i u a ae]by all 33 subjectsare shown in the scattergram of Fig. 2. Each point represents the mean of five imitations by one of the 33 speakers. Data
for [3'] are not includedbecausethey overlappedextensively with the data for the other vowels.
With the ex-
ceptionof the results for [ae],the pointsare arranged
decline in fundamental{requencyand the fact that the
in elongated clusters oriented diagonally in the Fi-F 2 plane. Because it appeared that a straight line would
hypothetical error in the estimation of formant frequencies from spectrograms is equal to or greater than
was attempted using the method of least squares.
fo/4 (Lindblom, 1972). The accuracy of formant-fre-
fit the clusteringsfor [i u a], straight-line regression The
quency measurement also was assessed by comparing the spectrographic measurements with measurements
derived from a linear prediction analysis (Markel, 1971). The linear prediction analysis, which became available late in the study, was used to determine the formant frequencies for three of the 12-year-old girls. The statistics D, st, and sD, computed with the formulas given above, were as follows for the comparison of spectrographic and linear prediction analyses for the
first three formant frequencies;subject 12-a- F t (-4, 70, 11 Hz), F•. (-2, 88, 13 Hz), F s (3, 81, 14 Hz); subject 12-b' Ft (-17, 45, 6 Hz), F•. (17, 96, 14 Hz), F s (-10, 81, 13 Hz); subject 12-c' F t (-5, 73, 9 Hz), F•. (-34, 115, 16 Hz), F s (-58, 128, 19 Hz). These values are based on at least 90% of the imitations recorded for each subject; the remainder were eliminated from consideration, usually because the linear prediction analysis missed a formant or identified one spuriously.
Of the nine t values, three were significant at the 0.05 level, with the LPC measures being lower in frequency
thanthe spectrographicmeasures. These error estimates probably are liberal in that they include several sources of variability, including differences in the sampling points for the analysis as well as differences in the techniques of analysis. That is, although measures
210
J. Acoust.Soc.Am., Vol. 65, No. 1, January1979
3.6
3.2
0
0
2.8
%%0% o
øøooo 0 $ø%
2.0
0(9
1.6
1.2
0.8
0.4
o
0.2
0.4
0.6
0.8
1.0
1.2
FREQUENCY OF F1 (kHz)
FIG. 2. Scattergram of the F 1 and F•. frequencies for imitations of the English vowels [i u a •e] by 33 men, women, and children. Each data point represents the mean for five imitations of the vowel by a given speaker.
R.D. Kentand L. L. Forner:Studyof vowelformants
2t0
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08
+½, where /t is the slope of the regression line and ½ is a normally distributed random variable. None of the
3.6
],.77
t values reached significance (p >0.30), so the null hy-
3.2
pothesis that the lines pass through the origin cannot be rejected. The estimate of the slope /• for each vowel is:
2.8
•
[i], 7.11; [u], 2.48; [a], 1.545; [m], 2.65; and [•], 2.79.
2.4
/ 1.6
IM
o
Ld 1.2
• [o],. 70
[u],.59
0.8
0.4
I
I
0.2
[
I
0.4
I
I
I
0.6
I
0.8
FREQUENCY
FIG. 3.
This result
confirms Mol's (1963) suggestionthat straight lines
0 2.0 z
These lines are showntogether with the F1-F•.
means for each age-sex group in Fig. 4.
I
I
1.0
I
I
I
1.2
OF F1 (kHz)
Linear regression determined by the method of least
squares for the F 1 and F•. frequencies of imitations of the En-
glish vowels[i u o ae3•] by the 33 speakers. The r •' coefficient of determination is shownfor each regression line.
results are depicted in Fig. 3, which shows the regres-
sion line determinedfor each vowel ([3•] included)as well as the r 2 coefficient of determination. With the ex-
passing through the origin are a good approximation to the F1-F • values for different vowels produced by a variety of speakers. Mol interpreted the lines passing through the origin as evidence for a principle of "axial growth," in which development of the vocal tract has a uniformity that is reflected by regular changes in the formant patterns of different vowels. Perhaps the most useful aspect of Mol's scheme is that it allows the vowel spaces of different speakers to be compared along fixed vectors in F1-F • space, with an easy extrapolation for
subjects who fall outside (longer or shorter vocal tracts) of an existing subject set. One area of application is the acoustic examination of vowels produced by persons with speech disorders, as the rays shown in Fig. 4 offer a quick, although approximate, test of the adequacy
of a speaker's vowel F1-F • patterns. A scattergram of the F 2 and F 3 frequencies for imitations of the five English vowels is given in Fig. 5. As in Fig. 2, each point is the mean for five imitations of
each vowel by one of the 33 speakers (except for vowel
ceptionof the result for [m], the r 2 valuesare larger
[3•] for whichtwo of the 4-year-oldscouldnot makea
than 0.5, which means that the least-squares lines account for more than half of the variability for four of
satisfactory phonemie production as judged by the in-
the five vowels. Thus, linear effects are substantial in the Fi-F2 variability for four of the five vowels.
It was concludedin earlier reports based on partial analyses of the current data that straight lines through the origin provide a satisfactory fit to group means for
eachof the Englishvowels[i u a ae3•] (Kent, 1978). To
vestigators). Regression lines again were determined by the method of least squares, and the results are il-
lustrated in Fig. 6. The r • coefficientsof determina-
3.6
i
test this possibility statistically, a t test described by Snedecor and Cochran (1967) was used to evaluate the null hypothesis that a straight line determined for each English vowel passes through the origin. The formula for the test of significance is
t = ( • - b2)/s,., 1/n +2
x2
, with(n- 2)d.f.
whereb is the least squaresestimateof the slope½in the linear
y=a
2.4
model
1.6
1.2
+/•x +½:
xr// x ,
0.8
and S•o,is the residual mean square,
0.4
0
In these equationsX is taken to be a F• value and Y is taken to be a F 2 value.
If a straight line throughthe origin is a satisfactory fit, thenthe linear modelgivenabovereducesto y =/Ix 211
J. Acoust.Soc. Am., Vol. 65, No. 1, January1979 ,
0.2,
0.4
i 0.6
i 0.8
i 1.0
i 1.2
FREQUENCY OF F1 (kHz)
FIG. 4. F1--F 2 regression lines for the English vowels
[i u a ae3•] accordingto the linear equationY=•x +½. The straight lines, all of which pass throughthe origin, are deter-
mined withb=•XY/• X2astheestimate oftheslope/•,where X is an F 1 value and Y is an F 2 value., R. D. Kent and L. L. Forner: Study of vowel formants
211
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08
85
-
80-
75-
70-
65-
55
I 60
I 65
I 70
10 In F1 FIG. 7.
Logarithmic transformations of the group Ft-F•.
means for imitations of the five English vowels and the two arbitrary vowels numbered 1, 2. The subject groups are men,
women, 12-year-old girls, 6 year-old boys, 6-year-old girls, 4-year-old boys, and 4-year-old girls. Lines have been drawn to connect the mean data points for each vowel.
dependentscale factors that are used to transform the
F 1 andF 2 frequencies(Fant, 1966). The successof a multiplicative scale factor rests upon a formant-frequencytransformation that yields approximately equal
ti-onare largerthan0.70 for eachvowelexcept[u](for
dispersions for differentvowels(LennigandHindle,1977).
which Fig. 5 shows a rather unsystematic variation in
Uniform scaling by multiplication may be regarded as
a narrow F 2 range). Thus, for four of the five vowels, the regression lines accountfor 70% or more of the
small increases in the r 2 value (increments of 0.02 or
a process of addinga speaker-dependentconstantto logarithmically transformed F 1 and F 2 frequencies (Nearey, 1977). Thus, the potential success of uniform scaling procedures can be gaugedby determining the extent to which vowel dispersion is equalized in a log-
less) exceptfor vowels[u] and [•e], whichhadincre-
transformed F1-F 2 space.
variance in the group data. Calculation of the linear regression of F 3 on combined F t and F2 resulted in
ments of 0.23 and 0.11, respectively. Thus, for the F2-F 3 values, as with the Ft-F 2 values, linear effects are strong in the group formant data.
B. Formant scalingand logarithmictransformations Attempts at vowel normalization by uniform scaling typically haveinvolvedthe derivationof constantspeaker-
Figure 7 showsthe log-transformed mean values of the Ft and F 2 frequencies for the seven age-sex groups
producing the Englishvowels[i u a m3•] andarbitrary vowels 1 and 2. Comparison of this log-transformed
Fi-F 2 plot with the linear Fi-F2 plot (Fig. 4) showsthat the transformation
does indeed make the vowel disper-
sions more nearly the same. [u]
4.4
[a]
,45
.72
[ae]
.74
[i]
.85
For example, whereas
the dispersionfor [u] is markedlycompressedrelative to that for [i] in the linear Ft-F2 plot, the dispersions for these two vowels are nearly equal in Fig. 7. The effect of the logarithmic transformation on the F2 and
F 3 frequenciesis illustrated in Fig. 8. This log plot offers a somewhat equivalent dispersion for the five
3.6
vowels, but the remaining differences probably are not
negligible(for example,noticethecompression for [3•] relative to [•e]). 2.8
These results with a logarithmic transformation of
formantfrequenciessupportLennigandHindle's(1977) conclusionthat a logarithmic transformation can greatly reduce the differences in dispersion among vowels, al-
2.
'•/110 i i i I I I ß
1.4
1.8
2.2
2.6
3.0
3.4
FREQUENCY OF F2 (kHz)
FIG. 6. Linear regression determined by the method of least squaresfor the F 2 andF 3 frequenciesof imitations of the En-
thoughthis transformationmay notyield an exact equivalence of dispersion. The outcomeperhaps is sufficiently satisfying to hold promise for uniform scaling factors, which will be considered again in the next section of this paper. Further work is neededto determine
if the logarithmic transformation also serves to segregate the F1-F 2 values by vowel category. This issue glishvowels[i u a •e3•] bythe33 speakers.Ther 2coefficient was consideredby Lennig and Hindle (1977). of determination is shown for each regression line.
212
J. Acoust. Soc.Am.,Vol.65, No.1,January 1979
R.D.KentandL. L. Forner: Study ofvowel formants
212
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08
85-
i
40 t,
83--
81
-
79
•.
8o
•
40
o
[u]> [ • ] > [•e](theydidnotreportdatafor [ a]).
tionsyieldthe series[•e]> [u]>[a]>[i], whereasfor the
214
J.Acoust. Soc.Am.,Vol.65,No.1,January 1979
R.D.KentandL.L.Forner: Study ofvowel formants
214
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 130.88.53.18 On: Sun, 07 Dec 2014 13:50:08
D. Formant variability• 3.2
The variability of the Ft, F2, and F 3 frequencies in
i
vowel imitation
is influenced by several factors,
includ-
ing the error of measurement (which varies with a speaker's fundamental frequency), the precision of a subject's vowel articulation, andthe subject's familiarity
z
1.6
with the imitation target. Because these factors are difficult to tease apart, the current report focuses on variability measures for the adult male subjects. The men are expected to have the smallest variability of
Q
::D
formant frequencies for two major reasons: (1) they have the lowest fundamental frequencies of all the agesex groups tested, so measurement error in the estimation of the formant frequencies should be minimal,
0.8
O 12
"•'
YEARS
ß6 YEARS