DURATION CHARACTERISTICS OF ESOPHAGEAL SPEECH JOHN M. CHRISTENSEN
Idaho State University, PocateUo BERND WEINBERG
Purdue University, West Lafayette, Indiana
The duration of a large number of representative vowels produced by 10 esophageal and nine normal speakers were measured. Overall vowel durations of esophageal speakers were consistently longer than those of normal speakers, indicating that esophageal speakers do not compensate for their striking diminution in air supply for speech by decreasing vowel duration. The differences in the vowel duration characteristics between normal and esophageal speakers were observed to vary systematically as a function of the voicing features of their consonant environments. Specifically, the durations of vowels of esophageal speakers spoken within voiceless consonant environments were consistently longer than those spoken in similar contexts by normal speakers. There were no significant differences between the average durations of vowels spoken by normal and esophageal speakers within voiced consonant environments. The observation that the durations of vowels produced by esophageal speakers differed significantly as a function of the voicing features of their consonant context was interpreted to support the belief that inherent, rule-governed durational features of English are retained following laryngeal amputation. Until recently, most research on esophageal speech has been concerned with the measurement of acoustic and perceptual correlates of esophageal phonation. This emphasis can be attributed to the widespread assumption that the principal speech features affected by total laryngectomy surgery are those related to the vibratory source (Damste, 1958). It is now generally agreed that the average voice fundamental frequency of male esophageal speakers is about 65 Hz, that variation in average fundamental frequency among esophageal speakers is sizable, and that average fundamental frequency levels differ with the sex of esophageal speakers (Weinberg and Bennett, 1972). The literature presents a contradictory picture regarding the effects total laryngectomy surgery have on articulatory characteristics of esophageal speakers. For example, Damste (1958, p. 3) has suggested that "the rest of the vocal tract (the pharyngeal and oral cavities) behaves substantially the same in both normal and esophageal speech. For that reason, . . . phonetic events in this region undergo no change." By contrast, the data of Diedrich and Youngstrom (1966) indicate that there are marked differences in overall vocal tract length, 678
pharyngeal cavity size and shape, and duration of articulatory motions of the tongue, lips, and velum between esophageal and normal speakers. At the acoustic level, the observation is that average vowel formant frequencies of esophageal speakers are consistently higher than those for normal speakers. This suggests that primary consequences of total laryngectomy include a shortening of the vocal tract and, by inference, altered articulatory behavior (Sisty and Weinberg, 1972). These findings, coupled with additional observations of speech intelligibility problems among esophageal speakers, suggest that total laryngectomy produces substantial changes in articulatory maneuvers in esophageal speakers. A fundamental purpose of the present project was to explore additional dimensions of articulatory change occasioned by laryngeal amputation. Since laryngeal amputation necessitates the creation of a permanent respiratory stoma and produces a functional separation between the patient's respiratory airway and the vocal and digestive tracts, laryngectomized speakers employing esophageal speech are deprived of their normal pulmonary air supply for speech purposes. We hypothesized that esophageal speakers might compensate for their reduced respiratory supply for speech by decreasing phonetic duration. In other words, esophageal speakers were expected to decrease vowel duration in an attempt to conserve their limited esophageal air supply for speech and, thereby, maximize the efficiency of their newly modified speech production apparatus. A second fundamental area of inquiry was to determine whether selected, inherent, rule-governed durational features of English are retained following laryngeal amputation. In this regard, it is well known that vowel durations are conditioned by the voicing features of their consonant contexts. Moreover, vowel duration lengthening in voiced consonant environments and shortening in voiceless contexts is believed to represent an inherent, rule-governed phonological feature of English (House, 1961; Stevens, House, and Paul, 1966). Accordingly, comparisons were also made of the differences in vowel durations of esophageal speakers occurring as a function of the voicing features of their consonant environment. These comparisons were completed to assess the hypothesis that important, rule-governed durational features of English are not altered by total laryngectomy. METHOD
Subjects Vowel duration measures were obtained from 10-adult male laryngectomized speakers who had used esophageal speech as their sole method of oral communication for more than one year. The two investigators rated each speaker in terms of speech acceptability (Shipp, 1967) and vocal effectiveness (Curry and Snidecor, 1961). All subjects were judged to have above-average to excellent esophageal speech and can be compared to highly rated speakers studied by Weinberg and Bennett (1972). CHnlST~.nsEN,W~INBF,XaG:EsophagealSpeech 679
Speech Materials and Recordings High-quality tape recordings were made of both the esophageal and normal speakers uttering vowels spoken within various consonant environments. The stimulus materials were 32 symmetric CVC syllables (for example, /pip/). The syllables were formed by combining eight consonants ( /p/, /t/, /k/, /b/, /d/, /g/, /s/, /z/) with four vowels (/i/, /i/, /a/, /u/). The four vowels were chosen because they sample a wide range of articulatory positions within vowel space and they provide a reasonable sampling of important secondary acoustic characteristics of vowels (Peterson and Barney, 1952; House and Fairbanks, 1953; Tiffany, 1953; Stevens et al., 1966). The eight consonants were selected to provide a representative sampling for voiced-voiceless cognate pairs, varying manners of production, and three characteristic places of articulation. The recordings were made in an anechoic chamber. The stimuli were recorded within the sentence frame " _ _ is a word." The stimulus materials embedded within the sentence frame were read to each subject by the investigator from randomized lists. The lists were organized to form seven randomized repetitions for each CVC syllable. Subjects were instructed to speak each sentence frame at a conversational rate, to produce sentences in a natural manner, and to stress the initial CVC monosyllable of each sentence. Analyses of vowel durations were made on the second, third, fourth, fifth, and sixth sentence recordings. Thus, 1600 CVC utterances were available for analysis for the esophageal talkers; 1440 CVC utterances for the normal talkers.
Listening Procedures A group of listeners was asked to evaluate the recordings of the two groups of talkers to insure that the vowels and their syllable consonant environments were phonetically representative. Ten listeners completed broad phonetic transcriptions of the recordings made by the 10 laryngectomized subjects, and five listeners transcribed the recordings made by the normal speakers. Fewer listeners were used in the latter evaluation since it was assumed that the transcription of utterances produced by normal speakers would not be as difficult. The five listeners who evaluated the recordings of the normal subjects were among the 10 listeners used to evaluate the recordings of the esophageal speakers. All listeners had extensive training in phonetics, were experienced in participating in psychoacoustic and speech perception research, and had extensive background in evaluating speech. The responses of these listeners were used to select representative vowels for durational analysis. The stimulus materials used in the listening experiment were the high-quality recordings of each esophageal and normal speaker producing five repetitions of the 32 CVC utterances spoken within a carrier sentence. A master listening tape of these utterances was prepared by randomizing the utterances of each speaker and separating each stimulus sentence by a five-second silent interval. 680 Journal o[ Speech and Hearing Research
19 678-689 1976
Stimuli were presented by means of a loudspeaker using a high-quality tape recorder-amplifier-speaker system (TEAC A-1200U Tape Loop Repeater, Dynaco amplifier and preamplifier, and an Electro Voice Speaker Model SP-12). Signal levels were adjusted for comfortable listening conditions. The task of the listener was to phonetically transcribe the CVC elements of each sentenceinitial syllable spoken within each carrier sentence on a response form. Each listener evaluated the recordings individually and was free to listen to each syllable as long as necessary by allowing any given stimulus to recur on the loop repeater.
Selection of Vowels for Duration Analysis The selection of vowels for duration analysis was based on two criteria: (1) that 807~ or more of the listeners identified a given vowel sample within a CVC utterance as the vowel intended by the talker, and (2) that 807~ or more of the listeners identified both the initial syllable and final consonants surrounding a given vowel as the consonants intended by the talker.
Vowel Duration Measurements Spectral analysis techniques were used to obtain measurements of duration of vowels selected for analysis. Specifically, broad-band spectrograms and amplitude tracings were made of each of the CVC utterances meeting representativeness criteria (Voice Print-Model 700). The symmetrical structure of the test syllables facilitated the identification of vowels and created a situation which fostered improved reliability of the durational measurements. The specific measurement criteria used parallel those suggested by Peterson and Lehiste (1960). For example, the initiation of vowels following wordinitial voiceless plosives was determined by identifying the time of voice onset, that is, the onset of phonation. In this case, the initial vertical striation present in the broad-band spectrogram was used to identify the onset of the vowel. In the case of CVC syllables containing word-initial voiced plosives, vowel measurements were made from the center of the burst spike and included the frication period as part of the vowel duration measurement. The initiation of word-final voiceless plosives was determined by observing an abrupt cessation of energy typically associated with all formants. The termination of vowels preceding word-final voiced plosives was determined by identifying the point at which this abrupt reduction in the intensity associated with the formants occurred (Peterson and Lehiste, 1960). It is well known that the terminal boundaries of word-initial fricatives are rather easily identified on broad-band spectrograms (Peterson and Lehiste, 1960). Accordingly, the onset of the vowel following word-initial voiceless fricatives was determined by identifying the onset of phonation as reflected by the onset of periodicity, that is, by labeling the initial vertical striation in the region of the first formant. In the case of word-initial voiced fricatives the CH:RISTENSEN,WEINBERG: Esophageal Speech 681
abrupt termination of the superimposed noise served to identify vowel onset. Word-final fricatives are identifiable on broad-band spectrograms by the onset of random noise. Accordingly, the termination of vowels in such environments was measured by detecting the point at which noise pattern associated with the final consonant began (Peterson and Lehiste, 1960). RESULTS
Listener Performance and Representativeness 1udgments As indicated previously, a group of listeners was used to phonetically transcribe the recordings of both esophageal and normal speakers. The reliability of the listeners' transcription performance was assessed by requiring each listener to reevaluate a randomly selected set of 40 CVC utterances produced by one of the esophageal speakers. The percentage of agreement in phonetic transcription of both the consonant and vowel elements of the stimuli was calculated. The average percentages of agreement in phonetic transcription for the 10 listeners were 90.2~ for word-initial consonants, 94.7~ for word-final consonants, and 97.57O for the vowels. The responses of these listeners were used to establish the acceptability of vowels for duration measurement. For the esophageal speakers, 931 vowels (approximately 587o) were representative according to the established criteria. For normal speakers, 1294 vowels (approximately 90~) met the two representativeness criteria.
Investigator Measurement Error To evaluate the error in measurement of vowel duration, one of the investigators (J.C.) independently remeasured the durations of 32 randomly selected vowels produced by a normal speaker and 32 vowels spoken by one of the esophageal speakers. An analysis of the measurements for this series of vowels indicated that there were no significant differences between the repeated average values and that the average error of measurement was small (SD = 5.57 msec for esophageal vowels; SD = 5.12 msec for normal vowels). The average error values for repeated measurement were consistently smaller than both i n t r a - a n d intersubject vowel duration variation values across repeated syllable productions and were well within measurement error values reported by others (Klatt, 1971). Correlation analyses were also used to assess the reliability of these repeated measurements. The correlation coefficients between the two measurement sets were r = 0.99 (normal speakers) and r = 0.99 (esophageal speakers). These observations support the assumption of adequate investigator measurement reliability for both types of speech studied.
Vowel Duration Characteristics of Normal and Esophageal Speakers A primary hypothesis under study was that the durations of vowels pro682 Journal of Speech and Hearing Research
19 678-689 1976
FxctrR~ 1. Comparison of overall mean representative vowel durations produced by esophageal (E) and normal (N) speakers.
duced by esophageal speakers would be shorter than those spoken by normal speakers. The overall average vowel duration characteristics of representative vowels /i/, /I/, /a/, and /u/ produced by normal and esophageal speakers are illustrated in Figure 1. These values represent vowel durations averaged across all consonant environments and subjects within each speech-type group. Overall vowel durations of esophageal speakers were consistently longer than those of normal speakers, indicating that esophageal speakers do not compensate for their loss of respiratory air supply for speech by decreasing vowel duration. Rather, it would appear that the esophageal speakers studied here increased vowel duration in the face of a diminished respiratory supply for speech. It is important to reemphasize that the values portrayed in Figure 1 represent mean characteristics averaged across all consonant environments. Hence, any differential effect associated with such features as consonant voicing is masked. The average vowel duration properties of normal and esophageal speakers are plotted as a function of consonant voicing features in Figure 2. The differences in the vowel duration characteristics between the two speaker groups appear to vary systematically as a function of the voicing features of the consonant environment within which the vowel was spoken. Specifically, average vowel durations spoken in voiceless consonant environments were always longer for esophageal speakers. Durations were significantly (p < 0.05, see Table 1) longer for esophageal speakers in eight of 14 context comparisons completed. By contrast, average vowel durations spoken in voiced consonant environments appear comparable for the two speaker groups. Analysis of variance procedures were used to test the significance of these CHRISTENSEN, WEINBERG:Esophageal Speech 683
280 ~. i