Rehabilitation Psychology 2014, Vol. 59, No. 3, 289-297

© 2014 American Psychological Association 0090-5550/14/$ 12.00 http://dx.doi.org/10.1037/a0036663

Listeners’ Preference for Computer-Synthesized Speech Over Natural Speech of People With Disabilities Steven E. Stem, Chelsea M. Chobany, and Disha V. Patel

Justin J. Tressler Seton Hill University

University of Pittsburgh at Johnstown

Purpose/Objective: There are few controlled experimental studies that examine reactions to people with speech disabilities. We conducted 2 studies designed to examine participants’ reactions to persuasive appeals delivered by people with physical disabilities and mild to moderate dysarthria. Research Method/Design: Research participants watched video clips delivered by actors with bona fide disabilities and subsequently rated the argument, message, and the speaker. The first study (n = 165) employed a between-groups design that examined reactions to natural dysarthric speech, synthetic speech as entered into a keyboard by hand, and synthetic speech as entered into a keyboard with a headwand. The second study (n = 27) employed a within-groups design that examined how participants reacted to natural dysarthric speech versus synthetic speech as entered into a keyboard by hand. Results: Both of these studies provide evidence that people rated the argument, message, and speaker more favorably when people with disabilities used synthetic speech than when they spoke in their natural voice. Conclusions/ Implications: The implications are that although people react negatively to computer-synthesized speech, they prefer it to and find it more persuasive than the speech of people with disabilities. This appears to be the case even if the speech is only moderately impaired and is as intelligible as the synthetic speech. Hence, the decision to use synthetic speech versus natural speech can be further complicated by an understanding that even the intelligible speech of people with disabilities leads to more negative reactions than synthetic speech. Keywords: disability, computer-synthesized speech, assistive technology, prejudice

Impacts and Implications

individual speaker. Policymakers and organizations need to consider

• This study provides a direct comparison between perceptions of people with speech disabilities either using an assistive technology (computer-synthesized speech) or not using the technology. Because of the nature of disability, there has been little systematic experimentation on perceptions of people with disabilities. By creating stimuli employ­ ing people with disabilities as actors using either their own voices or computer-synthesized speech, we were able to conduct carefully con­ trolled studies of perceptions of computer synthesized versus natural speech.

how synthetic speech is considered in terms of providing reasonable accommodations to workers with disabilities. If synthetic speech is preferred over natural speech, this is evidence that employers should consider it to be a reasonable accommodation for certain types of employment that require verbal interaction.

Introduction

• These findings have implications for understanding prejudice to­ ward people with disabilities. Users of computer-synthesized speech face a double-edged sword. The assistive technology that they use, computer-synthesized speech, is unpopular. But choosing not to use it can lead to less favorable reactions to the message as well as to the

Speech disability is a stigmatized condition (Weitzel, 2000) that isolates people, leaving them with fewer social contacts and friend­ ships (Cruice, Worrall, & Hickson, 2006; Hilari & Northcott, 2006) . People with dysarthria and aphasia report difficulty com­ municating and frequently withdraw from activities that involve speaking (Hartelius, Elmberg, Holm, Lovberg, & Nikolaidis, 2007) . People are often uncomfortable around people with speech disabilities, avoiding communication with them, infantilizing them, and distancing themselves from them (Brady et al., 2011). Unsurprisingly, the inability to use one’s voice is frequently ac-

Steven E. Stem and Chelsea M. Chobany, Department of Psychology, University of Pittsburgh at Johnstown; Disha V. Patel, Department of Biology, University of Pittsburgh at Johnstown; Justin J. Tressler, Marriage and Family Therapy Program, Seton Hill University. The University of Pittsburgh’s Council for Research Development Fund provided financial support for this work. The authors also acknowledge

staff and consumers at the Hiram G. Andrews Center in Johnstown, Pennsylvania for their assistance on this project. We are grateful to John Mullennix for his comments on our article. Correspondence concerning this article should be addressed to Steven E. Stem, PhD, Department of Psychology, University of Pittsburgh at Johnstown, 450 Schoolhouse Road, Johnstown, PA 15905. E-mail: [email protected]

• This research provides evidence that people react more positively to and are more persuaded by computer-synthesized speech over the natural speech of people with disabilities even when the synthesized and natural speech are similarly intelligible.

289

290

STERN, CHOBANY, PATEL, AND TRESSLER

companied by diminished self-worth, feelings of being left out, depression, and social withdrawal (Brady et al., 2011; Weitzel,

2000). Computer-synthesized speech is a commonly used and very practical assistive technology for many people who are unable to use their own voices (Beukelman, Fager, Ball, & Dietz, 2007; Fried-Oken et al., 2006). There are numerous ways in which text can be inputted into “text to speech” (TTS) systems including keyboards, PC-tablet like interfaces, and eye-tracking devices (Higgenbotham, 2010). Insofar as computer-synthesized speech should be a boon to people who are unable to use their own voices, there is consistent evidence that people react negatively to computer-synthesized speech (e.g., Stern, Mullennix, Dyson, & Wilson, 1999). This may be due, in part, to difficulty comprehending computer-synthesized speech (Gorenflo & Gorenflo, 1997; Venkatagiri, 2004). There are, of course, other problems such as the limitations of pitch, emotional range, naturalness, and volume that are inherent in most versions of the technology (Stern, 2008). The technology is par­ ticularly problematic in more challenging situations, such as using the telephone, speaking to children, and conversing in a car (Ball, Beukelman, & Patee, 2004). Our research program has established that computer-synthesized speech is evaluated less favorably and is less persuasive than typical natural speech (Stem et al., 1999). However, when in­ formed that the user of synthetic speech has a disability, listeners are typically sympathetic and evaluate that speaker more favorably (Stem, Mullennix, & Wilson, 2002). Unfortunately, this effect does not generalize to more compli­ cated situations that require the listener to consider other issues. When listeners are informed, for instance, that a person with a disability is using synthetic speech for a negatively perceived purpose (e.g., a telephone appeal), they become more critical than if the person does not have a disability (Stem, Dumont, Mullennix, & Winters, 2007). Research that has compared reactions with computersynthesized speech with reactions to dysarthric speech is no more encouraging. Dysarthric speech can be difficult to comprehend (Drager, Hustad, & Gable, 2004), and participants are more com­ fortable listening to computer-synthesized speech than dysarthric speech and the speech of people with aphasia (Drager, Hustad, & Gable, 2004; Lasker & Beukelman, 1999). Hence, people with speech disabilities are in a double bind. Using their own voice to speak, if possible, is unwelcome. And using assistive technology that is widely available and highly practical may be unwelcome because it is unpleasant to listen to. In fact, computer-synthesized speech may serve as a cue that reinforces and reiterates the disability that it is intended to ame­ liorate (Stem et al., 2007). To better examine how people respond to people with speech disabilities, both with and without the use of assistive technology, we conducted experiments in which participants watched videos of people with bona fide disabilities who were using their own mildly dysarthric to moderately dysarthric voice or computer-synthesized speech to make a persuasive appeal. In Study 1, we used a between-groups design to examine how people would react to people with disabilities (cerebral palsy and spinal cord injury) using their own dysarthric voices or using computer-synthesized speech. Based upon previous research showing a level of tolerance

toward people who use synthetic speech as an assistive technology, but intolerance for dysarthric speech, we hypothesized that the argument, message, and the speaker would be rated more favor­ ably when the person with a disability used synthetic speech than when he used his own moderately dysarthric voice. Our design examined reactions to synthetic speech as inputted by hand and by headwand. The headwand condition was intended to be indicative of greater impairment. Nevertheless, we had no directional hypoth­ esis regarding how these different input modes would affect our dependent measures. In Study 1 we also examined how evaluations would be affected by whether or not speakers were choosing to use the synthetic speech as an assistive technology. Based upon prior research (Stem et al., 2002) that shows that synthetic speech is better tolerated when used legitimately as an assistive technology, we expected participants to react positively to speakers who chose the technology than speakers with disabilities who chose not to use the technology. In Study 2 we examined the same question using a withingroups design. All participants viewed two presentations of the same persuasive appeal, once presented by an actor using their own dysarthric voice and once presented by the other actor using synthesized speech. The sequences were counterbalanced for both actor and which speech conditioned participants were exposed to first. For Study 2 we hypothesized that the preference for synthetic speech over the natural speech of people with disabilities would persist even when participants had the opportunity to make a direct comparison between them and were fully aware that there were two conditions. A successful within-groups replication of this effect would have an added benefit of justifying follow up studies utilizing fewer participants.

General Method Overview In the experiments here, participants watched videotaped seg­ ments of persons with physical disabilities delivering persuasive appeals. In both studies, we examined participants’ perceptions of the effectiveness of the persuasive argument, as well as their perceptions of the message and the speaker. The Institutional Review Board of the University of Pittsburgh approved of all procedures.

Stimuli Our stimuli consisted of video segments of two men with physical disabilities. Each of these actors delivered a frequently used standardized strong persuasive appeal (Petty & Cacioppo, 1986), under one of three conditions: (a) vocally, (b) using syn­ thetic speech inputted by hand on a keyboard, and (c) using synthetic speech inputted by headwand on a keyboard. This per­ suasive appeal is a scripted speech lasting between 7 min 2 1 s and 8 min 44 s depending upon condition. In the appeal, the speaker explains the arguments in favor of adopting comprehensive exit exams as a graduation requirement for seniors at undergraduate institutions. The argument focuses, to a large extent, on the extent to which employers and graduate schools are favorably impressed

LISTENERS’ PREFERENCE FOR COMPUTER SYNTHETIC SPEECH

by students who graduate from universities that require compre­ hensive exams for graduation. Actor 1 was a 24-year-old male with a spinal cord injury resulting in paraplegia. Actor 2 was a 19-year-old with cerebral palsy with spastic quadriparesis (photographs of the actors can be seen in Figure 1). Both actors used wheelchairs and both actors were able to speak intelligibly. They were recruited from a local state run vocational rehabilitation facility and. were paid for their participation. The intelligibility of the actors was assessed with a 3 X 3 Latin squares (Raters X Sentences) in which 21 inexperienced raters each watched a sentence long excerpt from each of two actors using their own voices and an actor using synthetic speech (e.g., Hammen, Yorkston, & Minife, 1994). Each rater listened to each actor reading different sentences. They listened to each sentence three times and transcribed it to the best of their ability. For each actor and the synthetic speech program, a mean score was calcu­ lated by dividing the number of words understood by the number of words spoken. The mean scores for the two actors were as follows: Actor 1 (spinal cord injury) 86.8%; Actor 2 (cerebral palsy) 70.6%. This indicates that they had mild (Actor 1) to moderate (Actor 2) dysarthria (Yorkston, Strand, & Kennedy, 1996). The mean sentence intelligibility for the computersynthesized speech sample was 75.5%, indicating that it was similarly intelligible to the natural samples. In the videotapes that we used in this study, each of the actors were doing one of three things: (a) reading a persuasive passage developed by Petty and Cacioppo (1986), (b) mimicking the use of a small keyboard attached to the right arm of their wheel chair, and (c) mimicking the use of a headwand to input onto a

Actor 1

Gender: Male Age: 24 Condition: Spinal cord injury with paraplegia caused by motor vehicle accident

Actor 2 Gender: Male

291

small keyboard in a manner sometimes used by quadriplegic users of keyboards. In the second and third setting, to create the impression that the actors were using computer-synthesized speech, the persuasive appeal was simultaneously outputted from a TTS system (DecTalk Express, V2.4C), a commercially available technology. The synthetic speech was presented in a default male voice.

Measures Our dependent measures have been used previously in research on perceptions of computer-synthesized speech (e.g., Stem et al., 1999). These include a pretest and posttest measure of attitude change (only used in Study 1), and a series of questions assessing (a) overall effectiveness of the argument, (b) attitudes toward the message, and (c) attitudes toward the speaker. Attitude change. Immediately prior to watching the video clip, and immediately afterward, participants were given a pretest and posttest measure of attitudes. These tests consisted of 12 questions that examined opinions on the target topic of comprehensive exams and three distractor topics (animal rights, environmental opinions, and tuition rate increases). These items were rated on 7-point rating scale from 1 (disagree com pletely) to 7 (agree com pletely). Persuasion was measured by compar­ ing attitude change on the target topic to that on the combina­ tion of the three distractor topics. Effectiveness of the argument. Evaluations of the argument were measured through a series of six questions, each consisting of a 9-point scale anchored by the following opposites (i.e., bad-good, foolish-w ise, negative-positive, beneficial-harm ful, convincingunconvincing, effective-ineffective). These measures were adapted

from Baker and Petty (1994). Perceptions of the message and speaker. Perceptions of the message and perceptions of the speaker were measured with items developed by Lucia (1998), which were based on work by Leathers (1997). Perceptions of the message were assessed through six 7-point questions anchored by the following opposites: stim ulating-boring, vague-specific, unsupported-supported, complexsimple, convincing-unconvincing, and uninteresting-interesting.

Perceptions of the speaker were assessed through 12 7-point questions, anchored by the following opposites: unintelligent-intelligent, straightforward-evasive, active-inactive, qualified-unqualified, sincereinsincere, meek-forceful, incompetent-competent, honest-dishonest, unassertive-assertive, uninformed-informed, untrustworthy-trustworthy, and timid-bold.

Age: 19 Condition: Cerebral palsy with spastic quadriparesis

Figure 1. Actors used in study. Actor 1 is shown in the synthetic speech by hand condition. Actor 2 is shown in the synthetic speech by headwand condition. All photographs and videotapes are the property of the first author. The actors were contracted by the first author and provided written permission to use their likenesses in research reports. The color version of this figure appears in the online article only.

Factor Analysis, Data Reduction, and Dependent Measures Principal-components analyses using Varimax rotations were run on the six items that examined the effectiveness of the argument, the six items that examined the perceptions of the message, and the 12 items that examined the perceptions of the speaker. For the items that measured the effectiveness of the argument, a single factor solution was obtained accounting for 67.39% of the variance, with all items loaded strongly on a unique factor (>.79) For the items that measured the percep­ tions of the message, a two-factor solution was obtained ac­ counting for 58.32% of the variance in which all but one item

STERN, CHOBANY, PATEL, AND TRESSLER

292

(simple/complex) loaded strongly on a unique factor (> .5). For the 12 items that measured the perceptions of the speaker, a three-factor solution was obtained accounting for 60.93% of the variance in which all the items loaded strongly on a unique factor (> .5). The structures of the resulting aggregated vari­ ables can be seen in Table 1.

29.66

28.66

Results

The first experiment was an extension of our research that has examined perceptions of natural versus computer-synthesized speech. This experiment builds upon that foundation by examining reactions to speakers with actual disabilities. The study also ex­ amines whether participants’ beliefs that the actor has a choice of using synthetic speech or natural speech has an effect on thenevaluations. We hypothesized that participants would be more rate the argu­ ment, message, and speaker more favorably and would be more persuaded when the person with a disability used synthetic speech than when he used his own voice. We included both the keyboard input and headwand input in order to manipulate the apparent severity of the disability. This manipulation, however, was to some extent exploratory and we did not have a directional hypothesis. We also hypothesized that participants would give particularly low

Table 1

Composite Variables Used in Study 1 and Study 2

Attitude toward argument Factor 1: Overall attitude Bad-Good Foolish-Wise Negative-Positive Beneficial-Harmful* Effective-Ineffective* Convincing-Unconvincing* Attitude toward message Factor 1: Attention based items Stimulating-Boring* Uninteresting-Interesting Factor 2: Information based items Vague-Specific Unsupported-S upported Convincing-Unconvincing* Attitude toward speaker Factor 1: Informedness Unintelligent-Intelligent Straightforward-Evasive* Qualified-Unqualified* Incompetent-Competent Uninformed-Informed Factor 2: Strength Timid-Bold Unassertive-Assertive Meek-Forceful Factor 3: Credibility Active-Inactive* Sincere-Insincere* Honest-Dishonest* Untrustworthy-Trustworthy

Factor loading

Method Participants and design. A total of 165 undergraduates (60 men, 105 women, mean age 19.17 years) from the University of Pittsburgh at Johnstown participated in this study in exchange for course credit. Participants were randomly assigned to one of six experimental conditions and were presented with a video segment of one of two speakers delivering a persuasive appeal in favor of comprehensive exams at universities. The factorial design was 3 (Type of Speech: Natural Voice vs. Keyboard vs. Headwand) X 2 (Choice: Choice vs. No Choice) between-groups design. Partici­ pants were debriefed, allowed time for questions, and thanked for their participation. Procedure. Participants were asked to watch a video segment on the computer. They were seated in a small cubicle with the computer, and listened through earphones. They were permitted to adjust the volume. The experimenter closed the door and asked participants to open the door when video was over. Prior to the video, participants were given a pretest attitude measure. They were given the same test after the video as a posttest. They were also given an instruction set to read that explained that they were going to watch a video of a person with a disability delivering a persuasive appeal. The instruction sets were also designed to experimentally manipulate participants’ perceptions of whether or not the speaker had a choice in whether or not to use synthetic speech as an assistive technology. Under the choice conditions, when the actor was using synthetic speech, it was specified that the speaker “is choosing to use a computersynthesized speech system,” and that “they could have chosen to use their own voice.” When the actors were using their own voice (in the choice condition), it was specified that the speaker “is choosing to use their own voice,” and that “they could have chosen to use a computer-synthesized speech device to assist them.” Under the three no-choice conditions, it was not explicitly stated whether the speaker did or did not have a choice of using the assistive technology.

Experiment 1

Factor and items

ratings to speakers who had specifically chosen not to use syn­ thetic speech as an assistive technology.

% Variance explained 67.39

.82 .79 .87 .82 .79 .83 .89 .87 .79 .76 .53 40.14 .75 .62 .61 .79 .81 12.04 .77 .74 .85 8.74 .55 .56 .78 .70

* Denotes items that were reverse scored prior to data reduction proce­ dures.

Main effects for type of speech. All dependent variables were subjected to 3 (Type of Speech: Natural Voice vs. Synthetic by Hand vs. Synthetic by Headwand) X 2 (Choice: Choice vs. No Choice) factorial ANOVAS, with between groups variables. There was a main effect for speech on the overall effectiveness of the argument, F (l, 152) = 7.54, p < .001, r = .29, and the extent to which the message was perceived as information based, F (l, 152) = 5.41, p < .005, r = .19. The analyses revealed main effects for informedness of speaker, F(2, 152) = 16.36, p < .001, r = .42; speaker strength, F (l, 152) = 15.59, p < .001, r = .41; and speaker credibility, F (l, 152) = 3.56, p = .03, r = .21. See Table 2 for the means for the speech conditions across the dependent variables related to perceptions of the speaker and message. The ANOVA, however, did not directly address our specific hypothesis, that when the speakers used their natural (moderately dysarthric) voice, the persuasive appeal would be received less favorably than when they used computer-synthesized speech by

LISTENERS’ PREFERENCE FOR COMPUTER SYNTHETIC SPEECH so m

O M 00

os

SO Os Os

in o o

>n Tt in

o

XI U

O

r-

SO

o

Z,

On SO

>

3

vo —; O

»n Tf «n

Os

Listeners' preference for computer-synthesized speech over natural speech of people with disabilities.

There are few controlled experimental studies that examine reactions to people with speech disabilities. We conducted 2 studies designed to examine pa...
5MB Sizes 0 Downloads 5 Views