Behav Res DOI 10.3758/s13428-014-0467-x

Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD) Niloofar Keshtiari & Michael Kuhlmann & Moharram Eslami & Gisela Klann-Delius

# Psychonomic Society, Inc. 2014

Abstract Research on emotional speech often requires valid stimuli for assessing perceived emotion through prosody and lexical content. To date, no comprehensive emotional speech database for Persian is officially available. The present article reports the process of designing, compiling, and evaluating a comprehensive emotional speech database for colloquial Persian. The database contains a set of 90 validated novel Persian sentences classified in five basic emotional categories (anger, disgust, fear, happiness, and sadness), as well as a neutral category. These sentences were validated in two experiments by a group of 1,126 native Persian speakers. The sentences were articulated by two native Persian speakers (one male, one female) in three conditions: (1) congruent (emotional lexical content articulated in a congruent emotional

Copyright 2010–2012 Niloofar Keshtiari. All rights reserved. This database, despite being available to researchers, is subject to copyright law. Any unauthorized use, copying, or distribution of material contained in the database without written permission from the copyright holder will lead to copyright infringement with possible ensuing litigation. Directive 96/9/EC of the European Parliament and the Council of March 11 (1996) describe the legal protection of databases. Published work that refers to the Persian Emotional Speech Database (Persian ESD) should cite this article. Electronic supplementary material The online version of this article (doi:10.3758/s13428-014-0467-x) contains supplementary material, which is available to authorized users. N. Keshtiari (*) : G. Klann-Delius Cluster of Excellence Languages of Emotion, Freie Universität Berlin, Habelschwerdter Allee 45, 14195 Berlin, Germany e-mail: [email protected] M. Kuhlmann Department of General Psychology and Cognitive Neuroscience, Freie Universität Berlin, Berlin, Germany M. Eslami Department of Persian Language and Literature, Zanjan University, Zanjan, Iran

voice), (2) incongruent (neutral sentences articulated in an emotional voice), and (3) baseline (all emotional and neutral sentences articulated in neutral voice). The speech materials comprise about 470 sentences. The validity of the database was evaluated by a group of 34 native speakers in a perception test. Utterances recognized better than five times chance performance (71.4 %) were regarded as valid portrayals of the target emotions. Acoustic analysis of the valid emotional utterances revealed differences in pitch, intensity, and duration, attributes that may help listeners to correctly classify the intended emotion. The database is designed to be used as a reliable material source (for both text and speech) in future cross-cultural or cross-linguistic studies of emotional speech, and it is available for academic research purposes free of charge. To access the database, please contact the first author. Keywords Emotion recognition . Speech . Emotional speech database . Prosody . Persian Communicating and understanding emotions is crucial for human social interaction. Emotional prosody (i.e., modulations in the acoustic parameters of speech, such as intensity, rate and pitch) encompasses non-verbal aspects of human language and provides a rich source of information about a speaker’s emotions and social intentions (Banse & Scherer, 1996; Wilson & Wharton, 2006). Aside from prosody, emotions are also conveyed verbally through the lexical content of the spoken utterance. These two channels of information (prosody and lexical cues) are inextricably linked and may reinforce or contradict each other (e.g., by conveying sarcasm or irony; Pell & Kotz, 2011). Therefore, to interpret the intended meaning or the emotions and the attitudes of a speaker, listeners should effectively monitor both prosodic and lexical information (Tanenhaus & Brown-Schmidt, 2007). To date, only scarce empirical data are available on the extent to which listeners

Behav Res

harness prosody versus lexical cues, or combine both channels of information, to activate and retrieve emotional meanings during ongoing speech processing (Pell, Jaywant, Monetta, & Kotz, 2011). The existing literature on emotional speech has usually focused on prosody and disregarded the complex interaction of prosody and lexical content (Ben-David, van Lieshout, & Leszcz, 2011; Pell et al., 2011). Considering the interaction of lexicon and prosody, emotional speech could be conveyed through the following three conditions: (a) emotional lexical content articulated in congruent emotional prosody, (b) emotional and neutral lexical content articulated in neutral prosody, (c) neutral lexical content conveyed in emotional prosody. The first two conditions are reported as being useful for functional neuroimaging studies (see, e.g., Mitchell, Elliott, Barry, Cruttenden, & Woodruff, 2004), whereas the third condition is useful for emotion perception and identification tasks (Russ, Gur, & Bilker, 2008). Therefore, conducting a study including any of these conditions requires the development of a validated database of emotional speech (Russ et al., 2008). This article reports the process involved in the design, creation, and validation a Persian database of acted emotional speech. Persian is an Indo-European language (Anvari & Givi, 1996), spoken by almost 110 million people around the world (Sims-Williams & Bailey, 2002). The dialect examined in this database is Modern Conversational Persian, as spoken in Tehran. This database contains a validated set of 90 sentences articulated in five basic emotions (anger, disgust, fear, happiness, and sadness) and a non-emotional category (neutral). The authors covered all of the three aforementioned conditions in this database in order to create a comprehensive language recourse that enables researchers to separately identify the impact of prosody and lexical content, as well as their interaction, on the recognition of emotional speech. Emotional speech in Persian has been studied in limited ways by researchers. The existing research has mainly focused on the recognition of emotional prosody in speech-processing systems. For instance, Gharavian and Ahadi (2009) depicted the substantial changes of speech parameters, such as, pitch caused by emotional prosody. In a recent study, Gharavian, Sheikhan, Nazerieh, and Garoucy (2012) studied the effect of using a rich set of features such as pitch frequency, and energy to improve the performance of speech emotion recognition systems. Despite extant research on speech processing in Persian, no well-controlled stimulus database of emotional speech is available to researchers. The studies referenced earlier recruited native speakers with no expertise in acting to articulate neutral or emotional lexical content in the intended emotional prosody. Neither the lexical nor the vocal stimuli were validated in these studies. Furthermore, the previous studies only investigated a limited number of emotions (anger, happiness, and sadness; e.g., Gharavian & Ahadi, 2009; Gharavian & Sheikhan, 2010).

Pell (2001) suggests that emotional prosody can be affected by the linguistic features of a language. Besides, as Liu and Pell (2012) have claimed, it is essential for researchers to generate valid emotional stimuli that is suitable for the linguistic background of the participants of a study. Considering this issue, recording samples of all the basic emotions are required for a comprehensive study on emotional speech for the language under study. Therefore, by generating a robust set of validated stimuli, the authors aimed to fill this gap for Persian language. Such a stimuli can minimize the influence of individual bias, and can avoid subjectivity in stimulus selection in future studies of Persian emotional speech. Within a discrete emotion framework, anger, disgust, fear, happiness, pleasant surprise and sadness are frequently regarded as basic emotions, each having a distinct biological basis and qualities that are universally shared across cultures and languages (Ekman, 1999). When designing the database, the authors considered a set of basic emotions known as “the big six” (Cowie & Cornelius, 2003) and added a neutral mode. However, pleasant surprise was later omitted from the list of the target emotions, due to close resemblance to happiness (for more details, please see the Lexical Content Validation section). To date, numerous databases of vocal expressions of the basic emotions have been established in several languages including English (Cowie & Cornelius, 2003; Petrushin, 1999), German (Burkhardt, Paeschke, Rolfes, Sendlmeier, & Weiss, 2005), Chinese (Liu & Pell, 2012; Yu, Chang, Xu, & Shum, 2001), Japanese (Niimi, Kasamatsu, Nishinoto, & Araki, 2001), Russian (Makarova & Petrushin; 2002), as well as many other languages (for reviews, see Douglas-Cowie, Campbell, Cowie, & Roach, 2003; Juslin & Laukka, 2003; Ververidis & Kotropoulos, 2003). However, this has not been achieved in Persian. Therefore, the present database provides a useful language recourse for conducting basic research on a range of vocal emotions in Persian, as well as for conducting cross-cultural/cross-linguistic studies of vocal emotion communication. Moreover, the stimuli in this database can be used as a reliable language recourse (lexical and vocal) for the assessment and rehabilitation of communication skills in patients with brain injuries. One of the greatest challenges in emotion and speech research is obtaining authentic data. Researchers have developed a number of strategies to obtain recordings of emotional speech, each with their own merits and shortcomings (Campbell, 2000). For a review of the strategies used for the various databases see Douglas-Cowie et al. (2003). Some researchers have employed “spontaneous emotional speech” to gain the greatest authenticity in conveying the emotions (see, e.g., Roach, 2000; Roach, Stibbard, Osborne, Arnfield, & Setter, 1998; Scherer, Ladd, & Silverman, 1984). Eliciting authentic emotions from speakers in the laboratory is another approach used by researchers (see Gerrards‐Hesse, Spies, &

Behav Res

Hesse, 1994; Johnstone & Scherer, 1999). Although these two methods generate naturalistic emotional expressions, they are reported to be restricted to non-full-blown states and to have a very low level of agreement with the decoder’s judgments on a specific emotion (Johnson, Emde, Scherer, & Klinnert, 1986; Pakosz, 1983). These methods also suffer from various technical and social limitations that hinder their usefulness in a systematic study of emotions (for a detailed explanation, see Ekman, Friesen, & O’Sullivan, 1988; Ververidis & Kotropoulos, 2003). One of the oldest, and still most frequently used, approaches for obtaining emotional speech data consists of “acted emotions” (Banse & Scherer, 1996). The advantages of this approach are: control over the verbal and prosodic content (i.e., all of the intended emotional categories can be produced using the same lexical content), a number of speakers could be employed to utter the same set of verbal content in all intended emotions, and high-quality recordings can be produced in an anechoic chamber. This production strategy allows direct comparison of acoustic and prosodic realizations for the various intended emotions portrayed. Critics of the acted emotions approach question the authenticity of the actors’ portrayals. However, this drawback can be minimized by employing the “Stanislavski method”1 (Banse & Scherer, 1996). As a result, the authors employed the acted emotions approach and the “Stanislavski method” in order to achieve vocal emotional portrayals that were as authentic as possible for the stimuli of the present database. The present article consists of six different parts. The first part describes the construction of the lexical content, whereas the second part concerns its validation. In the third part, we measured the emotional intensity of the lexical content. The fourth part describes the elicitation and recording procedure of the vocal portrayals. The fifth part concerns the validation of the vocal material and the sixth part is a summary and general discussion of all previous parts. This study started in October 2010 and finished in October 2012. As we discussed below, all these procedures were conducted entirely in Persian at each stage of the investigation.

Part One: Lexical content construction The existing literature on emotional speech has usually emphasized the role of prosody and neglected the role of lexical content (Ben-David et al., 2011). In studying emotional speech, various researchers have often prepared their own, study-specific lists of sentences without validating 1 The Stanislavski method is a progression of techniques (e.g., imagination and other mental or muscular techniques) used to train actors to experience a state similar to the intended emotion. This method, which is based on the concept of emotional memory, helps actors to draw on believable emotions in their performances (O’Brien, 2011).

the emotional lexical content (see, e.g., Luo, Fu, & Galvin, 2007; Maurage, Joassin, Philippot, & Campanella, 2007). However, to conduct a study on emotional speech, a set of validated sentences is required to separate the impact of lexical content from prosody on the processing of emotional speech (Ben-David et al., 2011). Therefore, in this study three experiments were conducted to generate a set of validated sentences (lexical material) in colloquial Persian. In all the three experiments participants were recruited worldwide through online advertisements and referrals from other participants and attended the tests on a voluntary basis without financial compensation.2 The criteria for participant recruitment were the same in all experiments: Participants were all native speakers of Persian and this was their working and everyday language. In case participants knew other languages besides Persian, we asked further questions to make sure that Persian was their dominant language. Participants did not suffer from any psychopathological condition or neurological illness, had no head trauma, and took no psychoactive medication. As the first step, a set of validated sentences that convey a specific emotional content or a neutral state was produced. A total of 252 declarative sentences (36 for each of the target emotions plus 36 sentences that were intended to be neutral) were created using the same simple Persian grammatical structure: subject + object + prepositional phrase + verb. Female and male Persian proper names were used as the subject of the sentences. To avoid gender effects, the authors used an equal number of male and female proper names. Each of the sentences describes a scenario that is often associated with one of the target emotions. See Example 1 for a sample sentence. In developing an emotional speech database, it is desirable to match the lexical content for word frequency and phonetic neighborhood density (i.e., the number of words one can obtain by replacing one letter with another one within a single word). However, this was not possible because the existing Persian corpora (see Ghayoomi, Momtazi, & Bijankhan, 2010, for a review) did not fully cover the domain of colloquial speech when the sentences were compiled. Nevertheless, in order to produce a set of stimuli that was as authentic as possible, everyday Persian words were chosen and all of the sentences were checked for naturalness and fluency by four native speakers of Persian (two psychologists, two linguists). However, to develop a reliable set of sentences conveying a particular emotion or neutral state a validation procedure was needed.

2 In order to prevent the same participant from taking the test twice, the IP address of each participant’s computer was checked.

Behav Res

Part Two: Lexical content validation To make sure that each generated sentence only conveys one specific emotion or no emotion at all (neutral), we performed two perceptual studies to validate the 252 sentences. In doing so, we carefully considered the ecological validity of the sentences (Schmuckler, 2001). Furthermore, it was important to obtain recognition accuracy data for the sentences from a sample that is more representative of the general population than only student volunteers. Therefore, we recruited a large group of participants to serve as “decoders.” The method employed to analyze emotion recognition warrants special consideration. On one hand, forcing participants to choose an option from a short list of emotions may inflate agreement scores and produce artifacts (Russell, 1994). On the other hand, providing participants with more options or allowing them to label the emotions freely would result in very high variability (Banse & Scherer, 1996; Russell, 1994). However, if participants are provided with the response option “none of the above,” together with a discrete number of emotion choices, some of the artifacts can be avoided (Frank & Stennett, 2001). Therefore, in the following experiments, we used nominal scales including the intended emotions and we added the option “none of the above.” Experiment 1 Method Participants A total of 1,126 individuals with no training in this type of task were recruited as participants. The data for 132 participants were not included in the analysis due to their excessively high error rates (i.e., above 25 %). Thus, the data from 994 participants (486 female, 508 male) were analyzed. The mean age of the participants was 32.6 years (SD = 13.9), ranging from 18 to 65 years. Participants were roughly equivalent in years of formal education (14.6 ± 1.8). Materials and procedure The 252 sentences were presented to participants in an online questionnaire. Participants were asked to complete the survey individually in a quiet environment. They were instructed to read each sentence, imagine the scenario explained, and as quickly as possible, select the emotion that best matches the scenario explained in the sentence. Responses were on an eight-point nominal scale corresponding to: anger, disgust, fear, happiness, pleasant surprise, sadness, neutral, and “none of the above.” Based on an eightchoice paradigm (six emotions, neutral, and none of the above), chance level was 12.5 %. To avoid effects of presentation sequence, sentences were presented in three blocks in a fully randomized design. Following Ben-David et al. (2011), seven control sentences

(one in each emotional category) were presented in each questionnaire (in a randomized order) to control for inconsistent responses. These seven control sentences were the repetition of seven previous items in exactly the same wording. Participants who did not mark the repeated trials of these control sentences consistently (i.e., a marking difference of more than three), were removed from the analysis. A similar method has been used by Ben-David et al. to control for inconstancy of responses. Results and discussion Previous work on emotion recognition (Scherer, Banse, Wallbott, & Goldbeck, 1991) suggest that emotional content is recognized approximately at four times chance performance. Accordingly, to develop the best possible exemplars, a minimum of five times chance performance in the eightchoice emotion recognition task—that is, 62.5 %—was set as the cutoff level in this study. A set of 102 sentences form the emotional categories (anger: 18; disgust: 23; fear: 17; happiness: 21; pleasant surprise: 0; sadness: 23), and another 21 sentences form the neutral category fulfilled the quality criteria. However, pleasant surprise was recognized most poorly overall (ranging between 16.9 % and 50.5 %; mean = 39.4 %), and no token met the quality criteria (i.e., five times chance performance in the eight-choice emotion recognition task; i.e., 62.5 %). The data revealed that there was a high probability (two to four times beyond chance level; i.e., 25 % to 50 %) that pleasant surprise was considered to be analogous to happiness. A similar confusion between happiness and pleasant surprise is reported in the literature (Paulmann, Pell, & Kotz, 2008; Pell, 2002). Besides, Wallbott and Scherer (1986) found out relative difficulties in distinguishing the two emotions in a study of emotional vocal portrayals. These two point may apparently suggest an overlap between happiness and pleasant surprise. Some researchers believe that misclassifications of these two categories occur because of their similar valence (Scherer, 1986). This confusion may be more pronounced when the linguistic context is ambiguous as to the expected emotional interpretation (e.g., happiness and pleasant surprise). Additionally, analysis of the participants’ comments showed that having proper names as the subject of the sentences caused a bias. Therefore, a second experiment was performed to omit this bias effect. Experiment 2 We conducted another experiment to avoid bias effects previously mentioned. In doing so, proper names were replaced with profession titles such as “Ms. Teacher” and “Mr. Farmer.” The English translations of these “profession titles” could be construed as common English surnames. However,

Behav Res

in Persian they only imply the person’s gender and their profession. For example, Mr. Farmer should only be understood to mean a male farmer. In Persian it is not typical that profession titles also exist as surnames like in English. Having profession titles instead of proper names is also beneficial in that it indirectly provides the reader/listener (participants of the study) with extra contextual information that more easily evokes a mental image of the scenarios. Consider the following two sentences: (a) Sara lost sight in both her eyes forever. (b) Ms. Tailor3 lost sight in both her eyes forever. Once participants have read “a lady tailor has lost sight in both her eyes forever,” it is very likely for them to imagine that this lady is in great trouble, since she can no longer work as a tailor. As compared to the first sentence (Sara lost sight in both her eyes forever.), in which the reader has no information about Sara, the additional information provided in sentence (b) can help elicit the intended emotion more easily (sadness, in this example). To avoid gender effects, the authors used an equal number of male and female profession titles as the subject of the sentences. Then, a second experiment was conducted with the aforementioned modifications (i.e., omission of pleasant surprise from the list of the intended emotions, and replacing proper names with profession titles).

Method Participants A total of 716 participants responded to this questionnaire. The data from 83 of the participants had to be discarded because of their excessively high error rates (i.e., above 25 %). These participants selected a “same” response option for most of the items (probably to finish the experiment as soon as possible). This resulted in 633 participants (329 male, 304 female) being retained for the present data set. The mean age of the participants was 30.2 years (SD = 12.6), ranging from 18 to 62 years. The participants were roughly equivalent in years of formal education (15.9 ± 2.4). Materials and procedure A set of 123 modified sentences were presented to participants in an online questionnaire. The procedure was the same as in the previous experiment. However, due to the removal of the pleasant surprise, participants were provided with a seven-point nominal scale (i.e., chance level was 14.3 %). These seven points corresponded to anger, disgust, fear, sadness, happiness, neutral, and none of the above. The instructions for this experiment were the same 3

On the basis of earlier explanations, Ms. Tailor was a lady who worked as a tailor, but whose family name is not Tailor.

as those for the previous one. The presentation sequence effects were avoided by displaying the sentences in a random order. A set of six control sentences (one for each emotional mode) was presented to control for inconsistent responses. Participants with more than two inconsistent responses were excluded. Results and discussion Scherer et al. (1991) suggest that in emotion recognition tasks, emotional content is recognized approximately at four times chance performance. Accordingly, to develop the best possible exemplars, a minimum of five times chance performance in the seven-choice emotion recognition task, (i.e., 71.5 %) was set as the cutoff level. Applying this criterion led to a set of 90 validated Persian sentences that were reliably associated either with one particular emotion (i.e., anger, 17 sentences; disgust, 15; fear, 15; sadness, 14; and happiness, 15) or with no emotion at all (neutral, 14). See Appendix A for examples of the Persian sentences and their English translations. So far, studying emotional speech, many researchers have prepared their own study-specific lists of sentences without validating the emotional lexical content. (see Luo et al., 2007, for such studies). However, to conduct a reliable study on emotional speech, a set of validated sentences is required (Ben-David et al., 2011). Accordingly, on the basis of the results of the this experiment a set of 90 sentences were categorized into one of the five emotional or neutral categories. However, since the intensity of the emotion conveyed through the lexical content of the sentences could affect the participants’ recognition of the intended emotions, it was necessary to determine the emotional intensity of each sentence.

Part Three: Measuring the emotional intensity of the lexical content The intensity of the emotion conveyed through the lexical content of the sentences could affect the participants’ recognition of the intended emotions. Therefore, as the next step, the authors performed a third experiment with the validated 90-sentence set in order to identify its emotional intensity of each sentence. Method Participants A total of 250 Persian speakers (117 male, 133 female) took part in the experiment, none of whom had participated in previous studies. Of these, 50 participants were

Behav Res

excluded either because Persian was no longer their dominant language (34 %) or because they did not follow the instructions (16 %). The participant recruitment procedure was the same as in the previous experiments. The mean age of the remaining participants (105 male, 95 female) was 29.6 years (SD = 12.3), ranging from 18 to 61 years. Participants were similar in years of formal education (15.7 ± 2.2). Materials and procedure The 90 sentences were presented, in random order, to each participant in an online questionnaire. Participants were instructed to attend the questions individually in a quiet environment. They were asked to read each sentence, and to imagine the scenario depicted in the sentence. In this study, participants were required to rate the intensity of the intended emotion (given at the end of each sentence) on a five-point Likert scale (Likert, 1936), corresponding to very little, little, mild, high, and very high intensity. A five-point Likert scale is a reliable method to measure the extent to which a stimulus is characterized by a specific property (Calder, 1998). In addition, in order to provide participants with the possibility to reject the intended emotion, we added the response option “not at all.” For each item, this option was provided below the Likert scale, and participants were instructed to select “not at all” if they believed another emotion was being described by the sentence.

Results and discussion Accordingly, the mean emotional intensity of each of the 90 sentences was calculated. Therefore, the intensity of the lexical content of each sentence was identified (see Table 1 for the details). This additional piece of information will allow researchers to use a matched set of sentences in future studies. At the end of part three of this study, we generated and validated the first list of emotional and neutral sentences for Persian. This sentence set served as the finalized lexical content for recording the vocal emotional portrayals. Table 1 Descriptive statistical values of the emotional intensity of the lexical content Emotional Category

Mean Intensity

SD

Max

Min

Anger Disgust Fear Happiness Sadness

4.9 5.3 4.8 4.8 5.2

0.33 0.51 0.46 0.61 0.43

5.4 5.7 5.4 5.5 5.6

4.2 4.7 4.3 4.2 4.6

Neutral

5.2

0.27

5.5

4.9

Part Four: Elicitation and recording procedure The ultimate goal of this study was to establish and validate a database of emotional vocal portrayals. Therefore, as the next step, the validated sentences were articulated by two Persian native speakers (encoders) in the five emotional categories (anger, disgust, fear, happiness, and sadness) and neutral mode. These vocal portrayals were recorded in a professional recording studio in Berlin, Germany. Method Encoders Actors learn to express emotions in an exaggerated way. Therefore, professional actors were not used in this study. Instead, two middle-aged native Persian speakers (male, 50 years old; female, 49 years old) who had taken acting lessons and had practiced this profession for a while, were chosen to simulate the verbal content of the database. Both speakers had learned Persian from birth and speak Persian without an accent. The speakers received €25/h as financial compensation for their service. Materials The 90 sentences selected in the lexical content validation phase (i.e.; anger, 17 sentences; disgust, 15; fear, 15; sadness, 14; happiness, 15; and neutral, 14) were used as materials to elicit emotional speech from the two speakers in the following three conditions: (1) congruent: emotional lexical content articulated in congruent emotional voice (76 sentences by two speakers); (2) incongruent: neutral sentences articulated in emotional voice (70 sentences by two speakers); and (3) baseline: all emotional and neutral sentences articulated in neutral voice (90 sentences by two speakers). This resulted in the generation of 472 vocal stimuli. Procedure Prior to recording, each speaker had four practice sessions with the first author of this article. Each practice session started with a review and discussion of the literal and figurative meanings of a given emotion, its ranges, and the ways it could be portrayed in speech. After these discussions, the speakers were provided with standardized emotion portrayal instructions based on a scenario approach (Scherer et al., 1991). Five scenarios (one corresponding to each emotion) were used in the portrayal instructions (see Appendix B for the list of scenarios). The same scenarios had been used in a similar study by Scherer et al. (1991). These scenarios had been checked for cultural appropriateness and translated into Persian prior to the recording session. The speakers were asked to read each scenario, imagine experiencing the situation described (Stanislavski method) and then to articulate the given list of sentences in the way they would have

Behav Res

uttered them in that situation. Scherer et al. (1991) selected these scenarios on the basis of intercultural studies on emotion experience, in which representative emotion eliciting situations were gathered from almost 3,000 participants living on the five continents (Wallbott & Scherer, 1986). These scenarios are likely to elicit the target emotion and were used both in practice as well as in recording sessions. Having audio tokens that could serve as representative examples of particular prosodic emotions was the criterion for selecting the speech samples. Therefore the speakers were encouraged to avoid exaggerated or dramatic use of prosody. Once the authors and the speakers were satisfied with the simulations in the practice sessions, the speakers made the final recordings. The speakers were recorded separately in a professional recording studio in Berlin under the supervision of an acoustic engineer and the first author. Each of the five emotions and the neutral portrayals were recorded in separate sessions. All utterances were recorded on digital tapes under identical conditions, using a high-quality fixed microphone (Sennheiser MKH 20 P48). The recordings were digitalized at a 16-bit/44.1 kHz sampling rate. The sound files were recorded on digital tapes (TASCAM DA-20 MK II), digitally transferred to a computer and edited to mark the onset and offset of each sentence. Following Pell and Skorup (2008), each audio sentence was normalized to a peak intensity of 70 dB using Adobe Audition version 1.5 to control for unavoidable differences in the sound level of the source recordings across actors. Accordingly, a total of 472 vocal utterances were generated. These vocal portrayals encompass the three conditions of congruent (76 sentences by two speakers), incongruent (70 sentences by two speakers) and baseline (90 sentences by two speakers). It was anticipated that difficulties with the elicitation simulation procedure would lead to some of the recorded stimuli not serving as typical portrayals of the target emotions (Pell, 2002; Scherer et al., 1991) or that other nuances of emotional categories would be identifiable with specific vocal portrayals (Scherer et al., 1991). Therefore, a perceptual study was essential to eliminate the poor portrayals.

Part Five: Validation of the vocal materials To develop a well-controlled database of emotional speech for Persian, we conducted a perceptual study to eliminate the poor vocal portrayals. Then, we conducted an acoustic analysis to check if the vocal portrayals reveal obvious differences in acoustic parameters that might help participants to distinguish the intended emotional category correctly.

The perceptual study Method Participants To date, similar studies on emotional speech have recruited between ten to twenty four participants as decoders to eliminate the poor vocal portrayals (i.e., Pell, 2002; Pell, Paulmann, Dara, Alasseri, & Kotz, 2009). Nevertheless, in order to obtain robust results, we recruited a total of 34 participants as decoders (17 males; mean age 26.3 years, SD = 2.6). Four participants had to be excluded for not following the instructions of the experiment. All of the participants were Iranian undergraduate or graduate students studying in Berlin. They had all learned Persian from birth and had been away from Iran for less than three years. They all reported good hearing and had normal or corrected-to-normal vision, as verified by the examiner at the beginning of the study. Participants did not suffer from any psychopathological conditions, had no history of neurological problems, and took no psychoactive medication, as assessed by a detailed questionnaire. A detailed language questionnaire was completed by each participant prior to testing to ensure that Persian was their native and dominant language. Participants received €8/h as financial compensation for their cooperation. Materials and procedure A total of 472 vocal utterances (all the emotional and neutral portrayals) encompassing the three conditions of congruent (76 sentences by two speakers), incongruent (70 sentences by two speakers) and baseline (90 sentences by two speakers) were included in a perception study. Each participant was tested individually in a dim lit, sound-attenuated room. Participants were presented with the vocal utterances previously recorded. They were instructed to listen to the utterances and to identify their emotional prosody, regardless of their lexical content. They were asked to mark their answers on a seven-button answer panel. The seven choices available were anger, disgust, fear, happiness, sadness, neutral, and none of the above. The stimulus set was presented in four blocks in a fully randomized design. The experiment took almost 90 min for each participant. To limit the fatigue effects and possible inattention to the stimuli, participants were tested during four 20-min sessions, with a five-minute break after each session. The experiment was run as follows: Acoustic exemplars were presented via a laptop computer using the E-Prime software (Schneider, Eschman, & Zuccolotto, 2002). Each participant heard the audio stimuli binaurally (through

Behav Res

Sennheiser HD 600 headphones) at his or her comfortable loudness level (manually adjusted by each participant at onset of the study). Each trial sequence consisted of (1) the presentation of a fixation cross for 200 ms, (2) a blank screen for 200 ms, (3) audio presentation of an exemplar with simultaneous display of an image of a loudspeaker, (4) display of a question mark, indicating that an emotion judgment decision should be made, and (5) a blank screen for 2,000 ms. See Fig. 1 for a schematic illustration of the procedure.

Results and discussion The percentage of native listeners who accurately categorized the target emotion expressed in each sentence was computed for each item and speaker. The percentages of accurate responses, as well as the error patterns averaged for the two speakers, are presented in Table 2. In order to create a set of exemplars that portray the intended emotion as accurately as possible, we adopted a particular criterion, described below. Previous works suggest that vocal emotions (except for pleasant surprise and neutral) are recognized almost four times above chance level (Scherer et al., 1991). Therefore, in order to select the best possible exemplars, a minimum of five times chance performance in the seven-choice emotion recognition task (i.e., 71.42 %) was set as the cutoff level in the present study. Application of this criterion led to the exclusion of only one token from the incongruent condition, articulated by the female speaker, which was intended to communicate “disgust.” In addition, in order to exclude a systematic uncontrolled condition, all of the tokens whose response percentage fell between 71.4 % and 85.7 % (i.e., between 5 and 6 times chance level) were scrutinized carefully for their error pattern. The tokens that showed repetition of the same wrong answer above chance level (i.e., 14.3 %) were then omitted. This resulted in the omission of three exemplars. All of the three omitted portrayals belonged to the incongruent condition and were meant

Trial start Blank screen Exemplar presentation

+

Question mark

to portray fear (one token by the female speaker) and sadness (two tokens by the male speaker). As a result, a total of 468 vocal portrayals that fulfilled the quality criteria were kept. These vocal portrayals conveying five emotional meanings and the neutral mode serve as the vocal materials of the database of Persian emotional speech. See Fig. 2 for a tree chart of the validated database, articulated by two speakers. The results obtained from the perceptual study (shown in Table 2) reveal that all emotional portrayals were recognized very accurately (ranging between 90.5 % and 98 %). The most difficult emotion to recognize was disgust (90.05 % in the incongruent and 95.65 % in the congruent condition). Interestingly, relative difficulties recognizing the vocal portrayals of disgust have been reported in the literature (Scherer et al., 1991). The participants also had difficulty with recognizing fear portrayals in comparison with the other emotional categories (94.5 % in the incongruent and 97.7 % in the congruent condition). Analyzing the error patterns can provide valuable cues as for the nature of the inference process as well as the clues used by the participants (Banse & Scherer, 1996). Cursory examination of the error patterns suggests that in the congruent condition, sadness and fear were often confused with one another (i.e., fear was mistaken for sadness by 1.2 %, and sadness was confused with fear by 1.05 %). Similar confusion patterns of fear and sadness have been reported in the literature on emotional vocal portrayals (Paulmann et al., 2008; Pell et al., 2009). Portrayals of disgust were also mistaken for sadness (1.3 % in the congruent and 2.35 % in the incongruent condition). It has been argued that emotions that are acoustically similar (e.g., disgust and sadness) are very likely to be misclassified (Banse & Scherer, 1996). Some researchers have also claimed that misclassifications often include emotions of similar valence (e.g., fear and anger) and arousal (e.g., sadness and disgust; Scherer, 1986). These reasons may explain some of the errors witnessed in this study. As expected, the vocal tokens in the congruent condition showed higher rates of emotion recognition than did those in the incongruent condition. This could be due to the absence of lexical cues in the incongruent condition. The results also reveal that vocal portrayals of neutrality, anger, and happiness were associated with the least amounts of confusion.

Blank screen 200 ms 200 ms

?

Acoustic analysis

Until response 2000 ms

Fig. 1 Schematic illustration of a trial presentation

Acoustic analyses were performed to determine whether the vocal portrayals would show obvious differences in acoustic parameters that might help participants to distinguish the intended emotions correctly. The analyses

Behav Res Table 2 Distribution (as percentages) of the responses given to each of the intended expressions Condition

Congruent

Incongruent

Baseline

Lexical Content of the Sentences

Target Vocal Emotion

Percentage of Responses Anger

Disgust

Fear

Anger

Anger

97.55

0.7

0.5

Disgust Fear Sadness Happiness

Disgust Fear Sadness Happiness

0.35 0.1 0.6

95.65

0.2 97.7 1.05

Neutral

Anger

98

0.45

0.8

Neutral Neutral Neutral Neutral

Disgust Fear Sadness Happiness

1.05 1.4

90.5

0.1 94.5 3.95 0.15

Anger

Neutral

Disgust Fear Sadness Happiness Neutral

Neutral Neutral Neutral Neutral Neutral

0.55

0.45 0.5

Sadness

Happiness

Neutral

0.1

1.15

1.3 1.2 98.35 0.65

0.55

0.45

1.8 1

0.1

0.1

0.55

2.35 0.6 95.4 0.7

0.1

1.8 0.35 0.1 0.45

4.1 3.15 0.1 1.2

0.1

99.7

0.2

0.8

99.35 99.8 98.95 99.8 100

0.65 0.2 0.25 0.2

Recognition accuracy rates (sensitivity) are indicated in bold. Values are averaged across speakers.

Fig. 2 Tree chart of the validated database

1.1

97.7

97

None of the Above

Behav Res

were limited to three critical parameters (mean pitch, mean intensity, and duration) that have been reported to differentiate well among vocal emotion categories and perceptual terms (Juslin & Laukka, 2003). A total of 468 vocal utterances (all of the validated emotional and neutral portrayals), encompassing the congruent, incongruent, and baseline conditions, were included in this analysis. These vocal utterances were analyzed using the Praat speech analysis software (Boersma & Weenink, 2006). See Table 3 for the mean values of the normalized acoustic measures of the valid emotional portrayals. Please also see the supplementary materials for a detailed list of the file names, along with their values for the acoustic measures. The sentences that served as the lexical content of the database were not matched for the number of syllables. Therefore, in the first step only the mean pitch and mean intensity of the vocal utterances were entered in a series of univariate analyses of variance (ANOVA). The two acoustic measures (mean pitch and mean intensity) served as the dependent variables, and the six-level independent variable included the five emotion types and the neutral mode. The results revealed highly significant differences across the emotional categories. We found main effects of mean pitch [F(5, 84) = 142.307, p < .01, η2 = .894] and mean intensity [F(5, 84) = 7.626, p < .01, η2 = .312] in the congruent condition (emotional lexical content articulated in the emotional voice) as well as for mean pitch [F(5, 78) = 54.41, p < .01, η2 = .777] and mean intensity [F(5, 78) = 9.424, p < .01, η2 = . 377] in the incongruent condition (neutral lexical content portrayed in emotional voices). More specifically, in the incongruent condition anger (279.18) and happiness (280.40) had the highest pitch

values, fear (250.18) and sadness (247.35) had similar but lower pitch values, and disgust (216.34) had the lowest pitch value. In the congruent condition, anger (274.76), disgust (266.67), and happiness (268.78) had the highest pitch values, fear (249.56) had lower pitch value, and sadness (226.74) had the lowest pitch value. Using the Tukey–Kramer HSD test, a pairwise comparison between the mean pitch values of the emotional categories and the neutral mode was conducted. This comparison was conducted for each of the two conditions separately. The results of this comparison revealed a highly significant difference (p < .01) between each of the five emotions and the neutral mode in both conditions. Figures 3 and 4 display the mean pitch values of the sentences portrayed in the congruent and incongruent conditions, respectively; highly significant effects are marked by two asterisks. As for intensity, in both the congruent and incongruent conditions, all of the emotional categories had similar values (congruent condition: anger = 72.11, disgust = 71.75, fear = 73.40, happiness = 73.63, sadness = 72.11; incongruent condition: anger = 72.61, disgust = 71.53, fear = 71.22, happiness = 73.81, sadness = 72.89). A pairwise comparison was performed between the mean intensities of the five emotional portrayals and that of the neutral portrayals, using Tukey–Kramer HSD tests. In the congruent condition, the results revealed a highly significant difference (p < .01) between the mean intensities of happiness and neutral portrayals. Fear portrayals also showed a significant difference (p < .05). However, we found no significant difference between the mean intensities of the three emotional portrayals of anger, disgust, and sadness and that of the neutral portrayals. On the basis of the results of this pairwise comparison, in the incongruent condition, only the mean

Table 3 Normalized acoustic measures of the valid emotional portrayals, per condition (averaged for both speakers) Condition

Lexical Content of the Sentences

Target Vocal Emotion

Pitch (in Hz)

Intensity (in dB)

Mean Value

SD

Mean Value

SD

Congruent

Anger Disgust Fear Happiness Sadness

Anger Disgust Fear Happiness Sadness

274.76 266.67 249.56 268.78 226.74

17.50 16.41 11.24 13.64 18.74

72.11 71.75 73.40 73.63 72.11

0.90 1.15 0.50 1.53 1.32

Incongruent

Anger Disgust Fear Happiness Sadness

Anger Disgust Fear Happiness Sadness

279.18 216.34 250.18 280.40 247.35

13.91 17.24 31.80 22.34 38.32

72.61 71.53 71.22 73.81 72.89

0.98 1.41 1.07 1.11 1.30

Behav Res

**

**

**

** **

Fig. 3 Mean pitch values of the sentences, portrayed in the congruent condition. Error bars represent ±1 SD. **p < .01

Figures 5 and 6 display the mean intensity values of the emotional and neutral portrayals in the congruent and incongruent conditions; significant effects are marked by an asterisk, and highly significant effects are marked by two asterisks.

intensity of happiness showed a highly significant difference (p < .01) with that of the neutral portrayals. The differences between the mean intensities of the other emotional portrayals (anger, disgust, fear, and sadness) and that of neutral portrayals were not significant.

**

** **

**

**

Fig. 4 Mean pitch values of the sentences, portrayed in the incongruent condition and the neutral mode. Error bars represent ±1 SD. **p < .01

Behav Res

Fig. 5 Mean intensities of the emotional portrayals in the congruent condition and the neutral mode. Error bars represent ±1 SD. **p < .01, *p < .05

Table 4 shows the mean values of the normalized acoustic measures per condition (averaged for both speakers). In the next step, a repeated measures ANOVA was conducted with a 6 (prosody) × 2 (speakers) design, with the mean duration of the vocal utterances in the incongruent

condition as the dependent variable (neutral sentences portrayed in the five emotional categories plus neutral). The results revealed highly significant main effects of speaker [F(1, 10) = 64.425, p < .01] and prosody [F(5, 50) = 377.275, p < .01], as well as a highly significant interaction

**

Fig. 6 Mean intensities of the neutral portrayals and of the emotional portrayals in the incongruent condition. Error bars represent ±1 SD. **p < .01

Behav Res Table 4 Comparison of mean durations of the emotional sentences, portrayed in corresponding emotions (congruent condition) as well as neutral voice (baseline condition) Condition

Lexical Content of the Sentences

Target Vocal Emotion

Mean Duration per Sentence (in Seconds) Male Speaker (SD)

Congruent

Baseline

Female Speaker (SD)

Anger

Anger

4.73 (0.42)

5.00 (0.35)

Disgust Fear Happiness Sadness

Disgust Fear Happiness Sadness

7.15 (0.81) 4.06 (0.34) 5.31 (0.54) 5.43 (0.5)

6.78 (0.75) 4.46 (0.46) 5.23 (0.54) 6.00 (0.42)

Anger

Neutral

4.85 (0.34)

4.25 (0.3)

Disgust Fear Happiness Sadness

Neutral Neutral Neutral Neutral

4.33 (0.52) 4.59 (0.24) 4.45 (0.39) 4.81 (0.35)

3.81 (0.35) 4.01 (0.28) 3.82 (0.26) 4.06 (0.32)

[F(5, 50) = 31.697, p < .01]. See Figs. 7 and 8 for the comparison of the durations of the vocal utterances across various emotions. Next, each emotion was compared to the neutral portrayal by within-subjects contrasts. Apart from anger, which revealed no significant effect [F(1, 10) = 0.246, p = .63], all of the other emotions (disgust, fear, happiness, and sadness) displayed a highly significant difference [all Fs(1, 10) > 27.915, ps < .01]. However, we did obtain a highly significant result for anger [F(1, 10) =

13.684, p < .01] when the other variable—that is, speaker— was taken into account in this comparison. As can be seen in Fig. 7, when articulating the sentences in an angry voice, the female speaker slowed down to a meaningful extent. In particular, fear portrayals (3.84) were the fastest to be uttered, anger (4.21) and happiness (4.85) utterances were slower, sadness vocalizations (5.52) were even slower, and disgust portrayals (7.88) were the slowest of all (all durations are in seconds).

Fig. 7 Comparison of mean durations of the neutral sentences, portrayed in all emotional categories for each speaker. Error bars represent ±1 SD

Behav Res

Fig. 8 Comparison of mean durations of the neutral sentences, portrayed in all emotional categories averaged for the two speakers. Error bars represent ±1 SD

The durations of the emotional portrayals were then compared with neutral portrayals of the emotional sentences for each emotion. Figure 9 and Table 4 display the comparison of the mean durations of the emotional sentences, portrayed in corresponding emotions as well as neutral voice. These

pairwise comparisons, through paired-sample t tests, showed significant differences for the mean durations of anger [t(33) = 3.18, p < .01], disgust [t(29) = 30.237, p < .01], happiness [t(29) = 9.044, p < .01], and sadness [t(26) = 16.51, p < .01]. Only the mean duration of fear did not differ significantly

Fig. 9 Comparison of mean durations of the emotional sentences, portrayed in the corresponding emotions as well as neutral voice. Error bars represent ±1 SD

Behav Res

[t(29) = –0.353, n.s.]. By making multiple comparisons, we reduced the significance level to .01, in accordance with Bonferroni correction. Discriminant analysis A discriminant function analysis for all three conditions (i.e., congruent, incongruent, and baseline) was performed to determine how well the six categories (five emotions plus neutral) could be classified on the basis of the intended acoustic measures (the mean pitch, mean intensity, and duration). A total of 468 vocal utterances (all of the validated emotional and neutral portrayals) encompassing the three conditions of congruent, incongruent, and baseline were included in the discriminant analysis. Due to the various natures of each condition, not all of the three measures (means of pitch, intensity, and duration) were taken into account for each condition. In the analysis, the intended emotional category served as the dependent variable, and the acoustic measurements served as independent variables. In the congruent condition, the analysis was conducted only on the basis of the two measures of mean pitch and mean intensity (due to the different lengths of each sentence in the congruent condition, the mean speech rate was not taken into account here). The vast majority (95.1 %) of the variance was accounted for by mean pitch. Pooled within-group correlations between the acoustic parameters and the first canonical discriminant function scores revealed that mean pitch demonstrated the highest correlation (r = .998). Mean intensity had the largest pooled within-group correlation with the canonical discriminant function score (r = .999) in a second function that accounted for 4.9 % of the variance. The classifications resulting from the discriminant analysis revealed that the model identified 62.2 % of the sentences correctly (anger, 47.1 %; disgust, 33.3 %; fear, 66.7 %; happiness, 53.3 %; sadness, 78.6 %; and neutral, 100 %). Figure 10 illustrates how the canonical discriminant functions separated the emotional categories for each sentence. As can be seen, with the exceptions of anger and disgust, which could often be mistaken for each other, the first two functions successfully separated sentences by emotional category. In the incongruent condition, the authors calculated the speech rate for each emotional category by subtracting the duration of the neutrally portrayed sentences from that of the emotionally portrayed sentences. To avoid the double use of values, the discriminant analysis was only conducted on the neutral sentences that were portrayed emotionally (i.e., the incongruent condition). Therefore, in this condition the analysis was conducted on the basis of the three measures of mean pitch, mean intensity, and mean duration.

The vast majority (95.0 %) of the variance was accounted for by the first function described by this discriminant analysis. Pooled within-group correlations between the acoustic parameters and the first canonical discriminant function scores revealed that mean duration demonstrated the highest correlation (r = .861). Mean intensity had the largest pooled within-group correlation with the canonical discriminant function score (r = .895) in a second function that accounted for 4.7 % of the variance. In a third function, which accounted for 0.3 % of the variance, mean pitch had the highest pooled within-group correlation with the canonical discriminant function score (r = .789). Figure 11 illustrates how the canonical discriminant function scores for Functions 1 and 2 separated the emotional categories for each sentence. As can be seen, the first two functions clearly separated sentences by emotional category. The classification results obtained from the discriminant analysis revealed that the model identified 81.4 % (anger, 71.4 %; disgust, 100 %; fear, 78.6 %; happiness, 71.4 %; sadness, 85.7 %) of the sentences correctly. In the baseline condition only, mean pitch and mean intensity were accounted for (due to the different lengths of the sentences in this condition, the mean speech rate was not taken into account). In this condition, the vast majority (95 %) of the variance was accounted for by the first function described by this discriminant analysis. Pooled within-group correlations between the acoustic parameters and the first canonical discriminant function scores revealed that mean pitch demonstrated the highest correlation (r = .999). Mean intensity had the largest pooled within-group correlation with the canonical discriminant function score (r = .994) in a second function that accounted for 5 % of the variance. As expected, even the best model did not perform well in predicting the category membership correctly, averaging 43.4 % (anger, 35.3 %; disgust, 53.3 %; fear, 46.7 %; happiness, 33.3 %; sadness, 50.0 %). As can be seen in Fig. 12, the two functions did not separate the categories clearly. At best, Function 1 separated anger and disgust from the rest.

Part Six: Summary and general discussion The present study was designed to create a well-controlled database of emotional speech for Persian. A set of Persian sentences for the five basic emotions (anger, disgust, fear, happiness, and sadness) as well as neutral mode were generated and validated in a set of rating studies. Then, the emotional intensity of each sentence was calculated. Having information on the intensity of the emotional meanings will allow researchers to use a matched set of sentences for clinical and neurological studies. Two native speakers of Persian articulated the sentences in

Behav Res Canonical Discriminant Functions emotion 1Anger

4

2Disgust 3Fear 4Happiness 5Sadness

2

6Neutral Group Centroid

Function 2

3

4

6

0

1 5 2

-2

-4 -7,5

-5,0

-2,5

0,0

2,5

5,0

Function 1

Fig. 10 Results of a discriminant feature analysis that illustrates how the canonical discriminant function scores for Functions 1 and 2 separate the emotional categories for each sentence (in the congruent condition)

the intended emotional categories. Their vocal expressions were validated in a perceptual study by a group of 34 native listeners. The validated vocal portrayals were then subjected to acoustic analysis. In both perceptual and acoustic patterns, expected variations were observed among the five emotion categories.

The present study has a number of limitations. First, having two speakers (one speaker for each gender) as encoders makes it difficult to gauge the extent of interspeaker variability (Douglas-Cowie et al., 2003). This might have led to greater variability in acoustic and perceptual measures. Moreover, having a small

Canonical Discriminant Functions emotion 1Anger

4

2Disgust 3Fear 4Happyness 2

5Sadness Ungrouped Cases

4

Group Centroid 5

Function 2

1

0

2

3

-2

-4

-5

0

5

10

Function 1

Fig. 11 Results of a discriminant analysis that demonstrates how the canonical discriminant function scores for Functions 1 and 2 separate the emotional categories for each sentence (in the incongruent condition)

Behav Res Canonical Discriminant Functions emotion 1Anger 2Disgust 3Fear 4Happiness 5Sadness Group Centroid

4

Function 2

2

4 1 3 2

0 5

-2

-4 -2

0

2

4

Function 1

Fig. 12 Results of a discriminant feature analysis that reveals that, as expected, the canonical discriminant function scores for Functions 1 and 2 do not separate the emotional categories for each sentence (in the baseline condition)

number of speakers might have introduced certain artifacts (Pell, 2002). However, to mitigate this drawback the speakers were selected from a group of semiprofessional actors, and Stanislavski method was employed to help the speakers articulate emotions naturally. In addition, their utterances were recorded in a professional recording studio, under the supervision of an acoustic engineer. Another potential problem is that a small number of decoders were recruited for the perceptual validation study, aimed to select the best possible exemplars. A larger number of decoders would likely improve the reliability of the results. Yet, to minimize this drawback, we recruited more participants as decoders than had been in similar studies (i.e., ten decoders in a study by Pell, 2002, and 20 decoders in another study by Pell et al., 2009). In addition, the usual cutoff level used in previous studies (i.e., three times above chance level; Pell et al., 2009) was increased to a minimum of five times chance level in the seven-choice emotion recognition task (i.e., 71.42 %). Finally, only a small number of acoustic parameters were taken into account in the acoustic analysis. Having a larger set of acoustic parameters would provide more information about the acoustic features of Persian emotional speech. However, to diminish this gap and to make sure the vocal utterances contained detectable acoustic contrasts, a discriminant function analysis was performed on the intended acoustic measures (pitch, intensity, and duration). The classifications resulting from the discriminant analysis showed that in the congruent condition the model identified 62.2 % of the sentences correctly.

This value was equal to 81.4 % in the incongruent condition and 43.4 % in the baseline condition. Despite these limitations, the emotional stimuli (both textual and vocal) of the present study were perceptually validated. This database (Persian ESD) encompasses a meaningful set of validated lexical (90 items) and vocal (468 utterances) stimuli, conveying five emotional meanings. Since the database covers the three conditions of (a) congruent, (b) incongruent, and (c) baseline, it provides the unique possibility to separately identify the effect of prosody and lexical content on the identification of emotions in speech. The database could also be used in neuroimaging and clinical studies to assess a person’s ability to identify emotions in spoken language. Additionally, this database can open up new opportunities for future investigations in speech synthesis research, as well as in gender studies. To access the database and the supplementary information, please contact [email protected]. Author note The authors express their appreciation to Silke Paulmann, Maria Macuch, Klaus Scherer, Luna Beck, Dar Meshi, Francesca Citron, Pooya Keshtiari, Arsalan Kahnemuyipour, Saeid Sheikh Rohani, Georg Hosoya, Jörg Dreyer, Masood Ghayoomi, Elif Alkan Härtwig, Lea Gutz, Reza Nilipour, Yahya Modarresi Tehrani, Fatemeh Izadi, Trudi FalamakiZehnder, Liila Taruffi, Laura Hahn, Karl Brian Northeast, Arash Aryani, Christa Bös, and Afsaneh Fazly for their help with sentence construction and validation, recordings, data collection and organization, and manuscript preparation. A special thank you to our two speakers Mithra Zahedi and Vahid Etemad. The authors also thank all of the participants who took part in the various experiments in this study. This research was financially supported by a grant from the German Research Society (DFG) to N.K.

Behav Res

Appendix A: Sample of the Persian sentences included in the database, along with their transliteration, glosses, and English translation

Abbreviations used are as follows: Ez: ezafe particle; CL: clitic ; CL.3SG third person singular clitic; DOM: direct object marker; 3SG: third person singular.

Behav Res

Appendix B: List of scenarios The director is late for the rehearsal again and we have to work until late at night. Once again I have to cancel an important date. Disgust: I have a summer job in a restaurant. Today I have to clean the toilets which are incredibly filthy and smell very strongly. Fear: While I am on a tour bus, the driver loses control of the bus while trying to avoid another car. The bus comes to a standstill at the edge of a precipice, threatening to fall over.

Happiness:

I am acting in a new play. From the start, I get along extremely well with my colleagues who even throw a party for me. Sadness: I get a call to tell me that my best friend died suddenly.

Sadness:

I get a call to tell that my best friend died suddenly.

Anger:

Example 1

Example 1

Note that Persian is written from right to left. The abbreviations are as follow: Ez: ezafe particle; CL.3SG third person singular clitic; DOM: direct object marker; 3SG: third person singular.

References Anvari, H., & Givi, H. (1996). Persian grammar (2 vols). Tehran, Iran: Fatemi. Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70, 614– 636. doi:10.1037/0022-3514.70.3.614 Ben-David, B. M., van Lieshout, P. H., & Leszcz, T. (2011). A resource of validated affective and neutral sentences to assess identification of emotion in spoken language after a brain injury. Brain Injury, 25, 206–220. Boersma, P., & Weenink, D. (2006). Praat: Doing phonetics by computer (Version 4.4.11) [Computer program]. Retrieved February 26, 2010, from www.praat.org Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005, September). A database of German emotional speech. Paper presented at the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal. Calder, J. (1998). Survey research methods. Medical Education, 32, 636– 652. Campbell, N. (2000, September). Databases of emotional speech. Paper presented at the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, Northern Ireland, UK. Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32. Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40, 33–60. Ekman, P. (1999). Basic emotions. In T. Dalgleish & T. Power (Eds.), The handbook of cognition and emotion (pp. 45–60). Hove, UK: Wiley. Ekman, P., Friesen, W. V., & O’Sullivan, M. (1988). Smiles when lying. Journal of Personality and Social Psychology, 54, 414–420. doi:10. 1037/0022-3514.54.3.414

Frank, M. G., & Stennett, J. (2001). The forced-choice paradigm and the perception of facial expressions of emotion. Journal of Personality and Social Psychology, 80, 75–85. doi:10. 1037/0022-3514.80.1.75 Gerrards‐Hesse, A., Spies, K., & Hesse, F. W. (1994). Experimental inductions of emotional states and their effectiveness: A review. British Journal of Psychology, 85, 55–78. Gharavian, D., & Ahadi, S. M. (2009). Emotional speech recognition and emotion identification in Farsi language. Modares Technical and Engineering, 34(13), 2. Gharavian, D., & Sheikhan, M. (2010). Emotion recognition and emotion spotting improvement using formant-related features. Majlesi Journal of Electrical Engineering, 4(4). Gharavian, D., Sheikhan, M., Nazerieh, A., & Garoucy, S. (2012). Speech emotion recognition using FCBF feature selection method and GAoptimized fuzzy ARTMAP neural network. Neural Computing and Applications, 21, 2115–2126. Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpus development for Persian. International Journal on Asian Language Processing, 20, 17–33. Johnson, W. F., Emde, R. N., Scherer, K. R., & Klinnert, M. D. (1986). Recognition of emotion from vocal cues. Archives of General Psychiatry, 43, 280–283. doi:10.1001/archpsyc.1986.01800030098011 Johnstone, T., & Scherer, K. R. (1999, August). The effects of emotions on voice quality. In Proceedings of the 14th International Congress of Phonetic Sciences (pp. 2029–2032). San Francisco, CA: University of California, Berkeley. Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129, 770–814. doi:10.1037/0033-2909.129. 5.770 Likert, R. (1936). A method for measuring the sales influence of a radio program. Journal of Applied Psychology, 20, 175–182. Liu, P., & Pell, M. D. (2012). Recognizing vocal emotions in Mandarin Chinese: A validated database of Chinese vocal emotional stimuli.

Behav Res Behavior Research Methods, 44, 1042–1051. doi:10.3758/s13428012-0203-3 Luo, X., Fu, Q. J., & Galvin, J. J. (2007). Vocal emotion recognition by normal-hearing listeners and cochlear implant users. Trends in Amplification, 11, 301–315. Makarova, V., & Petrushin, V. A. (2002, September). RUSLANA: A database of Russian emotional utterances. Paper presented at the International Conference of Spoken Language Processing, Colorado, USA. Maurage, P., Joassin, F., Philippot, P., & Campanella, S. (2007). A validated battery of vocal emotional expressions. Neuropsychological Trends, 2, 63–74. Mitchell, R. L., Elliott, R., Barry, M., Cruttenden, A., & Woodruff, P. W. (2004). Neural response to emotional prosody in schizophrenia and in bipolar affective disorder. British Journal of Psychiatry, 184, 223–230. Niimi, Y., Kasamatsu, M., Nishinoto, T., & Araki, M. (2001, August). Synthesis of emotional speech using prosodically balanced VCV segments. Paper presented at the 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, Perthshire, Scotland. O’Brien, N. (2011). Stanislavski in practice: Exercises for students. New York, NY: Routledge. Pakosz, M. (1983). Attitudinal judgments in intonation: Some evidence for a theory. Journal of Psycholinguistic Research, 12, 311–326. Paulmann, S., Pell, M. D., & Kotz, S. A. (2008). How aging affects the recognition of emotional speech. Brain and Language, 104, 262– 269. doi:10.1016/j.bandl.2007.03.002 Pell, M. D. (2001). Influence of emotion and focus location on prosody in matched statements and questions. Journal of the Acoustical Society of America, 109, 1668–1680. doi:10.1121/1.1352088 Pell, M. D. (2002). Evaluation of nonverbal emotion in face and voice: Some preliminary findings on a new battery of tests. Brain and Cognition, 48, 499–514. Pell, M. D., Jaywant, A., Monetta, L., & Kotz, S. A. (2011). Emotional speech processing: Disentangling the effects of prosody and semantic cues. Cognition and Emotion, 25, 834–853. doi:10.1080/ 02699931.2010.516915 Pell, M. D., & Kotz, S. A. (2011). On the time course of vocal emotion recognition. PLoS ONE, 6, e27252. doi:10.1371/journal.pone. 0016505 Pell, M. D., Paulmann, S., Dara, C., Alasseri, A., & Kotz, S. A. (2009). Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics, 37, 417–435. Pell, M. D., & Skorup, V. (2008). Implicit processing of emotional prosody in a foreign versus native language. Speech Communication, 50, 519–530.

Petrushin, V. (1999, November). Emotion in speech: Recognition and application to call centers. Paper presented at the Conference on Artificial Neural Networks in Engineering, St. Louis, USA. Roach, P. (2000, September). Techniques for the phonetic description of emotional speech. Paper presented at the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, Northern Ireland, UK. Roach, P., Stibbard, R., Osborne, J., Arnfield, S., & Setter, J. (1998). Transcription of prosodic and paralinguistic features of emotional speech. Journal of the International Phonetic Association, 28, 83–94. Russ, J. B., Gur, R. C., & Bilker, W. B. (2008). Validation of affective and neutral sentence content for prosodic testing. Behavior Research Methods, 40, 935–939. doi:10.3758/BRM.40.4.935 Russell, J. A. (1994). Is there universal recognition of emotion from facial expressions? A review of the cross-cultural studies. Psychological Bulletin, 115, 102–141. doi:10.1037/00332909.115.1.102 Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99, 143–165. doi:10.1037/ 0033-2909.99.2.143 Scherer, K. R., Banse, R., Wallbott, H. G., & Goldbeck, T. (1991). Vocal cues in emotion encoding and decoding. Motivation and Emotion, 15, 123–148. Scherer, K. R., Ladd, D. R., & Silverman, K. E. A. (1984). Vocal cues to speaker affect: Testing two models. Journal of the Acoustical Society of America, 76, 1346–1356. doi:10.1121/1.391450 Schmuckler, M. A. (2001). What is ecological validity? A dimensional analysis. Infancy, 2, 419–436. Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime 1.0 user’s guide. Pittsburgh, PA: Psychological Software Tools. Sims-Williams, N., & Bailey, H. W. (Eds.). (2002). Indo-Iranian languages and peoples. Oxford, UK: Oxford University Press. Tanenhaus, M. K., & Brown-Schmidt, S. (2007). Language processing in the natural world. Philosophical Transactions of the Royal Society B, 363, 1105–1122. doi:10.1098/rstb.2007.2162 Ververidis, D., & Kotropoulos, C. (2003, October). A state of the art review on emotional speech databases. Paper presented at the 1st Richmedia Conference, Lausanne, Switzerland. Wallbott, H. G., & Scherer, K. R. (1986). How universal and specific is emotional experience? Evidence from 27 countries on five continents. Social Science Information, 25, 763–795. Wilson, D., & Wharton, T. (2006). Relevance and prosody. Journal of Pragmatics, 38, 1559–1579. Yu, F., Chang, E., Xu, Y., & Shum, H. Y. (2001, October). Emotion detection from speech to enrich multimedia content. Paper presented at the 2nd IEEE Pacific Rim Conference on Multimedia, London, United Kingdom.

Recognizing emotional speech in Persian: a validated database of Persian emotional speech (Persian ESD).

Research on emotional speech often requires valid stimuli for assessing perceived emotion through prosody and lexical content. To date, no comprehensi...
2MB Sizes 0 Downloads 4 Views