Expectations and speech intelligibility Molly Babela) and Jamie Russell Department of Linguistics, University of British Columbia, Vancouver, British Columbia, Canada

(Received 18 July 2014; revised 1 April 2015; accepted 9 April 2015) Socio-indexical cues and paralinguistic information are often beneficial to speech processing as this information assists listeners in parsing the speech stream. Associations that particular populations speak in a certain speech style can, however, make it such that socio-indexical cues have a cost. In this study, native speakers of Canadian English who identify as Chinese Canadian and White Canadian read sentences that were presented to listeners in noise. Half of the sentences were presented with a visual-prime in the form of a photo of the speaker and half were presented in control trials with fixation crosses. Sentences produced by Chinese Canadians showed an intelligibility cost in the face-prime condition, whereas sentences produced by White Canadians did not. In an accentedness rating task, listeners rated White Canadians as less accented in the face-prime trials, but Chinese Canadians showed no such change in perceived accentedness. These results suggest a misalignment between an expected and an observed speech signal for the face-prime trials, which indicates that social information about a speaker can trigger linguistic associations that come with C 2015 Acoustical Society of America. processing benefits and costs. V [http://dx.doi.org/10.1121/1.4919317] [CGC]

Pages: 2823–2833

I. INTRODUCTION

Spoken language has two major duties. One is to serve as the primary medium of linguistic communication. The second is to convey the signals often described as paralinguistic cues (Abercrombie, 1967). Paralinguistic information offers information about who is talking instead of what is being said. These indexical cues are rich, providing a listener with acoustic information signaling myriad speaker traits about speakers’ identities. While such identity traits can be recognized at varying levels of accuracy, the fact that listeners can identify them at all indicates that non-linguistic associations are deciphered with the speech stream. Not only are these associations learned, but listeners also exploit their paralinguistic and sociolinguistic knowledge to parse the speech stream (Ladefoged and Broadbent, 1957). For example, listeners have been shown to use information related to perceived talker gender (Strand and Johnson, 1996; Johnson et al., 1999), perceived sexual orientation (Munson et al., 2006), perceived emotional state (Nygaard and Queen, 2008), perceived national identity (Niedzielski, 1999; Hay et al., 2006a), and perceived age (Drager, 2011) to recognize words and categorize phonemes. Even stuffed animals have been used as an effective association prompt to shift listeners’ perceptions of sounds (Hay and Drager, 2010). Such categorization involves applying either generalizations or observations made about the world (e.g., females tend to be smaller than males and, hence, the cutoff between hud and hood will be at a higher first formant frequency for female voices or, speakers from Australia produce fish like [fiS] and speakers of New Zealand English produce it as [fØS]). Whether or not these generalizations are wholly accurate is another issue. a)

Electronic mail: [email protected]

J. Acoust. Soc. Am. 137 (5), May 2015

Spoken language processing extends, of course, beyond the identification of single speech sounds. In the context of sentence-length stimuli, American English listeners are more likely to perceive the sequence [mæs] as hmasti for an African American face and as hmassi for a White American face, indicating that listeners use the sociolinguistic information that African Americans show higher rates of t/d deletion to make predictions to parse the speech stream (Staum Casasanto, 2008). Listeners exploit indexical information available in the speech stream to make higher level linguistic and social judgments as well. Hearing a man’s voice say “I might be pregnant” generates an N400 response indicative of surprisal (Van Berkum et al., 2008) and hearing a natively accented Dutch speaker produce an ungrammatical utterance generates a larger P600 response, an indication of hearing a grammatical error, than hearing the identical utterance produced by a non-native Dutch speaker (Hanulıkova et al., 2012). Listeners thus use available acoustic information in the signal to identify voices as, for example, belonging to males or females or native or non-native speakers and interpret their speech accordingly. Indeed, McGowan (2015) found that Mandarin-accented English was more intelligible when paired with an Asian face than with a Caucasian face, suggesting that listeners use stereotyped expectations to facilitate comprehension of spoken language. While indexical cues are often beneficial in speech processing, they may also have certain costs. One type of cost involves social evaluation. The speech stream inherently provides a multidimensional array of paralinguistic information that prompts social evaluations, and not all accents are deemed socially equivalent (Bourhis and Giles, 1977; Trudgill and Giles, 1983; Williams et al., 1999). For instance, adult speakers of American English from Michigan consider southern varieties of American English to be less correct, but friendly, compared to the language varieties

0001-4966/2015/137(5)/2823/11/$30.00

C 2015 Acoustical Society of America V

2823

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

used by Northern speakers in the United States (Preston, 1993, 1999). Such stereotypes are rampant in the media, and are already held by children aged 9–10 (Kinzler and DeJesus, 2013). These social categorizations made from the speech stream can result in cases of linguistic profiling (Baugh, 2000, 2003). A meta-analysis of speech evaluation research found that the effects of pejorative evaluations of non-standard dialects were especially strong in the context of American English (Fuertes et al., 2012). Another cost to indexical cues relates to intelligibility and perceived accentedness. In a seminal study, Rubin (1992) paired the photograph of a Chinese woman and a Caucasian woman with a voice of a native speaker of English from central Ohio. North American undergraduates rated the voice as more accented when it was paired with the Chinese woman’s face. This effect of perceived group membership affecting speech perception and evaluation has been referred to as reverse linguistic stereotyping (Kang and Rubin, 2009), and it is often attributed to listeners’ willful misunderstanding of the speech signal (Rubin, 1992; Lippi-Green, 1997; Kang and Rubin, 2009). In a replication of Rubin (1992), Kang and Rubin (2009) also explored what determines how a listener is affected by group membership manipulations. Like Rubin, they paired a natively accented American English voice with a photo, this time of a White or East Asian male. Listeners completed an accentedness rating task, a cloze task, and a teaching quality assessment for each guise. Kang and Rubin also used the speech evaluation instrument (Zahn and Hooper, 1985)—divided into superiority, social attractiveness, and dynamism dimensions—to quantify listeners’ attitudes toward the different guises. Kang and Rubin (2009) found that listeners who engaged in linguistic stereotyping on the social attractiveness dimension performed worse on the comprehension task, and those with linguistic stereotyping along the superiority dimension perceived the speech in the East Asian guise as more accented. Kang and Rubin found no effects for the Euro-American guise and, therefore, conclude that reverse linguistic stereotyping only occurs on voices perceived to be from non-native speakers, based on stereotypes about the visual guise presented. Yi et al. (2013) recently suggested that a reduction in intelligibility for non-native speech stems from reduced efficiency in audio-visual (AV) integration for non-native accents. In an AV condition, listeners exhibited more AV enhancement for native-accented American English speakers than for Korean-accented American English speakers. The stronger listeners’ “Caucasian-American” and “Asianforeign” biases, as measured by an implicit association task (IAT; Greenwald et al., 1998), which quantified associations of Caucasian faces with Americanness and Asian faces with foreignness, the better their processing of natively accented AV speech was compared to their processing of non-natively accented AV speech. An independent group of listeners rated the accentedness of the same speakers in both audio-only and AV conditions. The Korean-accented speakers were rated as more accented in the AV condition, while native-accented speakers were rated as less accented. These results are perhaps unsurprising when viewed in light of Devos and Banaji (2005). Their set of experiments show that although Asian 2824

J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

Americans are consciously viewed as being equally American as White Americans in the social context of the United States, implicit attitudes suggest “White ¼ American” and “Asian ¼ foreign” associations. Devos and Banaji (2005, p. 448) assert that such subconscious attitudes arise through exposure as “implicit associations reflect the knowledge an individual has acquired through repeated personal experiences within a particular cultural context.” The studies cited above, which associate non-native accents with Asian group membership, were all conducted in contexts where Asians are a minority. Social associations between speech and ethnicity are examined in this study within the particular cultural context of the multicultural and multilingual urban area of Vancouver, British Columbia. Canada is internationally recognized as one of the most ethnoculturally and linguistically diverse countries in the world. In 2011, > 20% of the country’s population was foreign born, which is the highest proportion out of all of the G8 countries (Statistics Canada, 2011). Vancouver, British Columbia is one of the top destinations for immigrants arriving in Canada: 40% of the Metro Vancouver population is comprised of immigrants (Statistics Canada, 2011). The majority of both recent and established immigrants in Vancouver come from China (Statistics Canada, 2014b), and Chinese, broadly including Cantonese, Mandarin, and other Chinese languages, is the most widely spoken non-official language in both Metro Vancouver (Statistics Canada, 2014b) and in Canada (Statistics Canada, 2011). In Richmond, a coastal city in the Metro Vancouver regional district, immigrants account for 60% of the population, the highest proportion in Canada (Statistics Canada, 2014a). Over half of the immigrants in Richmond were born in both mainland China or Hong Kong, and over half of all immigrants report speaking a variety of Chinese at home (Statistics Canada, 2014a). Richmond also has the highest percentage of residents who report Chinese as their first language (41%) compared to 37% for English (Statistics Canada, 2014c). This can be compared with 15% of Vancouverites who report Chinese as their first language and 56% whose first language is reported as English (Policy Planning Division, City of Richmond, 2014). Thus, the Metro Vancouver area and Richmond, in particular, is a region with a large population of non-native speakers of English who live and work alongside native speakers of the local varieties of English from a range of ethnic heritages. The speakers in this study were recruited from the city of Richmond, which allows us to play into listeners’ expectations about Richmondites of Chinese descent, i.e., that their first language (L1) will be a language other than Canadian English and, further, that they will possess some degree of non-native accent. There is no research to suggest that Asian Canadians or Chinese Canadians, more specifically, have a unique ethnolect such that the expectation is for a distinct L1 variety (and this issue for the current set of speakers is addressed in an experiment below). All of the listeners came from the University of British Columbia (UBC) community in Vancouver, which, in terms of students, has a similar demographic makeup to Vancouver, Richmond, and Metro Vancouver as a whole. According to a survey of first-year M. Babel and J. Russell: Expectations and intelligibility

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

UBC students, the predominant ethnic makeup of both domestic and international students is Chinese (39%), and Cantonese and Mandarin account for 30% of surveyed firstyears’ native languages (UBC Planning and Institutional Research, 2012). Listeners’ predictions about a potential speaker’s language background are complex and will often be wrong. Based on stereotyped assumptions about nativeness (Devos and Banaji, 2005), listeners often expect that individuals of Asian descent will be non-native speakers of English. We are interested in when listeners erroneously make this prediction. We hypothesize that Chinese Canadians will be less intelligible than White Canadians when listeners are made aware of the ethnic background of the speakers. To test this, we use a type of cross-modal priming task where auditory presentation of sentences is preceded by either an image of the speaker or a control trial with a series of fixation crosses. This design is similar to that of Yi et al. (2013), but there are several key differences. For one, this experiment is situated in a socio-cultural context where 39% of the population identified as Chinese, while 35% identified as White (UBC Planning and Institutional Research, 2012). This is a different demographic landscape compared to other regions where similar studies have been done (e.g., Rubin, 1992; Yi et al., 2013). We also use a larger set of speakers in our study, all of whom identify as native speakers of English, with half identifying as Chinese Canadian and half identifying as White Canadian. Additionally, we use still images as primes as opposed to synced AV clips of the speakers to test whether a change in perceived intelligibility and perceived accentedness can simply be primed by a visual image of each talker, as opposed to online issues in AV synching. In some ways, this design is a replication of the work by Rubin (1992) and Kang and Rubin (2009); we adapt their paradigm by using a within-listener design that pairs individuals’ actual faces with their voices. We predict that the Chinese Canadian speakers will be less intelligible than the White Canadian speakers in the audio þ face trials, but that no such difference will hold in the audio-only trials, given that these are all native speakers of the local variety of Canadian English. The mechanism behind the predicted audio þ face cost in the Chinese Canadian trials will be examined with three different listener-based measures. If the audio þ face cost for the Chinese Canadian speakers is due to a decrease in perceptual effort resulting from stereotype bias, we anticipate a relationship between listeners’ measures of implicit bias, as in Yi et al. (2013) or, alternatively, a relationship with their explicit bias. If the audio þ face cost is due to misalignment of the expected non-native signal (as prompted by the face prime) with the observed natively accented signal, we then anticipate that this effect will be stronger for those who participate more in the multicultural language dynamics of the Metro Vancouver area. We quantify this participation through a self-assessed social network measure where each listener indicated whether they spend more time with Asian Canadians or White Canadians. J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

II. MATERIALS AND METHODS A. Stimuli 1. Auditory stimuli

A randomized list of 120 sentences drawn from the Bamford, Kowal, and Bench (BKB) sentence lists (Bench and Bamford, 1979) were recored by 12 self-identified native speakers of Canadian English born and raised in Richmond, British Columbia. Six of the speakers self-identified as ethnically White, and six as Chinese. There was an even number of males (3) and females (3) in each group, and they ranged in age from 17 to 25 years old (M ¼ 21.5 yr). Three of the Chinese Canadians spoke both Cantonese and English at home, and two of the White Canadians spoke a language in addition to English (Dutch and Greek) at home as well. Each target sentence was presented and recorded through E-Prime 2.0 software (Psychology Software Tools, Pittsburgh, PA) at a sampling rate of 22 kHz using a Sound Devices USB PreAMP (Reedsburg, WI) and an AKG C520 head-mounted microphone (Vienna, Austria). All recordings were made in a sound-attenuated cubicle at UBC. Elicited sentences were saved as individual sound files. Silence was trimmed from the ends of the files, and they were peakamplitude normalized. A subset of sentences in this format was used in the accentedness rating task, which is described below. For the sentence transcription task, however, all sound files were embedded in pink noise at a 3 dB signalto-noise ratio with a 500 ms buffer of noise at the beginning and end of each utterance. 2. Visual stimuli (photographs)

All 12 speakers were photographed against a white background with neutral faces using an Olympus FE-220 (Tokyo, Japan) digital camera. These images were manipulated to black and white using Picasa picture-editing software (Mountain View, CA) and cropped to 580  850 pixels. 3. IAT stimuli

An IAT (Greenwald et al., 1998) was constructed to quantify Asian–White bias using surnames and positive (e.g., vacation) or negative (e.g., death) lexical items. Forty of the 100 most common surnames in Vancouver (Skelton, 2007) were initially selected. A group of raters (N ¼ 21) rated the 40 names for their perceived ethnicity; 15 unanimously “White” and 15 unanimously “Asian” surnames were ultimately selected as stimuli for the IAT. The IAT also included 15 semantically pleasant words and 15 unpleasant words, which were taken from a previous experiment (Babel, 2012). 4. Explicit measure of ethnic bias

The stereotypes questionnaire (Table I) was constructed to gauge consciously held attitudes about White Canadians and Asian Canadians. Participants rated their agreement with each stereotype on a seven-point Likert scale. The wording of the questions were counterbalanced with respect to response endpoint. M. Babel and J. Russell: Expectations and intelligibility

2825

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

TABLE I. Questions asked to gauge the extent to which participants believe Asian Canadian stereotypes. Expected responses if one holds stereotyped views about Asian Canadians Strongly Disagree (1) Asian Canadians speak English as well as White Canadians. White Canadians are involved in more car accidents than Asian Canadians. White Canadian males are less masculine than Asian Canadian males. White Canadians are less athletic than Asian Canadians. White Canadian parents are more strict with their children than Asian Canadian parents.

B. Participants

Forty self-identified native speakers of English between the ages of 18 and 41 (M ¼ 23, SD ¼ 4.7) were recruited from the UBC community. All received monetary compensation ($10 Canadian) for their time, and none reported any speech, hearing, or language disorders. Participants selfreported their ethnic identities and came from a range of backgrounds (Asian ¼ 14, Asian and Pacific Islander ¼ 2, Asian and White ¼ 4, Black ¼ 1, Pacific Islander ¼ 1, South Asian ¼ 3, White ¼ 15). C. Procedure

Participants were seated at individual personal computer workstations in a sound-attenuated booth for the duration of the experiment. All listeners wore AKG K240 headphones, and logged responses using either the keyboard or a serial response box, depending on the task. They were casually informed that all the speakers they would hear were from Richmond, BC and were given instructions for each task verbally prior to the start of the first component. The experiment took 45 min to complete. The experiment was run using E-Prime 2.0 software (Psychology Software Tools, Pittsburgh, PA). 1. Speech perception in noise

The first part of the experiment involved transcribing sentences embedded in noise. Each participant heard 120 sentences total, with 10 sentences for each of the 12 speakers with no repetitions. Each set of ten sentences came from a single BKB list (Bench and Bamford, 1979). Half of each speaker’s sentences (N ¼ 5) were presented with their black and white photograph, while the other half were presented with a set of three fixation crosses (þþþ). Four separate counterbalanced lists were designed to vary the exact sentences in the visual prime and audio-only trials, thereby creating four versions of the experiment. The visual primes and fixation crosses were presented 2000 ms before the audio began. The audio-only and audio þ face trials were presented in a mixed block of both trial types. This was done to further increase the difficulty of the task. As soon as the audio began, a box appeared for listeners to type their responses. Each audio file began and ended with 500 ms of pink noise, and the presentation order of trials was fully randomized. 2826

J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

Strongly Agree (7) It is true that Asians Canadians are better at math than White Canadians. Asian Canadian females are less confrontational than White Canadian females. White Canadian students experience less family pressure regarding academic performance than Asian Canadians. Asian Canadians are more likely to excel in business versus trade professions than White Canadians. Asian Canadians are better at playing classical instruments than White Canadians.

Pressing the enter key moved listeners on to the next sentence. Breaks were programmed to appear every 40 trials. 2. Accentedness rating

Following sentence transcription, participants were asked to rate the accentedness of each of the speakers. Two sentences were randomly selected from each speaker’s stimuli list: one with the visual face prime and the other with three fixation crosses. These were presented in a mixed and fully randomized block of both trial types. These sentences were presented in the clear. Following Yi et al. (2013) and Smiljanic´ and Bradlow (2011), participants were asked to rate the accentedness of the speaker using a Likert scale from 1 (“no foreign accent”) to 9 (“very strong foreign accent”). This scale was provided at the bottom of each screen during the presentation of the sound files. Responses were entered using the number pad on the keyboard. 3. Implicit measure of ethnic bias: IAT

The IAT (Greenwald et al., 1998) involves five blocks of speeded categorization. Participants responded using a serial response box; the left-most button (1) was used to categorize items belonging to the left category on the screen and the right-most button (5) was used to categorize items belonging to the right category. Immediately after categorizing an item, participants were given feedback as to their response time and accuracy. Correct responses were indicated in blue and incorrect responses in red. If no response was detected after 3 s, the screen flashed white and indicated that no response was detected. It then continued on to the next trial. The first block is target-concept discrimination. The targets, Asian and White, are presented in opposite top corners of the monitor. The surnames to be categorized (e.g., Wong, Smith) were randomly presented in the center of the screen, and participants were asked to categorize them as Asian or White as quickly as possible. The second block is an associated attribute discrimination. The attributes, pleasant and unpleasant, replace Asian and White in the top corners of the monitor. Words with either pleasant or unpleasant connotations (e.g., vacation, disease) are randomly presented in the middle of the screen for categorization. The third block is a combined test; both Asian and White, as well as pleasant and unpleasant, labels are presented in the upper left and M. Babel and J. Russell: Expectations and intelligibility

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

right corners of the screen. The trials to be categorized thus include both the randomized lists of surnames from Block 1 and attribute words from Block 2. The fourth block was the same as the first block, except the categories Asian and White change sides of the screen (e.g., if participants were presented with Asian on the left side of the screen in Block 1, it was on the right in Block 4). The fifth block was another combined test block. As in Block 4, the target concepts (Asian and White) were reversed, but the attribute labels pleasant and unpleasant remained on the sides they were on in Block 3. Within each of the four versions of this experiment, participants were counterbalanced so that half (N ¼ 5) were first presented with Asian and pleasant, and the other half was first presented with White and pleasant. 4. Explicit measure of ethnic bias

Participants were next asked to indicate how strongly they agreed with a series of ten randomized statements specifically investigating common stereotypes about or issues affecting Asian Canadians and White Canadians. Subjects were instructed to bring to mind individuals born in Canada when deciding how much they agreed with each statement (as in Devos and Banaji, 2005) in an attempt to differentiate between recent immigrants and Canadian-born citizens. Participants were asked to respond as honestly as possible, and assured that all responses were anonymous. Each statement was presented in the top center of the computer monitor. The rating scale was presented at the bottom. This Likert scale ranged from 1 (strongly disagree) to 7 (strongly agree) and intermediate points were labeled as disagree (2), somewhat disagree (3), neither agree nor disagree (4), somewhat agree (5), and agree (6). A small box appeared in the center of the screen for participants to respond in. Answers were logged using the number pad on the keyboard, and the next question was automatically presented. Half of the statements were designed so if a participant held a particular stereotype they would be predicted to respond with strongly agree; the other half would be expected to elicit a strongly disagree.

We conducted two analyses of the effect of alignment of expected phonetic forms on the speech perception in noise (SPIN) task. For the first analysis, the proportion correct were normalized to rationalized arcsine units (RAUs) following the formulas in Studebaker (1985). The RAUs were used as the dependent measure in a hierarchical linear regression model with RAU scores calculated per sentence. Contrast coding was used on the independent variables Condition (audio þ face, and audio-only) and Talker Ethnicity (Chinese Canadian, White Canadian). There were random intercepts for listeners and random slopes for Condition and Talker Ethnicity for individual listeners. This was the maximal random effects structure that would converge, likely due to a number of responses with no correct words identified. The model returned a significant intercept (B ¼ 21.62, SE ¼ 1.53, t ¼ 14.13). There were main effects of Condition (B ¼ 1.23, SE ¼ 0.6, t ¼ 2.05) and Talker Ethnicity (B ¼ 2.9, SE ¼ 0.6, t ¼ 4.8), indicating that performance decreased in the audio þ face condition and that intelligibility was worse for the Chinese Canadian speakers. There was also an interaction between Condition and Ethnicity (B ¼ 1.5, SE ¼ 0.6, t ¼ 2.6). These results are shown in Fig. 1, but plotted with percent correct. Chinese Canadian voices were overall less intelligible than White Canadian voices, but the extent of this difference was further affected by condition. Chinese Canadian and White Canadian voices were more similar in intelligibility in the audio-only conditions, but intelligibility for the Chinese Canadian voices dropped when listeners were presented with a static visual prime of the speaker’s face 2000 ms before the presentation of the audio. There was no such drop in intelligibility for the White Canadian speakers. The second analysis used the formula traditionally used to calculate AV enhancement (AV  A0)/(1  A0) (Yi et al., 2013) on the proportion correct transcribed words, adapted for our audio þ face and audio-only conditions for the White Canadian and Chinese Canadian conditions. When exposed to Chinese Canadian faces, listeners showed a negative cost (M ¼ 0.06, SD ¼ 0.09) that was stronger than that for the White Canadians [M ¼ 0.0009, SD ¼ 0.10; t(39) ¼ 31.14,

5. Social network self-assessment

Participants were asked to describe the predominant ethnic composition of their social group, including friends, family, and co-workers with one of two options: White Canadian or Asian Canadian. III. RESULTS A. Sentence transcription

Listeners’ responses were normalized for spelling errors and punctuation was removed. Spelling errors were identified by running a spell-checker and hand checking each response. Errors were not corrected generously. Examples of corrected words include ambulence being changed to ambulance and litle being changed to little. The proportion of words correctly transcribed was scored for each utterance using a word-matching script. Function and content words were both assessed for transcription accuracy. J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

FIG. 1. Interaction between Condition and Talker Ethnicity for intelligibility. Chinese Canadian speakers were less intelligible when listeners were presented with a photo of the speaker. Error bars show standard error. M. Babel and J. Russell: Expectations and intelligibility

2827

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

p < 0.01]. The cost in the presence of White Canadian faces was not significantly different from 0 [t(39) ¼ 0.06, p ¼ ns], showing no benefit or cost to the static image. B. Accentedness ratings

Accentedness ratings were positively skewed, which is expected given that all of the speakers were locally accented native speakers of English. A histogram showing the accentedness ratings is shown in Fig. 2. As can be seen from the distributions, Asian Canadian voices (M ¼ 2.85, SD ¼ 2.16) were rated as more accented than White Canadian voices (M ¼ 2.2, SD ¼ 1.75). To standardize the ratings, accentedness ratings were log-transformed and used as the dependent measure in a hierarchical linear regression model.1 Contrast coding was used on the independent variables Condition (audio þ face, audio) and Talker Ethnicity (Chinese Canadian, White Canadian). There was a significant intercept (B ¼ 0.66, SE ¼ 0.06, t ¼ 10.4). There was a main effect of Talker Ethnicity (B ¼ 0.11, SE ¼ 0.02, t ¼ 5.2), indicating that Chinese Canadian voices were rated as more accented than White Canadian voices. There was also a Condition by Talker Ethnicity interaction (B ¼ 0.05, SE ¼ 0.02, t ¼ 2.58); this interaction is shown in Fig. 3. White Canadian speakers were rated as having less of a foreign accent in the audio þ face condition compared to the ratings for these voices in the audio-only condition. Ratings for Chinese Canadian speakers increased slightly in the audio þ face condition, compared to the audio-only condition. We conducted an additional analysis on the accentedness ratings that parallels the AV enhancement calculation: (Accentaudio þ face  Accentaudio)/(1  Accentaudio) for both the Chinese Canadian and White Canadian trials. For the participants who responded 1 in the audio-only trials or in both audio þ face and audio-only trials, 0.01 was added to their rating so as to not have 0 in the numerator or the denominator. There were greater negative changes for the Chinese Canadian speakers (M ¼ 7.25, SD ¼ 41.03) than for the White Canadian speakers (M ¼ 0.06, SD ¼ 1.13). This measure was highly variable and the difference was not significant [t(39) ¼ 1.1, p ¼ ns]. We also tested for a correlation between listeners’ SPIN scores and accentedness ratings using the audio þ face cost

The IAT task was scored following Greenwald et al. (2003). Figure 4 presents a histogram of participants’ scores. A negative score indicates a positive bias toward Asian Canadians where response times were faster when Asian surnames were paired with pleasant words, while a positive score indicates a positive bias toward White Canadians

FIG. 2. Histogram of accentedness ratings.

FIG. 4. A histogram of participants’ IAT scores. A negative score indicates a positive bias toward Asian Canadians and a positive score indicates a positive bias toward White Canadians. A score of 0 would indicate unbiased performance on the task.

2828

J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

FIG. 3. The interaction of Condition and Talker Ethnicity for accentedness ratings. Error bars show standard error.

(i.e., the AV enhancement calculation) scores from the SPIN task and the accentedness rating task. There was no relationship between these variables for the White Canadian voices [r(38) ¼ 0.005, p ¼ ns] or the Chinese Canadian voices [r(38) ¼ 0.051, p ¼ ns]. C. Listener measures

The three measures of individual listeners assess implicit attitudes (IAT), explicit attitudes (stereotype questionnaire), and self-assessed experiences (social network self-assessment). We analyze these in relation to the measures’ abilities to predict the audio þ face cost for the SPIN task and the audio þ face effects in accentedness ratings. 1. IAT

M. Babel and J. Russell: Expectations and intelligibility

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

where response times were faster when White surnames were paired with pleasant words. Participants’ IAT scores spanned a range of values and, overall, there was a positive bias toward White Canadians (M ¼ 0.39, SD ¼ 0.74), which was significantly different from 0 [t(39) ¼ 3.3, p < 0.01]. Participants’ IAT scores, however, were not correlated with the audio þ face cost for Asian Canadian speakers [r(38) ¼ 0.19, p ¼ ns] or with an equivalent change in accent for the audio þ face condition for Asian Canadian speakers [r(38) ¼ 0.19, p ¼ ns]. 2. Explicit stereotype questionnaire

Responses to the stereotype questionnaire were scored such that complete agreement with the Asian Canadian stereotype would result in an average of 7, whereas complete disagreement with the stereotype would be 1. The distribution of responses is shown in Fig. 5. Participants’ stereotype scores (M ¼ 4.59, SD ¼ 1.6) were not correlated with the audio þ face cost for Asian Canadians [r(38) ¼ 0.22, p ¼ ns]. They were also not correlated with the accent change for Asian Canadian voices [r(38) ¼ 0.02, p ¼ ns]. 3. Social network self-assessment

Nineteen participants reported they spent more time with Asian Canadians and 21 participants reported spending more time with White Canadians. Participants’ responses as to whether they spent more time interacting with Asian Canadians or White Canadians affected their audio þ face cost for Asian Canadians in the SPIN task [t(38) ¼ 1.9, p < 0.05, Cohen’s d ¼ 0.58]. Participants who identified more Asian Canadians in their social networks had a larger audio þ face cost (M ¼ 0.09, SD ¼ 0.08) than those who identified White Canadians as a larger part of their social networks (M ¼ 0.03, SD ¼ 0.1). There was no effect of social network on the change in accentedness ratings. D. Interim discussion

Groups of Chinese Canadians and White Canadians who were all native speakers of the local variety of Canadian English were found to be similarly intelligible in a SPIN task in an audio-only condition. Speech intelligibility dropped for

the Chinese Canadian voices in an audio þ face condition where static digital photos of the speakers preceded the audio signal. The same voices were presented to the same group of listeners in the clear for an accentedness rating task. The voices of the White Canadians were rated as less accented in audio þ face trials. There was no relationship between the listeners’ performance in the SPIN task and the accentedness ratings. Neither implicit nor explicit measures of racial attitudes predicted listener performance, but listeners’ self-assessed social networks did. Listeners who reported spending more time with Asian Canadians performed worse on the SPIN task with the Chinese Canadian speakers in the audio þ face condition. Asian Canadian voices were less intelligible overall and were rated as more accented compared to White Canadian voices. While there is little to no evidence of a new variety of Asian English in North America (Bucholtz, 2004; Wong and Hall-Lew, 2014), one of the main arguments against such a new variety relates to the heterogeneous nature of Asian American communities. Given that the majority of Asian population in the Vancouver Metro area is Chinese Canadian, such an argument is unsatisfactory. We therefore conducted a simple study to assess whether listeners could accurately identify the ethnicity of the voices used in the experiments above. IV. IDENTIFICATION OF TALKER ETHNICITY A. Methods 1. Stimuli

The same sentence stimuli used in the SPIN and accentedness rating tasks were used in this task. 2. Participants

Fifteen listeners participated in this task and were compensated with either $5 or partial course credit. All listeners identified as native speakers of English and had no speech, language, or hearing disorders. 3. Procedure

Listeners were presented with 10 sentences from the 12 talkers and asked to identify the ethnicity of each voice using the labels “Asian Canadian,” “White Canadian,” and “Other.” The sentence list was fully randomized for each listener. B. Results

FIG. 5. Histogram showing average scores on the Stereotype Questionnaire. The scale ranged from 1 to 7 with 7 indicating agreeing with the stereotype on all questions. J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

Accuracy in this task is computed in reference to a talker’s self-identified ethnicity. Overall accuracy was low at 54%, although listeners were generally more accurate for White Canadian voices (64%) than Asian Canadian voices (44%). A generalized linear effects model was run to predict listener accuracy. Talker Ethnicity was used as a fixed effect and there was a random listener slope for Talker Ethnicity by listener and random intercepts for talker and sentence. There were no significant effects. Table II shows the counts of the three response options for Asian Canadian and White M. Babel and J. Russell: Expectations and intelligibility

2829

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

TABLE II. Response counts for the Talker Ethnicity Identification task. The cells total 1786 as there were 14 null responses. Asian Canadian responses

White Canadian responses

Other responses

395 171

413 572

89 146

Asian Canadian voices White Canadian voices

Canadian voices. Listeners responded White Canadian most often for all voices, indicating a bias to assume all voices were more likely to be produced by White talkers. To confirm these results, we also calculated listeners’ accuracy for Chinese Canadian and White Canadian voices. Listeners performed no better than chance with the Chinese Canadian voices [t(14) ¼ 1.6, p ¼ 0.13] and better than chance with the White Canadian voices [t(14) ¼ 4.01, p < 0.01], further supporting the interpretation of a White Canadian response bias. C. Discussion

Listeners’ performance on an ethnicity identification task using the voices used in the SPIN and accentedness rating tasks does not suggest that listeners can robustly or reliably identify talker ethnicity from these speakers’ voices alone. This suggests that listeners biggest cue to the ethnic identity of each talker was delivered by the photographs. V. GENERAL DISCUSSION

Listeners show a drop in intelligibility for Chinese Canadian voices when voices are paired with static images of the speakers, and listeners show a decrease in perceived accentedness of White Canadian voices when their photos are presented. A separate group of listeners was not reliably able to identify the self-reported ethnicity of the set of voices, indicating that the most influential cue to talker ethnicity came from the photo prompts. Two theories provide a mechanism for these effects. One perspective is that the loss of intelligibility associated with a perceived-Asian background stems from lack of effort on the part of the listener (Rubin, 1992; Lippi-Green, 1997; Kang and Rubin, 2009). Yi et al. (2013) muster some support for this perspective. They found that individuals with stronger Asian ¼ foreign associations exhibited less of an AV boost for the Korean-accented voices. This effect could be due to either a familiarity effect (i.e., individuals with stronger White ¼ American associations might have less experience processing the speech of non-White English speakers) or due to a decrease in perceptual effort given to processing the Korean-accented speech. Sumner and Kataoka (2013) also provide evidence for an attention-based explanation, suggesting that less attention to non-standard New York City–accented pronunciations contributed to higher false memory rates for New York City speakers compared to general American and southern standard British English speakers. There is evidence from the faceprocessing literature that might support such an argument in the context of speech perception: own-group biases in face recognition suggest that face memory is affected both by 2830

J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

experience and, crucially, effort in the encoding process (Meissner and Brigham, 2001; Bernstein et al., 2007; Van Bavel and Cunningham, 2012; Young et al., 2010). Listeners in this task are, perhaps, not putting as much effort into unpacking the speech signal from the noise for the Chinese Canadian voices. The results here do not rule out the possibility that a socially selective attention mechanism where listeners work less hard to decode the speech from noise in the Chinese Canadian audio þ face trials is at work, but these data and others also support an alternative interpretation. McGowan (2015) provides evidence for another mechanism, grounded in listeners’ expectations of a signal and how it matches with the observed speech patterns. McGowan presented listeners with Mandarin-accented speech in babble, pairing the speech with a photograph of an Asian female, a White female, or a silhouette. Listeners recognized more words when the speech was presented with the photo of the Asian female, strongly suggesting that this primed listeners to expect a certain type of accent, which improved their performance on the task. Similarly, the current results suggest that the cost of audio þ face processing for Chinese Canadians is due to a misalignment of an anticipated non-native accent and an observed local accent. The presentation of a Chinese Canadian face provides the expectation of an upcoming accented signal, but this does not match the observed locally accented signal, resulting in a representational mismatch with a loss of intelligibility as a cost. This is contrasted with the roughly equivalent performance on Chinese Canadian and White Canadian voices in the audio-only trials. While our measures of listener attitudes may not have tapped the appropriate bias, the finding that neither implicit nor explicit racial attitudes predicted listener performance, indicates that a result solely rooted in a lack of listener effort in the Chinese-Canadian audio þ face is unlikely. In this study, listeners with more Asian Canadians in their social networks showed a larger effect. While this social network is grossly oversimplified, lumping together different social groups (friends, parents, co-workers) and not distinguishing between native and non-native English-speaking individuals, it does offer potential insight into these findings. Those who reported a stronger Asian Canadian social network (M ¼ 0.0005, SD ¼ 0.62) had nearly unbiased score on the IAT task, while those with stronger White Canadian social network (M ¼ 0.74, SD ¼ 0.67) had a clear positive bias toward White Canadians, which was significantly different from the Asian Canadian network IAT score [t(38) ¼ 3.62, p < 0.001]. Given these implicit biases, it seems unlikely that those with stronger positive White Canadian biases would show less of a loss of intelligibility if a decrease in perceptual effort or attention were the underlying cause. M. Babel and J. Russell: Expectations and intelligibility

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

Rather, we posit that this is due to stereotyped associations learned through interaction with Asian Canadians in the Metro Vancouver area. This is not to say that all Asian Canadians in the area are second language (L2) speakers of English—this is hardly the case—but that this is an example of a small difference in the population catapulting into a stereotype that listeners then exploit in speech perception (see also Babel and McGuire, 2015). This falls in line with another argument provided by Sumner and Kataoka (2013): that different perceptual experiences may be encoded with different social weightings. To be clear, these are simply examples of speech perception and phonetic memory being warped interpretations of reality. Drager (2011) made a similar argument regarding how New Zealand English-speaking listeners use perceived speaker age to parse speech sounds. Drager found that listeners perceived more tokens as members of the TRAP lexical set (Wells, 1982) on a bed-bad continuum when the apparent age of the speaker was younger, as determined by a photo prompt. This effect was found for older listeners, but not younger listeners. Drager reasons that this is because older speakers, having observed a change in progress over the years, associate the socio-phonetic variants to particular age groups more strongly than younger listeners. Drager models this result within the context of an exemplar model. Based on their amassed perceptual experiences, older listeners have a stronger weighting on the association between speaker age and realization of the DRESS and TRAP vowels. Listeners’ expectations of what speakers sound like are based on expectations according to the social categories of the speakers. Similarly, in an investigation on the perceived NEAR/ SQUARE merger in New Zealand English, Hay et al. (2006b) found that those who maintained a contrast themselves were better at accurately distinguishing the merger-in-progress when the perceived age of the speakers was older. The visual prompt of a static photo triggers expectations of what a speaker should sound like based on the information provided from a picture. In this case, there is the assumption that Chinese Canadians will have a non-local or non-native accent. White Canadians were rated as less accented in the audio þ face trials, similar to what was reported in Yi et al. (2013). Chinese Canadian speakers suffered a cost in the audio þ face trials with respect to intelligibility and the White Canadians benefited from the audio þ face trials in the accentedness rating task. The presentation of the White Canadians speakers’ faces may have allowed listeners to better parse and associate the variability inherently present in the speech stream. Listeners’ perceptions of accentedness are not exclusively related to the content of the speech stream. For example, Levi et al. (2007) found a decrease in accentedness ratings for native speakers and an increase in ratings for nonnative speakers when a word’s orthographic form was presented concurrently with the auditory signal. Thus, listeners’ ability to parse phonetic variability is partially determined by non-acoustic factors. These results run counter to the claims made by Kang and Rubin (2009) that linguistic expectations based on apparent group membership of the speaker only affect the speech of apparent nonJ. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

native—what is understood in their study to simply be non-white individuals—speakers. Changes in reported perceptual experiences change for White Canadians and Asian Canadians in this study, which is to be expected if changes in perception are moderated by changes in expectations and predictions (Clark, 2013). The results of Rubin and colleagues are also compatible within this framework: perceptual judgments change as a result of expected accentedness as prompted by photos. The lack of a relationship between intelligibility and perceived accentedness for our speakers suggests a disconnect between linguistic measures like intelligibility and indexical judgments like global accentedness ratings. This in and of itself is not a novel finding, and the non-equivalence of accentedness and intelligibility raises the question of what these terms mean, both for researchers and naive listeners in the world. Munro and Derwing (1995b, p. 291) identify three relevant terms: intelligibility, comprehensibility, and accentedness. They define intelligibility with regard to whether a listener understands the utterance, comprehensibility as a listener’s perception of how easy an utterance is to understand, and accentedness as “how strong the talker’s foreign accent is perceived to be.” In a series of studies with second language speakers of English, Munro and Derwing (1995a,b) and Derwing and Munro (1997) find that while intelligibility, comprehensibility, and accentedness are correlated measures, they are also largely independent assessments of speech; speakers can be rated as highly accented and still be highly intelligible. For our purposes, we conceptualize accentedness as the perceived deviation from an expected or anticipated phonological distribution that does not necessarily disrupt identification or recognition. The novelty of our findings regarding the interplay of accentedness and intelligibility is in the native-speaker status of our speakers. White Canadian native speakers of English are rated as less accented than Chinese Canadian native speakers of English; the perception of accentedness is distorted based on the ethnic identity of the speaker. To reiterate a practical point and echo the work of Munro and Derwing, given the disconnect between accentedness and intelligibility, using offline measures of accentedness as an actual proxy for intelligibility could have disastrous implications for diagnostic practices regarding the need for language intervention. To model the interaction of spoken word recognition and expectations derived from, in this case, visually presented social cues, it is necessary to integrate a neo-generative model of speech perception (Pierrehumbert, 2002; German et al., 2013) with an attention-weighting mechanism (Johnson, 1997) and connect this to a model of person construal (Freeman and Ambady, 2011). Under such a model, abstract phonological generalizations or categories are connected with multidimensional phonetic distributions (Pierrehumbert, 2002). An attention-weighting mechanism, such as that described in Johnson (1997), allows for context-sensitive perception. These phonetic and phonological representations need to be connected to social categories. Freeman and Ambady’s (2011) interactive model of person construal is a theoretical model which provides a dynamical systems framework for understanding how higher-order socio-cognitive categories and task M. Babel and J. Russell: Expectations and intelligibility

2831

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

demands affect lower level perceptual processing. Categorizing a speaker’s social identity based on voice or face cues creates a train of activation to social category representations (i.e., a category level), which interface with expectations about what a voice with such social category features should sound like given prior experience (i.e., a stereotype level). Extending Freeman and Ambady’s (2011) model, the stereotype level would interface with a linguistic level where language representations offer a prediction about what a given talker would sound like, akin to the category level in the person construal model. Such a model is an implicit assumption in much of the sociophonetic research, but this theory of person construal allows the formulation of more explicit predictions. The dualroute speech processing model by Sumner et al. (2014) offers the infrastructure for connecting a model of person construal with more traditional models of phonetic representations. In Sumner and colleagues’ model, the speech signal is independently parsed into linguistic units and social representations. Social representations and linguistic representations are associated and the linkage allows for weighting of particular linguistic representations over others. We can account for the current results within this framework. Speakers’ intelligibility without their photos is equivalent as all speakers exhibit the local accent, which is predictable in the context of the experiment. When listeners are presented with speakers’ digital photos, this activates a stereotype level for Chinese Canadians and White Canadians. This stereotype level activates a category level, which offers a prediction of what a speaker associated with such visual cues will sound like. The stereotype-category connection can be based on reality, imagined, or culturally derived experiences. For the White Canadian speakers, this connection sets up a linguistic prediction that matches what was anticipated in the photoless condition—the local Canadian English accent—and thus intelligibility across the two conditions are fairly uniform for the White Canadians. The linguistic prediction for the Chinese Canadians is that of a non-native signal and the mismatch between that prediction and the observed speech signal causes a drop in intelligibility. Listeners who spend more time with Asian Canadians have a stronger connection between the Chinese Canadian level and the expected linguistic signal, hence, the greater cost for these listeners in the audio þ face condition for the Chinese Canadian speakers. VI. CONCLUSION

Most research on intelligibility and accentedness has used the speech of non-native speakers of English (e.g., Munro and Derwing, 1995a,b; Bradlow and Bent, 2003, 2008; Yi et al., 2013). This study shows that listener experiences in a multicultural and multilingual urban environment can have negative effects on speech assessments for native speakers of non-White backgrounds. Chinese Canadians, even when native speakers of the local variety of English, are less intelligible and perceived as less-natively accented when listeners know they are hearing the speech of Chinese Canadians. With respect to our understanding of speech perception and the models we use to conceptualize the process, these results and the models described herein underscore the need to understand speech processing in the context of listener 2832

J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

experiences. Listeners develop internal models of linguistic awareness and knowledge based on their experiences, real and imagined, with individuals in the world, and they track the social characteristics of those individuals. The associations of linguistic knowledge and social traits are used, for better and for worse, to parse and assess the incoming speech stream. ACKNOWLEDGMENTS

This work benefited from input from Kathleen Currie Hall, David Kurbis, Martin Oberg, and Eric VatikiotisBateson. It was supported by a UBC Alma Mater Society Impact Grant. Thanks to all of the speakers and participants for their time. 1

We also conducted an analysis on z-scored accentedness ratings. This analysis involved removing two listeners who rated all voices with 1 ¼ no foreign accent. The uniformity in these two listeners’ ratings made their standard deviations 0, which made it impossible to z-score their ratings. The model using the remaining 38 listeners’ z-scored accentedness ratings revealed the same main effects and interaction reported in the text on the log-transformed ratings.

Abercrombie, D. (1967). Elements of General Phonetics (Edinburgh University Press, Edinburgh), pp. 1–216. Babel, M. (2012). “Evidence for phonetic and social selectivity in spontaneous phonetic imitation,” J. Phonetics 40, 177–189. Babel, M., and McGuire, G. (2015). “Perceptual fluency and judgments of vocal aesthetics and stereotypicality,” Cogn. Sci. 39, 766–787. Baugh, J. (2000). “Racial identification by speech,” Am. Speech 75, 362–364. Baugh, J. (2003). “Linguistic profiling,” in Black Linguistics: Language, Society, and Politics in Africa and the Americas, edited by S. Makoni, G. Smitherman, A. F. Ball, and A. K. Spears (Routledge, London), pp. 155–168. Bench, J., and Bamford, J. M. (1979). Speech-Hearing Tests and the Spoken Language of Hearing-Impaired Children (Academic, London). Bent, T., and Bradlow, A. R. (2003). “The interlanguage speech intelligibility benefit,” J. Acoust. Soc. Am. 114(3), 1600–1610. Bernstein, M. J., Young, S. G., and Hugenberg, K. (2007). “The crosscategory effect: Mere social categorization is sufficient to elicit an owngroup bias in face recognition,” Psychol. Sci. 18, 706–712. Bourhis, R. Y., and Giles, H. (1977). “The language of intergroup distinctiveness,” in Language, Ethnicity, and Intergroup Relations, edited by H. Giles (Academic, London), pp. 119–136. Bradlow, A. R., and Bent, T. (2008). “Perceptual adaptation to non-native speech,” Cognition 106(2), 707–729. Bucholtz, M. (2004). “Styles and stereotypes: The linguistic negotiation of identity among Laotian American youth,” Pragmatics 14, 2–3. Clark, A. (2013). “Whatever next? Predictive brains, situated agents, and the future of cognitive science,” Behav. Brain Sci. 36(03), 181–204. Derwing, T. M., and Munro, M. J. (1997). “Accent, intelligibility, and comprehensibility,” Studies Second Lang. Acquisition 19(1), 1–16. Devos, T., and Banaji, M. R. (2005). “American ¼ White?,” J. Pers. Soc. Psychol. 88, 447–466. Drager, K. (2011). “Speaker age and vowel perception,” Lang. Speech 54, 99–121. Freeman, J. B., and Ambady, N. (2011). “A dynamic interactive theory of person construal,” Psychol. Rev. 118(2), 247–279. Fuertes, J. N., Gottdiener, W. H., Martin, H., Gilbert, T. C., and Giles, H. (2012). “A meta-analysis of the effects of speakers’ accent on interpersonal evaluations,” Eur. J. Soc. Psychol. 42, 12–133. German, J. S., Carlson, K., and Pierrehumbert, J. B. (2013). “Reassignment of consonant allophones in rapid dialect acquisition,” J. Phonetics 41(3), 228–248. Greenwald, A. G., McGhee, D. E., and Schwartz, J. L. (1998). “Measuring individual differences in implicit cognition: The implicit association test,” J. Pers. Soc. Psychol. 74, 1464–1480. M. Babel and J. Russell: Expectations and intelligibility

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

Greenwald, A. G., Nosek, B. A., and Banaji, M. R. (2003). “Understanding and using the implicit association test: I. An improved scoring algorithm,” J. Pers. Soc. Psychol. 85, 197–216. Hanulıkova, A., Van Alphen, P. M., Van Goch, M., and Weber, A. (2012). “When one person’s mistake is another’s standard usage: The effect of foreign accent on syntactic processing,” J. Cognit. Neurosci. 24, 878–887. Hay, J., and Drager, K. (2010). “Stuffed toys and speech perception,” Linguistics 48, 865–892. Hay, J., Nolan, A., and Drager, K. (2006a). “From fush to feesh: Exemplar priming in speech perception,” Ling. Rev. 23, 351–379. Hay, J., Warren, P., and Drager, K. (2006b). “Factors influencing speech perception in the context of a merger-in-progress,” J. Phonetics 34, 458–484. Johnson, K. (1997). “Speech perception without speaker normalization: An exemplar model,” in Talker Variability in Speech Processing, edited by K. Johnson, and J. W. Muillennix (Academic Press, San Diego), pp. 145–165. Johnson, K., Strand, E. A., and D’Imperio, M. (1999). “Auditory-visual integration of talker gender in vowel perception,” J. Phonetics 27, 359–384. Kang, O., and Rubin, D. (2009). “Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation,” J. Lang. Soc. Psychol. 28, 441–456. Kinzler, K. D., and DeJesus, J. M. (2013). “Northern ¼ smart and Southern ¼ nice: The development of accent attitudes in the United States,” Quart. J. Exp. Psychol. 66, 1146–1158. Ladefoged, P., and Broadbent, D. E. (1957). “Information conveyed by vowels,” J. Acoust. Soc. Am. 29, 98–104. Levi, S. V., Winters, S. J., and Pisoni, D. B. (2007). “Speaker-independent factors affecting the perception of foreign accent in a second language,” J. Acoust. Soc. Am. 121(4), 2327–2338. Lippi-Green, R. (1997). English with an Accent: Language, Ideology, and Discrimination in the United States (Routledge, London). McGowan, K. (2015). “Social expectations improves speech perception in noise,” Lang. Speech (to be published). DOI:10.1177/0023830914565191. Meissner, C. A., and Brigham, J. C. (2001). “Thirty years of investigating the own-race bias in memory for faces: A meta-analytic review,” Psychol., Public Policy, Law 7(1), 3–35. Munro, M. J., and Derwing, T. M. (1995a). “Foreign accent, comprehensibility, and intelligibility in the speech of second language learners,” Lang. Learn. 45(1), 73–97. Munro, M. J., and Derwing, T. M. (1995b). “Processing time, accent, and comprehensibility in the perception of native and foreign-accented speech,” Lang. Speech 38(3), 289–306. Munson, B., Jefferson, S. V., and McDonald, E. C. (2006). “The influence of perceived sexual orientation on fricative identification,” J. Acoust. Soc. Am. 119, 2427–2437. Niedzielski, N. (1999). “The effect of social information on the perception of sociolinguistic variables,” J. Lang. Soc. Psychol. 18, 62–85. Nygaard, L. C., and Queen, J. S. (2008). “Communicating emotion: Linking affective prosody and word meaning,” J. Exp. Psychol.: Hum. Percept. Perform. 34, 1017–1030. Pierrehumbert, J. (2002). “Word-specific phonetics,” in Laboratory Phonology VII, edited by C. Gussenhoven and N. Warner (Mouton de Gruyter, Berlin), pp. 101–139. Policy Planning Division, City of Richmond. (2014). “Languages hot facts (Summary of 2006 and 2011census information),” Revised October 2014. Retrieved from http://richmond.ca/__shared/assets/Languages6251.pdf (Last viewed January 6, 2015). Preston, D. (ed.) (1993). “Folk dialectology,” in American Dialect Research (Benjamins, Philadelphia), pp. 333–377. Preston, D. (ed.) (1999). “A language attitude approach to the perception of regional variety,” in Handbook of Perceptual Dialectology (Benjamins, Philadelphia, PA), Vol. 1, pp. 359–373. Rubin, D. (1992). “Nonlanguage factors affecting undergraduates’ judgments of nonnative English-speaking teaching assistants,” Res. Higher Educ. 33, 511–531. Skelton, C. (2007). “Top 100 surnames in the Lower Mainland,” The Vancouver Sun., Nov. 3. Available at http://www.canada.com/vancouversun/news/

J. Acoust. Soc. Am., Vol. 137, No. 5, May 2015

weekendreview/story.html?id¼32537 115-d2c9-4c4d-8442-4406b5577300/ (Last viewed January 31, 2014). Smiljanic´, R., and Bradlow, A. R. (2011). “Bidirectional clear speech perception benefit for native and high-proficiency non-native talkers and listeners: Intelligibility and accentedness,” J. Acoust. Soc. Am. 130, 4020–4031. Statistics Canada. (2011). “National Household Survey, 2011,” in Immigration and Ethnocultural Diversity in Canada (Catalogue no. 99-010-X2011001). Available at http://epe.lac-bac.gc.ca/100/201/301/weekly_checklist/2013/ internet/w13-19-U-E.html/collections/collection_2013/statcan/CS99-010-20111-eng.pdf (Last viewed January 5, 2015). Statistics Canada. (2014a). “NHS Focus on Geography Series—Richmond,” in Immigration and Ethnocultural Diversity. Available at http://www12. statcan.gc.ca/nhs-enm/2011/as-sa/fogs-spg/Pages/FOG.cfm?lang=E&level=4& GeoCode=5915015 (Last viewed January 5, 2015). Statistics Canada. (2014b). “NHS Focus on Geography Series—Vancouver,” in Immigration and Ethnocultural Diversity. Available at http:// www12.statcan.gc.ca/nhs-enm/2011/as-sa/fogs-spg/Pages/FOG.cfm?lang =E&level=4&GeoCode=5915022 (Last viewed January 5, 2015). Statistics Canada. (2014c). “Census subdivision of Richmond, CY—British Columbia,” in Focus on Geography Series, 2011 Census. Available at http://www12.statcan.gc.ca/census-recensement/2011/as-sa/fogs-spg/FactsCSD-eng.cfm?LANG¼Eng&GK¼CSD&GC¼5915015 (Last viewed January 5, 2015). Staum Casasanto, L. (2008). “Does social information influence sentence processing?,” in Proceedings of the 30th Annual Conference of the Cognitive Science Society, edited by B. C. Love, K. McRae, and V. M. Sloutsky (Cognitive Science Society, Austin, TX), pp. 799–804. Strand, E., and Johnson, K. (1996). “Gradient and visual speaker normalization in the perception of fricatives,” in Natural Language Processing and Speech Technology: Results of the 3rd KONVENS Conference, Bielfelt, October, 1996, edited by D. Gibbon (Mouton de Gruyter, Berlin), pp. 14–26. Studebaker, G. A. (1985). “A rationalized arcsine transform,” J. Speech Hear. Res. 28, 455–462. Sumner, M., and Kataoka, R. (2013). “Effects of indexical variation on spoken word recognition,” J. Acoust. Soc. Am. 134, 485–491. Sumner, M., Kim, S. K., King, E., and McGowan, K. B. (2014). “The socially weighted encoding of spoken words: A dual-route approach to speech perception,” Front. Psychol. 4, 1015, 1–13 . Trudgill, P., and Giles, H. (1983). “Sociolinguistics and linguistic value judgements,” in On Dialect: Social and Geographical Perspectives, edited by P. Trudgill (New York University Press, New York), pp. 201–225. UBC Planning and Institutional Research. (2012). “2009–2012 new to UBC (NUBC) student survey,” Available at http://www.pair.ubc.ca/surveys/ nubc/UBCV%20NUBC%202012.pdf (Last viewed January 6, 2015). Van Bavel, J. J., and Cunningham, W. A. (2012). “A social identity approach to person memory: Group membership, collective identification, and social role shape attention and memory,” Pers. Soc. Psychol. Bull. 38, 1566–1578. Van Berkum, J. J., Van den Brink, D., Tesink, C. M., Kos, M., and Hagoort, P. (2008). “The neural integration of speaker and message,” J. Cognit. Neurosci. 20(4), 580–591. Wells, J. C. (1982). Accents of English (Cambridge University Press, Cambridge, UK), Vol. 1. Williams, A., Garrett, P., and Coupland, N. (1999). “Dialect recognition,” in Handbook of Perceptual Dialectology, edited by D. R. Preston (Benjamins, Philadelphia), pp. 345–358. Wong, A. W. M., and Hall-Lew, L. (2014). “Regional variability and ethnic identity: Chinese Americans in New York City and San Francisco,” Lang. Commun. 35, 27–42. Yi, H.-G., Phelps, J., Smiljanic´, R., and Chandrasekaran, B. (2013). “Reduced efficiency of audiovisual integration for nonnative speech,” J. Acoust. Soc. Am. 134, EL387–EL393. Young, S. G., Bernstein, M. J., and Hugenberg, K. (2010). “When do owngroup biases in face recognition occur? Encoding versus post-encoding,” Soc. Cognit. 28, 240–250. Zahn, C. J., and Hopper, R. (1985). “Measuring language attitudes: The speech evaluation instrument,” J. Lang. Soc. Psychol. 4(2), 113–123.

M. Babel and J. Russell: Expectations and intelligibility

2833

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.83.63.180 On: Thu, 04 Jun 2015 15:08:18

Expectations and speech intelligibility.

Socio-indexical cues and paralinguistic information are often beneficial to speech processing as this information assists listeners in parsing the spe...
425KB Sizes 6 Downloads 7 Views