Developing a comprehensive scale to assess college multicultural programming.

Joumal of Counseling Psychology 2014, Vol. 61, No. I, 133-145

© 2014 American Psychological Association 0022-0167/14/$I2.00 DOI: !0.1037/a00352l4

Developing a Comprehensive Scale to Assess College Multicultural Programming Brent Mallinckrodt, Josepli R. Miles, Tripti Bhaskar, Nicole Chery, Gahee Choi, and Mi-Ra Sung University of Tennessee A barrier to assessing effectiveness of multicultural programming is lack of a relatively brief instrument to measure the wide range of intended outcomes. A frequent goal of programming is to increase cultural empathy, but this is rarely the only intended outcome. We conducted focus groups of campus administrators, student affairs staff, and undergraduate instructors who identified a full range of racial/ethnic multicultural competencies that undergraduates should possess. An 84-item pool generated from these focus groups was combined with the 31-item Scale of Ethnocultural Empathy (SEE; Wang et al., 2003). These 115 items, together with instruments used to gauge concurrent validity, were administered to White undergraduate students in introductory psychology courses at the midpoint (n = 602) and end (n = 676) of fall semester. Exploratory factor analysis suggested 6 subscales for the Everyday Multicultural Competencies/Revised SEE (EMC/RSEE): (a) Cultural Openness and Desire to Learn; (b) Resentment and Cultural Dominance; (c) Anxiety and Lack of Multicultural Self-Efficacy; (d) Empathie Perspective-Taking; (e) Awareness of Contemporary Racism and Privilege; and (f) Empathie Feeling and Acting as an Ally. Item response theory principles guided final selection of subscale items. Analyses suggested good factor stability, reliability, and discriminant validity of the 48-item EMC/RSEE in these undergraduate samples. EMC/RSEE subscales were not strongly correlated with a measure of impression management and were significantly associated with nieasures of Openness to Diversity Challenge, and Universal-Diverse Orientation. Keywords: diversity programming, multicultural competencies, item response theory Supplemental materials: http://dx.doi.org/10.1037/a0035214.supp

In response to increasing diversity in the United States, many colleges and universities have invested considerable resources in multicultural education and programming (Howard-Hamilton, Cuyjet, & Cooper, 2011). Multicultural education takes many forms, including formal courses specifically designed to deliver multicultural content, and efforts to infuse multicultural content into existing courses (Adams, Bell, & Griffin, 2007). Multicultural programming includes workshops targeted at a general campus audience; for example, a workshop may be designed to increase awareness of inequalities, reduce prejudicial attitudes, or develop skills to combat discrimination and reduce bias. In contrast to serving a broad general audience, another type of multicultural programming is training targeted at specific groups, for example, residence hall paraprofessional peer counselors (EUeven, Allen, & Wircenski, 2001).

A growing body of research suggests that college multicultural programming is associated with a variety of positive outcomes, including increased empathy, intergroup understanding, communication skills, awareness about stmctural power relations, critical thinking skills, student engagement and motivation, and political involvement (Dessel, 2010; Gurin, Dey, Hurtado, & Gurin, 2002; Gurin, Dey, Gurin, & Hurtado, 2004; Gurin, Nagda, & Lopez, 2004; Nagda, Gurin, Sorensen, & Zúñiga, 2009). Although these studies have provided valuable information about multicultural interventions, researchers have been compelled to rely on batteries of single-purpose measures or a variety of unstandardized assessment tools and methods. The lack of a single, comprehensive, reliable, and valid measure with which to assess the wide range of intended outcomes of undergraduate multicultural programming has prevented educators and researchers from directly comparing outcomes of multicultural education and programming across interventions and studies. Answering the questions of "what works, for whom, and when" with respect to multicultural programming has been hampered by the lack of a common measure. Across the diverse range of multicultural programming, a frequent goal is to increase participants' empathy for others who are culturally different. Ethnocultural empathy is a specific type of empathy for others whose racial/ethnic background differs from one's own. Wang et al. (2003) drew from a model developed by Ridley and Lingle (1996) to conceptualize ethnocultural empathy as including components of (a) intellectual empathy, the cognitive ability to comprehend another's cultural perspective; (b) emotional

Brent Mallinckrodt, Joseph R. Miles, Tripti Bhaskar, Nicole Chery, Gahee Choi, and Mi-Ra Sung, Department of Psychology, University of Tennessee. Tripti Bhaskar is now in independent practice, Tallahassee, Florida. We gratefully acknowledge the assistance of Richard Saudargas in this project. We also thank Patrick R. Grzanka and Joseph G. Ponterotto for their helpful feedback on earlier versions of this article. Correspondence concerning this article should be addressed to Brent Mallinckrodt, Department of Psychology, University of Tennessee, 1404 Circle Drive, Room 305, Knoxville, TN 37996. E-mail: [email protected] 133

134

MALLINCKRODT ET AL.

empathy, a capacity to feel affect in the present similar to another's emotional experience; and (c) a capacity to communicate empathie understanding to another. Wang et al. developed the Scale of Ethnocultural Empathy (SEE) to assess this constmct. Because this type of empathy is frequently a goal of campus multicultural programming, the SEE has been used to evaluate the effectiveness of these interventions. The SEE has demonstrated good validity and reliability in undergraduate samples, and a growing number of multicultural programming studies have used it as the principal measure of outcomes (cf. Phillips, 2012; Rasoal, Jungert, Hau, Stiwne, & Andersson, 2009). However, ethnocultural empathy is not the only goal of many multicultural interventions. A more comprehensive multidimensional measure is needed to capture a more complete range of outcomes. Thus, the general purpose of the current project was to expand the SEE to develop a multidimensional self-report instmment that could be used to assess the effectiveness of campus ethnic/racial diversity and multicultural programming efforts aimed at a broad undergraduate audience. Because increased empathy is so frequently an intended outcome of multicultural programs, we conceptualized this project fundamentally as an effort to revise and expand the SEE. Therefore, we chose as a working title for the new measure; Everyday Multicultural Competencies/Revised Scale of Ethnocultural Empathy (EMC/RSEE). We recognized that the goals of "typical" programs can be quite varied, so our intention was to develop an instmment with a subscale stmcture likely to capture the most important generic goals for this broad range of interventions. Howard-Hamilton et al. (2011) recently identified three clusters of undergraduate multicultural programming goals: (a) culturally relevant knowledge (e.g., knowledge of one's own cultural identity, knowledge of the cultures of others), (b) multicultural skills (e.g., self-refiection, perspective-taking, intergroup communication), and (c) diversityrelated attitudes (e.g., pride in one's own culture, belief that discrimination is unjust, belief that intergroup interactions enhance quality of life). This tripartite organizing scheme closely parallels the three domains of multicultural competence identified in training mental health professionals (Sue, Arredondo, & McDavis, 1992; Sue et al., 1982), although the target participants in these programs are general populations of college students. Thus, the theoretical framework of (a) knowledge, (b) skills, and (c) attitudes/awareness served as a useful starting point for developing an instmment to assess a broad range of multicultural programming. We adopted a ground-up inductive approach that relied on focus groups of college administrators, student affairs personnel, and instmctors to identify specific knowledge, skills, and attitudes/awareness that undergraduate students should develop and that are typical outcome goals of multicultural programming. Second, we asked participants in these focus groups, together with other experts, to help generate an item pool to expand the 31-item SEE to capture this broader range of goals. Finally, we devised a data collection plan to solicit two large samples of undergraduates, separated by an interval of about six weeks, for the purpose of factor analysis, instmment development, and estimation of reliability and validity for the EMC/RSEE. We could not know in advance what new dimensions might emerge, so as a starting point we chose two of the same measures for exploration of convergent and divergent validity that Wang et al. (2003) chose in developing the SEE.

Because our measure was conceived as an extension of the SEE, we expected that any new factors that closely paralleled previous SEE subscales would show a similar pattem of significant correlations. The constmct of universal-diverse orientation is defined as "an attitude of awareness and acceptance of both the similarities and difference that exist among people" (Miville et al, 1999, p. 291) and is believed to be essential for effective multicultural interactions. Individuals who hold this cluster of attitudes recognize commonalities of human experience that form a basis for empathy and mutual connection while, at the same time, they recognize and value profound differences in experience rooted in ethnicity, race, gender, and other social identities. The original 45-item instmment used to assess this constmct is the Miville-Guzman Universality Diversity Scale (MGUDS; Miville et al., 1999). The shortened 15-item version used in this study (MGUDS-S; Fuertes, Miville, Mohr, Sedlacek, & Gretchen, 2000) consists of three subscales: (a) Diversity of Contact, which refiects predominantly behavioral intentions and commitment to participate in diverse social and cultural activities; (b) Relativistic Appreciation, which captures the core constmct of simultaneously valuing both differences and similarities and which the authors believe is primarily a cognitive component; and (c) Comfort with Differences, which is primarily an affective component involving feeling comfortable in close contact with others who are culturally different. Consequently, we anticipated that any EMC/RSEE factor that tapped strong themes of overt behavioral intentions would be significantly correlated with Diversity of Contact, factor(s) tapping attitudes of acceptance and valuing of diversity would be significantly correlated with Relativistic Appreciation; and factor(s) tapping aspects of affect and empathy would be significantly correlated with Comfort with Closeness. The second measure used by Wang et al. (2003) that we used in this study to assess divergent validity was the Impression Management (IM) subscale of the Balanced Inventory of Desired Responding (BIDR; Paulhus, 1984). There is controversy in the personality assessment literature about the appropriateness of using scales like the IM subscale of the BIDR in instmment development. For example, a recent literature review argued that there is very little empirical evidence supporting an interpretation of IM as a form of defensive self-presentation (Uziel, 2010b). Instead, theory and research (Holden & Fekken, 1989; Uziel, 2010a) suggest that instmments such as the IM subscale of the BIDR actually are best considered measures of an adaptive form of interpersonal sensitivity in which self-control in public settings is a central feature. From this perspective, individuals who score high on a measure like the IM subscale of the BIDR have a higher tendency to value social harmony and agreeableness than those low in "impression management," but this tendency does not prompt high scoring persons to intentionally misrepresent themselves (Uziel, 2010b). From this perspective, a significant correlation between counseling trainees' self-assessments of their multicultural counseling skills and social desirability (as found by Teel, 2003) might be interpreted as evidence of interpersonal sensitivity, rather than as deceptive or defensive self-presentation. Thus, it is possible that the interpersonal sensitivity component of IM may be an essential component of multicultural skills. Acknowledging this controversy, we nevertheless included the IM subscale because information about correlations between this measure and subscales that

EVERYDAY MULTICULTURAL COMPETENCIES

emerge for the EMC/RSEE are likely to be useful for interpreting meaning of the new measures. Finally, we included a third measure, one not used by Wang et al., the Openness to Diversity/Challenge Scale (ODCS; Pascarella, Edison, Nora, Hagedon, &. Terenzinl, 1996). This brief, eight-item scale is often used as an outcome measure in campus multicultural programming to assess general openness to cultural and values diversity in one's college environment. We chose the ODCS not only because of its widespread use but also because, as our item pool was developed through the focus group process, it became increasingly clear that negative dimensions were likely to emerge from the large number of these items being generated. Focus group informants, especially those working in residence halls, described how a portion of their efforts were directed at decreasing racist beliefs, and counteracting closed-minded, antidiversity attitudes. Consequently, we expected that any EMC/RSEE factor that emerged to capture the closed and negative attitudes toward diversity that programming sometimes seeks to reduce would be significantly, negatively correlated with ODCS scores. Worthington and Dillon (2011) pointed out that the particular multicultural counseling competencies that are effective in a given context are quite likely to vary, depending on the race and ethnicity of both the counselor and client. A single measure developed with only one type of dyad in mind (e.g.. White counselors and clients of color) is unlikely to be valid when used with other types of dyads. Thus, in an effort to appropriately narrow the focus for our new scale, development of items for the EMC/RSEE was limited to the context of White students, on a predominantly White campus. Although psychologists always need to be mindful of not including the experiences of people of color, particularly in multicultural education and programming, it is also problematic to assume that the experiences of those from privileged groups (e.g.. White people) in multicultural programs will be the same as those from marginalized groups (e.g.. people of color). For example. White people have unique social and emotional experiences related to race and racism (Spanierman & Hepner, 2004), and those from marginalized social identity groups and those from privileged social identity groups have been found to have different experiences in multicultural interventions (Miles & Kivlighan, 2012). Therefore, although we collected data from a broad sample, we retained for analysis only data from students who reported their ethnic/racial identification as "White, European American," and who checked no additional race/ethnicity. Although the decision to focus on White students restricts the generalizability of our results and range of potential uses for the new scale, students from the dominant White U.S. culture are frequently a focus for campus multicultural programming. Future studies will examine the validity and reliability of the measure with students of color. A final goal of this study arose from the belief that instrument development methods based on item response theory (IRT) are underutilized in counseling psychology research. We hoped to demonstrate the value of these methods in this project. Specifically, IRT criteria were used to (a) screen out items with high differential item functioning for women versus men, (b) evaluate and fine-tune performance of the Likert-type response scale, and (c) identify a set of items to retain for subscales with the highest capability for detecting differences between individuals across the fullest possible range of the underlying construct.

135

Method Generating an Item Pool We began by conducting three focus groups with (a) key campus administrators responsible for diversity programming (Provost, Associate Provost, and Assistant Director of Residence Life), (b) 17 residence hall peer counselors, and (c) eight counseling psychology doctoral students involved in undergraduate teaching. We used the same core stimulus questions for each focus group derived from Sue et al.'s (1982) multicultural competencies model: Imagine a bright, ambitious undergraduate senior graduating from a large public university. Considering the range of multicultural environments, both in the U.S. and abroad, that this student is likely to experience in a productive career, please discuss the following three questions: (a) What are the attitudes, personal awareness, and ways of thinking that this student should possess to function effectively? (b) What are the skills that this student must acquire to function effectively? (c) What bases of knowledge must this student have to function effectively?

Follow-up questions asked for anecdotes and exemplars of multicultural sensitivity and openness to diversity, as well as instances of insensitivity and intolerance in daily interactions between undergraduate students. Participants were asked initially to identify domains and overall themes, and then invited to generate items to tap these concepts. Each of the focus group interviews was audio recorded. A coding team of one faculty member and eight graduate students independently read the transcripts and generated lists of domains and items. (These eight students were the same team that comprised the third focus group of undergraduate instructors.) Collating these lists with the items that focus group members suggested themselves resulted in a set of 15 domains. The domains were grouped into a Knowledge Cluster (three domains: e.g., "Our daily life in the U.S. is affected by other cultures," "Recognizing White privilege"), a Skills Cluster (five dotnains: e.g., "Skills for communicating and connecting," "Monitoring one's own behavior so as not to be offensive"), and an Attitudes Cluster (six domains: e.g., "American culture is the best culture, the 'normative' culture, and others are inferior," "Openness to new experiences"). All domain labels, together with sample items are shown in Table S1 in the online supplemental materials. The team generated items for each domain, but many items were deemed appropriate for tapping more than one domain. For example, several items negatively worded for the "Recognizing White privilege" domain (e.g., "In America everyone has an equal opportunity for success") seemed to serve well as positive indicators for the "Colorblind attitudes" domain. An initial pool of 219 items was generated. Many of these items also tapped aspects of cultural empathy. After removing items redundant with the 31 SEE items and other redundant items, and editing to increase clarity, an initial pool of 84 items' was generated for the Everyday Multicultural Competencies domain of

' The original pool contained six items to tap knowledge of specific cultures, for example. "Deaf people have their own culture that is different from cultures of the hearing world." Initial factor analysis showed these items tended to form one- and two-item factors. They were dropped from subsequent analyses.

136

MALLINCKRODT ET AL.

the new instrument. Judgments about item clarity were made collectively by the team without additional pilot testing.

Participants Undergraduate students in five sections of an introduction to psychology course at a large, public, predominantly White, Southeastern U.S. university completed surveys at Time 1, approximately the midpoint of fall semester. The same five sections plus three others were solicited to provide data at Time 2, 6 weeks later, 1 to 2 weeks before final exams. A total of 819 students provided data at Time 1, and 928 did so at Time 2. Of these students, 13 (1.6%) at Time 1 and 31 (3.3%) at Time 2 had missing data for more than 10% of the items included in factor analyses. These cases were excluded. Three validity check items (e.g., "Please code a seven for this item") suggested that 107 (13%) students at Time 1 and 119 (13%) at Time 2 exhibited a random or inattentive pattern of responding. Data from these students were deleted. Regarding ethnic identification, 97 (14%) students at Time 1 and 102 (13%) at Time 2 reported an identification other than "White, European American" or did not answer this question. These data were also excluded from further analyses. Of the 602 remaining students who provided useable data at Time 1, 61% (n = 365) were women, and 39% (n = 237) were men. Their mean age was 18.56 years iSD = 1.51, range = 18-37). Ofthe 676 students who provided useable data at Time 2, 60% (« = 404) were women, 40% (n = 270) were men, and two did not indicate their sex. Their mean age was 18.78 years (5D = 2.18, range = 18-38). Measures Scale of Ethnocultural Empathy (SEE; Wang et al., 2003). The SEE was designed to assess empathy for persons whose racial and ethnic background differs from one's own. The 31 items form four subscales: (a) Empathie Feeling and Expression (15 items; e.g., "When I hear people make racist jokes, I tell them I am offended even though they are not referring to my racial or ethnic group"), (b) Empathie Perspective Taking (seven items; e.g., "It is easy for me to understand what it would feel like to be a person of another racial or ethnic background other than my own"), (c) Acceptance of Cultural Differences (five items; e.g., "I feel irritated when people of different racial or ethnic backgrounds speak their language around me" reverse keyed), and (d) Empathie Awareness (four items; e.g., "I am aware of how society differentially treats racial or ethnic groups other than my own"). Participants used a 6-point Likert-type scale, which was slightly modified for this study with the following anchors: 1 {strongly disagree), 2 imoderately disagree), 3 {slightly disagree), 4 {slightly agree), 5 {moderately agree), and 6 {strongly agree). Higher scores indicate greater ethnocultural empathy. In a sample of undergraduates, Wang et al. (2003) reported internal reliability (Cronbach's alpha) for the four subscales of .90, .79, .71, and .74, respectively. Retest reliabilities (2-week interval) were .76, .75, .86, and .64, respectively. For the total SEE scale Wang et al. reported internal reliability of .91 and retest reliability of .76 (2-week interval). They reported evidence of validity in the form of significant correlations in samples of undergraduates with measures of general empathie perspective-taking, and with universal-diverse orientation (Miville et al, 1999). Higher SEE scores were associated

with having more racially/ethnically diverse family members and friends, and having attended a more diverse high school. In this study, the internal reliability (coefficient alpha) in Time 1 data for the four subscales were .89, .69, .77, and .72, respectively; and in Time 2 data were .88, .71, .82, and .74, respectively. Miville-Guzman Universality Diversity Scale—Short Form (MGUDS-S; Fuertes et al., 2000). The MGUDS-S is a 15-item version of the 45-item MGUDS (Miville et al., 1999), which was developed to assess an attitude of awareness and acceptance of the commonality and diversity that exist in ethnic and racial groups other than one's own. Like the parent scale, the MGUDS-S, also has a three-factor structure: (a) Diversity of Contact (e.g., "I would like to join an organization that emphasizes getting to know people from different countries."), (b) Relativistic Appreciation (e.g., "I can best understand someone after I get to know how he/she is both similar to and different from me."), and (c) Comfort with Differences (e.g., "It's really hard for me to feel close to a person of another race"). Higher scores indicate a greater appreciation of similarities and differences in individuals who are ethnically and culturally different. Participants use a 6-point, fully anchored Likert-type scale, ranging from 1 {strongly disagree) to 6 {strongly agree). The original MGUDS was positively correlated with healthy narcissism, empathy, positive attitudes toward feminism, androgyny, and positive aspects of African American and European American racial identity; and negatively correlated with dogmatism and homophobia (Miville et al., 1999). Fuertes et al. (2000) reported that the MGUDS-S was significantly correlated {r = .11) with the parent scale in a sample of undergraduates. Confirmatory factor analyses supported the three-factor structure of the MGUDS-S, with Cronbach's alphas for the three subscales of .82, .59, and .92, respectively. For the 15-item total scale score, alpha = .77. Wang et al. (2003) reported that all three MGUDS-S subscales significantly correlated with all four SEE subscales. In the present study, internal reliabilities (coefficient alpha) for subscales ofthe MGUDS-S at Time 1 were .79, .82, and .81, respectively, and at Time 2 were .81, .86, and .81, respectively. Openness to Diversity/Challenge Scale (ODCS; Pascarella et al., 1996). The ODCS is an eight-item measure developed to assess individuals' acceptance of being challenged by different ideas, values, and cultural perspectives, as well as their appreciation of racial and cultural diversity (e.g., "I enjoy having discussions with people whose ideas and values are different from my own"). Respondents use a fully anchored, 5-point Likert-type scale, ranging from 1 {strongly disagree) to 5 {strongly agree). Higher scores indicate greater openness to cultural diversity and challenge. Pascarella et al. (1996) reported internal consistency reliability ofthe scale with college students ranged from .83 to .84. Support for validity is evident in significant positive correlations of ODCS scores with student participation in cultural workshops, studying abroad, and greater interaction with diverse peers (Pascarella et al., 1996). In the current study the internal reliabilities (coefficient alpha) for this scale at Time 1 and Time 2 were .89 and .90, respectively. Balanced Inventory of Desired Responding (Paulhus, 1984). Only the 20-item Impression Management (IM) subscale was used in the current study. The IM is used to assess social desirability bias and the degree to which a respondent consciously presents inflated descriptions of her- or himself to please others (e.g., "I never take things that don't belong to me"). The items use a fully


anchored, 7-point response scale ranging from I inot at all true) to 7 (very true). The scoring scheme recommended by Paulhus (1984) was used in this study. Only responses of 6 or 7 were counted and assigned one point each. Thus, total scores range from 0 to 20, with higher scores indicating a greater tendency to respond in a more socially desirable way. The coefficient alpha for the IM in the present study was .78.

Procedure At both Times 1 and 2, survey packets were distributed during class. Students received extra credit toward their course grade for participating on either occasion. Given total enrollment in these sections of introductory psychology, the estimated research compliance rate for useable data was approximately 74% at Time 1 and 62% at Time 2. Students in the five repeated measure class sections were encouraged to provide data on both occasions, but not required to do so. They self-generated a code name to permit collating data, while protecting their anonymity. Of the 602 students who provided useable Time ! data, 326 (54%) participated at Time 2 to provide a repeated measures sample. For the purposes of exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) random selection of Time 1 versus Time 2 data was used so that the 326 retest subjects provided only one set of data, and the EFA and CFA subsamples remained completely independent. There were 952 independent cases for factor analysis, 276 Time 1 only students, 350 Time 2 only students, and 326 retest students randomly assigned to provide either 163 Time 1 and or 163 Time 2 data sets. These data were randomly allocated to an EFA sample in = 585) that peimitted at least five cases per item, with the remainder assigned to the CFA holdout sample (« = 367).

Results Exploratory Factor Analysis (EFA) The 84 Everyday Multicultural Competences items developed for this study were grouped together with the 31 items of the SEE for factor analysis. The EFA sample permitted 5.09 cases per item. We used SPSS Version 20.0 missing values regression procedure to estimate missing values for cases with less than 10% missing data. (Recall that cases with > 10% missing data were deleted.) The Kaiser-Meyer-Olkin measure of sampling adequacy was .94, suggesting that the 115-item matrix exhibited much more than sufficient covariation. A parallel analysis was conducted using syntax developed for SPSS by O'Connor (2000) to suggest the appropriate number of factors for extraction (Fabrigar, Wegener, MacCallum, & Straham, 1999; Worthington & Whittaker, 2006). Results from principal axis estimation of 1,000 simulated samples suggested that 19 factors should be extracted. We used squared multiple correlations on the diagonal (e.g., principal axis) in these estimates because it is the extraction method Worthington and Whitaker (2006) suggested for EFA. However, Buja, and Eyuboglu (1992) wamed that principal axis parallel analysis tends to overestimate the number of factors to extract. Therefore, a parallel analysis using principal components extraction was conducted. It indicated that eight factors should be extracted. Following the recommendations of Worthington and Whitaker (2006), an EFA was performed using principal axis extraction with

137

oblique (direct oblimin) rotation to allow for correlated factors. The eight-factor solution resulted in extraction of two small factors of two items and three items. Therefore, both seven-factor and six-factor solutions were also inspected. The six-factor solution yielded more readily interpretable factors. Table 1 shows that the first six factors accounted for a total of 38.68% of the cumulative variance after extraction. Following recommendations of Worthington and Whitaker, an item was assigned to a factor if its loading was >.32 on that factor and did not load within an absolute value of .15 on any other factor. This selection procedure resulted in retention of 82 items. Factor 1 = 25 items. Factor 2 = 22 items. Factor 3 = nine items. Factor 4 = six items. Factor 5 = 10 items, and Factor 6 = 10 items. The complete 115-item X six-factor pattern matrix is shown in Table S2, with item wordings shown in Table S3. The EFA was run again, this time imposing a six-factor solution on the selected 82 items. Only one item "changed allegiance" by loading more strongly on a different factor. It was dropped from Factor 3, leaving 81 items (22 from the SEE and 59 from the new item pool). Cleariy, an 81-item instrument would be too cumbersome for the intended purposes of the EMC/RSEE. Our goal was to reduce factors to six to 10 items each, and to examine all items for poor conceptual fit and undesirable psychometric characteristics. Item 64, "Members from minority groups use many excuses to explain their laziness," was eliminated from Factor 2 because we were concerned that many respondents would find this item too offensive. For the same reason we deleted Item 17, "Seeing people from different cultures makes me feel somewhat disgusted," from Factor 3. The traditional method to reduce the length of a scale suggested by instrument development experts (cf. Worthington & Whittaker, 2006) has been to retain items with the highest factor loadings. Other experts (cf. Doucette & Wolf, 2009) assert that IRT should be used to generate additional empirical selection criteria. There are many of approaches to IRT, which can be roughly grouped according to the number of estimated parameters. The Rasch approach is a one-parameter IRT model (1-pl), which has the advantage of yielding a single score that refiects positioning along the hypothesized latent measurement construct (Doucette & Wolf, 2009). Note that three- and four-parameter approaches are also available to allow assessment of respondent guessing and careless responding. These models are used primarily by educational psychologists in performance test construction. The two-parameter model (2-pl) adds an item-discrimination dimension that can be useful in psychological assessment (de Alaya, 2009) but was not

Table 1 Exploratory Factor Analysis Eigenvalues and Total Variance Explained in 6-Factor Solution

Factor

Eigenvalue

% of variance

1 2 3 4 5 6

27.43 5.87 5.17 4.11 3.18 2.31

23.86 5.11 4.50 3.57 2.77 2.01

Note.

N = 585.

Cumulative % of variance

Cumulative % of variance after extraction

23.86 28.96 33.46 37.03 39.80 41.81

23.73 27.97 31.93 34.95 37.21 38.68

138

MALLINCKRODT ET AL.

used in present study. In contrast to 2-pl models, the 1-pl Rasch model used in this study has the desirable feature of specific objectivity, which allows the assumption of an invariant ranking of item difficulty independent of the abilities of individual test-takers (as well as an invariant ranking of person ability independent of the difficulty of items). The 2-pl estimates of item discrimination are always sample dependent, whereas the 1-pl Rasch model estimates of item difficulty carry an assumption of sample independence (Fox & Iones, 1998). Finally, the simplicity and parsimony of the Rasch model permits analyses of whether an instrument's response scale functions as intended. Rasch model principles were used to assist in the final selection of items to retain for subscales, using WINSTEPS 3.80.0 software (Linacre, 2009). Each subscale was examined as an independent instrument. First, items were screened for disordered thresholds, that is, a failure of pairs of neighboring points on the Likert-type response scale to discriminate respondents appropriately. Of the remaining items, three were excluded for this reason. Next, items were screened for differential item functioning (DIF) with respect to sex. Significant DIF in this context means that for the item in question, a significantly higher amount of the underlying construct is required for one sex to cross the threshold of .50 probability for endorsing this item than the other sex. In large samples, levels of DIF that may be statistically significant for an item might nevertheless make only a trivial contribution to possible sex bias for the subscale as a whole (Osterlind & Everson, 2010). Thus, we set the threshold for DIF significance at less than p < .0005 using the Mantel-Haenszel x" test. Six items were excluded for this reason. A third screen checked for the level of "underfit," which can be considered an indication of poor lit for an item with the assumed unidimensional structure of a subscale. Nine items were excluded on this basis. After these screens, 61 of the 79 items remained. The final IRT screen selected items to provide the best spread of difficulty, in an effort to maximize sensitivity of the subscales in discriminating among individuals across the full range of scores on the latent construct of interest (Bond & Fox, 2007; de Alaya, 2009). Difficulty refers to the quantity of the latent construct that an individual must possess corresponding to a .50 probably of answering a specific item in the scored direction. The term is a legacy from the initial development of IRT methods to calibrate items on tests of ability. For a psychological construct such as depression, an item like "Once in a while I feel sad" would probably have low difficulty, whereas "Nearly every day I consider a plan for killing myself would have high difficulty. A subscale with many low-difficulty items can make relatively finegrained discriminations among individuals who rank low on the construct but can make only gross distinctions among individuals at the high-score end of the continuum. Unless instrument developers can anticipate that users of a scale will be relatively more interested in a particular restricted range of scores, the wisest course is to retain items in a wide range of difficulty to maximize sensitivity across the widest possible range of scores, or "bandwidth" for each subscale (Embretson & Reise, 2000). In our screen for difficulty, pairs of items with approximately equivalent difficulty for a subscale were examined, with one member of each pair discarded to shorten the subscale. After the first three screens, the 28 items remaining on Factors 3-6 provided a good spread of difficulty, so all these items were retained. However, 14 items remained for Factor 1, and 19 for Factor 2. For each factor, the

most and least difficult items were retained to preserve the maximum possible bandwidth. Within these extremes, one member of each pair of items closely spaced in difficulty was eliminated until both subscales were reduced to 10 items each. Mallinckrodt, Miles, and Recabarren (2013) have provided a detailed practical guide for using IRT criteria in the final stages of item selection. Note that only six of the 10 items included for final selection in Factor 1, and six of the final 10 items selected for Factor 2 were among the 10 highest loading items in the EFA, that is, would have been selected using traditional methods for shortening a subscale. The traditional "highest loading" method would have resulted in retention of two items for Factor 1 (m22 and m27) with highly significant sex DIF, and a more constricted bandwidth of item difficulty than the IRT-assisted method used in this study. Finally, Rasch model screening was applied to evaluate performance of the Likert response scale at the level of total subscale scores. Results suggested crossed category Andrich thresholds, that is, a failure of a pair of neighboring responses for the subscale as a whole to differentiate respondents. This, in turn, suggests that the response scale contains too many categories, some of which do not contribute useful variance to the total scale score (Bond & Fox, 2007). Collapsing categories with crossed Andrich thresholds reduces error variance. Thus, the responses strongly agree and moderately agree were combined for Factor 2 (coding, 6 = 5). The 48-item EMC/RSEE is shown in Table 2. A fully formatted version of the scale is reproduced in the online supplemental materials. Scoring instructions for the EMC/RSEE are provided in the table notes. To name the factors we developed the preliminary labels shown in Table S4. To refine these labels, we engaged consultants who were student affairs professionals on our campus (n = 3) or national experts in diversity programming, all of whom are doctoral level researchers in = 6). They were randomly assigned to one of two teams, each presented with the clusters of items and given a different factor labeling task. Team A was asked to make a selection from among 12 choices that best fit each cluster of items. In contrast, each member of Team B was asked to generate his or her own label in fill in a blank following each set of items. Results of both tasks are shown in supplemental Table S4. Responses to Task A suggested that the preliminary label for Factor 1 was most in need of revision, because only three of the five consultants selected our label. Responses to the open-ended rating Task B suggested a new label for Factor 1, and more minor modifications to the other labels. Based on our consultants' responses, we selected the following labels: (a) Cultural Openness cmd Desire to Learn, (b) Resentment and Cultural Dominance, (c) Anxiety and Lack of Multicultural Self-Efficacy, (d) Empathie Perspective-Taking, (e) Awareness of Contemporary Racism and Privilege, and (f) Empathie Feeling and Acting as an Atly-

Confirmatory Factor Analysis Wirth and Edwards (2007) warned that ordered-categorical data (e.g., Likert-type items), when used as single-item indicators, violate many of the necessary assumptions of confirmatory factor analysis (CFA), including a linear relationship between observed and latent variables, and continuous normal distribution of outcomes. Floyd and Widaman (1995) recommended using item parcels in CFA instead of individual items, especially for subscales with more than five to eight items, due to problems introduced


139

Table 2 Everyday Multicultural Competencies/Revised Scale of Ethnocultural Empathy Pinal Items Selected

Item no.

Corrected total subscale with item correlation"

Item

m65. m50. m61. m31. ml2. mO7. mO6. m73. m38. m40.

Factor 1: Cultural Openness and Desire to Learn (10 items, a = .92) 1 think it is important to be educated about cultures and countries other than my own. I welcome the possibility that getting to know another culture might have a deep positive influence on me. I admire the beauty in other cultures. I would like to work in an organization where I get to work with individuals from diverse backgrounds. I would like to have dinner at someone's house who is from a different culture. I am interested in participating in various cultural activities on campus. Most Americans would be better off if they knew more about the cultures of other countries. A truly good education requires knowing how to communicate with someone from another culture. I welcome being strongly influenced by my contact with people from other cultures. I believe the United States is enhanced by other cultures.

.77 .75 .73 .72 .70 .62 .69 .66 .63 .62

mlO. m35. m66. m54. sIO. m71. m85. mOl. m60. ml8.

Factor 2: Resentment and Cultural Dominance (10 items, a = .85) Members of minorities tend to overreact all the time. When in America, minorities should make an effort to merge into American culture. I do not understand why minority people need their own TV channels. I fail to understand why members from minority groups complain about being alienated. I feel irritated when people of different racial or ethnic backgrounds speak their language around me. Minorities get in to school easier and some get away with minimal effort. I am really worried about White people in the U.S. soon becoming a minority due to so many immigrants. I think American culture is the best culture. I think members of the minority blame White people too much for their misfortunes. People who talk with an accent should work harder to speak proper English.

.62 .58 .58 .57 .57 .55 .55 .51 .51 .50

ml9. ml4. m28. mO4. m95. m75. mO9.''

I I I I I I I

s 19. s28.'' s31.'' sO6. sO2.''

Factor 3: Anxiety and Lack of Multicultural Self-Efficacy (7 items, a = .77) feel uncomfortable when interacting with people from different cultures. often find myself fearful of people of other races. doubt that I can have a deep or strong friendship with people who are culturally different. really don't know how to go about making friends with someone from a different culture. am afraid that new cultural experiences might risk losing my own identity. do not know how to find out what is going on in other countries. am not reluctant to work with others from different cultures in class activities or team projects.

.

Factor 4: Empathie Perspective-Taking (5 items, a = .69) It is easy for me to understand what it would feel like to be a person of another racial or ethnic background other than my own. It is difficult for me to put myself in the shoes of someone who is racially and/or ethnically different from me. It is difficult for me to relate to stories in which people talk about racial or ethnic discrimination they experience in their day to day lives. I can relate to the frustration that some people feel about having fewer opportunities due to their racial or ethnic backgrounds. I don't know a lot of information about important social and political events of racial and ethnic groups other than my own.

mlî.** m25.''

Factor 5: Awareness of Contemporary Racism and Privilege (8 items, ct = .79) The U.S. has a long way to go before everyone is truly treated equally. For two babies bom with the same potential, in the U.S. today, in general it is still more difficult for a child of color to succeed than a White child. I can see how other racial or ethnic groups are systematically oppressed in our society. Today in the U.S. White people still have many important advantages compared to other ethnic groups. I am aware of how society differentially treats racial or ethnic groups other than my own. I am aware of institutional barriers (e.g., restricted opportunities for job promotion) that discriminate against racial or ethnic groups other than my own. Racism is mostly a thing of the past. In America everyone has an equal opportunity for success.

s21.'' sl5. sO3. s26. s 16.''

Factor 6: Empathie Feeling and Acting a.s an Ally (8 items, a = .81) I don't care if people make racists statements against other racial or ethnic groups. I get disturbed when other people experience misfortunes due to their racial or ethnic background. I am touched by movies or books about discrimination issues faced by racial or ethnic groups other than my own. I share the anger of people who are victims of hate crimes (e.g., intentional violence because of race or ethnicity). I rarely think about the impact of a racist or ethnic joke on the feelings of people who are targeted.

m39. m34. s20. m20. s25. sO7.

.61 .57 .52 .51 .50 .41 .34 .55 .53 .44 .37 .33

.58 .56 .53 .51 .50 46 .45 .38 .58 .59 .57 .53 .49 (table continues)

MALLINCKRODT ET AL.

140 Table 2 (continued)

Item no. s30. s22. si 1.

Corrected total subscale with item correlation"

Item When I hear people make racist jokes, I tell them I am offended even though they are not referring to my racial or ethnic group. When I see people who come from a different racial or ethnic background succeed in the public arena, I share their pride, When I know my friends are treated unfairly because of their racial or ethnic backgrounds, I speak up for them.

.48 .48 .48

Note, m = item generated specifically for this study, s = item taken from the Scale of Ethnocultural Empathy. N = 952. " Correlations between an item and the subscale total score with that item omitted. '' Reverse key.

through correlated errors between pairs of items with similar wording. Alhija and Wisenbaker (2006) conducted a simulation study comparing the approaches and reported that item parcels performed better than individual indicators. The main argument in favor of single item indicators in CFA is that parcels might mask multidimensionality (Worthington & Whlttaker, 2006). Because IRT based tnethods were used to screen items for multidimensionality, and because ofthe disadvantages of single item indicators, in this study three multiple-item parcels were constructed as indicators of each factor. The CFA was conducted using Mplus Version 6.12 (Muthén & Muthén, 2010) on the holdout sample (n = 367). In addition to the x^ likelihood ratio of exact model fit, we examined three approximate fit indices suggested by Kline (2011): the comparative fit index (CFI), the root-mean-square error of approximation (RMSEA), and the standardized root-mean-square residual (SRMR). Hu and Bentler (1999) suggested that CFI values >.95; SRMR values

Developing a comprehensive school connectedness scale for program evaluation.

Developing a multicultural nutrition education tool: Pacific Island food models.

From empower to Green Dot : successful strategies and lessons learned in developing comprehensive sexual violence primary prevention programming.

On passing (or not): developing under multicultural heritages.

Developing a comprehensive measure of mobility: mobility over varied environments scale (MOVES).

Validity and reliability of a scale to assess fatigue.

Revision and validation of a scale to assess pregnancy stress.

Developing a comprehensive definition of sustainability.

Developing a reliable and valid scale to measure psychosocial acuity.

Development of a screening scale for programming psychiatric rehabilitation.

Developing an audit checklist to assess outdoor falls risk.

Developing a new apathy measurement scale: Dimensional Apathy Scale.

A comprehensive study of ovine haemostasis to assess suitability to model human coagulation.

Validation of the Stroke and Aphasia Quality of Life Scale in a multicultural population.

Beauty in a multicultural world.

Developing a methodology to assess the impact of research grant funding: a mixed methods approach.

Developing and implementing a bereavement support program for college students.

Developing a screening questionnaire for problem drinking in college students.

Modulation Spectra Morphological Parameters: A New Method to Assess Voice Pathologies according to the GRBAS Scale.

Tg2576 Mice.

Black Women's Recommendations for Developing Effective Type 2 Diabetes Programming.

Developing a More Rapid Test to Assess Sulfate Resistance of Hydraulic Cements.

Developing a Tool to Assess Placement of Central Venous Catheters in Pediatrics Patients.

Punctuated equilibrium in the large-scale evolution of programming languages.