Differences in observers' and teachers' fidelity assessments.

J Primary Prevent DOI 10.1007/s10935-014-0351-6

ORIGINAL PAPER

Differences in Observers’ and Teachers’ Fidelity Assessments William B. Hansen • Melinda M. Pankratz Dana C. Bishop

•

Ó Springer Science+Business Media New York 2014

Abstract As evidence-based programs become disseminated, understanding the degree to which they are implemented with fidelity is crucial. This study tested the validity of fidelity ratings made by observers versus those made by teachers. We hypothesized that teachers’ reports about fidelity would have a positivity bias when compared to observers’ reports. Further, we hypothesized that there would generally be low correspondence between teachers’ and observers’ ratings of fidelity. We examined teachers’ and observers’ ratings as they were related to mediating variables targeted for change by the intervention. Finally, we examined the role that years of teaching experience played in achieving fidelity. Eighteen teachers and four research assistants participated in this project as raters. Teachers made video recordings of their implementation of All Stars and completed fidelity assessment forms. Trained observers independently completed parallel forms for 215 sampled classroom sessions. Both teachers and observers rated adherence, quality of delivery, attendance, and participant engagement. Teachers made more positive fidelity ratings than did observers. With the exception of ratings for attendance, W. B. Hansen (&) D. C. Bishop Tanglewood Research, 420-A Gallimore Dairy Road, Greensboro, NC 27409, USA e-mail: [email protected] M. M. Pankratz Pacific Institute for Research and Evaluation, 1516 E. Franklin Street, Suite 200, Chapel Hill, NC 27514, USA

teachers and observers failed to agree on fidelity ratings. Observers’ ratings were significantly related to students’ pretest assessments of targeted program mediators. That observers’ ratings were related to students’ pretest scores, suggests it is easier to teach well when students are predisposed to program success. Teachers’ ratings were infrequently related to mediators, but when they were, the relationship was counterintuitive. Experienced teachers taught with greater fidelity than novice teachers. While possibly inflated and inaccurate, gathering fidelity assessments from teachers may sensitize them to issues of fidelity as a result of requiring form completion. Assessing fidelity through observers’ ratings of video recordings has significant merit. As a longterm investment in improving prevention outcomes, policy makers should consider requiring both teacher and observer fidelity assessments as essential components of evaluation. Keywords Fidelity Observer versus teacher Assessment Dissemination Bias Novice versus Experienced teachers

Introduction With the increasing dissemination of evidence-based prevention programs, focus has shifted from assessing effectiveness under conditions of controlled delivery to ensuring that teachers deliver evidence-based interventions as effectively as possible. Many

123

J Primary Prevent

researchers view poor fidelity as a potential threat to the effectiveness of the intervention. High quality implementation has been associated with improved outcomes (Abbott et al., 1998; Burke, Oats, Ringle, Fichtner, & DelGaudio, 2011; Durlak & DuPre, 2008; Dusenbury, Brannigan, Falco, & Hansen, 2003), while poor fidelity has been associated with diminished program effects (Pentz et al., 1990; Spoth, Guyll, Trudeau, & GoldbergLillehoj, 2002). Unfortunately, fidelity is rarely reported in the intervention research literature (Swanson, Wanzek, Haring, Cuillo, & McCulley, 2011). In a review of fidelity studies, prevention program outcomes were more positive when programs were fully implemented (Durlak & DuPre, 2008). Dane and Schneider (1998) provided useful definitions of fidelity. These include: (1) adherence—the degree to which facilitators follow program methods and complete delivery as outlined in the curriculum, and the percentage of activity objectives and session goals covered, (2) quality of delivery—the degree to which the intervention is delivered in a manner likely to have affect its goals and objectives, (3) dosage— providing sufficient exposure to the program, and (4) participant engagement—the degree to which participants are appropriately involved in intervention tasks (see also Berkel, Mauricio, Schoenfelder, & Sandler, 2011; Dusenbury, Brannigan, Hansen, Walsh, & Falco, 2005; Giles et al., 2008; Hill & Owens, 2013). While the goals of assessing fidelity are clear, methods used to capture assessments of fidelity generally use one of three methodologies: (1) selfreports from the facilitator, teacher, clinician, or intervention service provider, (2) observing and coding video or audio recordings, or (3) direct observation (e.g., trained raters visiting the site of intervention). In this paper we focus on the first two methods—teacher self-reports and observer ratings of video recordings. Self-reports are relatively easy to collect because they require little time and effort (Ransford, Greenberg, Domitrovich, Small, & Jacobson, 2009). Teachers may be in a good position to provide information about the overall quality of implementation and provide an assessment of the quality of intervention materials, such as the clarity of the teacher’s manual and usability of student materials. They likely also have a unique understanding of the challenges faced by program participants that is helpful to understand how well the intervention matches the needs of the target population and the setting in which it is delivered.

123

These strengths are balanced by known challenges. There is significant potential for a positivity bias among teachers (Adams, Soumerai, Lomas, & RossDegnan, 1999; Lillehoj, Griffin, & Spoth, 2004). Research on nutrition by Resnicow and colleagues (McGraw et al., 2000; Resnicow et al., 1998) demonstrated that teachers’ self-reports do not provide valid measures of fidelity. The optimal strategy for assessing fidelity is through multiple observations by trained observers. Similar findings have been found in a variety of other educational fields (Kunter & Baumert, 2006; Lawrenz, Huffman, & Robey, 2003; Mayer, 1999; Newfield, 1980; Wickstrom, Jones, LaFleur, & Witt, 1998; Wubbels, Brekelmans, & Hooymayers, 1992). Objective assessments of oneself are a rare quality (Jones & Nisbett, 1971). There may be motivating factors such as the possibility that reports about fidelity will be used to evaluate performance (Donaldson & Grant-Vallone, 2002). Teachers also may not be able to judge their own performance according to an evaluator’s standards and may not always see implementation in the same way that observers do (Hansen & McNeal, 1999). When used to supplement other methods of data collection and when accompanied by guarantees of confidentiality, self-report likely has significant potential to provide useful and meaningful data, although a multi-method approach may yield the highest benefit (Ruiz-Primo, 2006). In contrast, fidelity ratings based on video recordings by trained observers have numerous benefits. This includes having standardized procedures for training coders (Griffin, Mahadeo, Weinstein, & Botvin, 2006) and having the ready ability to assess inter-rater reliability (Johnson et al., 2010; Pankratz et al., 2006). A unique feature of video recordings is that they can be paused and replayed if needed. Further, recordings offer the opportunity to make corrections to prior work should coding standards change. The current technology for making recordings with high-quality sound is relatively inexpensive, especially when compared to in-person observations that often require scheduling and travel. Digital data files containing images and sound are easy to transmit and to store securely. There can be challenges with using recordings as the source of data. Camcorders, particularly when placed in a stationary location, may miss part of the classroom. If there are equipment or procedural failures, data are lost. The presence and operation of

J Primary Prevent

recording equipment may distract program participants (Spoth, Gull, Lillehoj, Redmond, & Greenberg, 2007). On balance, however, these challenges are small and are typically not a significant issue. As a result of these and other considerations, when conducting research there is a preference for data collected by observers (Domitrovich & Greenberg, 2000). Observer ratings are commonly considered to be the ‘‘gold standard’’ for fidelity measurement. Observers have been found to have high inter-rater reliability, superior to the agreement between observers and teachers (Hansen & McNeal, 1999). The purpose of the current study is three-fold. First, as demonstrated by prior research, we test the teacher positivity bias hypothesis. That is, we expect teachers to universally report more positive fidelity assessments than observers. Second, we examine the degree to which teachers’ and observers’ ratings are correlated. We hypothesize that associations between teachers’ and observers’ ratings of fidelity will be poor. We examine the degree to which observers’ and teachers’ assessments of adherence, dosage (in this case attendance), quality of delivery, and student engagement are related to targeted program mediators. Finally, we examine the role that experience plays in fidelity and assessing fidelity.

Methods Participants and Setting Teachers and Classrooms We invited 30 All Stars teachers to participate in this research. One teacher failed to respond to the invitation, six declined to participate outright, and five teachers who expressed initial interest failed in some way to begin their participation. This left 18 teachers who actively participated in the project. Sixteen teachers were female. Sixteen were white and two were African-American. Teachers were from schools or prevention service agencies that purchased All Stars for implementation and taught the program as part of their regular duties and were thus generally representative of those who used the program in a dissemination environment. Teachers’ experience teaching All Stars ranged from none to 13 years, with an average of 2.8 years of experience.

In total, these 18 teachers delivered All Stars to 37 groups of students, delivering the curriculum to between one and five classrooms each. Students in each of these classes were included as participants. Each session of program delivery was video recorded. All video recordings were provided to the research team. Teachers or in some cases their agencies were reimbursed $250 for completing all aspects of the study. Observers Four observers rated fidelity of implementation from video recordings. Three observers held a master’s degree, and one held a bachelor’s degree. Prior to this study, observers had collectively coded over 300 similar video recordings, all of them completed as paired or multi-participant observations (Bishop et al., 2013; Hansen et al., 2013). Intervention All Stars is an evidence-based prevention program designed for delivery in sixth and seventh grades. All Stars includes 13 required sessions in the program. The program is designed to be delivered within a 45–60 min class period. The program is designed to prevent or reduce adolescent substance use, early sexual behavior, and violence through changing five mediating variables (Hansen, 1996; Harrington, Giles, Hoyle, Feeney, & Youngbluth, 2001), which include: (1) normative beliefs—correcting erroneous perceptions about the acceptability and prevalence of substance use among their peers, (2) lifestyle incongruence—helping youth realize that substance use does not fit with their desired future, (3) commitment—building intentions to avoid substance use, (4) parental attentiveness—engaging parents in communicating expectations about risky behavior and promoting parental monitoring and supervision, and (5) bonding—increasing attachment to prosocial institutions. For the program to be effective, teachers must affect the mediators targeted by the curriculum (Gottfredson, Cross, Wilson, Rorie, & Connell, 2010; McNeal, Hansen, Harrington, & Giles, 2004). Recent research (Hansen et al., 2013) demonstrates that teachers have superior results teaching All Stars when they make few adaptations that are judged on

123

J Primary Prevent

average to be positive. Making numerous or negative adaptations has a deleterious effect on program outcomes. Measures Students were pretested just prior to the program and posttested just after the program using a standardized All Stars survey. The student survey included an assessment of demographics (age, gender, and ethnicity), behaviors (dichotomous measures of past 30-day alcohol consumption, drunkenness, cigarette, smokeless tobacco, marijuana, and inhalant use), and five targeted mediators, which included commitment to avoid drug use (8 items; a = .84), lifestyle incongruence (6 items; a = .66), normative beliefs (8 items; a = .82), bonding to the school or organization (7 items; a = .78), and positive parental attentiveness (7 items; a = .78). Teachers and observers completed forms that had parallel sets of ratings to be completed about each session. For each, language was nearly identical and was adapted only to refer to either a self-reflection on the part of the teacher or a perception on the part of the observer. Topics addressed adherence, quality of delivery, attendance, and student engagement. Adherence Two sets of items assessed adherence. Each session included multiple activities. The first set of ratings, which assessed whether or not each activity had been attempted, yielded ratings of the proportion of activities attempted. The second single rating item assessed how much of a session had been skipped or omitted. Possible responses to this item included 0–5 %, 5–15 %, 15–25 % and 25 % or more. Quality of Delivery Five items measured quality of delivery. The first item set focused on the achievement of objectives. Each session included multiple objectives. Each was rated on a 1–4 scale, based on how well it appeared to have been achieved (poorly, adequately, very well, or exceptionally, respectively). The combined average for each session represented the degree to which objectives had been achieved. There were also four items that assessed general quality of teaching. These included: (1) a rating of

123

classroom management, (2) a rating of the general quality of teaching, (3) a rating of the overall potential for the session to be effective at changing targeted mediators and behaviors, and (4) an estimate of the teacher’s understanding of program concepts. The latter four measures used a 1–5 rating scale with 1 reflecting poor quality and 5 reflecting high quality. Alpha coefficients for observers’ and teachers’ ratings of these combined items were .879 and .849, respectively. Attendance The measure of attendance asked for an estimate of the number of students present during each session. The number of students in attendance was rated using broad groupings (1–5, 6–10, 11–15, 16–20, 21–25, and 25–30). Student Engagement A single general rating of student engagement was made for each session which used a 5-point scale, with 1 reflecting poor engagement and 5 reflecting high engagement. Procedures Teachers were provided with a camcorder and tripod in order to record their teaching. They were instructed to set up the camcorder in the rear of their classroom with settings on a wide pan. They were to record each of the 13 sessions of All Stars. Camcorders used SDHC media for storing recordings. For each class they taught, teachers were asked to complete a fidelity assessment immediately after teaching. Teachers were provided with a 20-min webinar on how to retrieve and complete fidelity assessment forms. The webinar reviewed the basics of completing the form, including what each rating prompt referred to and was intended to measure. Examples were provided. Observers were provided with a detailed coding manual. The manual included common sense definitions and negative and positive examples as points of reference for making coding judgments (see Table 1 for an example). Details of coding criteria and the inter-rater reliability of the instrument are outlined elsewhere (Bishop et al., 2013). Teachers returned 463 videos out of 481 that were expected (96.3 %). Of these, 215 were selected at random to be rated by observers. Six observations for

J Primary Prevent Table 1 Example of instructions provided to observers about quality of teaching What to rate

Negative examples

Positive examples

Judge the quality of teaching. This should reflect your overall impression of how well the teacher delivered the curriculum. Keep in mind the teacher’s organization, delivery, and the methods used to teach. Try to determine if the style of teaching translated correctly to the students and if the teacher communicated intended messages

The teacher seemed bored or put out

The teacher was enthusiastic about the topic and took ownership of the material

The teacher seemed disinterested, teaching the material only because required to do so The teacher was lost and constantly reading from the manual and made a lot of mistakes in delivering the curriculum The teacher spent time on irrelevant issues to the extent of excluding what was important to cover

each of the 13 sessions in All Stars (N = 78) were completed by paired observers, with each observer making independent assessments, and then meeting together to develop a single assessment that was used for analyses. The remaining 137 selected recordings were coded by a single observer. Analysis of paired observations showed high inter-rater reliability. Intraclass correlation coefficients (ICC) were above .70 for all fidelity items with the exception of observers’ ratings of student engagement (ICC = .554). Human Subjects Protection All procedures were reviewed and approved by relevant institutional review boards. Teachers provided informed consent by way of a signed contract. Teachers were provided with informed consent materials to be sent to parents. Students’ identities were never associated with survey data and the research team had no access to this information.

Results Test of Bias in Fidelity Assessment We examined each fidelity measure using paired t tests to test for a teacher positivity bias. Teachers reported significantly higher scores than were reported by observers for eight of the nine fidelity measures (see Table 2). There were significant differences between observers and teachers for proportion of steps skipped, percent of activities taught, the quality of objectives

The teacher was prepared and demonstrated familiarity with the curriculum The teacher managed time well and ensured all main points were covered and understood by students The teacher had insightful ways to motivate and engage students

achieved, attendance, ratings of teacher understanding, ratings of potential impact, degree of participant engagement, and overall quality of teaching. Teachers consistently provided ratings of better performance than observers. The difference between observers and teacher reports was non-significant for classroom management. Details about differences are worth noting. Teachers said they skipped 25 % or more of a session’s steps only 13 times, whereas observers noted 73 times when this occurred. At the other extreme, teachers felt they had made minimal skips (0–5 %) 146 times. Observers saw teachers doing so only 31 times. There was relatively modest agreement among observers and teachers in the number of objectives that received ‘‘poor’’ ratings (19 and 25 of 776 ratings for observers and teachers, respectively). On the other hand, there were many more ‘‘exceptional’’ ratings given by teachers (164) than by observers (102), and many more ‘‘very well’’ ratings given by teachers (457) than observers (284). Teachers made 108 ratings of the highest possible value when rating their understanding of the program. Observers gave only 62 such ratings. It is also worth noting that, with the exception of documentation of activities taught and estimates of attendance, all correlations between teacher and observer ratings were exceptionally small, ranging from -0.057 to 0.284 with a mean of 0.117. This suggests that for such characteristics as quality of achieving objectives, teaching quality, engagement, teachers’ understanding, and the potential of the lesson to affect behavior, there is practically no relationship between teachers’ and observers’ ratings. Attendance

123

J Primary Prevent Table 2 Comparisons between observers’ (O) and teachers’ (T) ratings of fidelity items Mean Activities taught Steps skipped (1 = 25 %?, 2 = 15–25 %, 3 = 5–15 %, 4 = 0–5 %)

Std. Dev.

Cohen’s d Paired t

df

p

ICC

1.504

4.514

36

0.000

0.544

O

86.5 %

0.12

T

93.8 %

0.08

O T

2.28 3.54

1.10 0.84

2.165

15.496

205

0.000

0.284

0.995

13.849

775

0.000

0.026

0.365

2.637

209

0.009

0.692

0.768

5.602

213

0.000

0.125

0.203

1.48

213

0.140

0.242

0.555

4.051

213

0.000

0.026

0.938

6.828

212

0.000

-0.057

0.916

6.683

213

0.000

0.171

Quality of Objectives

O

2.43

0.95

(1 = poor, 2 = adequate, 3 = very well, 4 = exceptional)

T

2.98

0.71

Attendance

O

3.66

1.06

(3 = 16–20, 4 = 21–25)

T

3.82

1.14

Engagement

O

3.73

0.88

(1 = low, 5 = high)

T

4.16

0.82

Classroom management

O

3.99

1.05

(1 = low, 5 = high)

T

4.10

0.82

Teaching quality

O

3.59

1.08

(1 = low, 5 = high)

T

3.97

0.84

Teacher understanding

O

3.76

1.14

(1 = low, 5 = high)

T

4.40

0.72

Potential impact

O

3.31

1.13

(1 = low, 5 = high)

T

3.90

0.85

Ns vary based on whether classrooms (37), sessions (215), or objectives (776) were used as the basis for calculating results

is a measure that allows for somewhat objective counting. Differences between teachers and observers are likely to be attributed to camera placement. Whether or not activities were taught is also a less subjective issue; although we would have hoped for an even greater correspondence between teachers’ and observers’ ratings than we observed. Agreement Between Teacher and Observer Fidelity Assessments To examine the possibility that teachers and observers made ratings based on different criteria, we calculated ICC for each measure (see Table 2). ICC take into account how closely teachers and observers agree with each other; coefficients of at least .70 indicate sound inter-rater reliability. For ratings of the number of students in attendance, there was reasonably high concordance between observers and teachers with an ICC of .69. In contrast, there was little correspondence between observers’ and teachers’ ratings for the remaining fidelity assessments. For the quality with which student-centered objectives were achieved, the overall quality of teaching and the overall quality of

123

teacher understanding, there was essentially no relationship between teachers’ and observers’ judgments. For the percent of steps skipped and classroom management there were only modest correlations between observers’ and teachers’ ratings. Relationship Between Fidelity Assessments and Student Survey Measures We examined the relationship between observers’ and teachers’ ratings of fidelity and targeted mediators. Our original intent was to examine the relationship between fidelity ratings and classroom-level averages of students’ pretest–posttest change scores on targeted mediators. However, as can be seen in Table 3, 35.6 % of pretest correlations between students’ mediators and observers’ fidelity ratings were significant or nearly significant. (Because of the small number of classrooms, we set the Type I error to .10.) This situation made interpreting change scores untenable. Therefore, analyses were conducted only with pretest student data, which changed the goals of our analysis. We sought to understand the implications of having a significant correlation between pretest

J Primary Prevent Table 3 Correlations between fidelity indicators and pretest targeted mediators for observers (O) and teachers (T)

Activities taught Percent skipped Quality of objectives

Commitment

Lifestyle incongruence

Normative beliefs

Bonding

O

-0.075

ns

-0.090

ns

0.194

ns

0.115

ns

-0.152

ns

T

-0.052

ns

0.027

ns

-0.082

ns

-0.204

ns

-0.056

ns

O

0.104

ns

0.015

ns

0.239

ns

0.188

ns

0.073

T

-0.223

ns

-0.237

ns

-0.293

-0.212

ns

-0.300

0.083

Parental attentiveness

O

0.137

ns

0.124

ns

0.401

0.015

0.279

0.143

ns

T

-0.139

ns

-0.105

ns

-0.329

0.050

-0.050

ns

-0.123

ns

O

0.228

ns

0.222

ns

0.355

0.034

-0.078

ns

0.191

ns

T

0.268

ns

0.279

0.099

0.288

0.089

-0.158

ns

0.193

ns

O

0.316

0.060

0.399

0.016

0.444

0.007

0.423

0.010

0.359

0.032

T

-0.029

ns

0.068

ns


O

0.283

0.094

0.348

0.037

T

0.085

ns

0.185

Quality of teaching

O

0.130

ns

0.171

T

-0.100

ns

-0.047

ns

-0.216

ns

-0.175

O T

0.014 -0.213

ns ns

0.097 -0.105

ns ns

0.247 -0.258

ns ns

0.236 -0.075

Attendance Engagement

Teacher understanding Potential effectiveness

0.099

ns 0.075

-0.168

ns

-0.179

ns

0.021

ns

0.236

ns

0.120

ns

0.379

0.023

ns

-0.102

ns

-0.053

ns

0.161

ns

ns

0.338

0.058

0.149

ns

ns

-0.073

ns

ns ns

0.026 -0.207

ns ns

0.183

ns

-0.046

ns

O

0.172

ns

0.202

ns

0.371

T

-0.070

ns

0.056

ns

-0.092

0.044

0.026 ns

0.319

0.295 -0.053

0.081 ns

Correlations are based on class-level data, N = 37

mediating variable scores and subsequently observed fidelity ratings. The primary post hoc hypothesis we considered was that it might be easier for teachers to implement the program when students are predisposed to program success. Pretest normative beliefs were most frequently related to measures of fidelity. Of the 18 analyses involving this variable, eight were significant. Of these, five were related to observers’ ratings (the quality with which student-centered objectives were achieved, attendance, student engagement, overall quality of teaching, and judged potential for effectiveness). Better teacher performance as judged by observers was associated with more ideal student beliefs about how common and acceptable substance use was to the peer group. Three other teacher performance measures as rated by observers (percent of activities skipped, quality with which studentcentered objectives were achieved, and attendance) were also related to pretest normative beliefs. With the exception of attendance, higher rated teacher performance was associated with less ideal students’ beliefs about how common acceptable substance use is in one’s peer group.

Students’ pretest reports about bonding were related to fidelity only for observers. Classes of students who started the program with better bonding scores were observed to have teachers who did a better job at achieving student-centered objectives. These classes also had better classroom engagement, better quality of teaching, and greater potential to change targeted mediators and behaviors. The same relationship can be seen for observers’ ratings of student engagement and classroom management with commitment to not use substances, lifestyle incongruence, and parental attentiveness. Students in classes that scored better at pretest had teachers whose teaching was better. In contrast, many of the relationships between teachers’ ratings and student’s pretest scores had the opposite outcome. Teachers’ ratings of percent of steps skipped, and the quality with which studentcentered objectives were achieved, were inversely correlated with students’ normative beliefs. Based on teachers’ ratings, when they skipped more steps, their students had better pretest normative beliefs and had more positive parental attention. Only pretest associations between teachers’ ratings of attendance and

123

J Primary Prevent Table 4 Comparison of observers’ (O) and teachers’ (T) ratings for teachers with low and high experience teaching All Stars

Analyses were completed at the classroom level

Mean ratings Experience Low

High

Independent-t

Sig.

Steps Skipped

O

2.10

2.76

-3.07

0.004

(1 = 25 %?, 2 = 15–25 %, 3 = 5–15 %, 4 = 0–5 %)

T

3.39

3.95

-2.97

0.005

Quality of objectives

O

2.30

2.80

-2.79

0.008

(1 = poor, 2 = adequate, 3 = very well, 4 = exceptional)

T

2.85

3.27

-2.70

0.011

Attendance

O

3.58

4.05

-1.54

0.132

(3 = 16–20, 4 = 21–25) Engagement

T O

3.70 3.62

4.28 4.03

-1.59 -2.19

0.120 0.036

(1 = low, 5 = high)

T

4.08

4.34

-1.34

0.190


O

3.90

4.27

-1.37

0.181 0.188

(1 = low, 5 = high)

T

4.05

4.26

-1.34

Quality of teaching

O

3.48

4.02

-1.99

0.054

(1 = low, 5 = high)

T

3.76

4.41

-2.97

0.005

Teacher understanding

O

3.60

4.19

-2.16

0.038

(1 = low, 5 = high)

T

4.27

4.62

-1.73

0.092

Potential effectiveness

O

3.13

3.82

-2.64

0.012

(1 = low, 5 = high)

T

3.75

4.25

-2.33

0.026

students’ reports of normative beliefs and lifestyle incongruence were in the expected direction.

Discussion The Quality of Teacher and Observer Ratings

The Role of Experience Teachers involved in this project had varying numbers of years of experience with All Stars. We dichotomized teachers into two groups, namely those who had more than 2 years’ experience teaching All Stars (12 classes) and those who had up to 2 years (25 classes). We compared the eight fidelity ratings for each group from observers and teachers (see Table 4). For both sets of raters, teachers with less experience received lower ratings. Relative to more experienced teachers, those with less experience reported skipping more steps, achieved lower quality on meeting program objectives, had lower teaching quality, and saw their efforts as having less potential for effectiveness. These findings were corroborated by observers, who, in addition to the fidelity outcomes listed for teachers’ ratings, noted that less experienced teachers were less likely to be engaging and to have poorer understanding of the program. Attendance, which was not significant, nonetheless revealed that newer teachers had slightly smaller class sizes.

123

As evidence-based programs become disseminated, there will be increased emphasis on understanding the fidelity with which these programs are implemented. Particularly important will be understanding the quality of program delivery, including participant engagement, the degree to which participant-centered objectives have been met, and the quality of teaching and of teacher understanding. Our results suggest that teachers’ self-reports will nearly always be different than observers’ reports. It would appear from our results that teachers do, in fact, have a positivity bias in reporting about their own fidelity performance. While research reports that teachers perceive their implementation of substance use prevention programs as having high quality of delivery and participant responsiveness (Ennett et al., 2011), such findings may not be a reliable indicator of their actual performance in the classroom. Among the most obvious factors that may influence this is the desire to avoid negative performance evaluations (Donaldson & Grant-Vallone, 2002). However, the

J Primary Prevent

lack of correspondence between teachers’ and observers’ ratings for practically all measures of fidelity suggests that teachers and observers may also refer to different standards when making fidelity ratings. This is entirely reasonable to expect. Our observers had previously completed hundreds of hours of coding using identical methods and had achieved high levels of inter-rater reliability (Bishop et al., 2013). They also had access to a detailed coding manual that provided definitions and examples to guide their ratings. Further, observers had the benefit of video recordings which could be replayed should doubts about any rating be at issue. Thus, the differences observed can be attributed to many factors including experience, training, and resources. Even though it is clear that the positivity bias noted was present in both newer and more experienced teachers, it was nonetheless gratifying that the fidelity self-ratings for these two groups mirrored the overall pattern that resulted when observers made ratings. This suggests that, at least at some level, newer teachers may understand that their performance is not as good as it should be. Further, experience does matter and, with repeated cycles of teaching, teachers do learn to perform prevention tasks in ways that more clearly reflect the intentions of a curriculum. This latter finding is important in that many, if not all, randomized control trials testing the effectiveness of programs assign teachers who have limited experience to conditions. In such a situation, fidelity (and thus, in all likelihood, program effects) can be expected a priori to suffer. Teachers were both actors and novice raters who had limited time to complete fidelity rating tasks. The task most competently completed by teachers was documenting which activities were taught and class attendance. Thus, if limited to these two aspects of fidelity, teachers’ reports may be trusted to be fairly accurate. Both of these indicators are relatively concrete in nature and differ markedly from the more subjective judgment measures where bias is more likely to be seen. As such, it is appropriate and valuable to ask for self-reports as long as it is understood that reports may include some positivity bias. It should be noted that the training and technical support available to teachers was designed with two factors in mind. First, we desired to place a relatively low burden on teachers. Second, we were cognizant of

the prior research literature, nearly all of which has provided teachers with only minimal training in how to complete fidelity forms. Thus, while a just criticism of the differences observed between teachers and observers might related to their respective experience and access to detailed instructions, teachers received no less instruction than is typical. Without examples of more extraordinary cases, this approach was thought to be justified. One aspect of requiring the completion of fidelity ratings that researchers and policy makers should consider is that, in collecting fidelity data from teachers, two benefits may occur that are unrelated to the quality of their data. The first is that by requiring an assessment of fidelity, a not-so-subtle message is communicated about the value of adherence and quality of delivery. Reporting about fidelity is a form of general invitation to teachers to take responsibility for what they do. Fagan and Mihalic (2003) reached a similar conclusion in the evaluation of Life Skills Training. Indeed, under such circumstances, a positivity bias may be not only acceptable, but desired. Teachers should feel that they are succeeding. It is clearly better if teachers feel that they are succeeding when they actually perform well. However, if considered in the long term, in conjunction with feedback from an observer, the act of collecting fidelity forms may be used to encourage teachers to improve their performance. The second clear benefit of collecting fidelity forms from teachers, even when a positivity bias is present, is that the specifics of fidelity assessment forms focus attention on what is important (Taylor, 1994). Our forms included an inherent review about activities to be covered, objectives to be achieved, student engagement to be encouraged, understanding to be mastered, and outcomes to be achieved. In other words, the act of completing fidelity forms may be a useful teaching device. Despite teachers’ performance in this study, it is possible that they may be trained to make judgments about their own fidelity and that such training may result in improved performance in completing routine assessments. Teachers might be provided with high quality training to understand the concepts that underlie rating schema. If there is a cohort of teachers who work together, they may be asked whenever possible, to complete ratings of each other in conjunction with ratings completed by observers. It is also

123

J Primary Prevent

possible that a different format for completing ratings, such as revising prompts and response categories and providing clear examples, may yield improved results. Future research should consider these alternatives. The Relationship Between Fidelity and Student Mediators Student pretest measures showed a definable relationship with several subsequent observers’ assessments of fidelity. The quality with which student-centered objectives were achieved, student engagement, overall quality of teaching, and the potential of the session to have an impact on targeted mediators and behaviors were all predicted by multiple mediators when measured at pretest. That these effects were observed using only pretest data was unanticipated. There were two consequences of such findings. First, any attempt to assess the role of fidelity on predicting changes in mediators could not be completed because pretest mediating variable scores were confounded with fidelity. Second, because only post hoc explanations can be provided, our explanations are only suggestive and not conclusive. However, we believe such results are worthy of discussion because, if similar findings are subsequently observed elsewhere, there may yet be value in considering why such relationships exist. There are at least two possible explanations for the presence of these relationships. First, some assignment mechanism may be at work whereby better teachers are routinely assigned to teach better classes. A second, more promising explanation is that, having been assigned to a group of students, the quality of subsequent teaching may be dependent in large part on the responsiveness of the audience. That is, teachers who find themselves in the company of a receptive audience of students may automatically find it easier to implement a program as intended. Such a situation may be most conducive to promoting high quality of delivery. Although these explanations are conjectural on our part. If they are validated by subsequent research, they may have important implications for the delivery of evidence-based programs. At present, the dictum guiding implementation is simply to maintain full fidelity. However, systematic treatment matching or tailoring to adjust how interventions are delivered based on students’ pretest status may ultimately be required if programs are to succeed in real world

123

settings. There are potentially profound changes that should be made to interventions to meet the needs of higher risk students. When initial scores are worse than average, alternative methods may be needed both to match program messages and content and to improve quality of delivery. These may include (1) postponing intervention until rapport and mutual respect are established between teachers and their students, (2) ensuring that initial activities are highly engaging, (3) progressing at a different pace for the delivery on the intervention, (4) developing alternative lessons and activities that match the status of at-risk students, and (5) spending extra effort to build parental and administrative support. Such strategies would both increase readiness of the students to receive messages and enhance the capacity of the teachers to meet the needs of specific student groups. Whether or not future studies confirm that fidelity is affected by pre-existing student characteristics, matching intervention to pretest status will be important for program developers and providers to consider. Alternative intervention strategies will need to be developed and tested. Altering implementation strategies to match student characteristics will also require having rapid access to pretest survey results. These future directions will require new research strategies. Limitations The findings presented here were solely focused on All Stars. It would be wise to replicate these analyses with other prevention programs. Fidelity research is constantly limited because of a small number of classrooms and teachers; this project is no exception. It will be valuable to increase sample size and, if possible, create a cumulative set of teacher and observer fidelity assessments from which meta-analytic analyses can be completed. It may also be valuable to consider use of consumer assessments, in this case students, which may provide a more direct measure of key program delivery characteristics like the dose of the curriculum they received and self-reported engagement. Such information may be used to triangulate results when observer and teacher ratings are discordant. There are a number of definable limitations to generalizability based on the sample. These include a variety of unmeasured variables such as the socioeconomic status of schools and school populations, culture, urbanicity, and other risk factors that may have direct

J Primary Prevent

or interactive effects with outcomes. While we believe that these analyses were interesting, in the end, to answer questions about the validity of assessments, pretest–posttest change scores will be required.

Conclusion The purpose of this study was (1) to test the teacher positivity bias hypothesis, (2) to test the degree to which teachers’ and observers’ ratings are correlated, and (3) to examine the degree to which observers’ and teachers’ assessments are related to program-specific outcomes. Results confirm that teachers’ evaluations of their performance were more positive than those of observers for all measures assessed. Teachers were most similar to observers on concrete measures such as which activities they taught and how many students attended their instruction. With the exception of the latter two qualities, teachers’ and observers’ ratings of fidelity were divergent. Each makes assessments from different perspectives (actor versus observer) and had different training, experience, and resources to draw from that contributed to this finding. When given the option, using trained observers to rate video recordings will provide superior fidelity ratings. Finally, observers’ ratings of fidelity tended to reveal that even before the program was delivered, teachers who would ultimately be judged to have better quality of delivery already had students who performed better insofar as targeted mediating variables were concerned.

References Abbott, R. D., O’Donnell, J., Hawkins, J. D., Hill, K. G., Kosterman, R., & Catalano, R. F. (1998). Changing teaching practices to promote achievement and bonding to school. American Journal of Orthopsychiatry, 68, 542–552. Adams, A. S., Soumerai, S. B., Lomas, J., & Ross-Degnan, D. (1999). Evidence of self-report bias in assessing adherence to guidelines. International Journal of Quality Health Care, 11, 187–192. Berkel, C., Mauricio, A. M., Schoenfelder, E., & Sandler, I. N. (2011). Putting the pieces together: An integrated model of program implementation. Prevention Science, 12(1), 23–33. Bishop, D. C., Pankratz, M. M., Hansen, W. B., Albritton, J., Albritton, L., & Strack, J. (2013). Measuring fidelity and adaptation: Reliability of an instrument for school-based

prevention programs. Evaluation and the Health Professions. doi:10.1177/0163278713476882. Burke, R. V., Oats, R. G., Ringle, J. L., Fichtner, L. O. N., & DelGaudio, M. B. (2011). Implementation of a classroom management program with urban elementary schools in low-income neighborhoods: Does program fidelity affect student behavior and academic outcomes. Journal of Education for Students Placed at Risk, 16, 201–218. Dane, A. V., & Schneider, B. H. (1998). Program integrity in primary and early secondary prevention: Are implementation effects out of control. Clinical Psychology Review, 18, 23–45. Domitrovich, C. E., & Greenberg, M. T. (2000). The study of implementation: Current findings from effective programs that prevent mental disorders in school-aged children. Journal of Educational and Psychological Consultation, 11(2), 193–221. Donaldson, S. I., & Grant-Vallone, E. J. (2002). Understanding self-report bias in organizational behavior research. Journal of Business and Psychology, 17(2), 245–260. Durlak, J., & DuPre, E. (2008). Implementation matters: A review of research on the influence of implementation on program outcomes and the factors affecting implementation. American Journal of Community Psychology, 41, 327–350. Dusenbury, L., Brannigan, R., Falco, M., & Hansen, W. B. (2003). A review of research on fidelity of implementation: Implications for drug abuse prevention in school settings. Health Education Research, 18, 237–256. Dusenbury, L., Brannigan, R., Hansen, W. B., Walsh, J., & Falco, M. (2005). Quality of implementation: Developing measures crucial to understanding the diffusion of preventive interventions. Health Education Research, 20, 308–313. Ennett, S. T., Haws, S., Ringwalt, C. L., Vincus, A. A., Hanley, S., Bowling, J. M., et al. (2011). Evidence-based practice in school substance use prevention: Fidelity of implementation under real-world conditions. Health Education Research, 26(2), 361–371. Fagan, A. A., & Mihalic, S. (2003). Strategies for enhancing the adoption of school-based prevention programs: Lessons learned from the blueprints for violence prevention replications of the life skills training program. Journal of Community Psychology, 31(3), 235–253. Giles, S. M., Jackson-Newsom, J., Pankratz, M. M., Hansen, W. B., Ringwalt, C. L., & Dusenbury, L. (2008). Measuring quality of delivery in a substance use prevention program. Journal of Primary Prevention, 29, 489–501. Gottfredson, D. C., Cross, A., Wilson, D., Rorie, M., & Connell, N. (2010). An experimental evaluation of the All Stars prevention curriculum in a community after school setting. Prevention Science, 11(2), 142–154. Griffin, K. W., Mahadeo, M., Weinstein, J., & Botvin, G. J. (2006). Program implementation fidelity and substance use outcomes among middle school students in a drug abuse prevention program. Salud y Drogas, 6, 7–26. Hansen, W. B. (1996). Pilot test results comparing the All Stars program with seventh grade D.A.R.E.: Program integrity and mediating variable analysis. Substance Use and Misuse, 31(10), 1359–1377. Hansen, W. B., & McNeal, R. B. (1999). Drug education practice: Results of an observational study. Health Education Research, 14, 85–97.

123

J Primary Prevent Hansen, W. B., Pankratz, M. M., Dusenbury, L., Giles, S. M., Bishop, D. C., Albritton, J., et al. (2013). Styles of adaptation: The impact of frequency and valence of adaptation on preventing substance use. Health Education, 113(4), 345–363. doi:10.1108/09654281311329268. Harrington, N. G., Giles, S. M., Hoyle, R. H., Feeney, G. J., & Youngbluth, S. C. (2001). Evaluation of the All Stars character education and problem behavior prevention program: Effects on mediator and outcome variables for middle school students. Health Education & Behavior, 28, 533–546. Hill, L. G., & Owens, R. W. (2013). Component analysis of adherence in a family intervention. Health Education, 113(4), 2. doi:10.1108/09654281311329222. Johnson, K., Ogilvie, K., Collins, D., Shamblen, S., Dirks, L., Ringwalt, C., et al. (2010). Studying implementation quality of a school-based prevention curriculum in frontier Alaska: Application of video-recorded observations and expert panel judgment. Prevention Science, 11, 275–286. Jones, E. E., & Nisbett, R. E. (1971). The actor and the observer: Divergent perceptions of the causes of behavior. New York: General Learning Press. Kunter, M., & Baumert, J. (2006). Who is the expert? Construct and criteria validity of student and teacher ratings of instruction. Learning Environments Research, 9, 231–251. Lawrenz, F., Huffman, D., & Robey, J. (2003). Relationships among student, teacher and observer perceptions of science classrooms and student achievement. International Journal of Science Education, 25, 409–420. Lillehoj, C. J., Griffin, K. W., & Spoth, R. (2004). Program provider and observer ratings of school-based preventive intervention implementation: Agreement and relation to youth outcomes. Health Education & Behavior, 31, 242–257. Mayer, D. P. (1999). Measuring instructional practice: Can policy makers trust survey data? Educational Evaluation and Policy Analysis, 21, 29–45. McGraw, S. A., Sellers, D., Stone, E., Resnicow, K. A., Kuester, S., Fridinger, F., et al. (2000). Measuring implementation of school programs and policies to promote healthy eating and physical activity among youth. Preventive Medicine, 31, S86–S97. McNeal, R. B., Hansen, W. B., Harrington, N. G., & Giles, S. M. (2004). How all stars works: An examination of program effects on mediating variables. Health Education & Behavior, 31(2), 165–178. Newfield, J. (1980). Accuracy of teacher reports: Reports and observations of specific classroom behaviors. The Journal of Educational Research, 74(2), 78–82.

123

Pankratz, M. M., Jackson-Newsom, J., Giles, S. M., Ringwalt, C. L., Bliss, K., & Bell, M. L. (2006). Implementation fidelity in a teacher-led alcohol use prevention curriculum. Journal of Drug Education, 36, 317–333. Pentz, M. A., Trebow, E. A., Hansen, W. B., MacKinnon, D. P., Dwyer, J. H., Johnson, C. A., et al. (1990). Effects of program implementation on adolescent drug use behavior: The Midwestern prevention project. Adolescent Drug Use Behavior, 14, 264–289. Ransford, C., Greenberg, M. T., Domitrovich, C. E., Small, M., & Jacobson, L. (2009). The role of teachers’ psychological experiences and perceptions of curriculum supports on implementation of a social emotional curriculum. School Psychology Review, 38, 510–532. Resnicow, K., Davis, M., Smith, M., Lazarus-Yaroch, A., Baranowski, T., Baranowski, J., et al. (1998). How best to measure implementation of school health curricula: A comparison of three measures. Health Education Research, 13(2), 239–250. Ruiz-Primo, M. A. (2006). A multi-method and multi-source approach for studying fidelity of implementation. CSE Report 677, National Center for Research on Evaluation, Standards, and Student Testing, UCLA, Los Angeles, CA. Spoth, R., Gull, M., Lillehoj, C. J., Redmond, C., & Greenberg, M. (2007). PROSPER study of evidence-based intervention implementation quality by community-university partnerships. Journal of Community Psychology, 35, 981–999. Spoth, R., Guyll, M., Trudeau, L., & Goldberg-Lillehoj, C. (2002). Two studies of proximal outcomes and implementation quality of universal preventive interventions in a community-university collaboration context. Journal of Community Psychology, 30, 499–518. Swanson, E., Wanzek, J., Haring, C., Cuillo, S., & McCulley, L. (2011). Intervention fidelity in special and general education research journals. Journal of Special Education, 20, 1–11. Taylor, L. (1994). Reflecting on teaching: The benefits of selfevaluation. Assessment & Evaluation in Higher Education, 19, 109–122. Wickstrom, K. F., Jones, K. M., LaFleur, L. H., & Witt, J. C. (1998). An analysis of treatment integrity in school-based behavioral consultation. School Psychology Quarterly, 13, 141. Wubbels, T., Brekelmans, M., & Hooymayers, H. P. (1992). Do teacher ideals distort the self-reports of their interpersonal behavior? Teaching and Teacher Education, 8, 47–58.

Evaluation of group therapy: correlations between clients' and observers' assessments.

Notes for primary care teachers: assessments - the basics.

Training Teachers to use Evidence-Based Practices for Autism: Examining Procedural Implementation fidelity.

FRACGP Examination.

Ontogenetic and sex-based differences in habitat preferences and site fidelity of White's seahorse Hippocampus whitei.

Sex differences in behavior ratings: male and female teachers rate male and female pupils.

Differences in child symptom ratings among teachers and parents of emotionally disturbed children.

Psychological and biographical differences between secondary school teachers experiencing high and low levels of burnout.

Adolescents' unconditional acceptance by parents and teachers and educational outcomes: A structural model of gender differences.

Differences between experienced and inexperienced teachers' planning decisions, interactions, student engagement, and instructional climate.

Empathy and attribution: turning observers into actors.

Differences between reports from children, parents and teachers: implications for epidemiological studies.

Personality differences among female student teachers of relatively high and low mental ability.

Validity of therapist self-report ratings of fidelity to evidence-based practices for adolescent behavior problems: correspondence between therapists and observers.

Promises and lies: can observers detect deception in written messages.

Cosmetic and functional outcomes of breast conserving treatment for early stage breast cancer. 1. Comparison of patients' ratings, observers' ratings, and objective assessments.

Standardized mean differences cause funnel plot distortion in publication bias assessments.

Depression in teachers.

Pharmacological Neuroenhancement: teachers' knowledge and attitudes-Results from a survey study among teachers in Germany.

Teachers' smoking.

Both high-fidelity replicative and low-fidelity Y-family polymerases are involved in DNA rereplication.

Teachers of nursing in the United Kingdom: some characteristics of teachers and their jobs.

Hoarseness and Risk Factors in University Teachers.

Agreement of Personality Profiles Across Observers.