Journal o f Abnormal Child Psychology, Vol. 5, No. 2, ]977

Improving the Validity of Global Ratings 1 Mark R. Weinrott 2

Oregon Research Institute

To examine the effects o f behavior sampling on global ratings, four groups o f 10 teachers each received varying amounts o f observation training, practice, and feedback. Teachers viewed a series o f seven videotapes depicting two boys whose percentage o f distractible behavior was systematically manipulated. Ratings o f distractibility were obtained for each taped vignette. Results showed that teachers who received observation training and who routinely collected data in their own classroom submitted ratings which corresponded to actual levels o f distractible behavior. Teachers who received no training, or who were trained but dM not practice, submitted ratings that were significantly less accurate.

Adults are often asked to provide evaluation reports of children's behavior. These may be in checklist form, responses to a questionnaire, ratings, or a simple verbal description. Such evaluations have been criticized as biased, retrospective, and global (Webb, Campbell, Schwartz, & Sechrest, 1966). To counter these criticisms of summary reports, behavioral assessments have become increasingly popular. Behavioral assessments have typically included identification and tracking of discrete target behaviors (e.g., noncompliance, outof-seat) rather than global ratings (e.g., disruptiveness). Proponents of behavioral assessment contend that the extensive sampling of precisely defined behaviors, perhaps from some general class of behavior, provides greater measurement reliability than is obtained with conventional forms of trait attribution like global ratings (Walter & Gilmore, 1973; Wiggins , 1973).

Manuscript received in final form September 27, 1976. This research was supported by the Molson l:oundation, Montreal, Quebec, Canada. 2Address all correspondence to Mark R. Weinrott, Oregon Research Institute, P. O. Box 3196, Eugene, Oregon 97403. 187 9 1 9 7 7 P l e n u m P u b l i s h i n g C o r p . , 2 2 7 West 1 7 t h S t r e e t , N e w Y o r k , N . Y . 10011. T o p r o m o t e f r e e r access t o p u b l i s h e d m a t e r i a l in t h e s p i r i t o f the 1 9 7 6 C o p y r i g h t L a w , P l e n u m sells r e p r i n t articles t r o m all its j o u r n a l s . T h i s a v a i l a b i l i t y u n d e r l i n e s the fact t h a t no p a r t o f t h i s p u b l i c a t i o n m a y be r e p r o d u c e d , s t o r e d in a r e t r i e v a l system , or t r a n s m i t t e d , in a n y f o r m o r b y a n y means, e l e c t r o n i c , m e c h a n i c a l , p h o t o c o p y i n g , m i c r o f i l m i n g , r e c o r d i n g , or o t h e r w i s e , w i t h o u t w r i t t e n p e r m i s s i o n o f the p u b l i s h e r . S h i p m e n t is p r o m p t ; rate per a r t i c l e is $ 7 . 5 0 .

188

Weinrott

While behavioral assessments may well outperform traditional global ratings psychometrically, it is not uncommon for educators, mental health professionals, and clients to rely heavily upon trait descriptors (e.g., immaturity, aggressiveness), even when more specific data are available. Rather than trying to discourage the use of global constructs in assessment, as many behavior analysts are wont to do, it might be more productive to improve global ratings by training evaluators to be better observers of the child behaviors they are asked to rate. To train individuals to become more accurate observers in their own setting, Wahler and Leske (1973) conducted an experiment in which 40 elementary school teachers viewed a series of 15 videotapes depicting six children engaged in independent seat work. The children were actually following prepared scripts which systematically determined the percentage of time they were working appropriately or behaving in a distractible fashion. One of the children's distractible behaviors was faded over the 15 segments, such that she produced offtask behavior on 75% of the first tape, 70% of the second, and so on, with each subsequent tape portraying a decrease in off-task responses of 5%. One group of teachers was given instruction in how to sequentially sample behavioral events and record frequencies of specified responses, while another group received no such input. At the conclusion of each tape, all teachers were required to rate each child on a 7-point "distractibility" scale. Results showed that the untrained teachers were quite inaccurate in their ratings of the target child, most failing to report even slight "improvement" until the 13th tape (when the child was only 25% distractible, or after a 50% shift). The trained teachers were considerably more accurate in their appraisals, as ratings of most were sensitive to a shift of only 15%. A successful replication of this study conducted by Goldschmidt (1975) examined the direction of the behavior change as well as the effects of systematic observation. Subjects viewed a series of five videotapes in either gradually "improving" or "deteriorating" order. Compared to subjects who passively observed, those who applied the prescribed tracking techniques perceived a greater amount of change and estimated significantly less overall deviant behavior. The present study was an attempt to replicate and extend the findings of Wahler and Leske (1973) and Goldschmidt (1975). Similar procedures were adopted and modified so that the pattern of behavior was more complex (nonlinear). It was also deemed desirable to separate the effects of the initial observation training itself from those of practicing the skills on a routine basis. It was hypothesized that teachers' ratings of child behavior would show greater cofivergence with actual levels of distractibility as a function of training, practice, and feedback in systematic observation.

Improving Validity of Global Ratings

189 METHOD

Subjects Subjects were 40 elementary school teachers (grades 1-3) enrolled in a continuing education course in behavior modification. Teachers were consecutively assigned to one of four groups (N = 10) as follows: G~. Subjects received no training in observation skills or data-gathering techniques and were not encouraged to attempt systematic assessment of any kind. To control for differences in subject-instructor contact hours, G1 heard a 2-hour placebo lecture on pharmacological intervention with hyperactive children in lieu of the observation training session. G2. This group received one 2-hour session of observation training consisting of instruction and practice in systematic viewing of classroom interaction and collecting of data on specified behaviors (e.g., noncompliance, out-of-seat). Using videotapes, teachers learned three common recording procedures: event recording, which provides measures of frequency of occurrence of target behaviors; duration recording, which provides measures of the duration of occurrence; and occurrence-nonoccurrence (interval) recording, which can provide estimates of both the frequency and rate of the target response. Where appropriate, the obtained data were converted into rate per minute, or proportion of intervals in which the behavior had occurred. The session involved practice in deciding which sampling procedure was most efficient yet would still yield a valid index of the behavior in question. This was followed by application of the selected strategy. A minimum of six target behaviors and accompanying tapes served as practice material. Teachers were encouraged to use the techniques in their own setting, but were not required to do so. G3. Members of this group received one 2-hour session of observation training identical to that described above, plus the assignment of collecting data daily ("tracking") on selected behaviors (two to three) emitted by a target child in their own classroom. In addition, teachers were required to submit a weekly written record of the data to the instructor in order to gain admission to the session. (This group is analogous to Wahler and Leske's experimental group; the daily data collection is comparable to test trials conducted on consecutive days.) G4. Subjects received one session of observation training and were assigned the task of daily data collection as described for group G3. It was further stipulated that members of G4 would recruit a third-party monitor (e.g., student teacher, parent aide, free-flow teacher, assistant principal) from within the school who would observe the target child and record data along with the teacher.

190

Weintott

One purpose of introducing the monitor was to combat the observer drift phenomenon (O'Leary & Kent, 1973). There seems to be a strong possibility that a single trained observer who receives no feedback may be coding according to gradually shifting criteria. This form of "reliability" assessment was to be carried out for a minimum of three 15-minute periods per week. Each teacher was responsible for training her own calibrators using predetermined definitions and observation strategy. The identity of the external monitor changed periodically, but not systematically, as a partial control against observer drift in a fixed teacher-monitor pair. Admittance to the weekly course meeting was contingent upon the teacher submitting both sets of data to the instructor.

Procedure One week prior to observation training (or placebo input for G1), all subjects were presented with a strategy for pinpointing and defining target behaviors. Each teacher had isolated three target behaviors for a preselected child in her class and had been required to formulate definitions for these. Two weeks following the observation training (or placebo for G1), teachers viewed the first of a series of seven video protocols, each of which depicted two boys seated at adjacent desks whose rates of off-task (distractible) behavior were systematically manipulated. The behavior of one child was varied such that he emitted distractible behavior during 70% of the first tape, 60% of the second, and so on, until he was off-task on only 30% of fifth protocol. On the basis of episodes 1 through 5, this child's behavior could be construed as "improving." A matched set of 30% and 40% off-task tapes was also produced and constituted trials 6 and 7, respectively. The behavior of the other boy was held relatively constant at a level of 35-45% off-task. Children were assigned the task of independently solving arithmetic problems from a workbook. The protocols were produced by directing the children to follow prepared scripts which prescribed the topography of the response each would emit. The 10-minute segment was divided into 40 15-second intervals. At the beginning of an interval, individual instructions were given to the children by a director who wrote them on a blackboard or sheet of paper. The boys were asked to produce one or two of the following responses: out-of-seat, talking with peer, manipulate object, look around, or work. In cases where a child was told to produce two responses within an interval, these were to occur sequentially, not concurrently, and always involved shifting from one distractible behavior to another. When a task-relevant response was evoked, it was carried on for the entire 15 seconds. The responses were randomly assigned to intervals within the script, although the ratio of appropriate to inappropriate behavior was prearranged for each child.

Improving Validity of Global Ratings

t91

Teachers observed one tape per week for 7 weeks. They were told that the protocols were to be used for training independent observers on a classroom coding system, and that the purpose of screening them was to assess the cornplexity of the tape by determining the level of distractible behavior. Tapes used for training were to be catalogued in this manner so that the instructor could better evaluate the performance of the independent observers. Subjects were not given an instructional set with respect to an expected pattern of child behavior, nor were they advised of any diagnostic labels. Just before showing each tape, the instructor wrote the four distractible response categories on the blackboard and asked the teachers to refer to this list in any way which would help them make a more accurate overall appraisal. Members of all groups except G1 were asked to record frequencies for each. Teachers in all groups were instructed to watch each tape carefully, to remain silent, and at the conclusion, to place a mark on a 7-point "distractibility" scale at the point which best described how the target child compared to students with whom the teachers ordinarily dealt (i.e., peer norms). Teachers retained no record of their ratings from week to week.

RESULTS

The data were analyzed by assigning to each of the seven protocols an ideal or standard rating to which obtained scores could be compared. Because the amount of distractible behavior decreased in a linear fashion (trials 1 through 5), it was assumed that totally accurate ratings should depict a similar pattern. Ratings on tapes 6 and 7 (30% and 40% distractible) should coincide directly with those for protocols 5 and 4, respectively. The standard rating for the first protocol was identified as "7." Each tape (through number 5) was assigned a rating one point lower than the previous tape. Virtually any standard could have been used as long as the order and magnitude of the differences between protocols was preserved. The assignment of number "7" to the first tape offered the added advantage of representing the modal and median ratings for each of the four groups of subjects. Table I presents Table I. Standard Ratings for Each Protocol Protocol number 1

2

3

4

5

6

7

Percent of distractiblebehavior 70 60 50 40 30 30 40 Standard rating

7

6

5

4

3

3

4

192

Weinrott Table II. Group Means and Standard Deviations

for Ratings E1 Protocol ~ 1 2 3 4 5 6 7

6.4 6.2 6.4 5.3 3.6 4.2 3.8

E2 SD

E3

SD

X

.~

.8 .9 .7 1.5 1.1 1.9 1.3

6.6 .7 6.3 .5 6.4 .7 5.5 .7 6.5 .8 5.0 1.2 5.5 1.3 4.0 .7 4.1 1.5 3.2 1.5 2.7 .9 2.9 .9 5.5 1.2 3.4 1.5

E4

SD

X

SD

6.8 .4 6.1 .7 5.5 .7 4.5 .8 3.5 1.1 3.0 .9 3.3 .7

the standard ratings for each protocol, while Table II shows the group means and standard deviations for ratings on each protocol. Table III shows deviations scores for each protocol, obtained by taking the absolute value o f the difference between the standard ratings (Table I) and the group mean (Table II). A deviation score takes into account both the accuracy (agreement with an external criterion) and reliability (intraclass correlation) o f ratings. Use o f the actual ratings or o f simple change scores would provide a means o f assessing reliability but would offer no guarantee that raters detected a change in behavior from trial to trial. Figure 1 shows the mean group ratings for each trial and the corresponding standard ratings. A repeated measures A N O V A was performed on the seven deviation scores for the four groups. Results are presented in Table IV. Main effects for groups ( F = 7.014, d f = 3, 36; p < .001) and protocols ( F = 4.625, d f = 6, 216; p < .001) were found, as well as a significant interaction between the two factors ( F = 1.804, d f = 18, 216; p < .03). Orthogonal comparisons (Winer, 1971) between the four group means were performed, revealing differences between the two pairs o f groups. G1 and G2 deviated from standard ratings significantly more than did G3 and G4, the members o f which were collecting daily data ( F = 171643, d f = 1, 180; p < .01). Group G2, which received one session o f observation training, did not differ from G1. While those groups Table III. Means and Standard Deviations for Deviation Scores E1 Protocol 1

2 3 4 5 6 7

X .6

E2

SD .8

E3

.~ SD .4

.8 .4 .6 1.4 .7 1.5 1.7 .9 1.7 .8 .9 1.3 1.6 1.5 .7 1.0 .8 1.7

.7

.5 .8 .9 1.3 .7 .8

X .7

E4

SD X SD .5

.2

.4

.7 .5 1.0 .7 .4 .5 1.2 .8 .7 .5 1.2 1.0

.5 .7 .7 .7 .6 .7

.5 .5 .7 .9 .7 .7

Improving Validity of Global Ratings

193

_z s

I..,-

>.J i CO m

4

3

< la

1

~

Group Rating Standard Rating

Oi 1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

TRIAL

7

G3

7

6

6

5

5

4

4

3

3

1

2

3

4

5

6

7

Fig. 1. Obtained vs. standard ratings for each protocol.

194

Weinrott Table IV. Analysis of Variance of Deviation Scores Source

SS

(if

MS

F

A (group) Error B (protocol) A• B Error

14.3 24.5 17.0 19.9 132.5

3 36 6 18 216

4.8 .7 2.8 1.1 .6

7.0 a

ap < bp < Cp

Improving the validity of global ratings.

Journal o f Abnormal Child Psychology, Vol. 5, No. 2, ]977 Improving the Validity of Global Ratings 1 Mark R. Weinrott 2 Oregon Research Institute...
564KB Sizes 0 Downloads 0 Views