Evaluating Evaluation: Assessment of the American Board of Internal Medicine Resident Evaluation Form WARREN G. THOMPSON, MD, MACK LIPKIN, JR., MD, DAVID A. GILBERT, MA, RICHARD A. GUZZO, PhD, LORIANN ROBERSON, PhD The A m e r i c a n B o a r d o f l n t e r n a l Medicine suggests use o f a s t a n d a r d f o r m to rate residents o n n i n e d i m e n s i o n s (such as clinical j u d g m e n t a n d overall clinical conqJetence) o n a scale o f 1 to 9. The a u t h o r s e x a m i n e d the p s y c h o m e t r i c evidence f o r reliability a n d validity o f 1,039 ratings o f 85 residents by 135 a t t e n d i n g , i n a single i n t e r n a l m e d i c i n e residency p r o g r a m . O f these ratings, 95.696 were f r o m 6 to 9. Factor analysis revealed that high c o r r e l a t i o n s a m o n g the nine dimensions ( r r a n g e d f r o m O.72 to 0.92) resulted f r o m a single global f a c t o r a c c o u n t i n g f o r 8696 o f the varianee. The study also e x a m i n e d w h e t h e r the f o r m reliably distinguishes a m o n g residents s c o r i n g between 6 a n d 9. A g r e e m e n t a m o n g attendings r a t i n g the same i n d i v i d u a l was weak (average reliability= 0.64, by the m e t h o d o f James). The r a t i n g m e t h o d f a i l s to d i s c r i m i n a t e dimens i o n s o f clinical care a n d h a s l o w reliability f o r disting u i s h i n g among competent residents. Key words: education, medical, g r a d u a l ; e d u c a t i o n a l measuremeng. i n t e r n s h i p a n d residency. J GEN INlzttN MED 1990;
5:214-217. THE AMERICANBOARD of Internal Medicine (ABIM) desires to i m p r o v e medical care b y p r o m o t i n g standards of clinical e x c e l l e n c e and evaluating internists against those standards for p u r p o s e s of certification.1 To this e n d the Board has d e v e l o p e d a standard rating f o r m that p r o g r a m directors use to evaluate residents. In m a n y programs, attending physicians devote considerable t i m e to c o m p l e t i n g these forms w h e n evaluating residents after each clinical rotation. The Board d e p e n d s on the forms to evaluate skills, attitudes, and behaviors that the written examination does not test. ~ This rating process, however, has b e e n subjected to little critical evaluation. This p a p e r reports an evaluation of the reliability and validity of the forms as used at one internal m e d i c i n e training p r o g r a m over a four-year period.
to assess the clinical c o m p e t e n c e s of 85 traditional and p r i m a r y care internal m e d i c i n e residents at various points in their training. Each f o r m required a rating on nine dimensions: clinical judgment, m e d i c a l knowledge, history-taking skills, physical examination skills, p r o c e d u r a l skills, interpersonal skills, medical care, attitudes and professionalism, and overall clinical competence. Ratings w e r e m a d e on a nine-point scale w i t h the following category blocks: 1 - 3 , unsatisfactory; 4 - 6, satisfactory; and 7 - 9, superior. Attending physicians c o u l d write on the rating f o r m s u m m a r y comments as well as c o m m e n t s for each rated dimension. Means and standard deviations w e r e calculated for individual dimensions. Correlations a m o n g ratings w e r e c o m p u t e d and a factor analysis p e r f o r m e d . The factor analysis involved the extraction of orthogonal principal c o m p o n e n t s using least-squares c o m m u n a l ity estimates in the diagonals of the 9 X 9 matrix of correlations ( a m o n g the ratings on the nine dimensions of clinical c o m p e t e n c e ) . A measure of interrater reliability (i.e., consistency a m o n g ratings given to the same resident b y different attendings) also was calculated. The i n d e x of interrater reliability was based on a r a n d o m l y selected s u b s a m p l e of 15 residents, each of w h o m was rated w h i l e doing general m e d i c i n e ward or chest service w a r d rotations in the same year of resid e n c y b y six or m o r e attending physicians (the months b e i n g evaluated w e r e not necessarily the same). The measure of interrater reliability was that suggested b y James et al. 2 The i n d e x has an effective range f r o m 0 to 1.0, w i t h 1.0 indicating c o m p l e t e interrater reliability. Mean differences in overall c o m p e t e n c e ratings a m o n g cohorts of first-year residents f r o m 1984 - 85 to 1987 88 w e r e analyzed b y ANOVA and Scheff~ tests.
METHODS Using the ABIM standard evaluation form, 135 attending physicians c o m p l e t e d 1,039 evaluation forms Received from the Division of Primary Care, Department of Medicine (WT, ML), New York University School of Medicine; and the Department of Psychology (DG, RG, LR), New York University, New York, New York. (Dr. Guzzo is now at the Department of Psychology, University of Maryland, College Park, Maryland.) Presented at the annual meeting of the Societyof General Interhal Medicine, Washington, D.C., April 28, 1989. Address correspondence and reprint requests to Dr. Thompson at his present address: University of Tennessee College of Medicine, Knoxville Unit; 1924 Alcoa Highway; Knoxville, TN 37920. 214
RESULTS Mean ratings for the nine dimensions are s h o w n in Table 1. The average ratings on each d i m e n s i o n w e r e at the high end of the scale (7.34 to 7 . 6 1 ) and had small standard deviations. Although a nine-point scale is used on the ABIM form, raters used, in effect, a four-point scale; 95.6% of the scores w e r e b e t w e e n 6 and 9. Written c o m m e n t s u n d e r each dimension o c c u r r e d in app r o x i m a t e l y one-tenth o f the ratings. About half the forms had c o m m e n t s at the end. Most of these corn-
JOURNALOFGENERALINTERNALMEDICINE,Volume5 (May/June), 1990
ments were general and lacking in specificity, such as " e x c e l l e n t resident." Ratings on the nine dimensions were highly correlated: correlations ranged from 0.72 to 0.92. Factor analysis found a single-factor solution to be optimal. The loadings of the nine dimensions on the factor ranged from 0.88 to 0.96, and the factor a c c o u n t e d for 86% of the variance among ratings (Table 2). This strong, single factor is evidence for a " h a l o " effect in the ratings. That is, raters fail to differentiate among the nine dimensions of clinical c o m p e t e n c e in their ratings of resident performance. Within the de facto four-point scale used by the raters, it is important to k n o w w h e t h e r attending physicians agree w h e n rating the same resident. Using the m e t h o d of James et al., 2 interrater reliabilities ranged from 0.16 to 0.88 for 15 residents rated in the same year by six or more attending physicians (Table 3). The average reliability score was 0.64, and for only six residents was the score 0.80 or better. As discussed below, these data show rather weak interrater reliability. For exploratory purposes, an examination was made of the differences in ratings obtained in the first year of residency for the period 1984 - 85 to 1 9 8 7 - 88. Because all dimensions loaded on a single factor, an
TABLE 1 ABIM EvaluationScores
Mean (+ SD) 7.35 7.38 7.40 7.34 7.49 7.58 7.41 7.61 7.47
Clinicaljudgment Medicalknowledge History-taking skills Physical examination skills Proceduralskills Interpersonalskills Medicalcare Attitudes and professionalism Overall clinicalcompetence
(+ 1.10) (+ 1.09) (+ 1.13) (+ 1.09) (_+ 1.06) (+ 1.12) (+ 1.09) (+ 1.14) (+ 1.06)
TABLE 2 Factor Analysisof ABIM EvaluationScores
Factor 1 Factor 2 Factor matrix, Factor 1 Clinicaljudgment Medicalknowledge History-taking skills Physical examination skills Proceduralskills Interpersonalskills Medicalcare Attitudes and professionalism Overall clinicalcompetence
Eigenvalue
Variance Explained (%)
7.8 0.4
86.4 4.7 Correlation (r) 0.94 0.92 0.94 0.94 0.94 0.88 0.94 0.89 0.96
21S TABLE 3
Intarrater Reliability Scoresfor ResponseRange6 to 9
Reliability Score*
No. of Residents
0.80-0.88 0.50- 0.79 0.16-0.49
6 5 4
*Average reliability score = 0.64.
TABLE 4 Overall Scoresfor First-year Residentsby Year of Entry*
Year of Entry
Number Ratings
Mean Score
1984-85 1985-86 1986-87 1987-88
5 218 205 128
8.24 7.67 7.44 7.12
*Analysis of variance, F=9.57, p < 0.0001. The 1987-88 mean score was significantly lower than the 1985-86 and 1986-87 scores, Scheffep < 0.05.
overall evaluation score was calculated for each firstyear resident by averaging the nine ratings. These overall evaluation scores were then c o m p a r e d via &NOVA and Scheffe. The results are shown in Table 4. The &NOVA results indicate a decline in clinical comp e t e n c e ratings for first-year residents (F = 9.57, p < 0 . 0 0 0 1 ) . The overall evaluation scores for the 1 9 8 7 - 8 8 first-year residents were significantly lower than were the scores for the 1 9 8 5 - 8 6 and 1 9 8 6 - 8 7 first-year residents (Scheffe test, p < 0 . 0 5 ) .
DISCUSSION Murphy and Balzer describe four facets of accuracy in ratings that measure several dimensions of performance.3 They are: 1) " e l e v a t i o n " - - the overall level of the rating, 2) "differential elevation" - - discriminating among rates, 3) "stereotype accuracy" - - d i s c r i m i n a t i n g among performance dimensions, and 4) "differential accuracy" - - discriminating among ratees within performance dimensions. Our data do not indicate anything about differential accuracy. Whether the ratings were too lenient or " e l e v a t e d " for this group of residents cannot be answered definitively with these data. The lower end of the ABIM scale (scores 1 to 3) was not used at all, and 95% o f the scores were greater than the midpoint of the scale (scores 6 to 9). Although these data suggest leniency, o u r residents all had excellent academic records prior to starting resid e n c y and may warrant the very high ratings they received. It is also possible that some faculty raters rated " l e n i e n t l y " because they realized the difficult job that residents have and wished to provide a measure of support. Nevertheless, there is little doubt that the tend-
2"i 6
Thompson et aL, ABIM RESIDENTEVALUATIONFORM
e n c y to be lenient in rating is very real. 4 This is the most important facet of accuracy to the Board because an unsatisfactory resident could b e labeled satisfactory because of the t e n d e n c y to leniency. O u r data demonstrate definite p r o b l e m s w i t h the second facet of rating accuracy, "differential elevation." Although those being rated m a y all have b e e n p e r f o r m i n g at a c o m p e t e n t level, raters should b e able to discriminate a m o n g ratees reliably. H o w m u c h interrater reliability is required d e p e n d s on the uses to w h i c h a measure is put. Nunnally suggests that w h e n used to make p o l i c y decisions a b o u t individuals, reliability m u s t be very high (at least 0 . 9 0 ) for a measure to be used confidently, s Departmental r e c o m m e n d a t i o n s for fellowships and job o p p o r t u n i t i e s may b e based on these evaluations. O u r data s h o w that the interrater reliability is too l o w to distinguish a m o n g individual residents for such purposes. Poor interrater reliability has b e e n seen in other studies as well. 6 Some raters w e r e systematically too lenient and o t h e r raters too harsh relative to their peers. 6 This facet of inaccuracy is less important to the Board if all residents b e i n g evaluated are c o m p e t e n t . However, p o o r interrater reliability bec o m e s a major p r o b l e m w h e n b o t h c o m p e t e n t and inc o m p e t e n t residents are being evaluated. Discriminating a m o n g p e r f o r m a n c e dimensions ( " s t e r e o t y p e a c c u r a c y " ) is the third facet of accuracy in ratings. The ABIM believes it is important to measure different dimensions of clinical care p r o v i d e d b y resident physicians. 1 These dimensions are reflected in the Board's statement on clinical c o m p e t e n c e . 7 However, the data p r e s e n t e d here suggest that the form currently used in m a n y residency programs and r e c o m m e n d e d b y the Board is not yielding ratings that accurately differentiate a m o n g these nine dimensions. Failure to distinguish a m o n g different dimensions of an evaluation scale has b e e n seen in internal medicine and in other areas of medicine. 6. s, 9 Global rating scales often yield a " h a l o " effect, w h e r e a score on one dimension predicts the scores on all the dimensions w i t h great reliability. This has b e e n f o u n d to o c c u r in obstetrics and gynecology training programs, a An assessment of evaluation of internal m e d i c i n e residents at another institution also f o u n d this effect. 9 Further assessment at o t h e r p r o g r a m s w o u l d e n h a n c e the generalizability of these results. A study of o r t h o p e d i c residents taking the certification examination yielded results very similar to ours. 6 The authors used a variety of measuring instruments, including m u l t i p l e - c h o i c e questionnaires, written and oral simulation exercises, and faculty ratings. They also f o u n d a " h a l o " effect, and they f o u n d that very f e w candidates w e r e rated p o o r or marginal. 6 Statistical analysis s h o w e d that "information-gathering ability" was really "diagnostic ability," w h i c h the authors suggest m a y have b e e n the result of raters' not actually observing residents gathering data and thus basing their evaluations on some other d i m e n s i o n that
they had observed. 6 The authors c o m m e n t e d on the " t e n d e n c y to credit the person with a pleasing personality with greater cognitive skills than he has actually achieved." A further p r o b l e m w i t h the halo effect is that it limits the usefulness of f e e d b a c k p r o v i d e d to residents concerning their performances. If attending physicians fail to distinguish a m o n g dimensions of c o m p e t e n c e , their ratings cannot inform residents of areas that are strengths and those that n e e d i m p r o v e m e n t . Thus, the halo effect limits b o t h the evaluative and the educational utility of these ratings. The lack of useful written c o m m e n t s also limits the feedback potential of these forms. There are at least two possible explanations for these results. One, w h i c h w e consider less likely as the sole explanation, is that the specific rating format hampers rating accuracy. Alternatively, attending physicians m a y not b e sufficiently trained in the use o f the form to make distinctions a m o n g the dimensions of clinical c o m p e t e n c e . Ratings are the result of the interaction b e t w e e n the format, the rater, and the ratee. 1° Thus, either or b o t h explanations c o u l d b e responsible for the p r o b l e m s observed. H o w can these p r o b l e m s be r e d u c e d or eliminated? If p r o b l e m s are attributed to the rating format, then one avenue for action is to modify the form. Concerning format changes, an important question to be asked is w h e t h e r the nine items represent the most important characteristics of clinical c o m p e t e n c e . Using a Delphi process w i t h 78 faculty and residents, Wigton found substantial disagreement a m o n g full-time faculty, voluntary faculty, and residents regarding the ten most important characteristics to b e evaluated.11 This p r o b l e m occurs in the medical clerkship as w e l l ) 2 Wigton's g r o u p rated reliability, relationship w i t h patients, and clinical j u d g m e n t highest and did not rate history-taking skills, record-keeping skills, or procedural skills a m o n g the ten most i m p o r t a n t characteristics. W h e t h e r changing the dimensions rated w o u l d imp r o v e the evaluation process is u n k n o w n but c o u l d be explored; a lack of consensus a m o n g attending physicians regarding the important dimensions of c o m p e tence m a y be responsible for l o w agreement. ~ The alternative a p p r o a c h to i m p r o v i n g ratings is to focus on the attending physicians. We believe it unlikely that just changing the f o r m used to evaluate residents will i m p r o v e the p r o b l e m s identified, because reviewers of the literature on rating p r o b l e m s have found that format changes alone often fail to i m p r o v e the quality of the ratings. 13 I m p r o v e m e n t in the process by w h i c h raters make judgements is necessary. Successful rater training programs in industry have e m p h a s i z e d accurate observation o f behavior, decreasing idiosyncratic and erroneous beliefs a b o u t c o m p e t e n c e and improving rater motivation to rate a c c u r a t e l y ) Such programs might be profitably a p p l i e d to medical education.
JOURNALOFGENERAL[NTERNALMEDICINE.Volume 5 (May/June). 1990
In conclusion, this study suggests that in this setting, the ABIM form failed to yield valid assessments of the different dimensions of clinical c o m p e t e n c e . The f o r m is m e a n t to provide m e a s u r e m e n t of nine different dimensions, yet only one dimension was assessed. The data do not answer w h e t h e r or not this dimension was validly assessed, but the l o w reliability of the ratings makes a high degree of validity impossible to achieve, as validity is limited b y the reliability of a measure. 14 More specifically, the validity of distinctions a m o n g residents' clinical p e r f o r m a n c e s is called into d o u b t because of the l o w interrater reliability. One of the p r o b l e m s w i t h rating scales of this type is the absence of a " g o l d standard" with w h i c h they can be c o m p a r e d , t~ Standardized patients or patient instructors m i g h t b e able to p r o v i d e a " g o l d standard" against w h i c h to di• rectly assess the validity of the ABIM form. 15 Future research should e x p l o r e methods to i m p r o v e the quality and usefulness of such ratings. Meanwhile, caution in drawing conclusions a b o u t individual residents f r o m these evaluation forms is in order. O t h e r m e t h o d s of evaluation should be sought b y the Board and b y departments of m e d i c i n e to evaluate medical residents.
REFERENCES 1. The American Board of Internal Medicine. The role of the attending physician in evaluating residents. Philadelphia: American Board of Internal Medicine, 1987.
217
2. James LR, Demaree RG, Wolf G. Estimating within-group interrater reliability with and without response bias. J Applied Psychol. 1984;69:85-98. 3. Murphy KR, Balzer WK. Rater errors and rating accuracy. J Applied Psychol. 1989;74:619-24. 4. Bernardin HJ, Beatty RW. Performance appraisal: assessing human behavior at work. Boston: Kent Publishing, 1984. 5. Nunnally JC. Psychometric theory. New York: McGraw-Hill, 1967. 6. Levine HG, McGuire CH. Rating habitual performance in graduate medical education. J Med Educ. 1971 ;46:306-11. 7. American Board of Internal Medicine. Clinical competence in internal medicine. Ann Intern Med. 1979;90:402-11. 8. Cranton PA, Dauphinee WD, McQueen MM, Smith LP. The reliability and validity of in-training evaluation reports in obstetrics and gynecology. Research in medical education: proceedings of the twenty-third annual conference, 59-64. Washington, DC, Association of American Medical Colleges. 9. Williams SV. Assessment of procedures used to substantiate clinical competence for ABIM certifying examination. Abstract presented to Robert Wood Johnson Clinical Scholars Program, 1977. 10. Banks CG, Roberson L. Performance appraisers as test developers. Acad Management Rev. 1985;10:128-42. 11. Wigton RS. Factors important in the evaluation of clinical performance of internal medicine residents. J Med Educ. 1980;55:206-8. 12. Quarrick EA, Sloop EW. A method for identifying the criteria of good performance in a medical clerkship program. J Med Educ. 1972;47:188-97. 13. Landy FJ, Farr J. Performance rating. Psychol B,all. 1980; 87:72-107. 14. Allen MJ, Yen WM. Introduction to measurement theory. Belmont, CA: Wadsworth, 1979. 15. Stillman PL' Gillers MA' Clinical perf°rmance evaluati°n in m e d icine and law. In: Berk RA, ed. Performance assessment: methods and applications. Baltimore: Johns Hopkins University Press, 1986.