Medical Education 1991, 25, 110-118

Pitfalls in the pursuit of objectivity: issues of reliability C. P. M. VAN DER VLEUTEN, G. R. N O R M A N t & E. DE GRAAFFS University of Limburg, Maastricht, tMcMaster University, Hamilton, Ontario and $Technical University, Derft

Summary. Objectivity has been one of the hallmarks in the assessment of clinical competence in recent decades. A consistent shift can be noticed in which subjective measures are being replaced by objective measurement methods. In the transition from subjective to objective methods trade-offs are involved, both in the effort expended and in the range of behaviours assessed. The issue of the presumed superiority of objective measures is addressed in two successive papers. In this paper a distinction is made between objectivity as a goal of measurement, marked by freedom of subjective influences in general, and objectivity as a set of strategies designed to reduce measurement error. The latter has been termed objectification. The central claim of this paper is that these two approaches to assessment do not necessarily coincide. By reviewing a number of studies comparing subjective and objectified measurement methods, the claim of the supremacy of the latter with respect to reliability is discussed. The results of these studies indicate that objectified methods do not inherently provide more reliable scores. Objectified methods may even provide unwanted outcomes, such as negative effects on study behaviour and triviality of the content being measured. The latter issues, related to validity, efficiency and acceptability, are discussed in a second paper.

Introduction The recent history of evaluation in medical education has been marked by a consistent movement towards objectively scored examination methods - multiple choice questions, patient management problems or objective structured clinical examination. There are several reasons for this trend. On the one hand, it is easy to demonstrate that the reliability of subjective judgements is often miserably low. Essay tests commonly have interrater reliabilities which rarely exceed 0.70 (Neufeld 1985); traditional viva-like orals in the range of0.50 to 0.80 (Muzzin 1985), and ward ratings are easily the worst ofall, with median ratings of several large studies in the range from 0-25 to 0.37 (Streiner 1985). Moreover, many of these forms of assessment involve intensive manpower. As a result, the promise that the assessment can be accomplished by an alternative method which involves no human effort in scoring, and is objective (with an aura of scientific precision which accompanies the term), has proven irresistible. Nevertheless, these transitions have rarely proceeded without opposition. Many clinicians remain uncomfortable with multiple choice testing, claiming, for example, that recognition of the right answer from a list is different than being able to recall the right answer, and that a multiple choice version of a test therefore overestimates competence (Newble et al. 1979). This discomfort has led to alternative testing methods, such as the modified essay question (Feletti 1980; Knox 1980), which uses free response answers. The concern that multiple choice questions lead to a single-minded pursuit of isolated facts also contributed, in large part, to the evolution of the patient management problem (McGuire & Babbott 1967), which was both

Key words: educational measurement/ *methods; reproducibility of results; clinical competence; rev tutor Correspondence: C. P. M. Van der Vleuten PhD, Department of Educational Development and Research, PO Box 616, 6200 MD Maastricht, The Netherlands. 110

Pitfalls in the pursuit of objectivity 1 objective and closer to clinical problem-solving . The goal of developing methods which are based on actual clinical performance, yet retain the characteristic of objectivity, so attractive in the written tests, led to the objective structured clinical examination (OSCE) (Harden & Gleeson 1979). Since there are trade-offs involved in the transition from subjective to objective methods, both in the effort expended and in the range of behaviours which may reasonably be objectified, it is reasonable to address the issue ofwhether the objective methods can deliver on the claim that is, does the use of objective scoring methods yield improvements, and if so, at what cost? Before we begin, it is necessary to define some terms.

Definitions of objectivity Because the notion of objectivity has been central to many ofthese developments, we believe that it is important to clarify the meaning of the term. One view is that objectivity is a goal of measurement, marked by freedom from subjective influences. According to De Groot (1969, p. 163) the general scientific requirement of objectivity: ‘implies that the investigator must act as “objectively” as possible, that is, in such a way as to preclude interference or even potential interference by his personal opinions, preferences, modes of observation, views, interest, or sentiments’. This general directive states ‘as objective, as possible’, recognizing that subjective influence cannot completely be eliminated. Subjective influences may be encountered at all stages of research. The consequences of the requirements of objectivity, however, may vary according to the phase of scientific activity (De Groot, 1969, p. 167). Preferences of the investigator, for instance, usually play a part in the selection of the object of study. Hence, some subjective influence in the early phases of research cannot be excluded, and in fact may be justified. Bias can also enter into a measurement situation at a later stage. Ifthere is bias at the stage of item selection, for example an excess of particular disciplines, then the measurement is not objective. Similarly, if it can be shown that there is a sex or cultural bias in the test, then it is not objective. From this perspective, then,

111

objectivity is not simply obtained from any manipulations of the format or style of the instrument, but is a property to be assessed at all stages in the design and execution of an evaluation. An alternative view is that objectivity is a set of strategies designed to reduce measurement error. For example, Guilford (1954) states that ‘if the instructions of the judges are clear enough to be carried out by clerks, the measurement may be considered objective’. This version of objectivity then reduces to a judgement of the clarity of instructions, the degree to which specific behaviours have been identified and described, etc. and ultimately has become associated with long, detailed check-lists and yesho criteria as opposed to broad categories and 7-point scales. When Harden & Gleeson (1979) state, with reference to the OSCE, that ‘The examination is not only more valid but more reliable. The variables of the examiner and the patient are to a large extent removed. The use of a check-list by the examiner and the use of multiple choice questions result in a more Objective examination’ (italics added), they may well be hoping to achieve the goal of objectivity, but they are describing the strategies. In order to distinguish between these two views of objectivity, we will continue to refer to the comprehensive goal of value-free measurement as objectivity, and the strategies to achieve an apparently judgement-free instrument (detailed check-lists, yesho criteria, machine scoring, etc.) as objectification, and an instrument which has been constructed according to these strategies as one which has been objectified. Objectification has been traditionally defended in the literature on several grounds (Hofstee 1985): (1) validity - objectified test scores are a better, i.e. more valid, representation of the competencies assessed; (2) fairness -by removing (rater) biases, objectified tests are more fair to candidates; (3) efficiency -objectified tests may be marked by machines, therefore are less costly and more efficient; and (4) transparency objectified tests explicitly state the criteria of performances, thus the goals are clear to all. We will use this framework to pose a set of questions and then review a number of studies to determine:

C.P . M . Van der Vleuten et al. the extent to which the manipulations associated with objectification - yes/no responses, detailed behavioural checklists, do in fact achieve the intended goal of more reliable measurement; the extent to which objectified and subjective tests assess different competencies; whether objectified tests, written or performance, are more or less efficient to administer, as judged by time and cost of development and administration; whether objectified tests are perceived by teachers and students as more fair, transparent, useful, etc.; and whether objectification does, or does not, result in trivialization and loss of content validity. First, in this paper we will examine the claim that objectified measures produce better reliability. The remaining questions will be dealt with in a second paper. We do not intend that the literature review is complete or comprehensive. Nevertheless, in light of the unquestioning acceptance of the value of objectification to date, we feel that any evidence that contradicts the prevailing assumptions, providing of course that it is of acceptable scientific merit, is worthy of serious consideration.

Does objectification improve reliability? A predominant argument used in favour of objectification of measurement procedures is the improvement of reliability. Reducing bias by removing human inferences from thejudgemental process should improve the precision with which scores are derived from objectified tests. However, it may not be true that subjective measures are inherently unreliable, and therefore not to be preferred. Some of the evidence regarding the comparative reliability of objective and subjective assessment methods will be reviewed here. First, however, some explanation about reliability is in order. The purpose of any assessment is to draw inferences about the ability of examinees beyond the particular sample of items, raters, testing site, time of day, (role-playing) patients, etc. Reliability can thus be conceptualized as the extent to which examinee scores are stable or reproducible

across different but similar (randomly parallel) samples of items, raters, testing sites, time of day, patients, etc. Since all of these factors influence the reliability, the comparison of assessment procedures should focus on the overall reproducibility of scores. The question now becomes how the subjective methods compare to the objectified ones in terms of their overall reproducibility of scores. Some studies that have addressed this issue will be described, exemplifying a subjective/objectified comparison in three different testing situations:

(1) multiple choice questions versus openended questions; (2) check-lists versus rating scales in performance-based tests; and (3) structure of scoring keys for tests using open-ended questions.

Multiple choice questions versus open-ended questions

Stalenhoef-Halling et a / . (1990)administered a test consisting of two parts to 122 second-year medical students (of a 6-year curriculum), at the end of a problem-based learning unit on ‘Elderly People’. The first part consisted of seven short answer open-ended questions (OEQs), whereas the second part comprised 114 true/false questions (TFQs). The content of both parts was matched as far as possible. The OEQs were rated by 14 academic staff members, with pairs of raters scoring one question for all examinees. None of the raters were experts in the tested field; they were provided with brief written scoring keys. Using generalizability analysis the investigators reported generalizability coefficients separately for TFQs and OEQs (the latter using either one and two raters). Generalizability coefficients were also reported for all TFQ tests from year 2, by pooling variance components from all similar TFQ tests administered in that year. For comparability reasons the coefficients were expressed as a function of testing time. Table 1 reports their results. The generalizability coefficients (analogues to alpha) in the table are best thought of as the expected correlation between scores derived from similar, but not identical examinations using a different sample of items (and raters for

113

Pilfalls in the pursuit of objectivity 1 Table 1. Generalizability coeficients of true-false questions (TFQ) and openended questions (OEQ)'

True/false questions' Testing time (hours)

Open-ended questions3

All year 2

Experimental

tests

test

One rater

Two raters

0.64 0.78

0.43 0.60 0.90 0.75 0.79 0.82

0.58 0.74 0.81 0.85 0.88 0.89

0.6 1 0.76

0.84

0.88 0.90 0.92

0.82 0.86 0.89 0.90

'From Stalenhoef-Halling et al. (1990). '90 questions per hour. '7 questions per hour.

OEQs). They represent the overall reliability of scores found with these two formats, including the subjectivity introduced by disagreement between raters. Table 1 clearly shows that the reliability of open-ended questions may be comparable to the reliability of an 'objective question' format. In fact, the reliability of the experimental TFQ is even lower than the O E Q reliability, although this is probably an artifact of the specific test, since the values for other TFQs from that year were higher. The generalizability coefficients for using one versus two raters show only marginal differences. Apparently, the subjectivity involved from rater disagreement has a limited influence on the overall precision of the test. (From Table 1 one might conclude that OEQs have a higher precision, because fewer questions are needed. However, answering OEQs demands more time, and in terms of testing time needed both formats are approximately equal. Apparently OEQs contain more information per item, but at the cost of testing time needed. The overall testing time, therefore, seems to be more determinant than the number of items in a test; a finding which has been demonstrated before (Van der Vleuten & Swanson 1990)). Norman et al. (1989), in a study to investigate the effects of medical context and format, used multiple choice questions (MCQs) and two kinds of modified essay questions (MEQs): directed essay questions (DEQs), in which specific questions were used requiring point form answers, and undirected essay questions, (UEQs) in which questions were phrased glo-

bally and answers were allocated as much as half to a full page. The content across formats was completely matched. Content measures were gathered from all three formats, and a separate process score was gathered for the UEQ. Six problems were given to 12 subjects; two problems were in each format, one problem containing questions in the problem context, the second problem questions were in a randomized context. Each problem contained questions from each format. O n a subset of the data scores were gathered from two raters to estimate interrater reliability of the DEQ and UEQ. These were quite acceptable and ranged from 0.80 to 0.97. The investigators reported correlations between scores within and between formats. The correlation within a format signifies the correlations between two problems. Using this information, the overall reproducibility of scores for the (two-problem) test may be inferred. For the M C Q this results in a generalizability coefficient of 0.68, for the DEQ in 0.45, for the U E Q content score 0.62, and for the U E Q process score 0.66. Except for DEQ, the values for MCQs and tests requiring free responses were comparable. These results again indicate that more subjective measures may yield comparable precision of scores to an objectified format, and that the subjectivity introduced by rater inferences is of less importance. Check-lists versus rating scales in essay examinations Norcini et al. (1990)compared two scoring

114

C . P . M . Van der Vleuten et al.

methods on an essay examination developed for the American Board of Internal Medicine. The test consisted of a series of 12 focused questions on issues of diagnosis or clinical management, requiring a brief essay-type response. Subjects were 98 residents in internal medicine. Two scoring methods were devised - an ‘analytic’ check-list score based on an answer key of desirable actions weighted to create a total score, and a global score on a single rating on a 9-point scale. The analytic key was administered by (1) three non-medical individuals who were trained for 14 hours and (2) two fellows in internal medicine. Global scoring was done by 12 internists, who were not trained. The reliability of the %hour test, based on the analytic non-doctor scorers was 0.36, requiring 18hours of testing to achieve a reliability of 0.8. For the global scores, with untrained doctor raters, the test reliability was 0.63, requiring 5.5 hours of testing time to achieve a reliability of 0.8. Unfortunately, the paper does not report an analysis of the fellowgenerated analytic scores, thus it might be argued that the deficiency of the analytic scores reflects the use of non-experts, rather than scoring differences. However, they do indicate that the correlation between non-expert and fellow generated analytic scores of 0.87, thus it appears unlikely that the difference is due to lack of expertise. In any case, one of the apparent advantages of check-list strategies is that the process of objectification reduces the dependence on judgement. If this is accompanied by a cost of reliability of this order, even after extensive training, the purported advantage is illusory. Check-lists versus rating scales in pevformance-based tests

In the last decade, performance-based tests, e.g. OSCEs (Harden & Gleeson 1979) and patient-based tests (Stillman et al. 1976), have become very popular for assessment of clinical skills (Hart et al. 1986; Hart & Harden 1987; Bender et al. 1990). In these tests, examinees are confronted with a practical situation and are required to perform a variety of clinical tasks. These tasks may include history-taking, physical examination procedures, interpreting laboratory tests, etc., and performance is directly observed and scored by raters. An issue of measurement is

whether check-lists or rating scales should be used for scoring the performance of examinees. Check-lists generally contain a long list of detailed items of adequate performance, and the rater’s task is simply to check off these items, using ‘YeslNo’ responses. Rating scales, on the other hand, generally contain fewer items, and focus more on broader issues of performance. Ratings typically use Likert-type scales, in which raters are required to score performance in a number of boxes, ranging from very poor to very good performance. The number of boxes may vary from three to nine, sometimes even more. Very few studies have investigated the reliability of check-lists and rating scales in performance-based tests, by directly comparing reproducibility of both formats. Van Luijk & Van der Vleuten (1991) reported a study in which the (teacher) raters completed both a check-list and a global rating scale in a performance-based testing situation (Maastricht Skills Tests). These tests assess a broad variety of technical and clinical skills at various levels in the educational programme, including procedural skills, communication skills, and laboratory skills. The check-lists are based on ‘standards’ used in the educational programme (Bouhuijs et al. 1987), and are generally very explicit and detailed. Depending on the content of the stations, the number of items on the check-lists ranges from 30 to 120. Even communication skills are rated by these detailed check-lists (Kraan et al. 1989). By contrast, the global rating scale in this study consisted ofone item asking for the overall impression of examinee performance in a particular station (general impression rating), and three more specific items (specific ratings) referring to the technique of the skill, the fluency with which the skill was carried out, and the quality of the patient approach. Ratings were given on a 10-point scale, varying from very unsatisfactory performance (1) to excellent performance (10). The reported interrater reliability was calculated on a subset of stations where multiple scores were gathered using co-observers ( n = 649). Interrater reliability was 0.83 for the check-lists, 0.72 for the general impression rating, and 0.72 for the specific ratings. The findings on the overall reproducibility of scores, totalled across a number of tests (by pooling

Pilfalls in the pursuit of objectivity 1 Table 2. Generalizability coefficients of check-lists and rating scales in a performance-based testing

situation' Testing time (hours)' 1 2 3 4 5 6

Check-lists

General impression

Specific ratings

0.44 0.6 1 0.7 1 0.76 0.80 0.83

0.45 0.62 0.71 0.76 0.80 0.83

0.47 0.64 0.73 0.78 0.82 0.84

'From Van Luijk & Van der Vleuten (1990). 'On the average, four stations per hour.

variance components; 346 examinees), are reported in Table 2. These results quite strikingly indicate only modest differences in reliability between global ratings and check-lists when overall reproducibility of scores is considered, despite lower interrater reliabilities for the global ratings. Moreover, even when the rating is limited to only one item reflecting the general impression of examinee performance, the same reliability is found. Cohen et al. (submitted for publication) administered a 30-station OSCE to 72 candidates at the internship level, who were foreign-trained doctors applying for admission to a programme preparing them for licensure in Ontario. After each examiner had completed the check-list, he was asked to rate the 'logic of approach' and attitude to the patient on a rating scale. The generalizability of the test score based on content check-lists for the 30 stations was 0.84, and for the two ratings was 0.83 and 0.75 respectively. The logic of approach scale correlated 0.82 with the check-list total score, and the attitude to patient scale correlated 0-58. A second OSCE was administered to graduates of the programme ( n = 33), and used a total of20 stations, of which 17 were patient-based. This time, examiners assigned a single global score, 'Rate the extent to which the candidate demonstrated an organized approach to the examination of the patient'. The results were essentially the same; generalizability of the check-list total score was 0.69 versus 0.60 for the total global score. O n individual stations, the two scores were correlated between 0.11 and 0.75 (median = 0.58).

115

These studies clearly suggest that performance assessment by means of rating scales is often as reliable, or even more reliable than objectified check-lists, despite the fact that the first is considered to be more subjective and the latter more objective. By comparing psychometric studies in patient-based examinations using check-lists and rating scales, Van der Vleuten & Swanson (1990) came to the same conclusion: although interrater reliabilities tended to be generally better for check-lists, the overall reproducibility of scores indicated no systematic differences. The next section examines similar comparisons on written examinations.

Structure ofscoring keys

An analogue to the performance-based testing situation is the scoring of open-ended written questions, essays or short-answer questions, using differently structured scoring keys, allowing for more or less subjective influence in the rating process. Frijns et al. (1990) addressed this issue by comparing four different scoring keys, varying- in the amount of structure given to the The four scoring keys used were: globaljudgement: no key was given to the rater; a maximum of 100 points could be given to each answer; short-answer key: most important topics of the ideal answer were given; each topic was to be assigned with a score; global check-list: a list of items together framing the ideal answer; each item was to be scored; elaborated check-list: an elaborated list of items with detailed information per item, including synonyms. The test consisted of 11 open-ended questions referring to eight different patient cases, administered to 40 medical students. Eight teacher raters scored the answers using all four scoring methods in a balanced rotating scheme. Variance components were estimated and generalizability coefficients were reported for each of the scoring keys for a variety of test lengths using single and multiple ratings. These are summarized in Table 3. The generalizability coefficients for the most I

C . P . M . Van der Vleuten et al.

116

Table 3. Generalizability coefficients for open-ended questions using four differently structured scoring keys Scoring method Global judgement Testingtime

One

(hours)’

rater

Two raters

1 2 3 4 5 6

0.38 0.55 0.64 0.71 0-75 0.78

0.63 0.72 0.77 0.81 0.84

0.46

Short-

Global

answer key

check-list

Elaborated check-list

One

Two

One

Two

One

Two

rater

raters

rater

raters

rater

raters

0.37 0.54 0.64 0.70 0.75 0.78

0.42 0.63 0.69 0.75 0.79 0.82

0.38 0.55 0.65 0.71 0.75 0.79

0.45 0.62 0.71 0.77

0.44 0.61 0.70 0.76 0.80 0.82

0.51 0.67 0.76 0.81 0.84 0.86

0.80 0.83

‘From Frijns et al. (1990) 2Four cases per hour. structured scoring key, the elaborated check-list, were slightly higher. There were no differences between scores derived with the global judgement, the short-answer key, and the global check-list. The improvement of the reliability using multiple ratings per question was approximately invariant across the different scoring keys, indicating equal rater influence on the overall reliability irrespective of the structure given to the rating process. Based on the variance components the investigators demonstrated that the less structured scoring keys introduced some inconsistency of rater scoring across cases and examinees; however, the net effect on the overall reproducibility of scores was negligibly small.

Discussion The above studies consistently seem to indicate that objectification does not lead to dramatic improvement in reliability. Scores from subjective methods are often as reliable as, and in the case of the essay examination, considerably more reliable than, scores derived from objectified methods. The error caused by subjectivity, introduced by allowing human interferences in the assessment process, seems to have limited influence on the overall reproducibility of scores. Compared to the ‘subjectivity error’, the other error sources in the assessment process influence the reliability to a far greater extent. Overall reproducibility of scores seems far more domi-

nated by variability of examinee performance across content (items, stations or cases). The performance of an examinee on one item, station or case is a poor predictor of performance on the next one. In the problem-solving literature this has been termed the ‘content specificity problem’ (Elstein et al. 1978), and has been consistently found across formats (De Graaff et al. 1987; Page et al. 1990; Van der Vleuten & Swanson 1990).As a consequence, many items are needed to achieve minimal test reliability, thus requiring relatively long testing time. With longer tests, the error arising from subjectivity is further diminished, because effects at the item level are ‘averaged out’ across items. (This assumes nesting of the error source within items, which is an important consideration in test development and test administration. For instance, a good strategy in a performance-based test is to have different raters in different stations, as opposed to one rater following an examinee across stations. The same holds for a written test: it is better to have raters rating one question for all examinees than having the same rater rate all questions for a limited number of examinees. Rater stringency will thus be balanced within examinees (and averaged out), instead of between examinees (favouring some and disadvantaging others) .) From the above it follows that objectified methods should not be automatically preferred over subjective methods for assessment purposes, because they are considered more reliable. Objectified methods do not necessarily improve

Pigalls in the pursuit of objectivity 1

reliability, and subjective methods, depending on the testing circumstances (e.g. structure ofthe testing situation, the sampling of items and rates), may yield equally reliable test information. The foregoing should not be interpreted as a plea for subjectivity, nor as a petition to regress to Victorian ages ofassessment. O n the contrary, given the distinction made earlier between objectivity and objectification, the trend in recent decades in assessment to strive for objectivity is laudable, and indeed much progress has been achieved. Tests and examinations have followed general scientific research by abandoning subjective influences; they have become more objective, more standardized, scores have become more comparable, decisions have become more fair, etc. By no means is this article intended as a criticism of these developments. However, as far as objectification is concerned, some serious doubts have been expressed. Again, this should not be interpreted as a general disapproval of strategies for objectification, but more as a rejection of the opposite argument: the tacit assumption that objectified measures are inherently better and consequently to be preferred over more subjective measures. The assumption of automatic superiority of objectified measures finds no support from the studies reported in this paper. Since objectivity is a matter of degree the basic question becomes the reliability or reproducibility of the measurement results. Open-ended questions, rating scales, global scoring keys, etc. are not intrinsically less reliable than their objectified counterparts, and may, at some levels of education, measure similar competencies to those assessed by more standardized forms such as multiple choice questions or check-lists. Preference and use of measurement measures should consequently not exclusively be based on the assumption of the inherent superiority of objective measures. Instead, the choice of a measurement method should in the first place be determined by the educational context, or the purpose of the testing situation. Given a particular testing situation, a more subjective method may be preferred over an objective one, and vice versa. Pitfalls associated with the selection of appropriate measurement techniques are

117

discussed in the next article (Norman et a l .

1991).

References Bender W., Hiemstra R.J., Scherpbier A.J.J.A. & Zwierstra R.P. (eds) (1990) Teaching and Assessing Clinical Competence. Boekwerk Publ., Groningen. Bouhuijs P.A.J., Van derV1eutenC.P.M. & VanLuijk S.J. (1987) The OSCE as part of a systematic skills training approach. Medical Teacher 9,183-91. De Graaff E., Post G.J. & Drop M.J. (1987) Validation of a new measure of clinical problem-solving. Medical Education 21, 213-18. De Graaff E. (1989) A test of medical problem-solving scored by nurses and physicians: the handicap of expertise. Medical Education 23, 381-6. De Groot A.D. (1%9) Methodology: Foundations of Inference and Research in the Behavioural Sciences. Mouton, The Hague. Elstein A,, Shu1mznL.S. t2SpraflcaS.A. (1978) Medical Problem Solving: An Analysis of Clinical Reasoning. Harvard University Press, Cambridge, Massachusetts. Feletti G.I. (1980) Reliability and validity SNdieS ofthe modified essay question.Journa1 ofMedica1 Education 55,93341. Frijns P.H.A.M., Van der Vleuten C.P.M., Verwijnen G.M. &Van Leeuwen Y.D. (1990) Essay questions and their scoring problems. In: Teaching and Assessing Clinical Competence (ed. by W. Bender, R.J. Hiemstra, A.J.J.A. Scherpbier, & R.P. Zwierstra), pp. 466-71. Boekwerk Publ., Groningen. Guilford J.P. (1954) Psychometric Methods. McGraw Hill, New York. Harden R.M. & Gleeson F.A. (1979) Assessment of clinical competence using an objective structured clinical examination (OSCE). Medical Education 13, 41-54. Hart I.R., Harden R.M. & Walton H.J. (eds) (1986) Newer Developments in Assessing Clinical Competence. Heal Publications, Montreal. Hart I.R. &Harden R.M. (eds) (1987) Further Developments in Assessing Clinical Competence. Can Heal, Montreal. Hofstee W.K.B. (1985) Liever kiinisch: Grenzen aan het objectiviteitsbeginsel bij beoordeling en selectie. (Rather clinical: Frontiers of the objectivity principle) Nederlands Tvdschtift voor de Psychologie 40,459-73. Knox J.D.E. (1980) How to use the Modified Essay Question. Medical Teacher 2,2&24. Kraan H.F., Crijnen A.A.M., Zuidweg J., Van der Vleuten C.P.M. & Imbos Tj. (1989) Evaluation of medical interviewing skills in the undergraduate curriculum: a checklist for medical interviewing skills. In: Communication with Medical Patients(ed. by M. Stewart & D. Roter), pp. 166-77. SagePublications, New York.

118

C. P . M . V a n der Vleuten et al.

McGuire C.H. & Babbott D. (1967) Simulation technique in the measurement of problem-solving skills. Journal of Educational Measurement 1, 4-10. Muzzin L.J. (1985) Oral examinations. In: Assessing Clinical Competence (ed. by V.R. Neufeld & C.P. Norman). Springer, New York. Newble D.I., Baxter A. & Elmslie R.G. (1979) A comparison of multiple choice and free response tests in examinations of clinical competence. Medical Education 13, 263-8. Neufeld V.R. (1985) Written tests. In: Assessing Clinical Competence. (ed. by V.R. Neufeld & C.R. Norman), pp. 94-118. Springer, New York. Norcini J.J., Diserens D., Day S.C., Cebul R.C., SchwartzJ.S., Beck L.H., Webster G.D., Schnabel T.G. & Elstein A.S. (1990) The scoring and reproduceability of an essay test of clinical judgement. Academic Medicine (suppl.) 65, S41-S42. Norman G.R., Davis D.A., Painvin A,, Lindsay E., Rath D. & Ragbeer M. (1989) Comprehensive assessment ofclinical competence offamilylgeneral physicians using multiple measures. Proceedings of the Twenty-eighth Annual Conference of Research in Medical Education, Washington DC. Norman G.R., Van dervleuten C.P.M. & deGraaffE. (1991) Pitfalls in the pursuit ofobjectivity: issues of validity, efficiency and acceptability. Medical Education 25, 119-26. Page G., Bordage G., Harasym P., Bowmer I. & Swanson D.B. (1990) A new approach to assessing clinical problem solving skills by written examination: conceptual basis and initial pilot test results. In: Teaching and Assessing Clinical Competence. (ed. by W. Bender, R.J. Hiemstra, A.J.J.A Scherpbier&

R.P. Zwierstra), pp. 403-7. Boekwerk Publ., Groningen. Cohen R., Rothman A.I., Poldre P. & Ross J. (Submitted for publication) Validity and generalizability of global ratings. Stalenhoef-Halling B.F., Jaspers T.A.M., Fiolet J.F.B.M. & Van der Vleuten, C.P.M. (1990) The feasibility, acceptability and reliability of openended questions in problem-based learning curriculum. In: Teaching and Assessing Clinical Competence. (ed. by W. Bender, R.J. Hiemstra, A.J.J.A. Scherpbier & R.P. Zwierstra). Boekwerk Publ., Groningen. Stillman P., Sabers D & Redfield D. (1976) The use of paraprofessionals to teach and evaluate interviewing skills in medical students. Pediatrics 57,769-74. Streiner D.L. (1985) Global rating scales. In: Assessing Clinical Competence (ed. by V.R. Neufeld & G.R. Norman), pp. 119-41. Springer, New York. Van Luijk S.J. & Van der Vleuten C.P.M. (1991) A comparison of checklists and rating scales in performance-based testing. In: More Developments in Assessing Clinical Competence (ed. by I.R. Hart & R.M. Harden). Can Heal, Montreal. Van der Vleuten C.P.M. & Swanson D.B. (1990) Assessment of clinical skills with standardized patients: state of the art. Teaching and Learning in Medicine 2, 58-76.

Received 20 N o v e m b e r 1989; editorial comments to authors 11January 1990; accepted f o r p u b l i c a t i o n 11 September 1990

Pitfalls in the pursuit of objectivity: issues of reliability.

Objectivity has been one of the hallmarks in the assessment of clinical competence in recent decades. A consistent shift can be noticed in which subje...
662KB Sizes 0 Downloads 0 Views