The MCQ Controversy-A Review.

CONTROWRSY

The MCQ Controversy-A Review JOHN ANDERSON

Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 11/23/14 For personal use only.

Dr John Anderson, MB, FRCP, is Academic Sub-Dean, The Medical School, University of Newcastle upon Tyne, The Royal Victoria Infirmary, Newcastle upon Tyne, NEl 4LP, UK.

It is now two years since my argument in favour of multiple choice questions and the late Sir George Pickering’s views against the technique were presented in Medical Teacher (Anderson 1979; Pickering 1979). Since then there has been a lively and stimulating continuing correspondence in the journal’s columns on this topic. In particular, the letter from Mr Barker and Dr Maatsch (see Appendix 1) raised some important issues which deserve to be answered but at the same time included some statements that I find difficult to accept. Mr Barker and Dr Maatsch also interpreted some of my own views in a way which I do not think is justified and drew some conclusions from my paper which I feel are unwarranted. I delayed replying to this letter at the time partly because of the pressure of other commitments and partly because I wished to wait to see what further correspondence appeared. In this regard I was not disappointed, since the letter from Dr Marshall (see Appendix 2) also makes some fundamental points worthy of further discussion. Like Dr Marshall I am delighted to note the obvious involvement and interest that has been generated in this topic at an international level. However, what began as a simple pair of arguments ‘for’ and ‘against’ MCQs has developed, as I felt it would, into a useful exchange of views and opinions on the more detailed aspects of MCQ technique and usage, the bulk of the correspondence seeming largely to accept the MCQ principle. There is, nevertheless, clearly (and inevitably) a lack of agreement on the finer points. Perhaps this is a good thing; too much conformity is not necessarily an advantage and in a contentious area like this it would be unrealistic to expect complete agreement on all aspects of such a wide-ranging topic. Nevertheless, perhaps the time has come, if not to sum up and to close the discussion, then at least to review the present situation and to emphasize the salient points which the correspondents in Medical Teacher have made.

The MCQ Method There is no doubt that the views of all of these contributors and correspondents on the subject-of MCQs could be regarded as ‘controversial’in the literal sense of the word, 150

but I suspect that there is a much larger measure of agreement between all of us than may seem to be the case. Certainly, when I had the pleasure of meeting Dr Marshall on his visit to this country early last year, we found during a lengthy discussion very little indeed on which we actually disagreed, although we had different (but not mutually exclusive) views on certain aspects of the MCQmethod and its use in examinations. Of course, a great deal depends on the aims and objectives of the examination in which MCQs are used and, in particular, the level at which the ‘pass-mark’is set if, indeed, the examination is used exclusively to make a p a d f a i l decision, but more of this later. I would, however, like to remind Mr Barker and Dr Maatsch that the aim of my paper “For MCQs” was to argue in favour of the technique in general and not to provide a detailed critique of the types of questions used and their advantages and disadvantages in different contexts. This aspect is really quite specialized and my primary object was to look at the broad concept and to answer as best I could the criticisms that have been made of it. It is, perhaps, important to mention at this point that the late Sir George Pickering and I did not see each other’s contributions and both of us argued from first principles as we saw them. It was because of this that I largely confined my discussion to a single question-typethe type that I refer to as ‘multiple true/false’. This term has also been used for this type of question by Smart (1976) and Harden (1979). I would define this question as .one that presents an initial statement (or stem) followed by a series of five completions (or items), any number of which may be true and any number false. Each item is independent of every other item, each carries equal weighting when marked and the ‘don’t know’ option is an additional, although independent, refinement. I am glad to read in Marshall’s most recent letter that he agrees that the inclusion of this option is an advantage. The marking is simple-if one mark is awarded to each question, then each item correctly identified as either true or false scores 0.2 and each item incorrectly identified as either true or false scores -0.2. Every item identified as ‘don’t know’ scores 0. The need for the deduction of marks for

+

Medical Teacher V o l 3 No 4 1981


incorrect answers, or ‘counter-marking’, has been discussed both by Fleming et al. (1976) and myself (Anderson 1976). This procedure is essential and ensures that if a candidate marks every item as ‘true’, every item as ‘false’or marks at random, he will obtain a score that does not differ significantly from zero. Fleming and his colleagues (1976) use the synonym ‘independent true/ false’ for this format and in North America this type of question is termed a ‘true/false, cluster variety item’ (Mehrens and Lehmann 1975). Other forms of question may, as Barker and Maatsch claim, be more widely used in the USA, but there is no doubt at all that this is the form of question, whatever we choose to call i t , that i s most widely used in the UK, both in undergraduate and postgraduate examinations. I assume that it is this type of question that Marshall refers to in his recent letter, although he does not make this entirely clear, referring simply to the ‘true/false’ or ‘true/false/don’t know’ i t e m . Nomenclature (Appendix 3) Unfortunately, the foregoing discussion reveals one of the basic problems. Many of us use different terms for the same type of question and we are not always exactly clear what our colleagues are talking about- the old problem of communication. The term ‘multiple true/false’ as originally used by Hubbard and Clemans (1961) and by Hubbard (1971) does not refer exclusively to the type of question that I have described. These authors used the term to describe one form of what they regarded as the three basic types of MCQ. Their multiple true/false category includes two different types- item type K (multiple completion -see Appendix 3) and item type X only the latter conforms to the type of question referred to by other authors in the UK and by myself as multiple (or independent) true/false. When item type X questions are used, the candidate is instructed to respond separately to each of four or five choices, so that any combination of ‘rights’ and ‘wrongs’from all right to all wrong may be permitted. As Hubbard (1971) points out, this type of question gives the examinee credit for each correct choice within the set and he therefore gets partial credit for partial knowledge. However, the fact that item type K is also included under the general heading of ‘multiple true/false’ and the subsequent limitation of this designation in UK practice to questions that correspond to item type X has led to a certain degree of confusion, misunderstanding and lack of uniformity with regard to MCQnomenclature. In general, only three types of question are presently used at all widely in medical examinations in the UK: the multiple true/false form as I have described it (equivalent to Hubbard and Clemans type X ) , the one-from-five (Lennox 1974) and the relationship analysis type (although other variations are occasionally seen in other situanom-particularly in the 0 and A-level examinations of the GCE). The second type of question is referred to by Hubbard and Clemans (1961) and Hubbard (1971) as the ‘one/best response type’ (or ‘five-choicecompletion’) - item type A, and these authors refer to the relationship analysis type as item type E (see Medical Teacher V o l 3 N o 4 1981 ~

Appendix 3). Other complex American formats described by Hubbard and Clemans and by Hubbard in their classical books include the ‘five-choice association’ (item type B); the ‘four-choice association’ (item type C; types B and C are described by Hubbard and Clemans as ‘matching types’); the ‘excluded term’ (item type D); ‘multiple completion’ (item type K-see above); and the one/best response type applied to the specific problems of clinical diagnosis and management - type G. These types are not widely used in the UK but are described in detail by Fleming et al. (1976) and myself (Anderson 1976) and the former authors give useful examples of each type. They are classified, together with brief descriptions and examples, in Appendix 3. T h e Use of MCQs i n Examinations Fleming and his colleagues (1976) state firmly that whatever the scoring system used, an MCQ paper can do no more than place the candidates in rank order in terms of their ability to answer that particular paper. Bearing this in mind, the arguments I used in favour of MCQs apply to all question types, although I would freely admit that I feel that the simpler forms of question are far preferable to, and more appropriate than, the complex variants. I would also stick to my original statement that we are fortunate in the UK that the computer programs used to score MCQs have been adapted to the forms of questions used and to the aims and requirements of the examiners. By contrast, in North America examiners have sometimes had to devise new and complicated question formats to fit into a rigid pre-existing computer program (which may only have the facility to select one out of four or five items, regardless of the form of the question) in order to provide a degree of flexibility and variety within the limitations of the program. It seems unlikely that most of these complex question forms originated because of any inherent merit that they might possess- they do not necessarily add a new dimension of sophistication and reliability to the basic technique but, rather, tend to confuse and baffle the candidate, who is required to follow detailed and complicated instructions (which he must constantly bear in mind) when he answers the questions. It has been argued that such complex questions test reasoning ability and, to some extent, this is so; nevertheless, they are more in the nature of intelligence tests than tests of medical knowledge. Such complex questions are rigid and artificial in their construction and present an ‘all-or-none’situation - complete knowledge gains full marks whereas partial knowledge is not rewarded. The ‘test-wise’ candidate can sometimes accumulate marks by a process of exclusion based on his knowledge of the way such questions are constructed (particularly type K) rather than on his knowledge of medicine. Perhaps this is a good thing; perhaps not. The multiple (independent) true/false and one-from-five types are easy for the candidates to understand and instructions for answering such questions are simple and unambiguous. The relationship analysis type of question, although it is often used and has certain merits, is more difficult for the candidate to deal with. 151


If our aim is to test a knowledge of medicine rather than knowledge of the English language and its intricacies and the ability to remember accurately complicated instructions, there is little doubt that the multiple true/false and one-from-five types of question are to be preferred in that they are ‘purer’ and contain much less ‘background noise’. Perhaps the one-from-five type may place more emphasis on the ability of the candidate to reason and deduce than the multiple true/false variety, but really good one-from-five questions are very difficult to set and, whereas they fundamentally test only one item of knowledge, the multiple true/false question can be used to test five separate and independent items. This question type is the most reliable, reproducible and internally consistent method we have of testing recall of factual knowledge objectively (Smart 1976), as well as being the most versatile and flexible. What Do MCQs Test? In my paper “For MCQs” I stated that such questions can be used to test higher taxonomic levels than simple factual recall; I referred to questions that had been designed to test the understanding of basic facts, principles and concepts, the ability to understand and to interpret data, to solve relevant problems and to evaluate a total situation. Such tests have been used in these areas with considerable success and have been included in our final MB BS examination in Newcastle for ten years (Anderson 1980). Later in my paper I suggested that MCQs should be used primarily to test factual recall - by which I meant (in part) that if factual recall is to be tested, then MCQs are the best method we have available to do this. I do not feel that this latter statement is inconsistent with the former, nor that it indicates any waning of enthusiasm, as Barker and Maatsch seem to think. However, to test higher taxonomic levels, such as those referred to above, requires great care and skill in the setting of questions and, although the discriminatory power of such questions may remain high, the internal reliability of the examination as a whole may be slightly reduced (Anderson 1981). As to how high up the taxonomic ladder one can climb, I indicated this in my paper. In other words, I feel that MCQs can test up to and including Stage (e) in the taxonomy* of Charvat, McGuire and Parsons (1968), although it is difficult to see how MCQs can ever test the highest level in their taxonomy- the ability to create a new synthesis. For this, other assessment methods must be used. Furthermore, this deals only with the cognitive domain-to test in the psychomotor and affective areas other methods, again, may be needed. I disagree with Barker and Maatsch that, although all five items in the multiple true/false question are operT h e CMP classification: (a). Knowledge of fundamental facts, concepts, principles. laws, methods and procedures. (b). Understanding of these facts, concepts, etc. (c). Ability to understand and interpret data. (d). Ability to solve relevant problems. (e). Judgement in evaluating a total situation. (0.Ability to create a new synthesis.

152

ationally independent, the items within the question are not statistically independent. Up to a point there is some truth in the statement that “if the stem presents a familiar situation there is apt to be high correlation of performance on all items in the cluster; if unfamiliar, then low correlations”, but it is self-evident that this is by no means always the case-it depends on the construction and content of the question. A stem popular in the UK is “The following statements are correct:”. This stem does not describe any sort of situation at all. Types of Tests i n which MCQs are Used Barker and Maatsch state that “the distinction made between ‘achievement’ and ‘discriminatory’ tests is poorly explained” in my paper. I take the point they make that such a distinction is important and I do refer to this distinction. However, to define these terms and to discuss them in detail was not within the terms of reference of my paper and a full discussion of this topic would require a full-length pzper in its own right. It might be better to use the terms criterion-referenced instead of ‘achievement’ and peer- or norm-referenced instead of ‘discriminatory’. This alternative nomenclature might help to clarify the situation. Whatever we care to call them, I agree completely that both types of test should discriminate accurately. I cannot understand how my statement that well-written questions “will discriminate accurately between candidates on the basis of their knowledge of the topic being tested” could possibly be interpreted as meaning that well-written tests “are discriminating and less well-written tests are achievement tests”. This is a completely illogical and unwarranted inference and I can only assume that there is again a problem in ncjmenclature. A peer-referenced test signifies that a candidate’s performance is assessed in relation to that of the other candidates his peers, A criterion-referenced system is one in which candidates are assessed in relation to an external standard of performance set by the examiners. In either case, an accurate, ranked order of candidates is vital. The concept of criterion reference implies that there is a body of knowledge which a candidate must possess in order to pass. A paper set to test this basic knowledge will, of course, be easy for most candidates; a negatively skewed distribution is therefore to be expected and the pass rate will be high. This has the advantage of placing the ‘pass-mark’ on the tail of the curve where changes in its numerical value will move the smallest number of candidates from the pass to the fail category or vice versa. A very difficult examination, on the other hand, will tend to produce a positively skewed distribution and the same argument will apply, although this time affecting those candidates with the highest scores. Ranked order must therefore be accurate at the extremes of the skewed distribution curve in each case. An accurately ranked order is equally important in a peer-referenced test and all examiners familiar with such tests will realize how difficult it is to decide on an appropriate pass-mark objectively. However, although this may often seem to be an arbitrary decision, the use of Medical Teacher V o l 3 No 4 1981 -


‘marker’ questions and, if appropriate, a comparison with candidates’ performance in other and independent assessments, in association with a study of the performance of previous cohorts of examinees will give a measure of objectivity to the procedure of setting the pass-mark. In Newcastle, we try to avoid this problem b y not defining a strict p a d f a i l level for our MCQexaminations in absolute terms; the percentage scores are scaled down (I have described the procedure previously; Anderson 1976) and the final score obtained following this scalingdown procedure is only one of several scores derived from different methods of assessment which are taken into account before the p a d f a i l decision is made on the basis of overall performance in the whole examination. In two of our terminal examinations the MCQ is the only component; in neither case is a pass-mark defined, although the examiners decide on the basis of the performance of ‘marker’ questions, the mean score for the examination and the performance of previous cohorts what constitutes a satisfactory score and what is regarded as an unsatisfactory level of performance. Perhaps the difference between this concept and the straight p a d f a i l decision is a subtle one, but it is important, nonetheless. So far as a criterion-referenced examination is concerned, the difficulty is in defining the criteria that enabled the examiners to make their p a d f a i l decisionin other words what constitutes an acceptable level of performance. Marshall clearly recognizes this problem and also the fact that marks obtained in a peer-referenced test will tend to fall into a distribution which is close to normal or which is slightly negatively skewed. Nevertheless, I am not certain that his statement that the “true/false item is useful only to identify the top and bottom 20 per cent” is entirely valid. If this is the case, then I would certainly not challenge his statement that this would be acceptable in assessments where the failure rate is either greater than 80 per cent or less than 20 per cent, but I would disagree completely when he says that “the true/false/don’t know item does not allow rank ordering”. Possibly the problem is that I am still not entirely certain what he means by a true/false item. If he is indeed referring to the multiple (independent) true/ false question then I would dispute his statement that “as an instrument of assessment it has limited application”. I think, however, that what he is trying to draw attention to is the problem associated with the distribution of marks-if the pass mark is set somewhere on the ‘hump’ of the curve, it will occur at the point at which discrimination and ranking is least accurate, where candidates will be ‘bunched together. If this is indeed his problem, then he would be well advised to use a criterion- rather than a peer-referenced type of test since the former will tend to produce a strongly skewed distribution; however, this may not help him greatly if the pass-rate in the examination he describes is indeed in the range of 65 to 80 per cent, and I cannot see any easy solution to his problem whatever type of question he uses.

A Summary of the Controversy to Date The MCQcontroversy seems no longer to be a question of Medical Teacher V o l 3 No 4 1981

whether MCQs have or have not a place in assessmentthis is now generally accepted. Such controversy as persists is largely related to three points: 1. The nomenclature used to describe the various MCQ types. 2. The aims of the test-whether criterion- (achievement) or peer- (discriminatory) referenced. This will certainly determine the way the pass-mark is set and may well determine the type of marking system used. 3. The form of question that is most appropriate for these two types of test.

Answering these questions in reverse order, I am by no means convinced either by Barker and Maatsch or by Marshall that the multiple true/false question as I have defined it is unsuitable for either type of test, but I agree that other question forms might certainly have a useful part to play. The basic aim of a criterion-referenced test is to define the criteria and consequently the ‘pass-mark accurately and objectively according to these criteria; the need for precision in ranking and high internal reliability is self-evident. In the peer-referenced test accurate ranking is paramount; this again depends on the discriminatory power and internal reliability of the test as a whole. It is my conviction that these aims can readily be achieved by the use of carefully set multiple true/false questions of an appropriate degree of difficulty. As regards the second question, the aims of the examination can only be defined by the examiners; although I would emphasize again that it is often difficult to obtain general agreement on the criteria used to make the p a d f a i l decision when a criterion-reference test is used, and that when a peer-referenced test is set precision is much reduced if the pass-mark is set near the peak of the distribution curve of candidates’ scores. Concerning the first question, I would suggest that the classification in Appendix 3 might be found useful, Despite the invaluable pioneer work carried out by Hubbard and Clemans, their classification by item type seems less useful than it was, as some of their types are now only seldom used and descriptive terms to define question formats might well be more helpful-if we are able to agree on these terms! Finally, I would agree whole-heartedly with the last paragraph of Barker and Maatsch’sletter: “it may be that the ‘controversy’deals less with the MCQ as an item type, but more with the failure of some to develop well-written and clinically relevant items in sufficient numbers to provide reliable tests for medical students”. Even if we agree on nomenclature, aims and appropriate usage, the MCQs we set must fulfil these criteria. Part of the last paragraph of my paper “For MCQs” read: “There are two tasks before us now. The first is to help all examiners to prepare consistently good MCQs and to use them in the right context. The second aim, as quoted by Fleming et al. (1976), is to ‘develop methods of assessment in the psychomotor and affective domains which approach the objectivity of multiple choice questions”’. I am sure this is one point on which all of the correspondents and myself are in complete agreement. b 153

154

Medical Teacher V o l 3 No 4 1981


Medical Teacher Val 3 No 4 1981

155



References

Anderson, J., For multiple choice questions, Medical Teacher, 1979, 1, 37. Anderson, J., The Multiple Choice Question in Medicine, Pitman Medical, London, 1976. Anderson, J., Data interpretation and patient-management problems using the MCQformat, Medical Education, 1980(a). 14,82. Anderson, J.. The- reliability and discriminatory power of MCQ papers-a relationship, MedicalEducation, 1981, 15,62. Charvat, J.. McGuire, C. and Parsons, V . , A Review of the Nature and Uses of Examinations in Medical Education, World Health Organization, Geneva, 1968. Fleming, P. R., Sanderson. P. H., Stokes, J. F. and Walton, H., Examinations in Medicine, Churchill Livingstone, Edinburgh and London, 1976.

156

Harden, R. M., Constructing Multiple Choice Questions of the Multiple True/FaZse Type, Association f o r the study of Medical Education, Dundee, 1979. Hubbard, J. P., Measunng Medical Education, Lea and Febiger, Philadelphia, 1971. Hubbard, J. P. and Clemans, W. V., Multifilp-Choice Examinations in Medicine, Lea and Febiger, Philadelphia, 1961. Lennox, B., Hints on the Setting and Evaluation of Multiple Choice Questions of the One-from-Five Type, Association for the Study of Medical Education, Dundee, 1974. Mehrens, W. A. and Lehmann, I . J., Measurements and Evaluation in Education and Psychology, Holt, Rinehart and Winston, New York, 1975. Pickering, G., Against multiple choice. questions, Medical Teacher, 1979, 1,84. Smart, G. A., The multiple choice examination paper, BnIishJournal of Hospital Medicine, 1976, 15, 131,

Medical Teacher Yo13 No 4 1981

The MCQ Controversy: The Issue is Content, not Form.

Set a multiple choice question (MCQ) examination.

Validation of the French version of the marijuana craving questionnaire (MCQ) generates a two-factor model.

Compensating for memory losses throughout aging: validation and normalization of the memory compensation questionnaire (MCQ) for non-clinical French populations.

Peer review: the year in review.

The keratoacanthoma: a review.

The peer-review process.

Editorial: The review body.

Conducting the literature review.

The year in review.

The mucopolysaccharidoses (a review).

Film Review: The First Day.

Rhabdomyolysis: review of the literature.

Peer review: the unsung heroes.

The genus Vitex: A review.

The refeeding syndrome: a review.

Streamlining the ethics review system.

The public in peer review.

The kidney tight junction (Review).

Review of the international literature.

The year in review 2014.

The uses of utilization review.

A review of the osteopetroses.

The refeeding syndrome: a review.