Continued Research

on

Computer-Based Testing

Stephen G. Clyman Ellen R. Julian Nancy A. Orr Gerard F. Dillon Kenneth E. Cotton

National Board of Medical Examiners Philadelphia, PA 19104 Abstract

comparison of paper and computer administrations is helpful in addressing this concern. Unlike MCQs, however, CBX cannot be administered without a computer; therefore, comparison of a computer and paper version is not possible.

The National Board of Medical Examiners has developed computer-based examination formats for use in evaluating physicians in training. This paper describes continued research on these formats including attitudes about computers and effects of factors not related to the trait being measured;

A previous review [41 has summarized the problems with and research needed for simulations used for evaluation. The review pointed to the difficulty in defining scoring criteria, the relative unreliability of the simulation formats, and the paucity of unique information obtained at greater time and expense. The purpose of this research is to investigate further the characteristics of the cMCQ and CBX formats.

differencesbetweenpaper-adninisteredandcomputeradministered multiple-choice questions; and the characteristics of simulation formats. The implications for computer-based testing and fiirther research are discussed.

Background Method

The National Board of Medical Examiners (NBME) has developed computer-based simulations (CBX) and computer-administered multiple-choice questions (cMCQs) for evaluating physicians in training. Ultimately, these examination formats will be used in the NBME certification examinations [1]; however, the computerization raises questions about computer experience and examination performance, effects of computerization of paper examinations, the validity of new formats, and logistics.

CBX is a computer-administered simulation format in which the examinee must care for a patient through simulated time by requesting any of thousands of patient-care options through free-text entry. The patient responds based on the management decisions or based on the evolution of the disease state through simulated time. While the examinee manages the patient, the computer records the actions in a file that can be scored by a computerized algorithm.

Previous NBME research indicated that in certain applications, the computerization of multiple-choice questions (MCQs) had no demonstrable effect upon item difficulty, that use of CBX cases required practice; that examinees generally favored CBX over cMCQ format; and that CBX measured unique information that reasonably could be linked to the readiness to practice medicine.[2J,[31

Examinations in each of surgery, internal medicine, pediatrics, and obstetrics-gynecology were administered to approximately 1000 students in nine schools from August 1989 to July 1991. Eight simulations constituted the CBX portion of each discipline-specific examination; 40 minutes were allowed for the completion of each simulation. The paper-administered MCQs (pMCQs) and cMCQs were derived from the NBME Part II examination and were identical with those questions used elsewhere for clerkship evaluation. Each examination consisted of approximately 140 MCQs

The concern persists, however, that administration of examinations by computer may result in irrelevant or extraneous factors adversely affecting performance. For MCQs, the continued 0195-4210/91/$5.00 ©) 1992 AMIA, Inc.

742

typically completed within two hours. The computerized format provided many examinee-controllable features of the paper format (i.e., ability to change answers, tag items for later review, etc.).

To investigate the validity of the CBX scores, clinicians who defined the scoring criteria reviewed the actions of examinees using a holistic scoring approach. The clinicians, in groups of six, reviewed approximately ten examinees' responses and produced a holistic score both individually and as a group for each examinee. These holistic measures were then compared to those derived from the

The examinations were administered within one week of the completion of clerkships that ranged in length from six to 12 weeks. Because schools did not have enough computers to test everyone in one day, testing took place over as many as four days. For schools administering both cMCQs and pMCQs, a counterbalanced design, in terms of sequence of administration, was not executed because of administrative and logistics constraints. However, it was thought that the student samples that emerged, although ones of convenience, would still yield valuable data in the continued assessment of these new methodologies and formats.

computerized algorithm. Most of these results are based on the administration of examinations in obstetricsgynecology and surgery to approximately 600 examinees; analyses with the remaining data set are ongoig. Results

Students' attitudes varied from school to school, perhaps reflecting different commitments to this research study by the faculty. Generally, though, students believed that CBX simulations (compared with cMCQs) were more representative of materials in the clerkship and more effective in allowing demonstration of what was learned in the clerkship. Computer. experience does not correlate with performance on CBX or cMCQs.

Protocols in schools varied: In some instances, cMCQs and pMCQs were administered; in others cMCQ and CBX, or CBX only. In some instances, the pMCQs or cMCQs were used for clerkship grading; in others they were required, but were not a part of the clerkship grade; in one school, students were paid to participate. In some schools, some students took the examinations in two different disciplines. Students were required to practice with the NBME computer-based testing (CBT) materials before the examination.

Raw (number correct) scores for the cMCQs were converted into standard scores using the mean and standard deviation of a group of students who took the same questions as part of national certification examinations. Initial results, provided below, indicate that the cMCQ examination was more difficult than pMCQs, on average, by approximately 25 standard score points (this difference is significant at the .01 level); however, interpretation

Surveys were administered at each administration to gather demographic, computer experience, and format (e.g., cMCQ and CBX comparison) data. These questions were administered by the computer before the examination and at the completion of the examination.

CBT PHASE 11- MCQ ANALYSIS

Total Ob-gyn Surgery

N 283 83

cMCQ Standard scores Mean SD 521.6 97.8 441.8 78.8

pMCQ Standard scores Mean 548.4 466.8

Significantly different at the .01 level. l (C-P) Difference between cMCQ and pMCQ means. 2 Total group correlation corrected for attenuation. *

-

743

SD 81.0 76.3

Diff

(C-P)1

-26.8 -25.0

Correlation .72 (1.0)2 .73 (1.0)

is complicated by such factors as the sequence of administration (i.e., cMCQ then pMCQ or vice versa) and by inferred examinee motivation. Although analyses continue, preliminary indications are that the cMCQ/pMCQ difference persists to some degree if the sample is restricted to those who appear to be equally motivated on both formats. Additionally, there appears to be some interaction between the mode of presentation and the sequence of administration. The results of subsequent analyses will clarify these findings.

to evaluate whether the two formats measure the same trait and to investigate whether the scores on the two formats are comparable. The first and more critical issue has been supported by the high correlation of the two formats. The second issue of score equivalence requires further investigation.

The differences in average performance and the potential interaction with sequence of administration are not totally surprising since other researchers have reported similar findings under a number of different circumstances.[6] These findings pose interesting questions: Are item differences created by the display on a computer screen, or by the presentation of one item at a time rather than six or eight on a page? Does the vertical orientation of the computer screen as opposed to the horizontal presentation of paper tests make a difference? Is the issue of "control" over the format (e.g., how the "pages are turned" and the questions answered) an important factor? Or, did examinees' motivation make the difference? These questions guide further research.

To examine the relationship between cMCQ and pMCQ scores, Pearson correlations were calculated. The overall correlations between the cMCQ and pMCQ scores for obstetrics-gynecology and surgery were .72 and .73 respectively. After correcting for the unreliability of the measures, the pMCQ and cMCQ scores correlate 1.0; this implies that the same, or highly related constructs are being measured.

The pediatrics, surgery, and obstetrics-gynecology CBX examination scores have reliabilities ranging from .73 to .81 - consistent with the 1987 administration of an interdisciplinary examination to first-year residents. The examinations seemed to be appropriately difficult as judged prospectively by content experts and as confirmed retrospectively by the proportion of actions correctly requested by examinees. Most examinees seemed to be practicing sufficiently with the cases so that the format did not affect their scores. CBX and cMCQ correlations are low to moderate (.43 to .66 corrected for attenuation) indicating that there is a unique contribution of information by both formats. Gender of the examinee was not related to performance on cases even though men generally were more computer experienced.

Attitudes about the formats and effects of extraneous factors corroborate earlier NBME research.[21 Examinees generally found simulation formats preferable to cMCQs. Computer experience did not affect CBX scores. This finding is encouraging, but not necessarily intuitive. A distinction should be made between knowledge about computers and knowledge about how to use the program running on the computer (how to perform a task through a specific interface). There is no reason to conclude that computer knowledge (i.e., knowing a programming language or computer architecture) would help someone achieve a higher grade in a clinical content area unless that content included such topics (the CBX cases and cMCQs in these examinations did not). On the other hand, it should be obvious that familiarization with the computer program, particularly when it is relatively complex, may help the examinee. For this reason, students were required to complete practice cases.

Discussion

The initial results are encouraging for the ultimate implementation of computer-based testing. Some data become difficult to interpret because protocols were different at each testing site and examinees' motivation varied. Nonetheless, these findings tend to corroborate many previous findings about computer-based testing.

As alluded to above, researchers have expressed concern about devising consistent scoring algorithms for open-ended formats like CBX. Preliminary studies with small data sets by Solomon, Osuch, et al [71 suggest that raters can be consistent in rating examinees on CBX cases and that methods for devising criterion sets for CBX cases can decrease, in theory, this source of error variance. These studies have been reproduced and analyses are underway.

As recommended in guidelines provided by the American Psychological Association, [5] pMCQ and cMCQ forms of the same test need to be compared

744

The low to moderate correlations of CBX with cMCQs suggest that CBX is adding unique information to that obtained with cMCQs. Other validity studies are ongoing to determine whether the additional CBX information is relevant and useful in achievement or licensure examinations in medicine. So far, the comparisons of physicians at different points in their training, the opinions of content experts who construct the examinations, and other validity studies indicate that the information is useful. The unprompted nature of the simulation captures information that could not be obtained in MCQ formats even if correlations between cMCQ and CBX were extremely high. These findings support the assertion that the two formats provide complementary information.

the psychometric characteristics of the case differ with these minor changes must be tested.

Equivalent Forms Examinees who take comparable but different forms of the examination must be scored on a common scale.

Security Protocols The computer-administered examinations' security must be equal to that of paper examinations. Case Exposure CBX cases might be easier to remember because they are larger logical testing units. The implications for examinees' sharing information about CBX cases with each other must be studied.

The corroboration of some of these findings (i.e., the reliability of scores) across different examinee populations and with a different content (disciplinespecific examinations as opposed to an interdisciplinary examination) suggests a stable feature of the examination that could reasonably be extended to an interdisciplinary examination and to a comparable group of senior medical students (between the training levels of the groups in the previous studies). On the other hand, further research (i.e., standard setting on CBX) must be done within the specific examinee population using examination materials that actually would be administered in the "live" setting.

Adaptive and Sequential Testing These are ways of shortening testing time or increasing testing accuracy by making decisions about how much more testing is required while portions of the examination are still being administered. Other studies, examining the consistency of raters (i.e., scoring criteria), adequacy of content sampling, and the apparent increased difficulty of cMCQs will also be pursued.

Future Research

References

The conclusions in this study were based mostly on data from surgery and obstetrics-gynecology examinations administered to junior medical students; these analyses corroborate findings from previous studies.[2J Analyses with pediatrics and internal medicine examinations will be added and interpreted.

1. Clyman SG, Orr NA: Status report on the NBME's computer-based testing. Academic Medicine 1990; 65(4):235-241. 2. Conference on Computer-Based Testing in Medical Education and Evaluation - March 24, 1988 - Conference syliabus. National Board of Medical Examiners, Philadelphia, 1988. 3. Melnick DE: Clinical Simulations - Pygmalion Revisited? in Stead WW (ed): Proceedings of the Eleventh Annual Symposium on Computer Applications in Medical Care. Los Angeles, IEEE, 1987, pp 7-9. 4. Swanson DB, Norcini JJ, Grosso J: Assessment of clinical competence. Assessment and Evaluation in Higher Education 1987; 12(3):220-

In the 1991-1992 academic year, research will be expanded to include an interdisciplinary CBX examination for senior medical students. This will allow continuation of research on CBX in a format that more closely resembles an NBME certification examination. This research will include the following:

Case Disguise By modifying the patient's age and otherminor points in the presentation of the problem, CBX cases can be reused. Whether

246. 5. American Psychological Association. Guidelines for Computer-Based Tests and Interpretation. APA: Washington, D.C., 1986.

745

88-8. New York: College Entrance Examination Board, 1988. 7. Solomon DJ and Osuch JR et al: unpublished data, December 1990.

6. Mazzeo J and Harvey AL: 7he Equivalence of Scores from Automated and Conventional Educational and Psychological Tests: A Review of the Literature. College Board Report No.

746

Continued research on computer-based testing.

The National Board of Medical Examiners has developed computer-based examination formats for use in evaluating physicians in training. This paper desc...
639KB Sizes 0 Downloads 0 Views