Research Report

Script Concordance Testing: Assessing Residents’ Clinical Decision-Making Skills for Infant Lumbar Punctures Todd P. Chang, MD, David Kessler, MD, MSCI, Brett McAninch, MD, Daniel M. Fein, MD, D.J. Scherzer, MD, Elizabeth Seelbach, MD, Pavan Zaveri, MD, Jennifer M. Jackson, MD, Marc Auerbach, MD, MSCI, Renuka Mehta, MBBS, Wendy Van Ittersum, MD, and Martin V. Pusic, MD, PhD, on behalf of the International Simulation in Pediatric Innovation, Research, and Education (INSPIRE) Network

Abstract Purpose Residents must learn which infants require a lumbar puncture (LP), a clinical decision-making skill (CDMS) difficult to evaluate because of considerable practice variation. The authors created an assessment model of the CDMS to determine when an LP is indicated, taking practice variation into account. The objective was to detect whether script concordance testing (SCT) could measure CDMS competency among residents for performing infant LPs. Method In 2011, using a modified Delphi technique, an expert panel of 14

T

he development of clinical decisionmaking skills (CDMS) is an important part of physician training1,2; however, existing methods to assess CDMS are limited. The infant lumbar puncture (LP) is a common procedure that physicians frequently consider in their evaluations of febrile or ill-appearing infants, and the decision to perform the LP is one example of CDMS required in physicians who treat infants.3,4 Assessing this specific CDMS in residents is difficult because practice varies notably among general pediatricians, emergency medicine physicians, subspecialists, and even across different institutions.5–10 Variations in LP tendency are often based on patient Please see the end of this article for information about the authors. Correspondence should be addressed to Dr. Chang, Children’s Hospital Los Angeles Division of Emergency Medicine and Transport, 4650 Sunset Blvd. #113, Los Angeles, CA 90027; telephone: (323) 361-2109; e-mail: [email protected]. Acad Med. 2014;89:128–135. First published online November 25, 2013 doi: 10.1097/ACM.0000000000000059

128

attending physicians constructed 15 case vignettes (each with 2 to 4 SCT questions) that represented various infant LP scenarios. The authors distributed the vignettes to residents at 10 academic pediatric centers within the International Simulation in Pediatric Innovation, Research, and Education Network. They compared SCT scores among residents of different postgraduate years (PGYs), specialties, training in adult medicine, LP experience, and practice within an endemic Lyme disease area. Results Of 730 eligible residents, 102 completed 47 SCT questions. They could earn a maximum score of 47. Median SCT

characteristics—such as a positive respiratory syncytial virus titer, stability, or subjectivity of fever.8,11 Researchers have even documented geographical variation.12 The procedural technique itself has clear standards for teaching and practice,13,14 but the decision to perform the LP can vary among even experienced physicians. Therefore, in the context of such practice variation, typical multiplechoice testing methods such as those found in licensing exams may not be suitable to assess residents’ decisions to perform an LP on an infant.15 Available assessments to test complex CDMS include extended matching questions, comprehensive integrative puzzles, and script concordance testing (SCT).15 Medical educators have created this last method, SCT, to measure, specifically, CDMS in areas with practice variation.1,16–19 SCT answers are scored on a scale from −2 to +2, indicating, respectively, the learner’s tendency away from or towards a specific diagnosis, diagnostic test, or treatment plan. Unlike

scores were significantly higher in PGY3s compared with PGY-1s (difference: 3.0; 95% confidence interval [CI] 1.0–4.9; effect size d = 0.87). Scores also increased with increasing LP experience (difference: 3.3; 95% CI 1.1–5.5) and with adult medicine training (difference: 2.9; 95% CI 0.6–5.0). Residents in Lymeendemic areas tended to perform more LPs than those in nonendemic areas. Conclusions SCT questions may be useful as an assessment tool to determine CDMS competency among residents for performing infant LPs.

multiple-choice questions with only one correct answer, the SCT “answer key” allows the learner to earn partial credit. SCT questions are developed by an expert panel—usually attending-level physicians—whose aggregate responses become the answer key.18 Practice variation among the expert panel allows for multiple possible responses that will earn varying amounts of credit for the learner. In this way, SCT provides for variability in appropriate responses to a clinical question. Previous research has shown SCT to validly measure CDMS among resident physicians in a variety of specialties and subspecialties, including surgical subspecialties,20–22 dermatology,23 and emergency medicine.24,25 Clinical educators have even used SCT to assess task-based CDMS, such as managing geriatric urinary incontinence or subspecialty referral decision making.26,27 SCT is well suited for topics, such as the decision to perform an infant LP, with variations in practice or that have no true “wrong” answer, even among experienced and expert physicians.

Academic Medicine, Vol. 89, No. 1 / January 2014

Research Report

Although considerable literature on assessing competency for the LP procedure is available,13,28–30 we know of no standard method to assess whether a trainee has the CDMS to decide whether or not to perform an LP on a given infant. Because the Accreditation Council for Graduate Medical Education (ACGME) has man­ dated documentation of competencies and milestones,31 an effective assessment on CDMS would determine competency— even for clinical decisions with as much practice variation as the infant LP. SCT could be an optimal tool for assessing the infant LP decision. Thus, in this multiinstitutional study, we sought to show that clinical educators could use SCT to assess whether residents demonstrate improve­ ments in deciding when to perform an LP on an infant. We hypothesized that residents’ answers to SCT vignettes can measure their CDMS for performing infant LPs, and that their SCT scores would increase as their level of postgraduate training increased and their CDMS approached those of attending physicians. Method

First, we developed a series of LP case vignettes, which an expert panel of prac­ ticing physicians then calibrated per the guidelines for SCT question construction.17 After we developed the LP cases and SCT questions, we tested them through a multicenter cohort study of residents. Vignette development Cases were developed in May and June 2011 by participating board-certified or board-eligible pediatric generalists (n = 6) and/or pediatric specialists from critical care (i.e., the pediatric intensive care unit [PICU]; n = 2), from neonatology (i.e., neonatal intensive care; n = 5), from pediatric emergency medicine (PEM; n = 10), and from hospitalist medicine (n = 7) who served as site directors for the International Simulation in Pediatric Innovation, Research, and Education (INSPIRE) Network. This network is a multinational research network running another concurrent study on LP procedural competency.13 Through a modified Delphi process, this group of site directors iteratively developed and modified more than 30 proposed cases based on the following initial criteria: that they represent actual practice, that a resident trainee would

Academic Medicine, Vol. 89, No. 1 / January 2014

realistically be involved, and that there was potential for practice variation. The group members, who were all directly recruited from the larger pool of investigators in the multinational research network,13 volunteered their time. They provided anonymous feedback through one face-to-face meeting and one online meeting to solidify 30 candidate cases. Next, from July through August 2011, we convened a group of 14 physicians, some of whom were part of the larger initial group. This smaller group (10 PEM physicians, 3 pediatric hospitalists, and 1 PICU physician) served as experts on a panel to determine important themes from the 30 cases and to refine the final vignettes. One of these experts was a postgraduate year (PGY)-5 PEM fellow (D.M.F.), and the others (T.P.C., D.O.K., B.M., D.J.S., E.S., P.Z., J.M.J., M.A., R.M., W.V., M.V.P.) had up to 12 years of practice (see Table 1 for further expert panel physician demographics). The clinical themes the experts discussed encapsulated the reason for controversy or practice variation (e.g., interpreting subjective fevers). On the basis of these themes, the expert panelists refined the initial 30 cases into 15 vignettes to best exemplify themes and eliminate distracting elements. The expert panelists discussed the vignettes online and via teleconference in four successive meetings during a six-week period. They progressively revised or deleted vignettes that posed little decision-making difficulty, represented extraordinarily rare clinical scenarios, or were difficult to convey in succinct text form. The final set comprised vignettes involving salient themes such as concomitant viral and bacterial infections, clinical instability, seizures, altered mental status, and contaminant or difficult-tointerpret laboratory results. Script concordance testing For each vignette, the panel developed questions using Fournier and colleagues’17 guidelines for SCT construction. All vignettes began with a brief (two- to three-sentence) clinical scenario followed immediately by a question regarding the participant’s (i.e., the resident’s) initial inclination to perform an LP. For example, one vignette began as follows (see also Figure 1):

A 12-day-old male with known dacrocystocoele of the left eye on his newborn exam is here to see you. He now has developed a periorbital cellulitis and purulent drainage is present. He appears well otherwise. You then find out: Current vital signs show a rectal temperature of 38.6 (101.5F). Does this additional information change your likelihood to perform an LP?

The residents each gave an initial response, ranging from −2 to +2 to indicate their inclination to perform an LP: a −2 indicated the resident would definitely not perform an LP, whereas a +2 indicated he or she definitely would. Each vignette included added clinical findings, always followed by a question asking if the new finding altered the resident’s decision to perform an LP. The resident’s decision was captured in SCT format using an anchored five-point scale ranging from −2 to +2 where, as mentioned, −2 indicated a significant shift away from the decision to perform an LP, 0 indicated no change in decision, and +2 demonstrated a significant shift toward deciding to perform an LP. We

Table 1 Demographics of 14 Experts Serving on the Panel to Provide Answers for the Script Concordance Questions, 2011 Characteristic

No. of experts

Subspecialty  Pediatric emergency medicine

10

 Pediatric hospitalist medicine

3

 Pediatric critical care

1

Experience  Postgraduate year 5 fellow

1

 Faculty 0–6 years

10

 Faculty 7–12 years

3

Experience with adult medicine  Yes

0

 No

14

Practicing in a Lyme disease– endemic area  Yes

5

 No

7

 Don’t know

1

 Not reported

1

Frequency performing lumbar punctures on infants  1–11/year

5

 1–3/month

7

 1/week or more often

1

 Not reported

1

129

Research Report A 12-day-old male with known dacrocystocoele of the left eye on his newborn exam is here to see you. He now has developed a periorbital cellulitis and purulent drainage is present. He appears well otherwise. The rectal temperature is 37.5 C (99.5 F). Based on the above scenario, Would you perform an LP on this patient?

SCT information and steps Expert panel votes (total = 12)

(-2) Definitely not

(-1) Probably not

1

(+1) Probably

(+2) Definitely

5

5

1

1

1

1/5 equals 0.2

1

1

1/5 equals 0.2

0.2

1

1

0.2

Step 1: The modal answer receives 1 point Step 2: All other responses receive a fraction of the expert responses, divided by the modal responses Step 3: A zero response receives no points

(0)

0

You then find out: Current vital signs show a rectal temperature of 38.6 (101.5 F). Does this additional information change your likelihood to perform an LP?

SCT information and steps Expert panel votes (total = 12)

(-2) Much less likely

(-1) Less likely

(0) No change

(+1) More likely

(+2) Much more likely

0

0

6

5

1

Step 1: The modal answer receives 1 point

1

Step 2: All other responses receive a fraction of the expert responses, divided by the modal responses

1

5/6 equals 0.83

1/6 equals 0.17

1

0.83

0.17

Step 3: A zero response receives no points

0

0

Figure 1 Sample case vignette illustrating the use of script concordance testing for assessing residents’ clinical decision-making skills regarding whether to perform an lumbar puncture on an infant, 2011 to 2012.

developed two to four SCT questions per vignette, as recommended by Gagnon and colleagues.32

• 6 chose 0, “no change”;

Setting and population

• 5 chose +1, “more likely”; and

All resident participants were from 10 large, urban hospitals with ACGMEaccredited pediatrics and other residency programs through which they had regular clinical access to infant LPs. These 10 tertiary hospitals represented both private and public institutions and all regions of the United States. Pediatric, emergency medicine, and family medicine residents rotating through the hospitals received an electronic invitation and a link to the SCT LP vignettes. Any resident in any subspecialty within the institutions was eligible to participate. Resident participation in the study was voluntary, and neither participation nor SCT results had any bearing on residents’ evaluations. Recruitment occurred in each institution using up to two e-mail reminders and site director solicitation. We explained the general purpose of SCT and the study

• 1 chose +2, “much more likely.” The expert panel provided responses to finalized SCT vignettes in accordance with guidelines by Fournier and colleagues17 and by Lubarsky and colleagues.18 The panel’s responses constituted the SCT “answer key.” To illustrate this process, we return to the case provided in Figure 1. For the first question, the expert panel was asked whether a rectal temperature of 38.6°C changed their inclination to perform an LP for the infant with periorbital cellulitis. Their responses are as follows: Of 14 experts … • 2 did not answer; • 0 chose −2, “much less likely”; • 0 chose −1, “less likely”;

130

The residents who selected the modal answer—0 or “no change”—received a full score of 1 point. Because six panel physicians selected the modal answer and five selected +1 or “more likely,” the residents who selected +1 received a fractional score of 5/6 or 0.83 points. Those who answered +2 “much more likely” received a score of 1/6 or 0.17 points because one expert panelist had also selected this choice. If no expert panel members chose a response, that response is worth 0 points. Figure 1 illustrates the scoring system for the vignette. In this way, the SCT score measures the concordance of each learner’s responses compared with the “script” that the expert panel provided.

Academic Medicine, Vol. 89, No. 1 / January 2014

Research Report

hypothesis. We offered no incentives, and participants enrolled in a rolling fashion from August to May of the 2011–2012 academic year. The institutional review boards of all 10 participating INSPIRE Network institutions approved the study as exempt. We based our sample size calculations on a predicted mean difference of 5 SCT points (out of 47) between each of the resident groups, with an effect size of 1.0, an alpha of 0.05, and a power of 0.9, yielding 24 per group. We performed a planned interim analysis in February 2012 (month 7), and continued data collection until May 2012 (month 10). Test administration Participants voluntarily completed all 47 questions, through a designated SurveyMonkey link, using any device that permitted online access. The residents were allowed to complete the vignettes only once, and they had no time limits. To avoid any sequence effects, we randomized the order of the vignettes that each participant received, as well as the order of the SCT questions within each vignette. The question set was open to the residents of each participating institution for approximately four months following that institution’s institutional review board approval. Data collected Independent variables for residents included their PGY level (PGY-1, -2, or -3), their specialty, whether they had any experience in adult medicine, whether they practiced within a Lyme disease–endemic area (as some research indicates that Lyme disease has been associated with an increased tendency to perform LPs12), and their level of LP experience. We defined LP experience as the number of LPs residents either directly performed or supervised in their lifetime. SurveyMonkey collected each resident’s time to complete the test, which we defined as the time from log-in to the time when the resident completed the SCT. The SCT score per question could range from 0 (complete discordance) to 1 (full concordance), and the total possible SCT score ranged from 0 to 47 for the entirety of the question set. The SCT score measures each resident’s CDMS

Academic Medicine, Vol. 89, No. 1 / January 2014

for performing an infant LP and the degree to which his or her answers are in concordance with the responses of the expert panel. The raw score per question ranged from −2 to +2, and the total raw score ranged from −94 to +94 for all 47 questions. Given that all raw scores measured the same construct (i.e., tendency to perform an LP), a total raw score of −94 reflected an individual who never performed an LP, and a total raw score of +94 reflected an individual who performed an LP on every patient in the set. In other words, raw score measured a general tendency of the participant to perform an LP, but does not necessarily reflect a “correct” or “incorrect” score. The primary outcome variable was total SCT score, and the secondary outcome variable was the total raw score. Data analysis We eliminated any participant who failed to complete at least 90% of the vignettes (i.e., who did not provide 43 or more responses). We excluded cases with missing independent variables from the relevant analyses. We analyzed differences in times to completion, SCT scores, and raw scores among groups using nonparametric tests in anticipation of unequal group sizes. We verified this nonnormality using the Shapiro–Wilk test. The statistical analyses on the outcome variables included Mann–Whitney U and Kruskal–Wallis tests and Spearman rank correlation coefficients. We estimated confidence intervals (CIs) using bootstrapping and the Hodges– Lehman–Sen estimates. We calculated the internal consistency of the question set through Cronbach alpha using the SCT scores and the raw scores. We determined Cohen d effect sizes based on the noncentral t distribution.33 We performed all statistical analyses using SPSS version 20 (Armonk, New York). Results

The SCT vignettes and the expert panel The final SCT comprised 15 vignettes and 47 questions. Twelve of the fourteen panel experts (86%) provided answers for the entire set of 15 vignettes. These experts

worked at the following institutions: Children’s Hospital Los Angeles, Children’s Hospital at Montefiore, Children’s Hospital of Pittsburgh, Children’s National Medical Center, Cleveland Clinic, Columbia University, Georgia Regents University, Nationwide Children’s Hospital, New York University, Stony Brook University Medical Center, Wake Forest Baptist Medical Center, Medical University of South Carolina, and Yale University. The median SCT score for the 12 experts who provided answers was 37.96 (interquartile range [IQR]: 4.31), and the median raw score was 19 (IQR: 21). Participating residents Of the 730 eligible residents from all 10 participating institutions, 130 completed the vignettes, and 102 of those completed over 90% of the SCT questions; thus, our final data analysis included 14% of the total possible participants. Differences in times to completion, SCT scores, and raw scores Median time to completion was 13 minutes 38 seconds (95% CI 11:41 to 15:29), and we detected no significant differences in time to completion between PGY-1, PGY-2, or PGY-3 respondents (P = .51). As we hypothesized, the analysis revealed an association between training year and LP experience (r = 0.51, P < .001), such that directly performing an LP correlated with higher PGY year; however, we noted no other significant differences in the other independent variables by training years. SCT scores were significantly different among residents in different training years, rising with experience from PGY-1 (median SCT score = 32.79; 95% CI 31.3–34.5) to PGY-3 (median SCT score = 36.24; 95% CI 35.0–37.0) for a difference of +3.0; 95% (CI 1.0–4.9) and Cohen d effect size of 0.87 (95% CI 0.35–1.39). We also noted an increase in SCT scores as residents’ self-reported LP experience increased; those who performed only 1 LP and those who performed 6 to 10 LPs had the greatest difference (+3.3; 95% CI 1.1–5.5). These results, as well as others—including those for experience with adult medicine—are shown in Table 2.

131

Research Report

Table 2 The Scores, With Interquartile Range (IQR), of 102 Residents Participating in a Script Concordance Test (SCT) Assessment of Their Clinical Decision-Making Skills Regarding Performing a Lumbar Puncture on an Infant, 2011–2012* SCT scores† Resident characteristic

Median score (IQR)

Postgraduate year (PGY) status  PGY-1 (interns)

32.79 (6.04)

Raw scores‡

P value

Median score (IQR)

.008

Script concordance testing: assessing residents' clinical decision-making skills for infant lumbar punctures.

Residents must learn which infants require a lumbar puncture (LP), a clinical decision-making skill (CDMS) difficult to evaluate because of considerab...
322KB Sizes 0 Downloads 0 Views