Bayesian Checks on Cheating on Tests.

psychometrika doi: 10.1007/s11336-014-9409-x

BAYESIAN CHECKS ON CHEATING ON TESTS

Wim J. van der Linden CTB/MCGRAW-HILL

Charles Lewis FORDHAM UNIVERSITY Posterior odds of cheating on achievement tests are presented as an alternative to p values reported for statistical hypothesis testing for several of the probabilistic models in the literature on the detection of cheating. It is shown how to calculate their combinatorial expressions with the help of a reformulation of the simple recursive algorithm for the calculation of number-correct score distributions used throughout the testing industry. Using the odds avoids the arbitrary choice between statistical tests of answer copying that do and do not condition on the responses the test taker is suspected to have copied and allows the testing agency to account for existing circumstantial evidence of cheating through the specification of prior odds. Key words: answer copying, Bayesian checks, cheating on tests, erasure analysis, generalized binomial distribution, statistical hypothesis testing.

1. Introduction Although cheating on achievement tests is not common, it is definitely on the rise. For instance, in a recent survey based on a randomized response technique among a population of university students, Fox & Meijer (2008) found up to 20–31 % of the students reporting such types of cheating as conferring during the test, looking at the test papers of someone else, or allowing others to copy their own work. One of the main weapons of the testing industry against cheating is statistical detection of its occurrence. The current statistical methods address four different kinds of cheating: (i) test takers copying answers from others; (ii) fraudulent erasures on answer cheats after completion of the test; (iii) attempts to memorize items during testing to share them with later students; and (iv) preknowledge of unreleased test items. The main methods used in these categories are briefly reviewed here; more specific information on some of these methods is provided later in this paper. Early methods for the detection of answer copying on multiple-choice tests were proposed by Angoff (1974), Frary et al. (1977), and Saupe (1960). These methods have the statistical form of a test of the null hypothesis of no answer copying based on the number of matching responses between two test takers. Their null distributions, however, were derived from rather ad hoc assumptions about the response process. Versions of the same type of tests derived from more sophisticated assumptions are the K-index (Holland 1996; Lewis & Thayer 1998; Sotaridona & Meijer 2002), as well as the tests proposed by van der Linden & Sotaridona (2004) and Wesolowsky (2000). Tests directly based on standard item response models widely used in the testing industry were provided by Wollack (1997) and van der Linden & Sotaridona (2006). Although these models are based on response-model assumptions, it is a standard operating procedure in the testing industry to check their validity while field testing the items, and to discard items with a Correspondence should be sent to Wim J. van der Linden, CTB/McGraw-Hill, 20 Ryan Ranch Road, Monterey, CA 93940. Email: [email protected].

© 2014 The Psychometric Society

PSYCHOMETRIKA

less satisfactory fit. An entirely different approach was recently offered by Belov & Armstrong (2010). Their method uses the Kullback–Leibler divergence to detect inconsistencies in test takers’ performances between common sections administered to all of them and rotating sections with pretest items, and then utilizes the K -index for secondary analysis to find evidence of answer copying on the common sections. Fraudulent erasures on answer sheets arise when teachers or school administrators “help students” by improving wrong answers on their answer sheets after the test, or do so to hide underachievement by their class or school. One of the first to look into the possibility of detecting this type of fraud using optical scanning of the answer sheets for erasures was Qualls (2001). Her main goal was to obtain estimates of the typical distributions of the numbers of regular answer changes by test takers on low-stakes tests, where fraud is generally absent, for use with high-stakes tests to detect outliers. Jacob and Levitt (2003a,b, 2004) used multinomial logits of numbers of erasures regressed on performances by the same students in other years as well as several kinds of background information to detect unusual erasure patterns. Their case study for a Chicago school district, which created considerable interest thanks to its inclusion in Levitt & Rubner (2005), led to the identification of several teachers who had been erasing answers. A statistical test and method of residual analyses for the detection of fraudulent erasures based on item response modeling of the probability of wrong-to-right (WR) answer changes were given by Linden & Jeon (2012). Most detection of memorization and preknowledge of test items involves methods of personfit analyses from item response theory. For computerized testing, they become more powerful when combined with the analysis of the response times recorded for the items. Attempts to memorize test items typically lead to careless responses that are given only to move from one item to the next, as well as response times not representative of the actual amount of labor required to solve the items. The latter also holds for items that test takers have already seen before the test is administered to them, in combination with unlikely correct responses given their ability demonstrated on the other items. An earlier review of person-fit analyses used to detect aberrant responses was given by Meijer & Sijtsma (2001). Later methods include Bayesian options by Glas & Meijer (2003) and McLeod et al. (2003), as well as methods based on a Bayesian analysis of residual response times by van der Linden & Guo (2008). Several of the methods just reviewed are based on—or were at least motivated by—notions derived from statistical hypothesis testing. They typically focus on an obvious test statistic (for instance, the number of matching incorrect responses on multiple-choice items or the number of wrong-to-right (WR) erasures on an answer sheet), and then choose a statistical rationale for distinguishing between test takers with high but still likely values for the statistic and those with unlikely high values. Clear advantages of this approach relative to the heuristic approaches in the early literature on the detection of cheating are the use of explicit assumptions about the probability distribution of the test statistic for regular examinees as well as the distinction between regular and irregular behavior. However, its focus is exclusively on the null hypothesis of regular behavior. On the other hand, newer developments as in Linden & Jeon (2012) and van der Linden & Sotaridona (2006) also try to set up an explicit alternative hypothesis with a probabilistic structure expected to hold for those who cheat, which is then tested against the null hypothesis of no cheating (e.g., Linden & Jeon 2012; van der Linden & Sotaridona 2006). A still unresolved issue already emerging in the very first reports on the detection of answer copying is the question of how to deal with the responses by the source. Three different answers have been given, sometimes with different motivations. One answer has been to focus on the incorrect responses by the source only—a choice usually motivated by the observation that matching correct responses constitute weak evidence of copying since both the source and the copier may just know the answer (e.g., Holland 1996) or by demonstrating confounding of the hypotheses to be tested if correct responses by the source are admitted (van der Linden & Sotaridona 2004). This approach typically views knowing an answer to an item as a deterministic event. Two alternative

WIM J. VAN DER LINDEN AND CHARLES LEWIS

approaches have been proposed: (i) treating all responses by the source as fixed and (ii) treating all of them as random. The first in the literature to struggle with the choice between the two options was Frary et al. (1977). More recently, conditional and unconditional versions of a test based on the same multinomial model for the response alternatives were proposed by van der Linden & Sotaridona (2006). Suppose an item has alternatives a = 1, . . . , k, and let πcia , and πsia denote the probabilities of c (the copier) and s (the source) choosing alternative a on item i under the response model. One version is based on the null hypothesis of c and s working independently (i.e., without any coping) on the items with probabilities of a matching choice from the alternatives equal to k πcia πsia . (1) Pr{match on i} = a=1

The alternative is to take the responses of the source as given. Suppose s has chosen alternative a on an item. The probability of a regular matching response by c given the choice by s is then equal to (2) Pr{match on i} = πcia . Obviously, for the same level of significance, the two tests may flag different test takers. Classical statistics is unable to offer any criterion for the choice between the two types of test. The underlying statistical issue is the choice between unconditional and conditional inference (Lewis 2006). Both types of inference generally have the same expected power, and the choice between them can be motivated only by such extra-statistical considerations as the nature of the application or the intended use of the results (Lehmann & Romano 2005, Sect. 10.1). For an application of the proof of equal expected power to the problem of answer copying, see Appendix 1. Fortunately, there exists another option in the form of the calculation of the posterior odds of cheating. This option entirely avoids the arbitrary choice between unconditional and conditional inference in that, following the Bayesian logic of treating all data as given, it automatically conditions on all responses, both by the source and the copier. An additional advantage of the use of posterior odds is the ability to account for existing empirical evidence of cheating through the specification of prior odds. For instance, when the hypothesis of cheating is tested for test takers from a well-defined population, it is impossible for the types of tests just reviewed to account for our knowledge of the typical incidence of cheating in the population. The only control they offer is through the choice of significance level of the test. But that choice only impacts the proportion of cases flagged among those not involved in cheating; it does not account in any way for the proportion that actually did cheat. An important question is how to use statistical evidence of cheating in the actual practice of testing. The general rule followed by the testing industry is not to accuse any test taker of cheating but to inform those with highly suspicious answer sheets that, due to statistical irregularities, their response patterns do not warrant any scoring, and offer them a retake. The rule has been supported by jurisdiction, acknowledging the rights of testing organizations not to release test scores that are not in agreement with their professional standards. The posterior odds derived below are intended to help testing organizations reach such conclusions, combining whatever circumstantial evidence they may have collected with the statistical information in the test taker’s responses.

2. Posterior Odds of Cheating The basic procedure we propose is first illustrated as an alternative for the current statistical tests of answer coping on multiple-choice tests reviewed above. We then give an example for the calculation of the posterior odds of fraudulent erasures on answer sheets.

PSYCHOMETRIKA

2.1. Answer Copying Let N be the set of items in the test for which we want to check on the copying of any answers by a hypothetical copier c from a source s. We use M to denote its subset of items with matching alternatives between c and s. The subset of items on which c actually copied is denoted as . Although the sets M and vary across pairs of test takers, for convenience, we do not index them by c and s. Thus, it holds that ⊆ M ⊆ N. (3) Also, note that M is observed but is unknown. In fact, it may very well hold that = ∅ (c has not copied at all). The prior probability of being a specific subset of items in the test is given by probability function p(), ∈ P(N ), (4) where P(N ) is the power set of N (i.e., the set of all its possible subsets). It may be tempting to check blindly (i.e., without any prior evidence) on answer copying between test takers on the entire set of items in a test for a given administration, but we are not in favor of this option. First, as the use of miniature electronic communication devices in cheating is on the rise, it would no longer suffice to check just on test takers sitting adjacent to each other, whereas checking on all possible pairs would be infeasible given the sheer number of them. Second, for the full set of items in a real-world test, P(N ) would become prohibitively large (e.g., 240 for a test of 40 items), making it impossible to specify a meaningful probability distribution over it. We are therefore not in favor of such blind procedures. Rather, we recommend making checks only when there exists prior empirical evidence for individual test takers suggesting scrutiny of a specific portion of the test, for instance, a page of the answer sheet or a section in the test for which a proctor has reported suspicious behavior. Such specific analyses are more convincing, provided the evidence was collected and the prior distribution in (4) was specified before observing any of the actual responses by c and s. Finally, note that the option of specifying different prior distributions for different test takers allows us to tailor the analysis to different cases of suspicion. As for the response model, two choices are considered. The first is the nominal response model, which (with appropriate identifiability restrictions) specifies the probability of a regular test taker with ability level θ choosing response category u i = 1, . . . , ki for item i as exp(ζu i + λu i θ ) p(u i | θ ) ≡ k , i u=1 exp(ζu i + λu i θ )

(5)

where ζu i and λu i are the intercept and slope parameters for category u i (Bock 1972, 1997). Although the methods that follow are more informative for this polytomous type of model than one of the currently popular dichotomous models, the latter tend to fit achievement test data much better and are more frequently applied. In general, dichotomous responses (u i = 1, 2) have probability function (6) p(u i | θ ) = p(1 | θ )u i [1 − p(1 | θ )]1−u i . For this case, we use the well-known three-parameter logistic (3PL) response function as an example, which gives the probability of a correct response as pi (1 | θ ) ≡ ci + (1 − ci )

exp[ai (θ − bi )] , 1 + exp[ai (θ − bi )]

(7)


where bi ∈ (−∞, ∞), ai ∈ (0, ∞), and ci ∈ [0, 1] are parameters that may be interpreted as the difficulty, discriminating power, and the height of the lower asymptote required to deal with the effects of guessing on item i, respectively. The generalization of the results in this paper to other types of dichotomous, polytomous, multidimensional, etc. response models is straightforward. Also, throughout our treatment, we assume that the item parameters have been estimated with enough precision to consider their values as known. The estimation of the ability parameter will be discussed separately below. The conditional probability function of response Uci = u ci for copier c on item i given the response Usi = u si by s is equal to

p(u ci

⎧ ⎨ p(u ci | θc ), 1, | θc , , u si ) = ⎩ 0,

if i ∈ / , if i ∈ and u ci = u si , if i ∈ and u ci = u si .

(8)

The probability function can be explained as follows: If c did not copy (i ∈ / ), the probability of his/her response given the one by s is just the regular marginal response probability in (5) or (6). On the other hand, if c did copy (i ∈ ), the responses match and the probability of u ci given u si is equal to one. The event of c coping with responses u ci and u si that do not match is treated as impossible (that is, we exclude the unlikely case of c making a writing error when copying); its probability is therefore equal to zero. Applying the standard IRT assumption of conditional independence, the joint probability of the entire response vectors uc = (u ci ) and us = (u si ) may be written as p(uc , us | θc , θs , ) = p(uc | θc , , us ) p(us | θs ) n n p(u ci | θc , , u si ) p(u si | θs ). = i=1

(9)

i=1

Consider the posterior probability of c not copying any of the items; that is, = ∅. Combining (4) and (9), the probability is equal to n n p(∅) i=1 p(u ci | θc , ∅, u si ) i=1 p(u si | θs ) n n p(∅ | θc , θs , uc , us ) = p() p(u | θ , , u ) ci c si i=1 i=1 p(u si | θs ) n p(∅) i=1 p(u ci | θc ) n = p() i=1 p(u ci | θc , , u si ) n p(∅) i=1 p(u ci | θc ) n . = ⊆M p() i=1 p(u ci | θc , , u si )

(10)

Remember that is the hypothetical set of items that were actually copied. Consequently, for any not a subset of M, it holds that u ci = u si for some item, and therefore that p(u ci | θc , , u si ) = 0; see (8). Observe that the response probabilities for the source in the numerator and denominator have canceled; the only probabilities on which the expression depends are the response probabilities for the copier. The expression can be further simplified by distributing the product operator over the items in set M and its complement, M, the result being

PSYCHOMETRIKA

p(∅) i∈M p(u ci | θc , ) i∈M p(u ci | θc ) ⊆M p() i∈M p(u ci | θc , , u si ) i∈M p(u ci | θc ) p(∅) i∈M p(u ci | θc ) = . ⊆M p() i∈M p(u ci | θc , , u si )

p(∅ | θc , uc , us ) =

(11)

Thus, the posterior probability of c not copying does not depend on the responses to any of the items outside the set with matching responses, M. Further, although prior distribution p() still needs to be specified across all possible sets to be a proper distribution, none of the probabilities beyond ⊆ M is actually used. The posterior odds of c cheating on at least one of the items in N are equal to 1 − p(∅ | θc , uc , us ) = p(∅ | θc , uc , us )

⊆M =∅

p() p(∅)

i∈M

p(u ci | θc , , u si )

i∈M

p(u ci | θc )

.

(12)

Distributing the product operator in the numerator of (12) over the items in the set with copied answers and its complement relative to M, we obtain

p(u ci | θc , , u si ) i∈M\ p(u ci | θc , , u si ) 1 − p(∅ | θc , uc , us ) = p(∅ | θc , uc , us ) p(∅) i∈M p(u ci | θc ) ⊆M p() i∈ p(u ci | θc , , u si ) i∈M\ p(u ci | θc ) =∅ = p(∅) i∈M p(u ci | θc ) ⊆M p() i∈M\ p(u ci | θc ) =∅ = , (13) p(∅) i∈M p(u ci | θc ) ⊆M =∅

p()

i∈

where the last step follows from the fact that the conditional response probabilities for the items with matching responses in are equal to one [see (8)]. Observe that the odds of c cheating are not directly dependent on the ability of s. The only relationship between c and s exists through their observed set of matching responses, M. Now assume that the prior probabilities of copying are independent probabilities γi , i = 1, . . . , n, across the items in N . Of course, their values should be chosen to be consistent with the desiredprior probability of no cheating on the entire set of items; that is, it should hold that p(∅) = i∈N (1 − γi ). Although this seems a reasonable choice, the alternative of some of the prior probabilities being dependent deserves study. For instance, it may be argued that items on the same page or varying substantially in difficulty require a different specification of prior probabilities. Abbreviating the notation of the regular response probabilities p(u ci | θc ) as pci , the posterior odds of cheating can be written as

i∈ γi

i∈N \ (1 − γi )

i∈M\ pci i∈N (1 − γi ) i∈M pci (1 − γi ) ⊆M i∈ γi i∈M\ (1 − γi ) pci i∈N \M =∅ = i∈N (1 − γi ) i∈M pci ⊆M i∈ γi i∈M\ (1 − γi ) pci =∅ = . i∈M (1 − γi ) pci

1 − p(∅ | θc , uc , us ) = p(∅ | θc , uc , us )

⊆M =∅

(14)


It is illuminating to analyze the combinatorial nature of this expression. First, observe that (1 − γi ) pci is the product of the prior probability of no copying on item i and the regular response probability of c for this item. The product is proportional to the posterior probability of c not having copied on item i. Likewise, γi ∗ 1 is the product of the prior probability of c copying on item i and the probability of c having the same response as s when copying [see (8)]. Thus, γi is proportional to the posterior probability of c having copied on i. As (14) is the ratio of two posterior probabilities, the absence of norming does not matter. It follows that (14) equals the ratio of (i) the sum of the posterior probabilities associated with all possible combinations of copying on at least one item in M and not copying on any of its other items (numerator) and (ii) no cheating on any of the items in M at all (denominator). In order to prepare efficient computation of (14), let ξci ≡ (1 − γi ) pci .

(15)

Renumbering the items so that items 1, 2, . . . , m are in M and m + 1, . . . , n are in N \M, the expression can be written as γ ξ c − i∈M ξci 1 − p(∅ | θc , uc , us ) = , p(∅ | θc , uc , us ) i∈M ξci

(16)

where γ = (1, γ (1) , . . . , γ (m) ) and ξ c = (ξ c(m) , ξ c(m−1) , . . . , ξ (1) c , 1) are column vectors of length 2m that have components γ (1) = (γ1 , γ2 , . . . , γm ); γ (2) = (γ1 γ2 , γ1 γ3 , . . . , γm−1 γm ); .. . γ (m) = (γ1 γ2 . . . γm )

(17)

and ξ c(m) = (ξc1 ξc2 . . . ξcm ) ξ c(m−1) = (ξc2 ξc3 . . . ξcm , ξc1 ξc3 . . . ξcm , . . . , ξc1 ξc2 . . . ξc(m−1) ); .. . = (ξ ξ (1) c1 , ξc2 , . . . , ξcm ), c

(18)

respectively. An example of all possible products of γi and ξci involved in the calculation of (16) for a copier and source with m = 3 matching responses is given in Table 1. The first column lists all possible combinations of items, while the second and third column contain the components of the γ and ξ vectors, respectively, in the order in which they are defined in (16)–(18). The last column in the table lists the component-wise products of γ and ξ . This column enables us to calculate the posterior odds; its first entry is the denominator i∈M ξci of (16) while the sum of all other entries is its numerator. The structure of Table 1 reminds us of the generalized (or compound) binomial probability distribution of the number-correct score on a test of m items by a test taker with ability θc . The only

PSYCHOMETRIKA Table 1.

Example of all products of γi and ξci in the posterior odds in Eq. 16 (m = 3).

Items γ ∅ {1} {2} {3} {1,2} {1,3} {2,3} {1,2,3}

1 γ1 γ2 γ3 γ1 γ2 γ1 γ3 γ2 γ3 γ1 γ2 γ3

Products of γi and ξci ξc ξc1 ξc2 ξc3 ξc2 ξc3 ξc1 ξc3 ξc1 ξc2 ξc3 ξc2 ξc1 1

γ ξ c ξc1 ξc2 ξc3 γ1 ξc2 ξc3 ξc1 γ2 ξc3 ξc1 ξc2 γ3 γ1 γ2 ξc3 γ1 ξc2 γ3 ξc1 γ2 γ3 γ1 γ2 γ3

thing we have to do to obtain this distribution is replace ξci and γi by the response probabilities pci and qci ≡ 1 − pci , respectively. (Of course, ξci and γi do not sum to one, but this is not a problem here.) The first entry in the last column then becomes the probability of x = 3 items correct, the sum of the entries for {1}, {2}, and {3} becomes the probability of x = 2 items correct, etc., all the way down to the last entry, which now is the probability of x = 0 items correct. Continuing the analogy, it follows that the denominator of (16) corresponds to the probability of a numbercorrect score equal to x = m, whereas its numerator corresponds to the sum of the probabilities of the number-correct scores from x = 0 through m − 1. An efficient recursive algorithm for the calculation of the posterior odds of cheating suggested by this analogy is presented in Appendix 2. 2.1.1. Alternative Models The K -index procedure for detecting answer copying is designed for a test with multiple-choice items that are scored dichotomously. It assumes the sampling of c from a population of independently operating test takers with c’s number of wrong answers. The sampling distribution considered is for the number of matching incorrect alternatives conditional on the items s had wrong. This distribution is taken to be binomial, with a success parameter pcs obtained by a conservative piecewise linear model for the regression of the proportion of matches to the incorrect answers by s on the proportion of incorrect answers for the entire population of test takers. Let wc denote the proportion of incorrect answers by c. Formally, the success parameter is defined as pcs =

0.085 + βwc , 0.085 + 0.3β + 0.4β(wc − 0.3),

if if

0.0 < wc < 0.3, 0.3 ≤ wc < 1.0,

(19)

where β is estimated empirically for each population and test. For a motivation and further details of the binomial model with this success parameter, see Holland (1996) and Lewis & Thayer (1998). The K -index is defined as the upper tail probability for this distribution, evaluated at the observed number of matching incorrect alternatives for c and s. As a result of the conditioning on the items s has wrong, we need to redefine our notation. Again, N denotes the set of items that is considered, but its choice now is restricted to the items s had wrong. In addition, M is the subset of N with matching incorrect alternatives by c and s, and the subset of M for which c copied the answers. Likewise, u si now represents the incorrect alternative for item i chosen by s, and the prior probabilities γi are for the event of c copying the incorrect response to item i by s. The conditional probability of response Uci = u ci given Usi = 0 is equal to


p(u ci

⎧ ⎨ pcs , 1, | , u si ) = ⎩ 0,

if if if

i∈ / , i ∈ and u ci = u si , i ∈ and u ci = u si .

(20)

Analogously to (16), the posterior odds of c not cheating under the model for the K -index are equal to γ ξ c − i∈M ξcsi 1 − p(∅ | uc , us ) = , (21) p(∅ | uc , us ) i∈M ξcsi where γ and ξ c are vectors with the same format as in (17)–(18) but the latter now has entries ξcsi ≡ (1 − γi ) pcs .

(22)

The odds can be calculated using the algorithm in the Appendix 2 with ξci in (15) replaced by ξcsi in (22). The test of answer copying based on a binomial test (van der Linden & Sotaridona 2004) is also for dichotomously scored multiple-choice items. Although, like the K -index, its null distribution is derived from the binomial, it does not assume any sampling of c from a population of independently operating test takers. Instead, the result follows from the assumption of three alternative response processes for an individual test taker: First, if test takers know the correct answer, it is assumed that they will give it. Second, if test takers do not know the answer, they may copy it from a neighbor. Third, if test takers do not want to copy or have no access to the answers by any of the neighbors, they will guess randomly. In fact, the response process is the one of the familiar knowledge-or-random guessing model extended with the option of copying. The binomial test also conditions on the items s has wrong. Continuing our new notation, the relevant response probabilities are ⎧ / , ⎨ (k − 1)−1 , if i ∈ (23) p(u ci | , u si ) = 1, if i ∈ and u ci = u si , ⎩ 0, if i ∈ and u ci = u si , where k is the number of alternatives. It follows immediately that the posterior odds of cheating are equal to γ ξ − i∈M ξi 1 − p(∅ | uc , us ) = , (24) p(∅ | uc , us ) i∈M ξi with ξ c replaced by ξ , which has entries ξi ≡ (1 − γi )(k − 1)−1 .

(25)

The odds can be calculated using the algorithm in Appendix 2 with pci in (15) replaced by (k − 1)−1 . In the empirical examples below, we will address the case of a proctor who is positive of communication between c and s while working on a section of the test but unable to be more specific. The case can be represented by equal prior probabilities of copying on the individual items; that is, the choice of γi = γ , i = 1, . . . , n, with γ a positive number following from p(∅) = (1 − γ )n . For checks based on the K -index and the binomial test above, the posterior odds in (21) and (24) simplify to

g m 1 − p(∅ | uc , us ) m γ = , (26) g p(∅ | uc , us ) (1 − γ ) pcs g=1

PSYCHOMETRIKA

and

m 1 − p(∅ | uc , us ) m γ (k − 1) g , = g p(∅ | uc , us ) 1−γ

(27)

g=1

respectively, where g denotes the size of set . The simplification does not hold for response models with item parameters, such as the ones in (5) and (6)–(7). Although it seems tempting to use (26) as a rough approximation to (14), with the average of pci across the items substituted for pcs , the result will be a considerable loss of power. We will illustrate this point later. 2.2. Fraudulent Erasures The detection of fraudulent erasures of wrong answers on answer sheets requires a model for the probability of a regular test taker changing an answer on a multiple-choice item found to be incorrect during review. The proposed model is based on a two-stage response process: In the first stage, the test taker produces initial answers to the items in the test. Once the answers have been given, the second stage begins, in which (s)he reviews all answers, confirming the ones still thought to be okay but changing those for which a better alternative seems available. Of course, a test taker could change the answers more than once, but we focus on the final response. Observe that, as a result of optical scanning of the answer sheet, we know both the erased initial and this final responses for each item. The assumption of this two-stage process can only hold when the test is not speeded and the test taker thus has enough testing time to complete the reviews. Erasures may also occur through other non-fraudulent processes, for instance, when a test taker discovers systematic misreading of the item numbers on the answer sheets. Such cases have to be excluded by additional analysis and/or direct inspection of the answer sheets. Suppose the test consists of items that have been pretested and calibrated under the 3PL model in (7). It then makes sense to assume probabilities of the first-stage responses that follow the same model, which we now denote as p(1 | θ ) ≡ ci + (1 − ci )

exp[ai (θ (1) − bi )] , 1 + exp[ai (θ (1) − bi )]

(28)

where ai , bi , and ci are the item parameters estimated during item calibration. The test taker’s ability θ (1) is estimated from the initial responses. Because of the lack of identifiability of the model, in order to capture the change in response probability due to the effects of the initial responses, we need a constrained version of it for the second-stage responses. We therefore fix θ at its initial level, but let the ai and bi parameters free to capture the change. In addition, we assume that if the test taker still does not know the answer and already guessed during the initial stage, (s)he will not guess again. The result is = 0) =

exp[a0i (θ (1) − b0i )] , 1 + exp[a0i (θ (1) − b0i )]

(29)

p(u i(2) = 1 | θ (1) , u i(1) = 1) =

exp[a1i (θ (1) − b1i )] , 1 + exp[a1i (θ (1) − b1i )]

(30)

(2)

p(u i

(1)

= 1 | θ (1) , u i

and

where u i(1) and u i(2) are the initial and final responses to item i and (ai0 , bi0 ) and (ai1 , bi1 ) are the parameters for item i given u i(1) = 0 and 1, respectively. The latter are free parameters that can be estimated from the final responses given the initial responses using logistic regression; for


more details of the model and its estimation as well as the generalization to polytomous response models as in (5), see Linden & Jeon (2012). Observe that (29) is the probability of a WR change. These changes are the ones of interest when there is suspicion of fraudulent behavior by school teachers or administrators; we therefore ignore (30). Let E i be a binary variable indicating whether (E i = 1) or not (E i = 0) an observed change on item i was a WR change. As before, N denotes the set of items considered, M ⊆ N the subset of items with an observed WR erasure, and ⊆ M its subset of fraudulent erasures. Prior probability γi is for the event of a fraudulent erasure on item i. The probability of E i = ei is ⎧ ⎪ ⎪ ⎨

p(u i(2) = 1 | θ (1) , u i(1) = 0), (2) (1) (1) p(ei | ) = 1 − p(u i = 1 | θ , u i = 0), ⎪ 1, ⎪ ⎩ 0,

if i ∈ / and ei if i ∈ / and ei if i ∈ and ei if i ∈ and ei

= 1, = 0, = 1, = 0.

(31)

The second probability is only given for completeness; our current focus is exclusively on WR changes, which have the first probability. The last two probabilities reflect the assumptions of the teacher/administrator changing incorrect answers only and always replacing them by the correct answer. Following the same argument as before, the posterior odds of cheating are γ ξ − i∈M ξi 1 − p(∅ | e) = , p(∅ | e) i∈M ξi

(32)

ξi ≡ (1 − γi ) p(u i(2) = 1 | θ (1) , u i(1) = 0).

(33)

where vector ξ has entries

Again, the odds can easily be calculated using the algorithm in Appendix 2 with pci in (15) (2) (1) replaced by p(u i = 1 | θ (1) , u i = 0) in (29). 3. A Few Numerical Examples As each of the proposed checks implies an expression for its posterior odds differing only in the probabilities of the specific type of cheating behavior on the items it addresses, we restrict our examples to the fully developed case of answer copying under a known response model in (5)–(16). All examples are for the item parameters for the nominal response model in (5) for the same set of 40 items as used in Wollack (1997). Each of the items had five answer choices. Figure 1 highlights the probabilistic structure of random matches between pairs of test takers on the entire test derived from the item parameters only. Each of the plots shows the distribution of the number of random responses for the typical case of a lower-ability test taker suspected to copy from a more able one we may have to check on in practice: (θc , θs ) = (−2.0, 1.0), (−1.5, 1.0), (−1.0, 1.0), and (−0.5, 1.0). The distributions are known to be generalized or compound binomial and can be generated from the probabilities of a random match on the individual items using another version of the recursive algorithm discussed in Appendix 2 (for details, see van der Linden & Sotaridona 2006). For each item, the probability of a random match was calculated according to (1) (i.e., without any response data). The average probabilities for the four pairs of

0.10 0.05

0.10 0.05

0.00

0.00

Probability

0.15

0.15

PSYCHOMETRIKA

10

20

30

40

10

20

30

40

10

20

30

40

10

20

30

40

0.12

0

0.08

0.08

0.04

0.04

0.00

0.00

Probability

0.12

0

0

0

Number of Matches

Number of Matches

Figure 1. Distributions of random matches between the responses of a hypothetical copier and source on a test of 40 five-choice items for four different combinations of ability levels. Note lower-right corner of each plot shows the average probability of a match on an item.

test takers across all items were equal to π cs = .147, .225, .325, and .436, respectively. As the distributions in Figure 1 confirm, it is not unlikely for two test takers to work independently and nonetheless produce substantial numbers of matching responses. The problem of separating these cases from fraudulent matches is our statistical challenge. Suppose the test consists of different sections and we have prior evidence of c copying some of the answers from s in the first section of ten items, for instance, in the form of a proctor who has observed suspicious communications between the pair of test takers. The proctor is positive that the communications occurred while c and s were working on this section but is unable to be more specific. The strength of the evidence can be measured in the form of a prior probability of c having copied at least one of the answers from s in the section; that is, the specification of the prior probability 1 − p(∅). Since the proctor is ignorant as to the specific items on which copying might have occurred, it seems natural to adopt a constant prior probability of c having copied on any of them; that is, γi = γ , i = 1, . . . , 10. For the general case of n items, probability calculus allows us to derive γ as (34) γ = 1 − (1 − p(∅))−n . A set of response data was generated from the item parameters in the first section for the same four ability pairs (θc , θs ) as in Figure 1. The number of observed matches between c and s happened to be equal to two for all four pairs of θ values. In addition, for each of these pairs we simulated different levels of answer copying by systematically replacing responses by c with those by s for the items that did not have a random match, beginning with the first non-matching

WIM J. VAN DER LINDEN AND CHARLES LEWIS Table 2. Posterior odds given numbers of fraudulent matches between pairs of test takers on a section of ten items in the test as a function of their ability levels and the prior probability of at least one fraudulent match.

(θc , θs ) Prior odds

1/3

No. of matches 2 3 4 5 6 7 8 9 10

0.62 2.04 2.42 3.15 6.06 6.82 13.96 15.59 19.85

(θc , θs ) No. of matches 2 3 4 5 6 7 8 9 10

(−2.0,1.0) 1

3

1.71 7.56 10.51 16.00 45.44 57.40 188.98 239.49 391.38

4.28 27.58 45.50 444.05 681.63 * * * *

(−1.0,1.0) 0.19 0.67 0.98 1.14 1.38 1.63 8.97 9.56 11.10

0.50 1.99 3.36 4.18 5.65 7.36 64.74 74.34 101.38

1/3 0.31 0.41 0.76 0.90 1.62 4.48 5.12 12.89 15.11

(−1.5,1.0) 1

3

0.82 1.18 2.52 3.20 7.11 28.91 37.47 157.66 219.86

1.98 3.18 8.50 12.31 38.02 255.20 407.13 * *

0.31 0.53 0.98 1.43 2.11 2.65 3.18 3.68 4.72

0.69 1.27 2.65 4.38 7.50 10.55 14.02 17.70 26.35

(−0.5,1.0) 1.15 5.56 11.78 16.82 27.56 42.35 658.47 858.08 *

0.12 0.20 0.34 0.47 0.63 0.75 0.85 0.94 1.12

Bold numbers are for random matches. Asterisks indicate posterior odds larger than 1,000.

item in our files. Three levels of the prior probability were analyzed: 1 − p(∅) = .25, .50., and .75. Observe that these choices amount to prior odds of copying on at least one item equal to 1:3, 1:1, and 3:1, respectively; that is, a case of weak belief of cheating, indifference between cheating and no cheating, and strong belief of cheating. For each of the numbers of matches m = 2, ..., 10, the posterior odds of answer copying were calculated using the algorithm in Appendix 2. Table 2 shows the results for all simulated conditions. Several things can be observed from these examples. First, as expected, the posterior odds increased with the numbers of fraudulent answer changes. They happen to do so at a different rate for each of the simulated conditions, though. One of the reasons for these different rates is the dependence of the odds on the actual responses simulated for the source, which, obviously, were different for each simulated condition. Second, not surprisingly, the posterior odds did increase with the prior odds of cheating. In fact, moving from the prior odds of 3:1 against cheating to 3:1 in favor of it—a change by a factor equal to nine—had a strong effect on the posterior odds for each given number of matches. Finally, the posterior odds showed a tendency to decrease as the difference between θc and θs decreases. Again, this was as expected: The closer the abilities of the copier and source, the more likely a random match on any of the items, and the Bayesian response to this is lower posterior odds. 4. Discussion Although the tendencies in Table 2 are clear, they are for twelve specific conditions only. We therefore should not generalize without further study.

20 0

5

10

15

20 15 10 5 0

4

6

8

10

4

6

8

10

2

4

6

8

10

4

6

8

10

0

5

5

10

10

15

15

20

20

2

0

Posterior Odds of Answer Copying


PSYCHOMETRIKA

2

Number of Matches

2

Number of Matches

Figure 2. Posterior odds of answer copying as a function of the number of matches according to Eq. 26, with the probability of a random match on an item calculated as the average of Eq. 2 across all items in the first section of the test. Note upper-left corner of each plot shows the average probability of Eq. 2 across the items.

In fact, one of the dominant impressions derived from our current set of examples is their data dependency. For instance, for the case of prior indifference, if we had accepted posterior odds greater than five as evidence of answer copying, the critical numbers of matches given the responses for each of the four ability pairs would have been two, five, five, and ten matches (Table 2, middle columns), respectively. Such differences should not come as a surprise, though; in a Bayesian approach, all inferences are conditional on the data. Our earlier derivation of the posterior odds allows us to be more specific about the factors that have an impact on them. As demonstrated by our derivation of (10), in order to detect answer copying, the responses by c and s outside the subset of items with a match, M, appear to be redundant. On the other hand, the odds of having copied critically depend on the specific alternatives chosen by the copier. Consequently, in order to account for the likelihood of a random match with the source on them, we need to know the parameters that control the copier’s response probabilities for these alternatives; that is, both the pertinent item parameters and ability parameter θc . Our derivations also reveal that the ability of the source, θs , does not matter (Eq. 13)—a fact that makes perfect sense. The only thing that counts when c copies from s are the actual responses by the latter; it is unnecessary to know the ability that generated these responses. If any of the relevant quantities is left out of the equation, the posterior odds are miscalculated. Two cases in point are presented in Figures 2 and 3. Both figures are for the same indifference prior odds of 1:1 used for each of the four ability pairs in Table 2 (middle columns). Figure 2

20

20

15

15

10

10

5

5

0

0



6

8

10

4

6

8

10

2

4

6

8

10

4

6

8

10

5

5

10

10

15

15

20

20

4

0

0


2

2

2

Number of Matches

Number of Matches

Figure 3. Posterior odds of answer copying as a function of the number of matches according to Eq. 26, with the probability of a random match on an item calculated as the average of Eq. 1 across all items in the first section of the test. Note upper-left corner of each plot shows the average probability of Eq. 1 across the items.

presents the posterior odds of copying as a function of the number of matches, with the probability of a match calculated as the average of the probabilities πcia in (2) across the same ten items and for the same responses by the source as in the table. The curves follow directly upon substitution of the average probability for pcs in (26). This case thus amounts to intentional ignorance as to the differences between the items and their alternatives while still conditioning on the responses by the source. Observe that, with a slight exception for (θc , θs ) = (−0.5, 1.0), the odds of copying in Figure 2 are much smaller than in Table 2. As predicted earlier, the effect of ignoring relevant statistical information on the differences between the items and their alternatives thus tends to result in loss of power, in the current Bayesian context in the form of much less discrimination between the odds associated with the lower and higher numbers of matches. The curves in Figure 3 are based on the average probabilities of πcsi in (1); that is, when the conditioning on the source’s responses is omitted as well. The effects of the additional ignoring of these responses imply even larger loss of power. Again, we recommend not to use the Bayesian checks in this paper routinely for all test takers but only when prior empirical evidence suggests a specific type of cheating by a test taker on a specific set of items in the test. This restriction has two advantages. First, it automatically entails the formulation of prior probabilities for the items involved in the check which are much more informative than the weak prior probabilities following from the routine use of (34) for full-length

PSYCHOMETRIKA

tests. Second, use of the checks for a smaller subset of items enables us to estimate θc , the only unknown quantity in (14), from the responses to the other items in the test. Although the use of a simple plug-in estimate of θc would be computationally convenient, its estimation error would propagate in the calculation of the posterior odds. The Bayesian way to proceed is to introduce a prior distribution for θc as well and account for any remaining uncertainty about this parameter by integrating it out of the posterior quantities of interest. The questions of how to do so and how serious the impact on the posterior odds would be will be subject of subsequent research. Our final comment is on the combinatorial nature of the posterior odds of cheating in (5)– (16). Naively, one might have expected just an application of the rather straightforward “posterior odds = likelihood ratio × prior odds” factorization explained in most introductory texts to Bayesian statistics. However, this type of factorization only holds for the case of a statistical model for identically distributed independent variables with common parameters. In the current context, it would apply only if all items could be treated as exchangeable. However, as just demonstrated by the differences between the results in Table 2 and Figures 2–3, test items are not exchangeable. They differ substantially in their properties, and therefore considerably in their probabilities of a random match between independently working test takers. Consequently, in order to derive the posterior odds for an observed number of matches, we have to evaluate all possible combinations of cheating on at least one item in M and no cheating on the rest. Fortunately, the algorithm in Appendix 2 enables us to deal with this job efficiently.

Appendix 1: Conditional and Unconditional Tests of Answer Copying The experiment of c and s responding to the same test can be conceived of as a two-stage process, in which we first observe response vector Us = us by s and then Uc = uc by c given the former. For known item parameters, the two vectors have distributions us ∼ Fθs and uc ∼ Fθc depending on the ability parameters θs and θc , respectively. Given Us = us , the observation of Uc = uc is equivalent to that of the number of matches Mcs = m cs between c and s, with Mcs ∼ Fγcs , where γcs is the unknown number of items copied by c (see the main text for examples of the distribution of Fγcs ). Conditional inference of γcs from Us = us has expected power equal to inference from the observed joint distribution of Uc and Us , when Us is partially ancillary with respect to γcs , that is, (i) its distribution, Fθs , does not depend on γcs and (ii) γcs and θs are distinct (Lehmann & Romano 2005, Sect. 10.1–10.2). In the current application, both conditions are fulfilled.

Appendix 2: Calculation of Posterior Odds The distribution of the number-correct score on a test, which is known to be generalized or compound binomial, lacks a closed-form probability function, but its probabilities are easily calculated using a well-known recursive algorithm introduced in the test-theory literature by Lord & Wingersky (1984). The earlier analogy between these probabilities and the entries in the last column of Table 2 is the rationale for the following modification of the algorithm, which can be used to calculate the posterior odds in (16). Let z denote the size of the relative complement M\, that is, the subset of items in M for which the answers were not copied. In addition, we use πm (z) to represent the sum of all possible joint products of the z quantities ξci for the items in M\ (but remember that these are not probabilities) and the m − z prior probabilities γi for the items in [for example, for x = 1, the sum of products in the last column of Table 1 for the rows {1}, {2}, and {3}]. Finally, we define πm (z) = 0 for z < 0 and z > m.


Beginning with the first item (t = 1), the algorithm runs through steps t = 2, ..., m, each time adding one extra item to the set that is considered: 1. For t = 1, set π1 (0) = γ1 and π1 (1) = ξc1 ; 2. For t = 2, . . . , m and z = 0, . . . , t calculate the values of πt (z) from the recursive relation πt (z) =

⎧ ⎨

γt πt−1 (0), γt πt−1 (z) + ξct πt−1 (z − 1), ⎩ ξct πt−1 (t − 1),

for z = 0 for z = 1, . . . , t − 1, for z = t.

3. Finally, calculate the posterior odds in (16) as 1 − p(∅ | θc , uc ) = p(∅ | θc , uc )

m−1

πm (z) . πm (m)

z=0

(35)

Observe that, unlike the original version of the algorithm for use with number-correct score distributions, its current modification can be used for dichotomous and polytomous response models alike: The response probabilities pci that figure in ξci ≡ (1−γi ) pci are just the probabilities for the copier’s actual responses to the items in M, no matter whether the items are scored dichotomously under models as in (7), polytomously under models as in (5), or as a mixture of both. References Angoff, W.H. (1974). The development of statistical indices for detecting cheaters. Journal of the American Statistical Association, 69, 44–49. Belov, D.I., & Armstrong, R.D. (2010). Automatic detection of answer copying via Kullback–Leibler divergence and K-index. Applied Psychological Measurement, 34, 379–392. Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 46, 443–459. Bock, R.D. (1997). The nominal categories model. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item response theory (pp. 33–49). New York: Springer. Fox, J.-P., & Meijer, R.R. (2008). Using item response theory to obtain individual information from randomized response data: An application using cheating data. Applied Psychological Measurement, 32, 595–610. Frary, R.B., Tideman, T.N., & Watts, T.M. (1977). Indices of cheating on multiple-choice tests. Journal of Educational Statistics, 2, 235–256. Glas, C.A.W., & Meijer, R.R. (2003). A Bayesian approach to person-fit analysis in item response theory. Applied Psychological Measurement, 27, 217–233. Holland, P.W. (1996). Assessing unusual agreement between the incorrect answers of two examinees using the K-index: Statistical theory and empirical support (Research Report RR-96-7). Princeton, NJ: Educational Testing Service. Jacob, B.A., & Levitt, S. (2003a). Rotten apples: An investigation of the prevalence and predictors of teacher cheating. Quarterly Journal of Economics, 118, 843–877. Jacob, B.A., & Levitt, S. (2003b). Catching cheating teachers: The results of an unusual experiment in implementing theory. Brookings-Wharton Papers on Urban Affairs, 185–209. Jacob, B.A., & Levitt, S. (2004, Winter). To catch a cheat. Education Next, 68–75. Lehmann, E.L., & Romano, J.P. (2005). Testing statistical hypotheses (3rd ed.). New York: Springer. Levitt, S., & Rubner, S. (2005). Freakonomics: A rogue economist explores the hidden side of everything. New York: Harper Collins. Lewis, C. (2006). A note on conditional and unconditional hypothesis testing: A discussion of an issue raised by van der Linden and Sotaridona. Journal of Educational and Behavioral Statistics, 31, 305–309. Lewis, C., & Thayer, D.T. (1998). The power of the K-index (or PMIR) to detect copying (Research Report RR-98-49). Princeton, NJ: Educational Testing Service. Lord, F.M., & Wingersky, M.S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452–461. McLeod, L.D., Lewis, C., & Thissen, D. (2003). A Bayesian method for the detection of item preknowledge in computerized adaptive testing. Applied Psychological Measurement, 27, 121–137. Meijer, R.R., & Sijtsma, K. (2001). Methodology review: Evaluation of person fit. Applied Psychological Measurement, 25, 107–135.

PSYCHOMETRIKA Qualls, A.L. (2001). Can knowledge of erasure behavior be used as an indicator of possible cheating? Educational Measurement: Issues and Practice, 20(1), 9–16. Saupe, J.L. (1960). An empirical model for the corroboration of suspected cheating on multiple-choice tests. Educational and Psychological Measurement, 20, 475–489. Sotaridona, L.S., & Meijer, R.R. (2002). Statistical properties of the K-index for detecting answer copying. Journal of Educational Measurement, 39, 115–132. van der Linden, W.J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365–384. van der Linden, W.J., & Jeon, M. (2012). Modeling answer changes on test items. Journal of Educational and Behavioral Statistics, 37, 180–199. van der Linden, W.J., & Sotaridona, L.S. (2004). A statistical test for detecting answer copying on multiple-choice tests. Journal of Educational Measurement, 41, 361–377. van der Linden, W.J., & Sotaridona, L. (2006). Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31, 283–304. Wesolowsky, G. O. (2000). Detecting excessive similarity in answers on multiple-choice exams. Journal of Applied Statistics, 27, 909–921. Wollack, J.A. (1997). A nominal response model approach to detect answer copying. Applied Psychological Measurement, 21, 307–320. Manuscript Received: 22 OCTOBER 2013

UNIFORMLY MOST POWERFUL BAYESIAN TESTS.

Time for tighter checks on medical schools?

Effects of modeling and expectancy of reward on cheating behavior.

Vaping on Instagram: cloud chasing, hand checks and product placement.

RSPCA calls for tighter checks on imported dogs.

Bayesian inference on proportional elections.

Bayesian tests to quantify the result of a replication attempt.

Strategies for improving approximate Bayesian computation tests for synchronous diversification.

Rampant Cheating by Pathogens?

On Bayesian estimation of marginal structural models.

On Bayesian adaptive video super resolution.

Bayesian Perspective on Random Censored Survival Data.

On Bayesian Inference with Complex Survey Data.

A Bayesian perspective on tinnitus pitch matching.

A Bayesian perspective on magnitude estimation.

Letter: Tests on overseas doctors.

Editorial: Tests on overseas doctors.

Siderophore cheating and cheating resistance shape competition for iron in soil and freshwater Pseudomonas communities.

Social control, social learning, and cheating: Evidence from lab and online experiments on dishonesty.

Blood glucose checks.

Six-monthly dental checks.

Health checks for adults.

Calcitonin effects on rabbit bone. Bending tests on ulnar osteotomies.

Cheating at the end to avoid regret.