PSYCHOMETRIKA

2014 DOI : 10.1007/ S 11336-014-9405-1

SADDLEPOINT APPROXIMATIONS OF THE DISTRIBUTION OF THE PERSON PARAMETER IN THE TWO PARAMETER LOGISTIC MODEL

M ARTIN B IEHLER , H EINZ H OLLING ,

AND

P HILIPP D OEBLER

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER Large sample theory states the asymptotic normality of the maximum likelihood estimator of the person parameter in the two parameter logistic (2PL) model. In short tests, however, the assumption of normality can be grossly wrong. As a consequence, intended coverage rates may be exceeded and confidence intervals are revealed to be overly conservative. Methods belonging to the higher-order-theory, more specifically saddlepoint approximations, are a convenient way to deal with small-sample problems. Confidence bounds obtained by these means hold the approximate confidence level for a broad range of the person parameter. Moreover, an approximation to the exact distribution permits to compute median unbiased estimates (MUE) that are as likely to overestimate as to underestimate the true person parameter. Additionally, in small samples, these MUE are less mean-biased than the often-used maximum likelihood estimator. Key words: small-samples, saddlepoint approximation, Lugannani–Rice, modified signed likelihood ratio statistic, higher-order-theory, parameter distribution, 2PL, exponential family, person parameter, confidence intervals, median unbiased estimator, mean-bias.

1. Introduction Modern item response theory (IRT) has clear mathematical and conceptual advantages over classical test theory (CTT) (e.g., Reeve & Mâsse, 2004; Hambleton & Zhao, 2005). Most importantly, IRT provides a standard error of measurement unique to each possible response pattern, while tests using CTT often confine to one such value for all sum scores. Ideally, the maximum likelihood estimator (MLE) is normally distributed with asymptotic variance given by the inverse of the expected or observed information (e.g., Lehmann, 1999). So by default the asymptotic variance is used to form symmetrical confidence limits around the estimate. These confidence intervals are also referred to as Wald confidence intervals because they rely on the inversion of the Wald test. The asymptotic properties, however, belong to large sample theory, a point also addressed by Borsboom (2006, p. 443) who noted that “[t]he estimation and testing of psychometric models is not always feasible due to the fact that one often needs large data sets for this purpose”. For person parameter estimation by maximum likelihood the data available is given by the response patterns. Since the number of items in most tests is rather small, there is limited information about this parameter in the data. For the person parameter in the Rasch model, Fischer (2007, p. 548) questions the importance of the asymptotic properties altogether because of the insufficient number of items in most tests. This inconvenience is made more distinct by the preference of shorter tests over longer ones. In particular, computerized adaptive testing (CAT) (e.g., van der Linden & Glas, 2000) was developed with the intention to economize the collection of data. For a medium range of person parameters, the failure to meet the exact significance level is less pronounced than it is in the more extreme parts. Electronic Supplementary Material The online version of this article (doi:10.1007/s11336-014-9405-1) contains supplementary material, which is available to authorized users. Requests for reprints should be sent to Martin Biehler, Westfälische Wilhelms-Universität Münster, Münster, Germany. E-mail: [email protected]

© 2014 The Psychometric Society

PSYCHOMETRIKA

For example, the MLE of the person parameter in the Rasch model is not normally distributed, not even with 25 items in a test (Hoijtink & Boomsma, 1995); the same has been observed for the 2PL model (Doebler, Doebler, & Holling, 2013). Liou and Yu (1991) compared confidence limits for the person parameter in the three parameter logistic (3PL) model obtained from the score statistic, from the Wald statistic and from the bootstrap. Bootstrapping yielded most accurate results, closely followed by the score statistic, whereas confidence intervals obtained by the Wald statistic revealed less accurate coverage rates. The authors recommend to rely on first order confidence intervals starting from test lengths of at least 40 items. Ogasawara (2012) determined higher order distributions of several versions of the studentized person parameter (Wald pivot) for the 3PL model. These distributional forms were derived for correct model specification and model misspecification. From these asymptotic distributions he obtained one-sided confidence intervals for the MLE of the person parameter. Formulae provided by Ogasawara (2012) can be directly applied to the 2PL model by setting the guessing parameter to zero. 1.1. Approaches Based on the Exact Distribution Another promising approach to the problem would be to base inference on finite sample results, that is, on the exact rather than the asymptotic distribution of the estimator. Klauer (1991a) derived uniformly most accurate unbiased (UMAU) confidence intervals for the person parameter in the Rasch model by inverting uniformly most powerful unbiased (UMPU) tests based on the exact distribution of the person raw score, the sufficient statistic of the parameter. Note, however, that randomization of the person raw score is necessary in order to get a continuous distribution that allows for UMAU confidence bounds (Lehmann & Romano, 2005; Klauer, 1991a). The derivation of the exact distribution of the sufficient statistic in the two parameter logistic (2PL) model is forbiddingly tedious, and randomization is impaired by the unequal distances between successive sufficient statistics. However, a similar approach is pursued by Doebler et al. (2013) who define acceptance regions in terms of included response patterns instead of person raw scores or weighted sum scores. These acceptance regions are inverted to yield person parameter confidence sets. The large number of possible response patterns allows composing nearly exact acceptance regions by including only some patterns belonging to a person raw score and excluding the remaining ones. Two different kinds of confidence regions are constructed in this manner. One follows from the usual normal approach, and the other is based on a population prior. The latter interval region minimizes the average expected interval length. Note that these confidence sets are not necessarily connected; even though the exact significance level is held, confidence intervals can include discontinuities. Moreover, the proposed algorithm is computationally expensive and not practical for tests comprising more than 20 items. 1.2. Saddlepoint Approximations By contrast, there exist some highly accurate, easy-to-use methods to approximate the exact parameter distributions in finite samples—at least for models in the exponential family. It must be noted that the person parameter of the 3PL model is not in the exponential family, and hence the methods proposed in this paper do not apply to it. This is by name the Lugannani–Rice approximation (Lugannani & Rice, 1980), a saddlepoint approximation, and the modified signed likelihood root r ∗ due to Barndorff-Nielsen (1986). The benefit from these approximations is twofold. They yield confidence limits that are closer to the nominal level, and they may allow for the derivation of alternative estimators. These so-called higher-order-asymptotics, however, do not yield two-sided confidence intervals in the spirit of Klauer’s (1991a) UMAU confidence bounds, which have minimal length while exactly holding the nominal confidence level. Instead,

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

one has to be content with boundaries obtained from inverting two one-sided tests both with significance level of α/2. Nevertheless, these two confidence intervals have the advantage of not necessarily being symmetrical about the estimate as compared to boundaries based on the asymptotic variance. This can be crucial in testing single person parameters. Alternatively, it is possible to define confidence intervals similar to highest posterior density (HPD) intervals frequently used in Bayesian parameter estimation (e.g., Fox, 2010) by capitalizing on the fact that a highly exact approximation to the density beyond the normality assumption is available. However, if the assumption of normality holds, the two aforementioned confidence intervals will coincide with the standard normal approach. In contrast to the algorithm proposed in Doebler et al. (2013), the approach pursued here is also feasible for very long tests and ensures that the confidence sets are connected. Nevertheless, this approach will only be useful if the data fits the respective model and if the item parameters are known prior to testing. A good overview of the basic concepts and subsequent developments concerning saddlepoint approximations until 1988 is given by Reid (1988). Modern textbooks giving accessible accounts are Pace and Salvan (1997) and Butler (2007). There are several routes leading to the saddlepoint approximation. The mathematically most ambitious approach results from an approximation to the integral of an inverse Fourier Transform of a density (Small, 2010). A statistically more elegant development of the saddlepoint approximation preferred in most statistical text books (e.g., Severini, 2000; Pace & Salvan, 1997) relies on the Edgeworth expansion and exponential tilting. This is the approach also outlined here. Saddlepoint approximations have been successfully used in many different fields, among others in econometrics (e.g., Rogers & Zane, 1999; Aït-Sahalia & Yu, 2006), in signal processing (Kay, Nuttall, & Baggenstoss, 2001), and engineering (e.g., Butler, 2000). Srivastava and Yau (1989) give a saddlepoint approximation to the distribution of Wilks’ λ, the test statistic of the likelihood ratio in multivariate analysis of variance. More interestingly, there are several examples where saddlepoint approximations are used in logistic regression analysis. The initial paper introducing saddlepoint approximations to generalized linear models is due to Davison (1988). Levin (1990) obtains correction terms to logistic regression likelihood functions from doublesaddlepoint approximations. Bedrick and Hill (1992) compute more accurate significance levels for three conditional tests of logistic regression parameters. For cases where some logistic regression parameters are infinite, Kolassa (1997) shows how to obtain estimates for the remaining parameters by using double saddlepoint approximations to the conditional distribution functions. Wang and Carroll (1999) derive approximations to parameter distributions in retrospective sampling situations, and Chieffo, Stankovic, Bonizzoni, Tsagalou, Iakovou, Montorfano et al. (2005) use saddlepoint approximations for logistic regression models in a clinical context. Bedrick (1997) applied Edgeworth expansions in order to obtain the distributional form of a conditional person fit index (Molenaar & Hoijtink, 1990) testing the fit of the Rasch model. In the same paper, he compares the approximate distribution of a subtest score conditioned on the total score obtained from a saddlepoint approximation to the exact method proposed by Klauer (1991b). More concerned with the implementational details and practical issues is the work of Brazzale (1999), Brazzale, Davison, and Reid (2007), and Brazzale and Davison (2008). 1.3. Median Unbiased Estimation Once the exact distribution is approximated beyond the first-order normal approach, a point estimator for the person parameter can also be obtained as the zero-length confidence region given by the inversion of two one-sided tests with significance coefficient α = 1/2. Such an estimator overestimates and underestimates the true parameter with equal probability and is, therefore, a median unbiased estimator (MUE). This MUE is optimal among all MUE in the sense that it has minimal expected loss for any monotone loss function that has its minimum at the true value of the parameter (Lehmann & Romano, 2005, p. 76). This optimality statement

PSYCHOMETRIKA

is valid independently of the sample size and such MUE exist in all continuous exponential families. Pfanzagl (1970a) was the first to formally show that there is a randomized analogue to the MUE in discrete distributions. Birnbaum (1964) questions why mean-unbiasedness should be preferred over median-unbiasedness and highlights some important topics in this discussion. For example, the criterion of median-unbiasedness is invariant to reparametrization, i.e., to any monotone transformation of the scale, while the criterion of mean-unbiasedness is not. In all cases where the classical mean-unbiased estimator has a normal distribution, it is equivalent to the MUE, for instance, when the distribution of the MLE actually converges to a normal. Intuitively, this argument already suggests that the asymptotic efficiency of the MUE is the same as that of the MLE. Indeed, the maximal asymptotic concentration of the MUE meets the asymptotic efficiency of the MLE. Pfanzagl (1970b, 1972) showed that the MUE exists in all exponential families. Likewise, Salvan and Hirji (1991) proved that in multiparameter exponential families the conditional MUE (CMUE) has the same asymptotic normal distribution as the MLE. Other references to the MUE are given in Hirji, Tsiatis, and Metha (1989), Pace and Salvan (1999), and Read (2006). Hirji (2006) extensively treats the use of the MUE with discrete data problems. Special attention is given to the MUE in a forecasting context because negative and positive forecasting errors occur with equal frequency. There is no formal justification establishing the supremacy of the mean criterion over the median criterion. Historically, the preference of the mean-unbiasedness concept is linked to the squared error criterion, which is minimized in the classical linear regression model (Lehmann, 1951). This criterion is mathematically more tractable than is median-unbiasedness necessitating an exact distribution function. So the choice of mean-unbiasedness is due to the somewhat arbitrary adoption of the squared error loss function. Birnbaum (1964) suspects that econometric models tipped the balance in favor of mean-unbiasedness, ensuring that prices are neither undernor overestimated in the long run. In the context of person parameter estimation, it is questionable if the mean of the squared estimation errors should be reduced or if it is preferable to overestimate and underestimate the true parameter with equal probability. The optimality properties of the MUE are valid irrespective of the sample size, while those of the MLE apply only in large samples. In large samples, however, most optimality definitions come to be equivalent anyway (Brown, 1947). In the following section, some important properties of the person parameter estimator in the 2PL model and in the Rasch model will be recapitulated. Next, the theory concerned with saddlepoint approximations to the density and distribution of person parameter estimates for the 2PL is presented. For the Rasch model the approximations are somewhat simpler than those obtained for the 2PL. Higher-order results are then compared to results obtained from exact probability calculation in the Rasch model and then applied to the person parameter in the 2PL model. The median and mean bias as well as the standard error are subsequently illustrated for the MUE in the 2PL model. 1.4. The Person Parameter in the Rasch and 2PL Model In IRT, the relationship between an examinee’s response and its level on a latent trait is modeled by the item characteristic curve (ICC). The probability of a correct response to item i with difficulty parameter βi and discrimination parameter αi for a subject with ability θv is Pi (xvi = 1|θv , βi , αi ) =

exp{αi (θv − βi )} , 1 + exp{αi (θv − βi )}

(1)

where for the Rasch model all discrimination parameters αi are equal to one. Item responses are supposed to be locally independent (e.g., DeMars, 2010), i.e., all dependencies among the items are exclusively due to the underlying latent trait.

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

Once item parameters are estimated, either by conditional maximum likelihood (CML) for the Rasch model or by the marginal maximum likelihood (MML) approach for the 2PL model (e.g., Baker & Kim, 2004), ability parameters for the examinees are obtained by maximizing their likelihoods of having generated the raw score (rv ) or the specific response pattern (xv ), respectively. If a Bayesian approach is preferred, the person parameters are given as the modal or maximum a posteriori (MAP) or expected a posteriori (EAP) estimates. In the Rasch model, the person’s raw score is a sufficient statistic for θv and its distribution belongs to the one-parameter exponential family (e.g., Davison, 2003; Cox, 2006), given by    exp(rv θv ) xv |rv exp(− ni=1 xvi βi ) n P (xv |θv , β) = P (rv |θv , β) = i=1 [1 + exp(θv − βi )] xv |rv



= exp θv rv −

n 

   log 1 + exp(θv − βi ) γrv .

(2)

i=1

The summation in the numerator takes place over all patterns xv with sum score rv and γrv is the elementary symmetric function. In exponential family theory, rv is sufficient and is also called the canonical sufficient statistic (Davison, 2003, p. 169). Similarly, for the 2PL model the sum of the correct responses, weighted by the respective item discrimination parameters ni xvi αi , is the sufficient statistic for the ability parameter θv and its distribution, likewise, is in the oneparameter exponential family (see Equation (A.2) in Appendix A). The distribution of the MLE θˆv is asymptotically normal with expectation θv and minimal asymptotic variance given by the reciprocal of the amount of information I (θv ) in the data (e.g., Lehmann, 1999). In the context of IRT, I (θv ) is the test information at θv (Baker & Kim, 2004, p. 69 ff.); however, these distributional results are only true under the assumption of large samples. But for the person parameter θv , the sample size is given by the number of items in the test and asymptotic convergence must be questioned for small to moderate test lengths. 1.5. Finite Sample Results for the Rasch Model A possible solution to the difficulties arising from finite test length is the computation of exact probabilities. For the Rasch model this is an especially convenient solution since the raw score takes on values on the lattice of non-negative integers (e.g., Johnson, Kemp, & Kotz, 2005) and its distribution is non-degenerate (e.g., Agresti, 2002). There are only n + 1 different sufficient raw scores possible and the exact response pattern is of no interest for the estimation of the ability parameter via maximum likelihood. The fact that an exact probability calculation is possible for the person raw score in the Rasch model allows the computation of uniformly most accurate confidence bounds for the ability parameter (Klauer, 1991a), which are derived from uniformly most powerful unbiased tests. This, however, imports a continuous variable (Lehmann & Romano, 2005), and Klauer (1991a) randomizes the score by adding a random term to each raw score. By a similar development, one obtains median unbiased estimates (MUE) as the zero-level confidence intervals (Lehmann & Romano, 2005) for the ability parameter. The MUE for the parameter θ is given by the lower bound (θ ) or, likewise, the upper bound (θ ) for which 1 P (T ≤ t|θ = θ ) = P (T ≥ t|θ = θ ) = . 2

(3)

Here, T can be any statistic, possibly randomized and t is its observed value. In the case of the Rasch model T is the sum score. Whereas unbiasedness for the MLE actually signifies “meanunbiasedness” (Lehmann & Casella, 1998, p. 157), this point estimate is “median-unbiased”,

PSYCHOMETRIKA

i.e., it is as likely to overestimate as to underestimate the true parameter and may alternatively be stated as



1 P θ (t) ≤ θ = P θ (t) ≥ θ = . (4) 2 Even though the use of the randomized MUE and Klauer’s (1991a) optimal confidence bounds is enticing, in practice one will not be inclined to accept different estimates and confidence bounds for persons exhibiting identical sum scores. As a consequence, inference could be based on the exact P -values for the different raw scores. These, however, are overly conservative in discrete distributions due to the discrete sample space (e.g., Agresti, 2001; Agresti & Gottard, 2007). Take, for example, a given θ and a nominal significance level of α = 0.05. Hypothetical P values for the possible scores rv ≤ 2, 3, say, are 0.03, 0.07 and, hence, the actual significance level is 0.03 instead of 0.05. Furthermore, the two tails P (X ≥ x|θ ) and P (X ≤ x|θ ) sum to 1+P (X = x|θ ) which violates the law of statistics. A popular way to reduce the conservativeness of ordinary P -values is to smooth the discrete distribution by assigning half of the probability mass at X = x to the left tail and the other half to the right tail. In this way, one obtains the so-called mid-P -values 1 Pmid (X ≥ x|θ ) = P (X > x|θ ) + P (X = x|θ ). 2

(5)

Now, the probability mass in the left tail and the right tail sums up to 1 and the conservativeness of the exact P -values might be alleviated. Moreover, mid-P -values are essentially exact as discreteness diminishes with larger sample sizes (e.g., Agresti, 2001; Agresti & Min, 2001). From the perspective of randomized test scores, the mid-P is just the P -value for the score with expected random term U = 1/2. In contrast to the Rasch model, the distribution of the sufficient statistic of the person parameter in the 2PL model is degenerate. If all discrimination parameters are different and measured to the second decimal place, the odds are good that there are 2n different sufficient statistics, each corresponding to exactly one unique response vector. Violations of this rule of thumb are of no importance and should not occur too frequently in tests of moderate length. Thus, the exact probability distribution demands exhaustive enumeration of all possible response patterns. Moreover, the possible values of the weighted score do not take on values that are successive integer values or multiples of any other step width, i.e., the weighted score is not distributed on an equidistant lattice. This seriously impairs the application of randomization because the distribution of the random term U has to be adapted to each pair of neighboring values, and the same is true for the mid-P -values. So, seemingly for the person parameter in the 2PL model the use of exact methods is computationally expensive but not infeasible (Doebler et al., 2013). However, it is possible to approximate the exact distributional form of an estimate by rectifying the asymptotic approximation through higher-order terms. The usefulness of higher-order approximations in discrete data problems was demonstrated, for example, by Pierce and Peters (1992) or Routledge (1994), who both used saddlepoint approximations. Additionally, recent results presented in Butler (2007) suggest that saddlepoint approximations to discrete distributions without continuity correction mimic the behavior of the mid-P -values.

2. Higher-Order-Approximations An important building block of many higher order approximations is the saddlepoint approximation. The derivation of the saddlepoint approximation outlined here makes use of the Edgeworth expansion and a tilted version of the target density belonging to the exponential family. The trick used consists of choosing the parameter of the tilted distribution such that it is

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

expanded at its mean where the Edgeworth expansion achieves greatest accuracy. This mean is a one-to-one function of the MLE θˆ . Keeping only the first term of the Edgeworth expansion and solving for the initial density gives the saddlepoint approximation for an exponential family  (Butler, 2007). For the density of the weighted sum wv = ni=1 xvi αi the sufficient statistic in the 2PL model, the saddlepoint approximation is given by (for a sketch of the derivation see Appendix A) P (wv |θv , β, α) =

exp{wv θv − wv θˆv + K(θˆv ) − K(θv )} 1 L(θv ) = + Rest. ˆ 2πK  (θˆv ) 2πK  (θˆv ) L(θv )

(6)

In this equation, the MLE θˆv is associated with the empirical data and K is the cumulant generating function of the regular exponential family. Even though the above formula was derived in the context of the 2PL model, it also applies to the Rasch model where the sufficient raw score replaces the weighted sum score wv . The same is true for all subsequent formulae. For the distribution of the weighted sum score in the 2PL model and for that of the sufficient raw score for θˆv in the Rasch model, the cumulant generating functions are the logs of the terms in the denominator of Equation (2) and Equation (A.2), respectively. In exponential family theory, the expression K  (θˆ ), the variance of the sufficient statistic (wv for the 2PL model), is identical to the observed information (e.g., Davison, 2003) −

∂ 2 log L(θ ) ∂ 2 K(θ ) = j (θ ) = = K  (θ ). T ∂θ ∂θ ∂θ ∂θ T

(7)

The variance of wv should not be confused with the asymptotic variance of the estimator θˆv that is given by the reciprocal of the observed information at θˆv . Since the Edgeworth expansion is derived for a standardized sum, the term K  (θˆv )−1/2 is necessary for the transformation to the density of the sufficient statistic wv (e.g., Casella & Berger, 2002). The term K  (θv ) evaluated at the MLE θˆv is the test information given by  n n 

 ∂2   ˆ ˆ log 1 + exp αi (θv − βi ) = αi2 Pi (1 − Pi ) (8) j (θv ) = ∂ θˆv2 i=1

i=1

for the 2PL model. By comparing Equation (6) to Equation (2) (keep in mind that the sufficient statistics have to be exchanged), it follows that the elementary symmetric function for the Rasch model can be approximated as γr v ≈

exp{−rv θˆv + K(θˆv )} L(θˆv )−1 = . 2πK  (θˆv ) 2πK  (θˆv )

(9)

The two likelihoods in Equation (6) are given by Equation (A.2) (or Equation (2) in the Rasch case) evaluated at θv and at the MLE θˆv . Note that the function h(x) in the case of the 2PL model and the elementary symmetric function for the Rasch model cancel in Equation (6), for they are independent of θv . This fraction of likelihoods is related to the signed likelihood ratio (e.g., Cox, 2006, p. 104), which is equivalent to the square root of the likelihood ratio statistic  r = sgn(θˆ − θ ) −2 (θ ) − (θˆ ) . (10) Hence, the ratio of the two likelihoods can be expressed in terms of the log likelihoods  

 1 exp − r 2 = exp (θ ) − (θˆ ) 2

(11)

PSYCHOMETRIKA

and, therefore, the saddlepoint density P (wv |θv , β, α) (6) can be expressed as P (wv |θv , β, α) =

1 K  (θˆv )

φ(r) + Rest.

(12)

Here φ indicates the standard normal distribution (Butler, 2007). 2.1. Lugannani–Rice Approximation For a one-parameter full exponential family, integration of the saddlepoint density leads to the so-called Lugannani–Rice formula, first derived by Lugannani and Rice (1980), that can be stated in terms of the likelihood as

F (wv ; θv ) ≈ Φ ∗ (r) = Φ(r) + φ(r) r −1 − u−1 . (13) Here Φ is the standard normal CDF and u is given as u = t (θˆv ) = (θˆv − θv ) j (θˆv ),

(14)

the Wald statistic. Since in the one-parameter exponential family the transformation from the sufficient statistic to the canonical parameter is one-to-one and onto, given by the maximum likelihood equation K  (θˆv ) = wv (see, e.g., Davison, 2003, and Equation (A.11)), the distribution of θˆ (wv ) is given by that of wv (Butler, 2007)

(15) P (W ≤ wv ; θv ) = P θˆ (W ) ≤ θˆ (wv ) . Note that, in the case of the ability parameters, there exist no finite estimates for the zero and perfect scores. 2.2. Barndorff-Nielsen Approximation Another route to the higher-order approximation of a parameter distribution has as a starting point the square root of the likelihood ratio test. From first-order theory, we have that the √ pivot in Equation (10) has asymptotic normality to order O(n−1/2 ), that is, terms containing n and higher orders are dropped. Barndorff-Nielsen (1986) could refine this approximation by adjustments made to the mean and the variance of the signed log-likelihood ratio, yielding the modified signed likelihood ratio r ∗ whose distribution more closely follows a normal distribution and is given by r ∗ = r + r −1 log(u/r).

(16)

However, the modified signed likelihood ratio is not a proper saddlepoint approximation (Butler, 2007). For continuous distributions, the formulae given in Equations (13) and (16) both leave a 3 relative error of order O(n− 2 ) and are equivalent in all terms of order smaller or equal to O(n−1 ) (e.g., Brazzale et al., 2007). The formula in Equation (13), however, can produce values beyond 0 and 1 so that for computational simplicity Equation (16) could be used as well. In this article, however, Equation (13) obtained by integrating the saddlepoint density is preferred. For discrete data, there exist continuity corrections to the Equations (13) and (16), but the most accurate method, and also the simplest method, of approximating mid-P values consists in omitting continuity correction altogether (e.g., Butler, 2007; Brazzale & Davison, 2008). Note that formulae (16) and (13) are only correct to second order in discrete data problems, i.e., they leave an error of order O(n−1 ) (Davison, Fraser, & Reid, 2006). The associate editor points

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

out that “[t]he higher-order approximation beyond the normal approximation is for a continuous part of the distribution of an estimator. In the case of the Rasch model with only n √+ 1 distinct estimated values, the added higher-order approximation is of the same order (1/ (n)) as that of the residual of the normal approximation. Consequently, the validity of the total asymptotic expansion is lost in the case of the Rasch model (see, e.g., Essen, 1945). However, the expansion can be useful since the continuous part can be well approximated (e.g., Hall, 1982).” The effect of the Lugannani–Rice formula (Equation (13)) and that of the modified likelihood root (Equation (16)) are comparable. While the Lugannani–Rice formula attempts to adjust the distribution of the signed likelihood root by adding a correction term, the r ∗ -formula consists of the signed likelihood root and a correction term that improves on its normal distribution. Hence, P -values provided either by Φ ∗ (r) the modification to the distribution of r (Lugannani– Rice) or by Φ(r ∗ ) are more exact than those obtained from Φ(r). By inverting the previously mentioned functions, confidence bounds that more closely hold the nominal significance level can be obtained. In the exact same manner, an approximate MUE is obtained by inverting the distribution at Φ ∗ (r) = 1/2. Using only the leading term r in Equations (13) or (16) trivially returns the MLE, hence the additional terms in these equations correct the MLE to be median unbiased (Pace & Salvan, 1999). The above derivations are equally valid for the 2PL as well as for the Rasch model. However, for the Rasch model, many formulae simplify considerably, allowing for inexpensive and exact computations and, hence, make the Rasch model the choice of reference for the demonstration of the derived saddlepoint methods. 2.3. Computational Issues By combining Equations (10) and (14), an approximation to the probability P (θ ≤ θˆ ) is given either by Φ ∗ (r) or by Φ(r ∗ ). The quantities r and u are calculated for a grid of θ values and intermediate points are interpolated by a cubic spline, as implemented in the function smooth.spline of the stats R-package (R Development Core Team, 2009). The computation of the MUE with formula (16) is problematic because numerical instabilities arise for values very near to θˆ , where r ∗ has a singularity at the point θ = θˆ . This is of paramount importance for the computation of the MUE, which typically is not too different from the MLE. Numerically, this can be overcome by a polynomial interpolation at points near θˆ , as detailed in Brazzale (2000, B.3), and likewise implemented in the hoa R-package bundle (Brazzale, 2005). The exact method is given in Appendix B. Note, however, that this problem can most easily be avoided by simply using the Lugannani–Rice approximation. R-code examples for the relevant computational steps can be found in Appendix C (see supplementary material).

3. Numerical Demonstrations 3.1. Examples for the Rasch Model The good correspondence between the mid-P and the higher-order approximation can best be demonstrated for the Rasch model where the calculation of mid-P -values is straightforward. First, we will demonstrate the approximation to the elementary symmetric function given in Equation (9). For this purpose, a Rasch conform test consisting of n = 15 items was simulated with approximately normal difficulty parameters obtained from βi = Φ −1 ((i − 0.5)/n), a formula due to Warm (1989). Maximum likelihood estimates exist for person raw scores in the range from 1 to n − 1 = 14. For these raw scores, the exact values of the elementary symmetric function, as well as the approximations (Equation (9)), are listed in Table 1.

PSYCHOMETRIKA TABLE 1. Exact and approximated values of the elementary symmetric function for the raw scores 1 to 14 of a 15 items Rasch test.

Raw score 1 or 14 2 or 13 3 or 12 4 or 11 5 or 10 6 or 9 7 or 8

Exact

Approximated

23.3 234.9 1369.8 5188.8 13565.6 25339.3 34479.3

25.2 243.7 1402.8 5281.3 13763.1 25664.4 34895.5

F IGURE 1.

Saddlepoint approximations (solid lines) to the distribution of θˆ and asymptotic normal distributions (dashed lines) for a range of sum scores.

As can be seen, the absolute error of the approximation is high but the relative error is minor. The good performance of the saddlepoint approximation is due to the fact that the approximations of the elementary symmetric function are multiplied by a likelihood that typically has a very small value. Accordingly, the saddlepoint approximation to the probability of a certain raw score is correct to at least the second decimal place. An approximation to the density of the maximum likelihood estimator for a fixed raw score can be obtained by computing the saddlepoint approximation for a grid of values of θ . The asymptotic first-order density is given to be a normal density and direct comparison to the higherorder approximation is possible by rescaling the saddlepoint probabilities to a density encompassing the same area as the normal density. These distributions, along with their asymptotic counterparts, are shown in Figure 1 for a range of raw scores. Note that the exact probabilities are very close to those of the saddlepoint approximations and, hence, the rescaled version cannot be discerned from the saddlepoint density. As can be seen, the distribution of the MLE is clearly not normal for sum scores containing little information about θ . These are the very low and very high sum scores, respectively. For moderate sum scores, the asymptotic distribution and the saddlepoint approximation agree almost perfectly.

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

F IGURE 2. Cumulated distribution functions (cdf) for the raw score and fixed ability level. The step functions correspond to the exact discrete distribution, the continuous lines correspond to the Lugannani–Rice approximation. Dashed lines indicate results due to the normal approximation. The difference is not that distinct for the chosen resolution such that they can hardly be discerned.

The good correspondence between the mid-P and the Lugannani–Rice approximation of P (Rv ≤ rv |θv ) (Equation (13)) can be inferred from Figure 2. For a fixed ability value, the exact probabilities for the possible raw scores are cumulated and compared to the Lugannani–Rice approximation. Mid-P values P (Rv < rv ; θv ) + 12 P (Rv = rv ; θv ) correspond to one-half the step size at each raw score. The Lugannani–Rice approximation, hence, is smoothing this discrete distribution. Cumulated distribution functions resulting from the normal approximation are given by the dashed lines. Computing the Lugannani–Rice approximation Φ ∗ (r) for a grid of θ -values by holding the raw score fixed, corresponds to integrating one of the distributions in Figure 1. From this one obtains confidence limits and a MUE for a given raw score. Since an exact probability calculation is possible for the ability parameter in the Rasch model, confidence bounds derived by inverting two-sided tests (Klauer, 1991a) are to be preferred to those obtained from inverting two one-sided tests (Agresti & Min, 2001). These onesided limits, however, regain interest when estimated parameters are to be tested against other values. For the purpose of demonstration, the expected coverage rates for the two one-sided intervals obtained from inversion of the test based on Φ ∗ (r) are compared to those computed with the asymptotic variance. The later first-order confidence intervals (CIs) are in the sequel referred to as Wald-intervals. Comparison is done on a bound-by-bound level, since overall coverage rates could disguise errors in one direction that are balanced by errors in the other direction. Moreover, the region where such comparisons are meaningful is restricted by the lower bound and the upper bound for the ability estimates θˆ . These correspond to the raw scores of n − 1 and 1, respectively. More extreme ability parameters cannot exceed these bounds. For the one-sided bound, the respective coverage rate consequently rises to one beyond those limits. This is shown in Figure 3 for a 15 items Rasch test and a two-sided confidence coefficient of 1 − α = 0.95. Item parameters are again chosen according to the Warm (1989) formula. Response patterns with null or perfect scores were removed. The left-hand panels show the lower and upper coverage rates in the regions where θ still exceeds the lower and upper bounds corresponding to the sum score

PSYCHOMETRIKA

F IGURE 3. Expected coverage rates for the person parameter estimate in the Rasch model (15 items) and two-side confidence coefficient of 1 − α = 0.95. The respective coverage rates for the Wald-CI (dashed line) and the CI obtained from inverting a test based on Φ ∗ (r) (solid line) are shown in the upper left picture for the lower tail and in the lower left picture for the upper tail. Lower and respectively upper bounds for the maximal and minimal parameter estimates belonging to the raw scores of n − 1 and 1 are marked by vertical lines for the tow different CIs. The same coverage rates, but smoothed by a symmetrical moving average without lag, are shown in the right-hand plots. The width of the average window is exactly one ability unit. Effects due to the offset of the window were removed from the illustration.

of rv = n − 1 and rv = 1, respectively. The zigzag in the curve of the expected coverage rate is due to the discrete sample space. Each arbitrary ability parameter θ may only generate n − 1 different raw scores if the zero and the perfect score are omitted. Thus, it can fall beyond the confidence bounds of only n − 1 estimates corresponding to the very n − 1 possible raw scores. Now, observe the lower coverage rate. As the value of θ is shifted to the right, it passes the lower bound of the next lower estimate and the coverage rate rises suddenly. As one keeps on shifting θ to the right the probability rises for those raw scores belonging to the estimates whose CIs not yet encompass the value of θ . Hence, the expected coverage rate declines steadily until θ reaches the next lower bound. When the last bound is exceeded, the coverage rate rises to one. As can be seen, the progression of the expected coverage rate of the lower Wald confidence bounds (dashed line) is largely congruent to that obtained by inversion of Φ ∗ (r) (solid line). However, where this is not true, its values first are markedly smaller and then, starting from the ability level of −1 on, they are larger than the expected coverage rate due to the higher-order approximation. This issue is more clearly observable on the upper right panel where the course of the expected coverage rates is smoothed with the help of a symmetrical moving average with lag zero and total width of one ability unit. In order to avoid artificial effects due to the average filter, values are truncated by half a unit to the left and half a unit before the respective coverage rate rises to one. In mean, the expected coverage rate from Φ ∗−1 (r) holds the nominal 97.5 %-level for the most part of the examined region, whereas the coverage obtained from the first-order asymptotic normal distribution first falls short of the nominal level and then exceeds it. Results for the upper bounds are similar but reversed left to right. This symmetry is due to the symmetric distribution of the item difficulties used. 3.2. Numerical Demonstration for the 2PL Model 3.2.1. Coverage The analytic results obtained for the Rasch model are cumbersome to generalize to the 2PL case. Simulation studies take the place of exact computation. The first simulation was run for a total of 15 items with difficulty-levels given by the already acquainted formula βi = Φ −1 ((i − 0.5)/n). Each difficulty parameter was paired with a discrimination parameter taken at random from a log-normal distribution with μ = 0 and

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

F IGURE 4. Lower and upper coverage rates for the ability parameter estimate in the 2PL model with 15 items and two-sided confidence coefficient 1 − α = 0.95 is shown by the left-hand illustrations. The coverage rate due to the Wald-CI is given by the dashed line and that obtained from inversion of Φ ∗ (r) is represented by the solid line. On the right-hand side, coverage rates for a 30 items test are shown.

σ = 0.15. A total of N = 10,000 response patterns were generated at each of the ability levels ranging from −3 to 3 with a step width of 0.1. Zero and perfect patterns were excluded for they lack a serious estimate. For this simulation the fractions of response patterns dropped due to zero scores and perfect scores are given for some selected ability levels: (−3, 0.38, 0), (−2.5, 0.21, 0), (−2, 0.084, 0), (−1.5, 0.019, 0), (−1, 0.003, 0), (0, 0, 0), (1, 0, 0.0019), (1.5, 0, 0.019), (2, 0, 0.084), (2.5, 0, 0.20), and (3, 0, 0.38). Outcomes are depicted on the left-hand side of Figure 4. Results are very similar to those obtained for the Rasch model. Again, irregularities in the course of the simulated coverage rates are due to discreteness of the data, even though it is less severe as compared to the Rasch model. The last fact is a consequence of the degenerate distribution of the sufficient statistic allowing for 215 different ability estimates, which, however, not all need to be realized in the data. As before, the coverage rate of the lower Wald interval falls below the nominal coverage rate of 97.5 % in the region of θ -values from −3 to about −1.5. The nominal coverage criterion is met only at values of θ around −1. For higher ability values, the coverage rate is higher than the exact level until it rises to one when the lower bound of the highest estimate is passed. Contrary to the Wald interval, the lower interval due to Φ ∗−1 (r) holds the nominal coverage rate for values of θ from about −2 until 1 and is closer to this level than the Wald interval nearly everywhere outside this region. Again, due to the symmetric arrangement of the item difficulties, what was said for the lower coverage rate also applies to the upper coverage with directions reversed left to right. The second simulation study comprised a total of 30 items with difficulty generated by the same formula as before. Each of these difficulty levels was paired with a discrimination parameter obtained in the same way, as explained for the previous simulation. Simulated ability levels ranged from −4 to 4 with step widths of 0.1, and again a total of N = 10,000 examinees were simulated per ability level. For this simulation the fractions of response patterns dropped due to zero scores or perfect scores are given for some selected ability levels: (−4, 0.36, 0), (−3.5, 0.21, 0), (−3, 0.089, 0), (−2, 0.004, 0), (2, 0, 0.0031), (3, 0, 0.075), (3.5, 0, 0.19), and (4, 0, 0.32). Results from this study are shown on the right hand side of Figure 4. Findings are nearly identical to those of the previous study with the exception that the region of the ability scale, where both intervals hold the nominal level, is extended. Now the Wald interval performs nearly exactly in the ability range from −1 to 0 and from 0 to 1 for the lower and upper tail, respectively. Outside this zone, however, the nominal level cannot be held. The higher-order-

PSYCHOMETRIKA

F IGURE 5.

Two-sided coverage rates for the Wald-interval (dashed line) and for the interval obtained from Φ ∗−1 (r) (solid line) as two one-sided intervals. The dotted–dashed lines show the interval covering the range of highest approximated density. The upper panel shows these rates for the smaller simulated tests (n = 15), the lower panel for the larger simulated tests (n = 30).

interval assures the correct coverage rate in the range from about −3 to 2 for the lower bound and from −2 to 3 for the upper bound. Beyond these limits, boundaries due to inversion of Φ ∗ (r) perform better than the Wald-boundaries and are even exact to a high degree. Most importantly, these findings suggest that the distribution of the MLE is not normal, not even for 30 items in those regions of the ability scale where the nominal coverage rate is not held. Details to the overall coverage are, of course, only meaningful in the range of the parameter θ where neither the Wald-interval nor the higher-order-interval exceed the maximal lower bound or the minimal upper bound. Overall coverage for the two data simulations with n = 15 (upper plot) and n = 30 (lower plot) is shown in Figure 5. The coverage rate obtained from Φ ∗−1 (r), using two one-sided intervals, is given by the solid line, the coverage rate of the Wald-interval by the dashed line. Additionally, the highest density (HD) interval is indicated by the dotted–dashed line. This interval is given by a horizontal line cutting out an area of (1 − α) % of the saddlepoint density. The boundary points are given by the values of θ where the cutting line intersects the density curve. As can be seen, the Wald-interval is markedly more conservative over nearly the whole presented range while the HD-interval tends to be more liberal. Conservativeness for all three kinds of intervals is most pronounced beyond the values where either the lower or the upper bounds exceed the respective maximal lower or minimal upper bounds. Note that all three intervals should coincide when the distribution of the estimates is a genuine normal. This seems not to be the case, not even for the mid-range parts of θ . Note that the approximation is still meaningful for tests comprised of fewer than 15 items. But with fewer items, the zigzag pattern in the coverage grows ever stronger due to increased discreteness in the data. For this reason the presented results are restricted to n = 15 and n = 30. 3.2.2. Comparison to Ogasawara’s Method By setting the guessing parameter to zero, the second order accurate method presented by Ogasawara (2012) can be directly compared to the Lugannani–Rice approximation. Comparisons were done with and without model misspecification by using FORTRAN-code of Ogasawara (2012). Discrimination parameters αi and item difficulties βi were sampled, as explained in Ogasawara (2012). Note that the normal ogive model is used such that the scaling constant D = 1.702 has to be multiplied to each αi for direct comparison with the taxonomy used in this article. Since the focus of this paper is on small sample sizes, the number of items was set to n = 20. Levels of θ ranged from −3 to 3 by steps of 1.

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER TABLE 2. Simulated coverage rates obtained by Hall’s formula and those obtained by the Lugannani–Rice approach with and without model misspecification. Coverage rates are depicted for lower (l. b.) and upper bounds (u. b.) at seven different levels of ability.

Correct model

θ Hall −3 −2 −1 0 1 2 3

Misspecified model

Lugannani–Rice

Hall

Lugannani–Rice

l. b.

u. b.

l. b.

u. b.

l. b.

u. b.

l. b.

u. b.

0.0000 0.0000 0.0166 0.0226 0.0257 0.0272 0.0339

0.0467 0.0289 0.0226 0.0291 0.0266 0.0000 0.0000

0.0000 0.0000 0.0190 0.0240 0.0243 0.0260 0.0355

0.0504 0.0278 0.0228 0.0278 0.0264 0.0000 0.0000

0.0000 0.0000 0.0181 0.0219 0.0284 0.0275 0.0324

0.0434 0.0315 0.0232 0.0282 0.0258 0.0000 0.0000

0.0000 0.0000 0.0203 0.0243 0.0264 0.0268 0.0342

0.0464 0.0306 0.0238 0.0266 0.0257 0.0000 0.0000

A total of 10,000 tests were simulated, and data was generated accordingly. Zero and perfect scores were discarded, and a new data sample was drawn instead. Item probabilities for the misspecified model were generated using formula (6.1) of Ogasawara (2012). The variance of the perturbation parameter ei was set to 1.0, hence assuming gross model misspecification. Hall’s formula of variable transformation (Hall, 1992) (formula (6.2) in Ogasawara, 2012) was used for the computation of the confidence intervals. Results are shown in Table 2. As both methods are second-order, the good accordance of the coverage rates comes with no surprise. Interestingly, results are still good for the data generated by the misspecified model. 3.3. Median Unbiased Estimator for the Person Parameter Given the demonstrated ability of the Lugannani–Rice formula to approximate the exact distributional form of the person parameter estimates, it should also be possible to obtain an approximate MUE by inverting the distribution of Φ ∗ (r) for a significance level of α = 1/2. However, for discrete data randomization is usually needed in order to obtain an exact median unbiased estimator. Here the degeneratedness of the discrete sufficient statistic for the person parameter in the 2PL comes to alleviate the extreme discreteness as met in the Rasch model. Consider the case of the Rasch model: for a test comprising n = 10 items there are exactly n + 1 sufficient raw scores and hence n + 1 possible ability estimates. Thus, for a generating θ the best estimate is the one out of the n + 1 that is least distant to θ . In the 2PL model with all discrimination parameters distinct, there are 210 possible sufficient statistics and hence the same number of estimates. Now, on the average the best estimate is less distant to the generating θ . Effects due to discreteness of the sufficient statistic still exist but become minor as the number of items increases and the distribution of the sufficient statistic approaches continuity. Thus, the degree of median-unbiasedness of the approximate MUE that can be reached depends on the level of continuity of the sufficient statistic, as well as on the exactness of the approximated distribution. In order to assess bias of the approximate MUE, its mean and median error will be compared to that of Bayes Modal Estimator (BME) (Baker & Kim, 2004), the Weighted Likelihood Estimator (WLE) (Warm, 1989), known to be least mean-biased, and to that of the MLE. Sometimes the BME is also called the MAP estimator. In this study, the prior distribution was assumed to have μ = 0 and σ = 1. 3.3.1. Median Bias Two tests were simulated with 15 and 30 items, respectively, and item parameters were chosen corresponding to the above indications. At each of the ability levels,

PSYCHOMETRIKA

F IGURE 6.

Median error (θˆ − θ) for a test consisting of 15 items is shown to the left, and for a test consisting of 30 items to the right. Median error was assessed at each of the markings on the abscissa for the WLE (light solid line), the MLE (heavy solid line), the BME (dotted–dashed line) and the approximate MUE (dashed line).

ranging from −1.5 to 1.5 (short test) or from −2 to 2 (long test) by step widths of 0.1, a total of 10,000 response vectors were simulated. Perfect and all-zero responses were removed so that the number of subjects at the ends of the ability scale was less than 10,000. Fractions of patterns dropped were very similar to those reported when assessing the coverage. Median bias was measured by the median of the empirical error distribution (θˆ − θ ). Results for both test lengths are shown in Figure 6. In both cases, the BME (dotted–dashed line) reveals a very high frequency of overestimating the true parameter θ for negative abilities and of underestimating θ for positive abilities. The same applies for the WLE but to a smaller degree. The behavior of the MLE (heavy solid line) is contrary to that of the WLE and BME and less pronounced. The MUE (dashed line) by contrast does not show any tendency and seems to vary randomly around a median error of zero. Any zigzag in the plot of the median error of the MUE corresponds to similar zigzags in the median error of the two other estimators and is hence due to simulation irregularities and the discreteness of the sample space. 3.3.2. Mean Squared Error In a next step, the simulations done so far are complemented by the assessment of the mean squared error of the four different estimators. Hirji et al. (1989) reported the MUE to be even less mean-biased than the MLE in a logistic regression application with two binary covariates. The estimation of the ability parameter in the 2PL model can be viewed as a logistic regression problem involving a continuous covariate. In order to compare mean squared error of the approximate MUE to that of the WLE, the BME and to the MLE, two tests were again simulated with test length n = 15, 30. Item difficulties were chosen to be approximately normal by using the aforementioned formula. The discrimination parameters again were taken at random from a log-normal distribution with μ = 0 and σ = 0.15. For the shorter test, estimated mean squared error was assessed on the ability grid from −3.0 to 3.0 with step size 0.01. At each of these values a total of 10,000 response patterns were simulated. The same applies to the longer test, except that the grid on the ability scale ranged from −3.5 to 3.5. Zero and perfect patterns were removed for they miss a realistic estimate. Again, the fraction of patterns removed was comparable to that reported before. Results of the simulations can be seen in Figure 7. Squaring the error avoids the cancelation of positive and negative error for the moderate ability-levels. Along with this measure of bias, the test information is given, allowing to appraise

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

F IGURE 7. Average squared error for the MLE (heavy solid line), the WLE (light solid line), the BME (dotted–dashed line) and the MUE (dashed line) for two tests consisting of 15 (left-hand side) and 30 items (right-hand side). The test information (dotted line) is nearly normally shaped due to the symmetrical arrangement of the item difficulties. The scaling of the test information is given by the additional axis of ordinates to the right of each plot.

the importance of the error. As can be clearly seen, the MLE exhibits the highest estimated squared error nearly over the total observed range, whereas the estimated squared error of the BME is smallest, followed by that of the WLE. The estimated squared error of the approximate MUE adopts an intermediate level close to the MLE. This is essentially true for both test lengths taking into account, however, that the difference between the different estimators is minor for intermediate ability levels of the longer test. Note that the declining error at the extreme ability levels is an artifact due to removal of the zero and perfect patterns. The removed patterns have highest probability at these levels, and the few remaining patterns make up all the error. Since ML estimates are higher than MUEs, WLEs and BMEs, the error for the MLE declines most for values beyond the realm of measurement of the test. 3.3.3. Standard Deviations As one reviewer noted, the squared error contains the squared bias as well as the variance of an estimator. Ogasawara (2013) established that the higher order variance of an estimator is typically larger than the squared bias. Hence, in order to assess the variances and further appraise the efficiencies of the different estimators, standard deviations of the various estimators are given in Figure 8. The patterns are very similar to those of the average squared error.

4. Discussion Large sample theory states that the maximum likelihood estimator is asymptotically normally distributed. In practical testing situations, however, there is a need to gather the most information available with the least possible cost. These two constraints lead automatically to the preference of short tests. For short tests, however, the asymptotic properties of the estimator of the person parameter will not hold, and as a consequence, confidence intervals based on the asymptotic variance miss the nominal level of significance. Actually, the coverage rate exceeds the intended level and tests, as well as confidence intervals, are overly conservative. This might not be seen as a problem but leads to the paradox situation that one cannot be confident about the true level of coverage of the confidence intervals, even more so as the coverage is not constant over the range of the person parameter.

PSYCHOMETRIKA

F IGURE 8. Standard deviations of the estimates for the MLE (heavy solid line), the WLE (light solid line), the BME (dotted–dashed line) and the MUE (dashed line) for two tests consisting of 15 (left-hand side) and 30 items (right-hand side).

Saddlepoint approximations are a convenient solution to this unsatisfactory situation. Their usage meanwhile is wide-spread in many statistical areas and their theory is well-understood. It was shown that confidence intervals obtained in this way are more consistent than those of the standard approach. For tests encompassing 30 items, the coverage of the Wald confidence intervals operates near the exact level in the moderate range but goes astray in the more extreme levels of the person parameter. In comparison, the confidence intervals obtained from saddlepoint approximations hold the exact level of significance for a much broader range and are almost exact everywhere in this range. Moreover, the computation of the latter intervals is straightforward and easy to implement in any statistical package. Depending on the situation or preferences, it is possible to compute two one-sided confidence intervals or an interval covering parameters with highest density following the idea of HPD intervals. Asymptotic cumulants of the studentized person parameter (Ogasawara, 2012) in combination with Hall’s (1992) formula for variable transformation provide confidence intervals that are as accurate as those obtained by the Lugannani–Rice formula. While the former method works independently of the kind of model, the later model is only valid for models out of the exponential family. Both methods showed to be surprisingly robust to model misspecification. Undoubtedly, the work of Lord (1983) specifying the bias of the maximum likelihood estimate (MLE) of the person parameter and subsequent development of the weighted likelihood estimate (WLE) (Warm, 1989) mark cornerstones in person parameter estimation. Tacitly, the paradigm of mean-unbiasedness is adopted even though there is at least one serious alternative, median-unbiasedness. From a theoretical point of view, it is not clear why mean-unbiasedness should be given the preference. Ranking all possible estimators by the implicitly optimized squared error criterion seems somewhat arbitrary since there are alternative optimality definitions available. Most of these lead to identical estimates for symmetric distributions. The mathematical accessibility of the squared error criterion makes it the first choice in linear regression with normal error distributions. In large samples, the MLE, the WLE and the MUE will coincide with optimality properties given by large sample theory such that the question of mean or median is restricted to short tests. It was shown that in this case the distribution of the sufficient statistic is not symmetrical, at least not in the more extreme range of the person parameter and, as commonly known, the median is less sensitive to skewness and kurtosis. In linear regression, the least

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

squares estimator is also contraindicated if error distributions are skewed. Hence, even though the WLE minimizes the mean-bias, it is not clear why it should be given preference to the MUE. As demonstrated, the WLE and the BME are more probable to overestimate the true parameter for low scoring subjects and to underestimate it for high scoring subjects. The behavior of the MLE is just the opposite. The MUE, in contrast, is as likely to overestimate as to underestimate the true parameter over the entire range of proficiency and, in addition, it has clear small-sample optimality properties. When compared to the MLE, it was also shown that the MUE is even less mean-biased, so that there can be no question of whether to prefer the MUE over the MLE in small samples. Additionally, the standard error of the MUE is smaller than that of the MLE, therefore suggesting that the MUE is even more efficient with small samples. Note, however, that both measures, mean squared error and standard deviation, are closely linked to the least squares criterion. Concerning the WLE and BME, it is still debatable if mean-unbiasedness is to be preferred to median-unbiasedness. If mean-unbiasedness or just overall error reduction is chosen, both estimators, the WLE and the BME, outperform the MUE. The BME depicts the least estimated mean squared error as well as the least standard deviation.

Appendix A. Derivation of the Saddlepoint Approximation for the 2PL The probability of the response vector xv in the 2PL is   exp(θv ni=1 xvi αi − ni=1 xvi αi βi )  , P (xv |θv , β, α) = n i=1 [1 + exp{αi (θv − βi )}] and the probability to obtain the weighted sum score  P

n 

   xvi αi θv , β, α =

 xv |

i=1

=



i=1 xvi αi

is

P (xv |θv , β, α)

xvi αi

exp(θv

n

(A.1)

n





n  i=1 xvi αi ) i=1 xvi αi βi ) xv | xvi αi exp(− n , i=1 [1 + exp{αi (θv − βi )}]

(A.2)

 where xv |  xvi αi means summation over all response vectors having the same weighted sum n score i=1 xvi αi . In the context of the Rasch model, this sum is called γrv , the elementary symmetric function. If the weighted sums corresponding to the response vectors are all unique, formulae (A.1) and (A.2) agree exactly. Here we can observe an interesting feature of the exponential family, namely that the density of the data points xv is of the same family as the density of the sufficient statistic (e.g., Casella & Berger, 2002, p. 217)

 p(x; θ ) = exp θ x − K(θ ) f0 (x).

(A.3)

The density of the sufficient statistic is then

 p(s; θ ) = exp θ s − K(θ ) F0 (s),

(A.4)

with the sole difference given in Equations (A.1) functions f0 (x) and F0 (s).Comparing this to and (A.2), we see that exp(− ni xvi αi βi ) = f0 (xv ) and xv |  xvi αi exp(− ni=1 xvi αi βi ) =  F0 (rv ). In order to simplify the notation, the weighted sum score ni=1 xvi αi will be abbreviated in the sequel as wv . Because in exponential family theory (e.g., Lehmann & Casella, 1998;

PSYCHOMETRIKA

Davison, 2003) the log of the normalizing constants, i.e., the log of the denominatorsof Equations (A.2) and (A.1), are the cumulant generating functions, we abbreviate log ni=1 [1 + exp{αi (θv − βi )}] by K(θv ). Exponential tilting is a technique that allows any density to be embedded in an exponential family, and, by tilting an exponential family, a density of the same family is obtained (Davison, 2003, p. 168). Any arbitrary density can be multiplied by exp(sφ), and, by renormalizing this product, a valid density function is obtained. The distribution of a response vector xv in the 2PL model is given in Equation (A.1). Multiplying this density by exp(wv φv ) and normalizing the product yields exp(θv wv − K(θv ))f0 (xv ) exp(wv φv ) P (xv |θv + φv , β, α) =  . xv exp(θv wv − K(θv ))f0 (xv ) exp(wv φv )

(A.5)

The summation in the denominator takes place over all possible response vectors. The above equation can be expressed in a slightly different way by writing the term in the denominator not depending on wv before the summation sign P (xv |θv + φv , β, α) =

exp(−K(θv )) exp((θv + φv )wv )f0 (xv )  . exp(−K(θv )) xv exp((θv + φv )wv )f0 (xv )

(A.6)

Now, the first factors in the enumerator and denominator cancel, and from Equation (A.1) we have the result that the remaining term in the denominator must be  xv

n 







1 + exp αi (θv + φv ) − βi = exp K(θv + φv ) . (A.7) exp (θv + φv )wv f0 (xv ) = i=1

Starting from Equation (A.6), the probability P (xv |θv + φv , β, α) can hence be written as



 P (xv |θv + φv , β, α) = P (xv |θv , β, α) exp wv φv − K(θv + φv ) − K(θv ) . (A.8) This equation can be solved for the initial density P (xv |θv , β, α), and one obtains 

P (xv |θv , β, α) = P (xv |θv + φv , β, α) exp K(θv + φv ) − K(θv ) − wv φv .

(A.9)

By comparing Equation (A.9) to Equations (A.3) and (A.4), one sees that, in order to obtain an approximation to the density of the sufficient statistic wv , we have to exchange P (xv |θv , β, α) with P (wv |θv , β, α) and likewise for P (xv |θv + φv , β, α). This last term can be approximated √ by the first term of an Edgeworth expansion of a standardized sumSn∗ = (Sn − nμ)/ nσ of independent variables Y1 , . . . , Yn (Pace & Salvan, 1997). Here Sn = ni Yi , μ = E(Y ) and σ 2 = Var(Y ). The Edgeworth expansion for such a sum is given by   3

ρ2 ρ3 ρ4 pSn∗ (x) = φ(x) 1 + √ H3 (x) + H4 (x) + 3 H6 (x) + O n− 2 , 24n 72n 6 n

(A.10)

where ρr (Sn∗ ) = κr are the respective standardized cumulants. The first two cumulants are given by ρ1 = 0 and ρ2 = 1 and Hr (x) are Tchebycheff–Hermite Polynomials (e.g., Stuart & Ord, 1987). The odd-ordered Hermite Polynomials vanish at x = 0, the mean of the standardized sum. As can be seen, including only the first term of the above expansion approximates the density of the normalized sum as a normal density with error of order O(n−1/2 ) in accordance with the central limit theorem (e.g., Young & Smith, 2005). Including the second and third term corrects for skewness and kurtosis and the error will drop to O(n−1 ), and with the fourth term to

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER

O(n−3/2 ). Since the Edgeworth expansion achieves greatest accuracy at the mean, where H3 (x) vanishes and leaves an error of O(n−1 ), the mean of P (wv |θv + φv , β, α) is needed. Again, from the theory of exponential families we have that the expectation of a variable is given by the first derivative of the cumulant generating function. Equating the expectation to the sufficient statistic gives the so-called maximum likelihood equation  exp{αi (θv − βi )} ∂ αi , K(θv ) = ∂θv 1 + exp{αi (θv − βi )} n

wv =

(A.11)

i=1

which is solved by the maximum likelihood estimator θˆv . Since the distribution P (wv |θv + φv , β, α) has the same sufficient statistic wv , it is also maximized at θˆv , and hence φv = θˆv − θv . For standardized sums the first term of the Edgeworth expansion (Pace & Salvan, 1997, Chap. 10) is the standard normal density. The Jacobian of the transformation (Casella & Berger, 2002, pp. 120, 158) for the univariate case is the reciprocal of the standard deviation, which in turn is obtained as the square root of the second derivative of the cumulant generating function. Hence, the approximation to the desired density is written as P (wv |θˆv , β, α) =

1 2πK  (θˆv )

+ Rest,

(A.12)

where Rest stands for the terms not accounted for. Note that K  (θˆv ), the variance of variable wv , may not be confounded with the asymptotic variance of the estimate θˆv , which of course is given by the reciprocal of K  (θˆv ). Keep in mind that φv = θˆv − θv so that Equation (A.9) can now be written as P (wv |θv , β, α) =

exp{wv θv − wv θˆv + K(θˆv ) − K(θv )} + Rest. 2πK  (θˆv )

(A.13)

This is the saddlepoint approximation to the probability of a certain weighted sum score wv for a given ability level θv and item parameters β, α.

Appendix B. Polynomial Interpolation In order to obtain stable values of r ∗ (Equation (16)), polynomial interpolation is done for values of θ that lie in an interval θˆ ± 0.5 · j (θˆ )−1/2 . First, the term r for all values of θ (Equation (10)) is modeled as a tenth order polynomial in u (Equation (14)) r = a1 u + a2 u2 + · · · + a10 u10 .

(B.1)

The coefficients are fitted by the least-squares criterion. Next the fraction r/u is determined by r = a1 + a2 u1 + · · · + a10 u9 . u

(B.2)

After that the logarithm of this fraction in turn is modeled by a polynomial regression model with predictors given by the powers of r log

r = a1 r + a2 r 2 + · · · + a10 r 10 . u

(B.3)

PSYCHOMETRIKA

The fraction of the logarithmic term and r is estimated by   r 1 log = a1 + a2 r 1 + · · · + a10 r 9 . u r The modified likelihood ratio is finally given by r minus the above term   r 1 r ∗ = r − log . u r

(B.4)

(B.5)

References Agresti, A. (2001). Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine, 20, 2709–2722. Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken: Wiley. Agresti, A., & Gottard, A. (2007). Nonconservative exact small-sample inference for discrete data. Computational Statistics & Data Analysis, 51, 6447–6458. Agresti, A., & Min, Y. (2001). On small-sample confidence intervals for parameters in discrete distributions. Biometrics, 57, 963–971. Aït-Sahalia, Y., & Yu, J. (2006). Saddlepoint approximations for continuous-time Markov processes. Journal of Econometrics, 134, 507–551. Baker, F.B., & Kim, S.H. (2004). Item response theory: parameter estimation techniques (2nd ed.). New York: CRC Press. Barndorff-Nielsen, O. (1986). Inference on full or partial parameters on the standardized signed log likelihood ratio. Biometrika, 73(2), 307–322. Bedrick, E.J. (1997). Approximating the conditional distribution of person fit indexes for checking the Rasch model. Psychometrika, 62(2), 191–199. Bedrick, E.J., & Hill, J.R. (1992). An empirical assessment of saddlepoint approximations for testing a logistic regression parameter. Biometrics, 48(2), 529–544. Birnbaum, A. (1964). Median-unbiased estimators. Bulletin of Mathematical Statistics, 11, 25–34. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440. Brazzale, A.R. (1999). Approximate conditional inference in logistic and loglinear models. Journal of Computational and Graphical Statistics, 8(3), 653–661. Brazzale, A.R. (2000). Practical small-sample parametric inference. Unpublished doctoral dissertation, Ecole Polytechnique Fédérale de Lausanne, Switzerland. Brazzale, A.R. (2005). Hoa: An R package bundle for higher order likelihood inference. Rnews, 5(1), 20–27. (ISSN 609-3631). Brazzale, A.R., & Davison, A.C. (2008). Accurate parametric inference for small samples. Statistical Science, 23(4), 465–484. Brazzale, A.R., Davison, A.C., & Reid, N. (2007). Applied asymptotics: case studies in small-sample statistics. Cambridge: Cambridge University Press. Brown, G.W. (1947). On small-sample estimation. The Annals of Mathematical Statistics, 18(4), 582–585. Butler, R.W. (2000). Reliabilities for feedback systems and their saddlepoint approximation. Statistical Science, 15(3), 279–298. Butler, R.W. (2007). Saddlepoint approximations with applications. New York: Cambridge University Press. Casella, G., & Berger, R. (2002). Statistical inference. Pacific Grove: Duxbury/Thomson Learning. Chieffo, A., Stankovic, G., Bonizzoni, E., Tsagalou, E., Iakovou, I., Montorfano, M., et al. (2005). Early and mid-term results of drug-eluting stent implantation in unprotected left main. Circulation, 111, 791–795. Cox, D. (2006). Principles of statistical inference. New York: Cambridge University Press. Davison, A.C. (2003). Statistical models. New York: Cambridge University Press. Davison, A.C. (1988). Approximate conditional inference in generalized linear models. Journal of the Royal Statistical Society Series B (Methodological), 50(3), 445–461. Davison, A.C., Fraser, D., & Reid, N. (2006). Improved likelihood inference for discrete data. Journal of the Royal Statistical Society Series B, 68(Part 3), 495–508. DeMars, C. (2010). Item response theory. New York: Oxford University Press. Doebler, A., Doebler, P., & Holling, H. (2013). Optimal and most exact confidence intervals for person parameters in item response theory models. Psychometrika, 78(1), 98–115. Essen, C.-G. (1945). Fourier analysis of distribution functions. A mathematical study of the Laplace–Gaussian law. Acta Mathematica, 77(1), 1–125. Fischer, G.H. (2007). Rasch models. In C. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 515–585). Amsterdam: North-Holland. Fox, J.-P. (2010). Bayesian item response modeling. New York: Springer.

MARTIN BIEHLER, HEINZ HOLLING, AND PHILIPP DOEBLER Hall, P. (1982). Improving the normal approximation when constructing one-sided confidence intervals for binomial or Poisson parameters. Biometrika, 69(3), 647–652. Hall, P. (1992). On the removal of skewness by transformation. Journal of the Royal Statistical Society, Series B (Methodological), 54(1), 221–228. Hambleton, R.K., & Zhao, Y. (2005). Item response theory (IRT) models for dichotomous data. In B. Everitt & D. Howell (Eds.), Encyclopedia of statistics in behavioral science (pp. 982–990). Chichester: Wiley. Hirji, K.F. (2006). Exact analysis of discrete data. Boca Raton: Chapman & Hall/CRC Press. Hirji, K.F., Tsiatis, A.A., & Metha, C.R. (1989). Median unbiased estimation for binary data. American Statistician, 43(1), 7–11. Hoijtink, H., & Boomsma, A. (1995). On person parameter estimation in the dichotomous Rasch model. In G.H. Fischer & I.W. Molenaar (Eds.), Rasch models. Foundations, recent developments, and applications (pp. 54–68). New York: Springer. Johnson, N.L., Kemp, A.W., & Kotz, S. (2005). Univariate discrete distributions (3rd ed.). Hoboken: Wiley. Kay, S., Nuttall, A., & Baggenstoss, P. (2001). Multidimensional probability density function approximations for detection, classification, and model order selection. IEEE Transactions on Signal Processing, 49(10), 2240–2252. Klauer, K.C. (1991a). Exact and best confidence intervals for the ability parameter of the Rasch model. Psychometrika, 56(2), 535–547. Klauer, K.C. (1991b). An exact and optimal standardized person test for assessing consistency with the Rasch model. Psychometrika, 56(2), 213–228. Kolassa, J. (1997). Infinite parameter estimates in logistic regression, with application to approximate conditional inference. Scandinavian Journal of Statistics, 24(4), 523–530. Lehmann, E. (1951). A general concept of unbiasedness. The Annals of Mathematical Statistics, 22(4), 587–592. Lehmann, E. (1999). Elements of large sample theory (1st ed.). New York: Springer. Lehmann, E., & Casella, G. (1998). Springer texts in statistics. Theory of point estimation. (2nd ed.). New York: Springer. Hardcover. Lehmann, E., & Romano, J. (2005). Testing statistical hypotheses (3rd ed.). New York: Springer. Levin, B. (1990). The saddlepoint correction in conditional logistic likelihood analysis. Biometrika, 77(2), 275–285. Liou, M., & Yu, L.-C. (1991). Assessing statistical accuracy in ability estimation: a bootstrap approach. Psychometrika, 56(1), 55–67. Lord, F.M. (1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48(2), 233–245. Lugannani, R., & Rice, S. (1980). Saddle point approximation for the distribution of the sum of independent random variables. Advances in Applied Probability, 12(2), 475–490. Molenaar, I., & Hoijtink, H. (1990). The many null distributions of person fit indices. Psychometrika, 55(1), 75–106. Ogasawara, H. (2012). Asymptotic expansions for the ability estimator in item response theory. Computational Statistics, 27(4), 661–683. Ogasawara, H. (2013). Asymptotic properties of the bayes and pseudo bayes estimators of ability in item response theory. Journal of Multivariate Analysis, 114, 359–377. Pace, L., & Salvan, A. (1997). Principles of statistical inference from a neo-Fisherian perspective. Singapore: World Scientific. Pace, L., & Salvan, A. (1999). Point estimation based on confidence intervals: exponential families. Journal of Statistical Computation and Simulation, 64, 1–21. Pfanzagl, J. (1970a). Median unbiased estimates for m.l.r.-families. Metrika, 15(1), 30–39. Pfanzagl, J. (1970b). On the asymptotic efficiency of median unbiased estimates. The Annals of Mathematical Statistics, 41(5), 1500–1509. Pfanzagl, J. (1972). On median unbiased estimates. Metrika, 18(1), 154–173. Pierce, D.A., & Peters, D. (1992). Practical use of higher order asymptotics for multiparameter exponential families. Journal of the Royal Statistical Society Series B, 54(3), 701–737. R Development Core Team (2009). R: A language and environment for statistical computing [Computer software manual], Vienna, Austria. Available from http://www.R-project.org. (ISBN 3-900051-07-0). Read, C.B. (2006). Median unbiased estimators. In S. Kotz, L.J. Norman, N. Balakrishnan, C.B. Read, & V. Brani (Eds.), Encyclopedia of statistical sciences (2nd ed., Vol. 7, pp. 4713–4715). New York: Wiley-Interscience. Reeve, B., & Mâsse, L. (2004). Item response theory modeling for questionnaire evaluation. In S. Presser et al. (Eds.), Methods for testing and evaluating survey questionnaires (pp. 247–273). Hoboken: Wiley. Reid, N. (1988) Saddlepoint methods and statistical inference. Statistical Science, 3(2), 213–238. Rogers, L., & Zane, O. (1999). Saddlepoint approximations to option prices. The Annals of Applied Probability, 9(2), 493–503. Routledge, R. (1994). Practicing safe statistics with the mid-p ∗ . Canadian Journal of Statistics, 22(1), 103–110. Salvan, A., & Hirji, K. (1991). Asymptotic equivalence of conditional median unbiased and maximum likelihood estimators in exponential families. Metron, 49, 219–232. Severini, T.A. (2000). Likelihood methods in statistic. New York: Oxford University Press. Small, C.G. (2010). Expansions and asymptotics for statistics. Boca Raton: Chapman & Hall/CRC Press. Srivastava, M., & Yau, W. (1989). Saddlepoint method for obtaining tail probability of Wilks’ likelihood ratio test. Journal of Multivariate Analysis, 31, 117–126. Stuart, A., & Ord, J. (1987). Kendall’s advanced theory of statistics (5th ed., Vol. 1). New York: Oxford University Press. van der Linden, W.J. & Glas, G.A.W. (Eds.) (2000). Computerized adaptive testing: theory and practice. Dordrecht: Kluwer Academic.

PSYCHOMETRIKA Wang, S., & Carroll, R.J. (1999). High-order accurate methods for retrospective sampling problems. Biometrika, 86(4), 881–897. Warm, T.A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427–450. Young, G., & Smith, R. (2005). Essentials of statistical inference. New York: Cambridge University Press. Manuscript Received: 21 AUG 2012

Saddlepoint Approximations of the Distribution of the Person Parameter in the Two Parameter Logistic Model.

Large sample theory states the asymptotic normality of the maximum likelihood estimator of the person parameter in the two parameter logistic (2PL) mo...
567KB Sizes 1 Downloads 3 Views