Behav Res DOI 10.3758/s13428-014-0512-9

Estimating a DIF decomposition model using a random-weights linear logistic test model approach Insu Paek & Hirotaka Fukuhara

# Psychonomic Society, Inc. 2014

Abstract A differential item functioning (DIF) decomposition model separates a testlet item DIF into two sources: itemspecific differential functioning and testlet-specific differential functioning. This article provides an alternative modelbuilding framework and estimation approach for a DIF decomposition model that was proposed by Beretvas and Walker (2012). Although their model is formulated under multilevel modeling with the restricted pseudolikelihood estimation method, our approach illustrates DIF decomposition modeling that is directly built upon the random-weights linear logistic test model framework with the marginal maximum likelihood estimation method. In addition to demonstrating our approach’s performance, we provide detailed information on how to implement this new DIF decomposition model using an item response theory software program; using DIF decomposition may be challenging for practitioners, yet practical information on how to implement it has previously been unavailable in the measurement literature. Keywords DIF . IRT . DIF decomposition . Testlet . Testlet DIF . ConQuest When a test is assembled, an item bias investigation is routinely conducted to ascertain a valid test score inference and to construct a fair test across different ethnicity or gender groups. To declare that an item is biased, a statistical test of differential item functioning (DIF) is applied, and its results are used as part of supporting evidence for either item bias or no item bias. DIF is defined as differential performance on an item by I. Paek (*) Educational Psychology & Learning Systems, Florida State University, Tallahassee, FL, USA e-mail: [email protected] H. Fukuhara Pearson, San Antonio, TX, USA

different groups that are at the same ability level. A DIF method typically focuses on an item-level differential functioning across the reference (e.g., White) and the focal (e.g., African American) groups. DIF results for an item include both a statistical testing result and a DIF magnitude estimate that provides an overall amount of DIF at the item level. When a test is composed of item bundles or testlets (e.g., a testlet being a set of items in a reading test that share the same reading passage), DIF may be considered the combination of two elements: differential functioning due to the item itself under investigation (item-specific differential functioning: IS_DF) and differential functioning due to the testlet having the item under investigation (testlet-specific differential functioning: TS_DF)—that is, DIF = IS_DF + TS_DF. Beretvas and Walker (2012) proposed a DIF decomposition model that separates IS_DF and TS_DF. Their model building was based on a multilevel modeling approach. In their model estimation, they used SAS PROC GLIMMIX (SAS Institute, 2006), and the estimation method was restricted pseudolikelihood (RPL: Wolfinger & O’Connell, 1993). We propose an alternative model-building framework and estimation approach for DIF decomposition modeling. A major difference between our approach and Beretvas and Walker is that our model is directly built upon an item response theory (IRT) framework—more specifically, the random-weights linear logistic test model (RWLLTM: Rijmen & De Boeck, 2002)—employing the full information estimation approach, based on marginal maximum likelihood using the expectation and the maximization algorithm (MML-EM: Bock & Aitkin, 1981; Dempster, Laird, & Rubin, 1977). The RPL method adopts a model linearization method for the parameter estimation. RPL linearizes the mixed effect nonlinear model using a first-order Taylor series regarding the fixed and random parameters to construct pseudoresponses. These pseudoresponses are used in the estimation of model parameters using restricted maximum likelihood, typically known as REML in the multilevel modeling

Behav Res

literature. The MML-EM method, unlike the RPL method, constructs the actual objective function for the marginal likelihood in the estimation. Thus, it allows for familiar statistical tests such as a likelihood ratio test between nested models (e.g., the DIF decomposition model vs. the testlet DIF model) and the Wald test, which are not available in the RPL method due to the model linearization approach. The purpose of this article is twofold: (1) We show a DIF decomposition model under the RWLLTM framework and demonstrate the model’s parameter recovery under MMLEM through simulations; and (2) for practitioners, we provide detailed information on how to specify and estimate this DIF decomposition model (e.g., construction of a design matrix, an appropriate format of the input data structure) in using an IRT program, ConQuest (Wu, Adams, Wilson, & Haldane, 2007). In the previous study by Beretvas and Walker (2012), the impact factor (which is a group ability difference between the reference and focal groups) was addressed in their modeling; however, in the actual simulation study, it is not clear whether the impact factor was included in the simulation, and there was no report on the recovery of the impact factor parameter by the RPL estimation method under the multilevel modeling framework. Consideration of the impact factor is very important for DIF methods to work properly. DIF is in essence the difference in the conditional probabilities of correct answer (or the conditional expected score) between the compared groups on the same ability level, and should not be confounded with the group ability difference (i.e., impact) that may exist. This becomes clear when we consider the marginal item response (or proportion correct). The proportion correct of an item i for a group g (denoted by pi+) can be represented as Z pi þ ¼ PðX i ¼ 1jθÞf ðθÞdθ; ð1Þ

software. Our provision of detailed information on how to specify and estimate a DIF decomposition model using an IRT program can enhance practitioners’ ability to apply the model to their real data. We figure that one of the most challenging and time-consuming aspects of implementing the DIF decomposition model is the understanding of a design matrix and its construction (which is also true for RPL estimation using the SAS PROC GLIMMIX program). Also, as the number of items increases, it becomes more error-prone to manually construct a correct design matrix in general, due to the increasing complexity of that matrix. Thus, in addition to all of the file content that is necessary for specifying the model using ConQuest, we provide a function code written in the R language1 (R Development Core Team, 2013) that automates the construction of a design matrix for the DIF decomposition model for users’ convenience.

Method The DIF decomposition model Fisher (1973) proposed an item response model called the linear logistic test model (LLTM) in which person and item properties are additively parameterized. Rijmen & De Boeck, (2002) extended LLTM by allowing some of the additive model parameters to be random parameters, calling their model the random-weights LLTM (RWLLTM). The RWLLTM for a correct response in a dichotomously scored item can be expressed as h  i 0 logit P U ni ¼ 1jθn ; ψp ; wnp ; bi ¼ logitðPni Þ ¼ θn þ

where P(Xi = 1|θ) is the conditional probability of correct answer for an item i (or an item response function [IRF] for item i in IRT), and f(θ) is a density function for ability θ. Note that P(Xi = 1|θ) and f(θ) are separately considered in Eq. 1, and the DIF is the comparison of P(Xi = 1|θ) between groups—that is, the differential performance on an item between groups at the same ability level beyond the difference in f(θ), which is the impact. The impact factor should be modeled for a complete DIF model, and its performance should be evaluated regarding the separation of DIF and the impact. The second purpose was motivated by our observations that some practitioners and content-oriented researchers have shown interest in the idea of DIF decomposition and desired to apply the model, but the model implementation could be challenging because DIF decomposition modeling is new, and no detailed, practical information for practitioners has been available in the IRT literature regarding how to specify and implement the estimation of the model using statistical

M1 X m¼1

bip ψp þ

M X

bip wnp ;

ð2Þ

m¼M 1 þ1

where Uin is the nth person’s response to the ith item, θn is the nth person’s ability, M is the total number of parameters for the ith item, M1 is the number of fixed parameters, M−M1 is the number of random parameters, ψp is the pth fixed parameter for the ith item, ωnp is the pth random parameter for the nth person and the ith item, bip is an integer weight for the ith item’s pth parameter, and b'i is part of a design matrix for the ith item—more precisely, a row vector for the correct response of the ith item in the design matrix that specifies a linear additive parameterization for the ith item. When stacking (0′1×M , b′i)′ across all items (0′1×M being a vector of zeros that has M number of elements), we obtain the whole design matrix for all items. (See Rijmen & De Boeck, 2002, for more 1 R is a statistical software language that is freely available. One can simply copy and paste the design matrix function provided here into R and use it to generate an appropriate design matrix.

Behav Res

details on RWLLTM.) Adapting RWLLTM above to our modeling of DIF decomposition for a test with testlets, the effect of a testlet is modeled through the random parameter, ωnp, whereas IS_DF and TS_DF are modeled by ψp. The impact factor and the item difficulty are also modeled via ψp. The DIF decomposition model for a testlet item is formulated as follows: logitðPnid Þ ¼ θn þ ζ g −βi − κd ðiÞ þ λd þ γ ndðiÞ ;

ð3Þ

where Pnid is the probability of a correct answer (or endorsing a positive response) for the ith item in the dth testlet by the nth person, logit(Pnid) = log[Pnid/(1−Pnid)], θn is the nth person’s ability (n = 1, 2,⋯,N), ζg is the group g’s mean parameter [ g = R (reference group) or F (focal group)] on the target dimension θ, βi is the ith item difficulty (i = 1, 2,⋯,I), −κd(i) is the item-specific differential functioning (IS_DF) for the ith item nested within the dth testlet (i.e., IS_DF = −κd(i)), λd is the dth testlet-specific differential functioning (TS_DF) (d = 1, 2,⋯, D), and γnd(i) is the interaction parameter between the nth person and the dth testlet, where the ith item is nested. The random parameters in Eq. 3 are θn and γnd(i), whereas the other parameters are fixed unknown parameters. (Treating θn as a random parameter is due to the use of the MML estimation here.) ζg is the impact parameter. σ2d, which is the variance of γnd(i), represents the magnitude of the dth testlet effect. For the model identification, θn ∼N(0,σ2θ), where σ2θ is a free parameter, ζg = R = 0, ∑κdðiÞ = 0 for a given d (or ∃i such that κd(i) = i 0 at a given dth testlet), ∃d such that λd = 0, and γnd(i) ∼ 2 N(0, σd), where σ2d is a free parameter. The item level DIF (δi) in this model is equal to δi ¼ βig¼R −βig¼ F ¼ −κd ðiÞ þ λd ;

ð4Þ

where βig = R and βig = F are the item difficulties for the reference and the focal groups, respectively. IS_DF as −κd(i) makes the interpretation of IS_DF in terms of its direction (i.e., whether or not the focal [or reference] group is disadvantaged) have the same meaning as the TS_DF. When κd(i) > 0 or IS_DF < 0, the item (IS_DF) is biased against the focal group (which has a group membership indicator value of one in the input data matrix). When λd < 0 or TS_DF < 0, the testlet (TS_DF) is biased against the focal group in our modeling. DIF cancellation (δi = 0 in Eq. 3) takes place when IS_DF and TS_DF are the same in absolute value but their signs are opposite—that is, κd(i) = λd. Beretvas and Walker (2012) presented a simulation example of DIF cancelation. The DIF decomposition model specified in Eq. 3 can be viewed as a direct extension of the marginal likelihood estimation approach to the Rasch DIF model (Paek & Wilson, 2011) and to the Rasch testlet DIF model (Wang & Wilson, 2005a). The model in Eq. 3 is reduced to the Rasch DIF model when κd(i), λd, and γnd(i) (or σ2d) are equal to zero, and to the Rasch testlet DIF model when λd is equal to zero and κd(i) is

replaced by δi defined in Eq. 4. Also, if we take a multidimensional IRT perspective on the testlet IRT modeling (e.g., Rijmen, 2010), the model in Eq. 3 can be seen as a special case of the multidimensional random-coefficient multinomial logit model (MRCMLM: Adams, Wilson, & Wang, 1997), which turns out to subsume RWLLTM as a submodel. The program ConQuest (Wu et al., 2007) is the realization of MRCMLM and employs MML-EM for the model parameter estimation; thus, the MML-EM DIF decomposition model in Eq. 3 can be estimated. Due to ConQuest’s generality as a Rasch-type generalized item response model program, the specification of nonregular IRT models such as the DIF decomposition model requires a user-supplied design matrix and other necessary files, which can be challenging for practitioners. We provide the necessary file content and an R function code that generates a design matrix for the DIF decomposition model using the ConQuest program in the Appendices. Simulation design To show the parameter recovery of the DIF decomposition model by the MML-EM method, we simulated a 20-item test under the following four conditions by manipulating two simulation factors, group size (500 or 1,000 per group) and testlet size (ten or five items per testlet). Condition 1 had a sample size of 500 per group and the number of items in a testlet equal to ten. Condition 2 had a sample size of 1,000 per group and the number of items in a testlet equal to ten. Condition 3 had a sample size of 500 per group and the number of items in a testlet equal to five. Condition 4 had a sample size of 1,000 per group and the number of items in a testlet equal to five. The bias and the root mean squared error (RMSE), which are defined below, were calculated for each parameter and reported for assessing the model parameter recovery. X bξ j =h−ξ Bias ¼ and sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi X  2  b ξ j −ξ =h ; RMSE ¼ where ξ is a true (data-generating) parameter of interest, bξ j is the estimate of the jth replication, h is the total number of replications, and the summation is across j. The number of replications was 200 for Conditions 1 and 2, respectively, whereas it was 100 for Conditions 3 and 4. The reason for the reduced number of replications in Conditions 3 and 4 was due to the increased model estimation time because of an increased number of testlets, which in turn leads to an increased number of dimensions. (Conditions 3 and 4 had four testlets in a test, whereas Conditions 1 and 2 had two testlets. In both Conditions 3 and 4, fitting the DIF decomposition model

Behav Res

was equivalent to the estimation of a five-dimensional model. It took about 2.5 h for a single replication data run with the personal computers available to the authors, as opposed to a few minutes for a single replication data run under Conditions 1 and 2.) Note that the variations of the sample size and the number of items per testlet were addressed by Beretvas and Walker (2012) as part of the future investigations necessary toward the full evaluation of a DIF composition modeling. In all conditions, the magnitude of the testlet effect, σ2d, was set at 1 across all testlets, and the impact was set at 1 [i.e., θR ∼ N(0, 1) and θF ∼N(−1, 1) for the reference and focal groups, respectively; the reference group ability mean was higher than the focal group ability mean by one standard deviation difference]. These chosen magnitudes of the testlet effect and the impact are considered large in general. Our MML-EM DIF decomposition modeling approach was tested under these unfavorable conditions for DIF methods. Item difficulty parameters were generated from the standard normal distribution [i.e., βi ∼N(0, 1)]. There was one studied/ suspect testlet whose items were the studied items for the investigations of IS_DF (−κd(i)) and TS_DF (λd). Across all conditions, the second testlet was the studied testlet. For the five-item-per-testlet case, the IS_DF values were .3, –.5, .3, .1, and –.2 for each of the five items, respectively. TS_DF was set at –.3. Thus, DIF (at the item level using Eq. 4) was 0, –.8, 0, –.2, and –.5; that is, the first and third items had no DIF (due to DIF cancellation), whereas the rest of the items were simulated as exhibiting DIF against the focal group. Note that the first, third, and fourth values of IS_DF (.3, .3, and .1) indicate that IS_DF is against the reference group. The simulated DIF sizes at the item level correspond to large (the second item having –.8), medium (the fifth item having –.5), and negligible (the first, third, and fourth items having 0, 0, and –.2) DIF under the Rasch DIF classification rules (see, e.g., Paek & Wilson, 2011). These sets of IS_DF and TS_DF parameter values were used twice for the conditions having ten items per testlet.

Simulation results The performance of the MML-EM approach for the estimation of the DIF decomposition model was good in terms of bias and RMSE. Across all four conditions, bias ranged from –.02 to .031, with its mean being equal to .002 and a standard deviation (SD) of .01. The values of RMSE ranged from .049 to .166. with its mean being equal to .105 and an SD of .026. In all, 65 % of the bias values were less than .01 in absolute value, whereas about 98 % of the bias values were less than .02 in absolute value. Regarding RMSE, about 46 % of the RMSE values were less than .01, and about 94 % of the RMSE values were less than .15. Tables 1 and 2 are presented here to show the details of the parameter recovery when the sample size per group was 500.

Table 1 Parameter recovery when the sample size was 500 per group and the number of items per testlet was ten Parameter No.

Parameter

True Value

Bias

RMSE

1 2 3 4

β1 β2 β3 β4

−0.188 1.547 −1.416 1.074

–.009 –.003 .001 .006

.099 .121 .106 .102

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

β5 β6 β7 β8 β9 β10 β11 β12 β13 β14 β15 β16 β17 β18 β19 β20 κ2(1) κ2(2)

0.948 0.831 −0.418 0.520 −1.826 1.442 0.547 0.146 0.481 0.000 0.131 −0.038 0.383 −0.247 0.845 0.056 −0.300 0.500

.000 .006 .004 .005 .003 .008 .001 –.009 –.003 –.017 .015 .013 .007 .000 .001 .015 .013 .015

.120 .115 .111 .107 .108 .120 .116 .125 .133 .125 .124 .129 .129 .133 .123 .130 .147 .154

23 24 25 26 27 28 29 30 31 32 33 34 35

κ2(3) κ2(4) κ2(5) κ2(6) κ2(7) κ2(8) κ2(9) κ2(10) ζR −ζF λ2 σ2θ σ2d = 1 σ2d = 2

−0.300 −0.100 0.200 −0.300 0.500 −0.300 −0.100 0.200 −1.000 −0.300 1.000 1.000 1.000

.001 .016 –.013 –.020 –.003 .003 –.001 .010 –.003 .005 .000 .008 .002

.147 .149 .154 .143 .161 .136 .151 .154 .077 .109 .101 .136 .122

βs are item difficulties, κs are −IS_DF parameters, and λ2 is the TS_DF parameter. The impact is ζR −ζF. The variance of the target ability dimension is σ2θ, and testlet effects are represented by σ2d = 1 and σ2d = 2.

Both the impact parameter and the differential-functioning parameters such as IS_DF and TS_DF were estimated well, indicating that our DIF decomposition modeling was clearly able to separate them out. The other two conditions had an increased sample size per group (i.e., 1,000 per group). The major impact of the increased sample size was to decrease the RMSE values. The results of these increased-sample-size conditions are compared with the conditions having the sample size of 500 per group in Fig. 1.

Behav Res Table 2 Parameter recovery when the sample size was 500 per group and the number of items per testlet was five

Summary and discussion

Parameter No.

Parameter

True Value

Bias

RMSE

1 2 3 4

β1 β2 β3 β4

−0.188 1.547 −1.416 1.074

–.008 –.015 –.010 –.008

.106 .108 .107 .105

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

β5 β6 β7 β8 β9 β10 β11 β12 β13 β14 β15 β16 β17 β18 β19 β20 κ2(1) κ2(2)

0.948 0.831 −0.418 0.520 −1.826 1.442 0.547 0.146 0.481 0.000 0.131 −0.038 0.383 −0.247 0.845 0.056 −0.300 0.500

–.006 .004 .001 .014 –.011 .009 .011 .003 –.012 –.016 –.012 –.018 –.016 –.007 –.018 .003 .008 –.003

.099 .123 .129 .122 .143 .117 .103 .095 .096 .101 .100 .092 .100 .104 .108 .098 .159 .136

23 24 25 26 27 28 29 30 31 32

κ2(3) κ2(4) κ2(5) ζR −ζF λ2 σ2θ σ2d = 1 σ2d = 2 σ2d = 3 σ2d = 4

−0.300 −0.100 0.200 −1.000 −0.300 1.000 1.000 1.000 1.000 1.000

.004 .021 .031 –.005 .005 –.012 .017 –.016 .014 .015

.134 .138 .163 .086 .122 .070 .165 .160 .141 .144

Beretvas and Walker (2012) introduced a DIF decomposition model formulated using multilevel modeling with the restricted pseudolikelihood (RPL) estimation method. This article illustrates an alternative DIF decomposition modeling approach directly built upon the IRT framework and its estimation using the MML-EM algorithm. In addition, we provided detailed information on how to implement the MML-EM DIF decomposition model estimation using an IRT program for practitioners, which should fill a void in the IRT literature, which currently lacks practical information on how to implement the model. Due to the model’s complexity and its identification constraints, using the DIF decomposition model in an exploratory manner, so that all items are parameterized for DIF decomposition, is not recommended, because such an exploratory approach would be unlikely to lead to a sufficient condition for the model identification. We suggest that the DIF decomposition model should be used in a confirmatory manner. The first step in this confirmatory approach would be to determine sufficient conditions for the model identification. To this end, results from a conventional (uniform) DIF analysis at the item level [e.g., using the Mantel–Haenszel method (Mantel & Haenszel, 1959), the logistic regression method (Swaminathan & Rogers, 1990), or the Rasch DIF model (Paek & Wilson, 2011)] and content experts’ evaluations of the sources of DIF for items could be combined. In the latter analysis, for example, content experts are asked whether an item itself or its reading passage in a reading test is more responsible for differential performance across the groups being compared. When the content experts’ information is available, some of the IS_DF or TS_DF parameters may be fixed at zero (e.g., IS_DF = 0 for a DIF item if content experts conclude that the DIF for that item is almost solely due to the reading passage). This first step is analogous to what is called a purification procedure in the DIF literature, in which a set of anchor items (i.e., DIF-free items) is statistically located through an iterative process within the test under investigation. Several statistical purification procedures have been proposed for different DIF methods (e.g., Wang & Su, 2004) to increase the power to detect DIF items and to decrease potential inflation of Type I error rates due to DIF contamination within the test under investigation when the internal anchor items are sought. (Locating internal anchor items plays the role of establishing the model identification condition and linking the model parameter estimates across the compared groups in the IRT DIF analysis.) We suggest substituting the purification procedure with the first step described above, since a statistical purification method for the DIF decomposition model has not yet been developed. Note that the use of content experts’ evaluations was also suggested by Wang and Wilson (2005a) for the Rasch testlet DIF model identification. Once the model identification constraints are established, the second step is to run the DIF

βs are item difficulties, κs are −IS_DF parameters, and λ2 is the TS_DF parameter. The impact is ζR −ζF. The variance of the target ability dimension is σ2θ, and testlet effects are represented by σ2d = 1, σ2d = 2, σ2d = 3, and σ2d = 4.

The effect of the increased sample size on the RMSE values (i.e., decreased RMSE values) is clearly indicated in Fig. 1, which follows the usual statistical model parameter estimation behavior. Overall, all of the DIF decomposition model parameters were estimated successfully by the MML-EM algorithm in the face of variations in the number of items per testlet and across different sample sizes. To present concrete information on how to implement the DIF decomposition model using ConQuest, Appendices A and B are provided separately.

Behav Res No. of testlet items = 10 0.20

No. of testlet items = 10

0.00

-0.04

0.05

0.10

RMSE

0.02 0.00 -0.02

10

15

20

25

30

35

0

10

15

20

25

30

Parameter No.

No. of testlet items = 5

No. of testlet items = 5

N=500 per group N=1000 per group

35

N=500 per group N=1000 per group

0.00

-0.04

0.05

-0.02

0.00

RMSE

0.02

0.15

0.04

5

Parameter No.

0.20

5

0.10

Bias

0

Bias

N=500 per group N=1000 per group

0.15

0.04

N=500 per group N=1000 per group

0

5

10

15

20

25

30

Parameter No.

0

5

10

15

20

25

30

Parameter No.

Fig. 1 Bias and RMSE. “Parameter No.” on the x-axes in the top row follows the order specified in Table 1, and “Parameter No.” in the bottom row follows the order specified in Table 2

decomposition model to investigate the source(s) of DIF.2 Developing a statistical purification method for the DIF decomposition model could be a focus for future research, since it is beyond the scope of the present study. A useful statistical test of this confirmatory approach would be a model comparison between the DIF decomposition model and the testlet DIF model that has the TS_DF parameters fixed as zero. The testlet DIF model is nested within the DIF decomposition model, and the null hypothesis in this model comparison test is equivalent to Ho : λd∈S =0, where S is a set including all of the studied testlets. The rejection of this test by either the likelihood ratio test or Wald test would indicate that the DIF decomposition model

As compared to the constraint “∃isuch that κd(i) = 0 at a given dth testlet,” it seems to the authors that ∑κd ðiÞ = 0 is more challenging to

2

i

satisfy for the model identification, even with the aid of the content experts’ evaluations. We conjecture that in practice the constraint “∃isuch that κd(i) = 0 at a given dth testlet” would be the one most likely to be used.

should be preferred—that is, would confirm that IS_DF and TS_DF should be parameterized separately.3

3 To conduct a test for the model comparison test between the DIF decomposition model and the testlet DIF model—or, equivalently, to test Ho : λd∈S = 0—one can use either a likelihood ratio test or a z test as the  ̂ Wald test ̂ λ =SE λ d Þ , where replacement. The z test for a scalar parameter testing is d  SE λ ̂ d Þ is the standard error of λ ̂ d . We recommend the use of the likelihood ratio test, because the current ConQuest program for the estimation of the DIF decomposition model does not permit full estimation of the information matrix but instead employs a diagonal matrix approximation, which results in the underestimation of standard errors (See Wu, Adams, Wilson, & Haldane, p. 140) and in turn leads to a liberal z-test (or Wald test) result. To conduct the likelihood ratio test to compare the two models, estimate the DIF decomposition model and the testlet DIF model in which all λd parameters are dropped. (Thus, the “rg” file should be appropriately modified to have all λds fixed as zeros, while the design matrix and all other files should stay the same.) The difference in the deviances between the testlet model and the DIF decomposition model would be an approximate chisquare value with its degrees of freedom (dfs) equal to the difference in the number of parameters between the models (i.e., the number of λds in the DIF decomposition model). The deviance value for each model is shown as the “Final deviance” in the shw file.

Behav Res

The MML-EM estimation worked well and recovered all model parameters successfully across different numbers of items per testlet and different sample sizes. In particular, we demonstrated successful separation and estimation of the impact and differential-functioning parameters, which had not been clearly examined in the previous literature on DIF decomposition modeling. Because of the direct approximation of the objective function by the quadrature method in the MML-EM estimation process, the estimation cost (run time) increases as the number of testlets increases. The ConQuest program offers a Monte Carlo approximation as an alternative, to ease the computational burden for some limited model cases, but unfortunately the current version used in this study does not support the Monte Carlo approximation in the estimation of the DIF decomposition model. Additional options to enhance the model’s estimation efficiency include an adaptive quadrature method (Schilling & Bock, 2005) and the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm (Cai, 2010), both of which have been proposed for reducing the computational burden for high-dimensional IRT model estimations. The DIF decomposition modeling and its estimation under these quadrature and/or MH-RM algorithm can be researched in future studies. As compared to the computational burden that MML-EM carries when the number of testlets (and the number of items per testlet) is large, the RPL method could make the computational cost less than the MML-EM approach because of its model linearization. However, we note that at a given number of testlets (e.g., a two-testlet test with ten items per testlet), our small-scale simulation experience, whose details are not reported here, has suggested that the MMLEM estimation (using ConQuest) was more efficient than the RPL method (using SAS) with a relatively large sample size (e.g., sample size of 1,000 per group). Comparing the RPL method (or its variations), which is frequently used in multilevel model estimations, with the MML-EM method (direct quadrature, adaptive quadrature) and/or MH-RM could provide useful information for researchers and practitioners regarding model parameter recovery under different estimation algorithms, their estimation costs, and the choice of an estimation procedure.

Appendix A: Implementation of the DIF decomposition model To estimate the DIF decomposition model using ConQuest, one needs to prepare a command file (which we call a “cm” file), a file specifying the parameters of σ2θ and σ2d (which we call a “cov” file), a data file that is formatted specifically for our DIF decomposition model (which we call a “dat” file), a

file specifying the impact parameter ζg and the TS_DF parameter λd (which we call an “rg” file), and finally, the design matrix file specifying the item parameter βi and the negative of the IS_DF parameter κd(i) (which we call a “dgn” file). (Note that the λd parameter can be specified as part of the ConQuest design matrix, but in our presentation here, to simplify the design matrix, we modeled λd using what is known as the latent regression in ConQuest. Using both a design matrix and the latent regression is the same as modeling λd in ConQuest. Using the latent regression for λd also makes convenient the specification and the estimation of the Rasch testlet DIF model, which is needed in the likelihood ratio test for the model comparison between the DIF decomposition model and the Rasch testlet DIF model, because the same design matrix can be used. However, If λd is specified in the design matrix, the design matrix has to be modified for the Rasch testlet DIF model case. See note 3 for more details on how to conduct the likelihood ratio test for the model comparison.) All of these files should be in the same folder with the ConQuest executable console program. When all of these files are prepared, you can start the estimation by typing “submit” with the cm file name at the prompt after the console version ConQuest is open. The contents of all these files are provided below. For the purpose of illustration, suppose that we have a nine-item test composed of three testlets, each of which has three items, and that the items in the second and third testlets are studied items (Items 4, 5, and 6 in the second testlet and Items 7, 8, and 9 in the third testlet). CM file reset; set update = yes; set warnings = no; datafile example.dat; format response 1–18 grp 19 ; code 0 1; score (0 1) (0 1) (0 1) ( ) ( ) ! items ( 1–3, 10–12); score (0 1) (0 1) ( ) (0 1) ( ) ! items (4–6, 13–15); score (0 1) (0 1) ( ) ( ) (0 1) ! items ( 7–9, 16–18); import designmatrix

Estimating a DIF decomposition model using a random-weights linear logistic test model approach.

A differential item functioning (DIF) decomposition model separates a testlet item DIF into two sources: item-specific differential functioning and te...
421KB Sizes 0 Downloads 5 Views