Development and Simulation Testing of a Computerized Adaptive Version of the Philadelphia Naming Test.

JSLHR

Research Article

Development and Simulation Testing of a Computerized Adaptive Version of the Philadelphia Naming Test William D. Hula,a,b Stacey Kellough,a and Gerasimos Fergadiotisc

Purpose: The purpose of this study was to develop a computerized adaptive test (CAT) version of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996), to reduce test length while maximizing measurement precision. This article is a direct extension of a companion article (Fergadiotis, Kellough, & Hula, 2015), in which we fitted the PNT to a 1-parameter logistic itemresponse-theory model and examined the validity and precision of the resulting item parameter and ability score estimates. Method: Using archival data collected from participants with aphasia, we simulated two PNT-CAT versions and two previously published static PNT short forms, and compared the resulting ability score estimates to estimates obtained

from the full 175-item PNT. We used a jackknife procedure to maintain independence of the samples used for item estimation and CAT simulation. Results: The PNT-CAT recovered full PNT scores with equal or better accuracy than the static short forms. Measurement precision was also greater for the PNT-CAT than the static short forms, though comparison of adaptive and static nonoverlapping alternate forms showed minimal differences between the two approaches. Conclusion: These results suggest that CAT assessment of naming in aphasia has the potential to reduce test burden while maximizing the accuracy and precision of score estimates.

I

The PNT is a prominent naming test that possesses favorable psychometric properties (Roach et al., 1996; Walker & Schwartz, 2012) and has been widely used in aphasia research (e.g., Beeson et al., 2011; Dell, Schwartz, Martin, Saffran, & Gagnon, 1997; Dell, Schwartz, Nozari, Faseyitan, & Coslett, 2013; Fridriksson et al., 2007; Ruml, Caramazza, Shelton, & Chialant, 2000; Schwartz, Dell, Martin, Gahl, & Sobel, 2006). The PNT stimuli consist of 175 black-andwhite line drawings whose target names are common nouns of one to four syllables in length, with frequencies of occurrence (Francis & Kučera, 1982) from 1 to 2,110 per million (Roach et al., 1996). The PNT total correct score has demonstrated excellent test–retest reliability (.99) in adults with aphasia and has shown to correlate highly with overall aphasia severity (.85) as measured by the Western Aphasia Battery–Aphasia Quotient (Walker & Schwartz, 2012). The PNT has also been shown to be uncorrelated with demographic variables such as race, age, gender, and socioeconomic status (Walker & Schwartz, 2012). These positive psychometric attributes of the PNT and its strong theoretical basis make it a potentially useful clinical tool, but, as noted by Walker and Schwartz, its length makes it unsuitable for most clinical settings. To address this issue, Walker and Schwartz (2012) developed two 30-item PNT short forms (PNT30A and

n this article, we extend work presented in a companion article (Fergadiotis, Kellough, & Hula, 2015) to construct and evaluate an item response theory (IRT)– based computer adaptive version of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996). Using simulations based on responses previously collected from participants with aphasia, we evaluated agreement between computer adaptive short forms and the full PNT and compared the results to those obtained using recently developed static short forms (Walker & Schwartz, 2012). We also evaluated the equivalence of alternate test forms created by the adaptive-testing algorithm.

a

Geriatric Research, Education, and Clinical Center, VA Pittsburgh Healthcare System, PA b University of Pittsburgh, PA c Portland State University, OR Correspondence to William D. Hula: [email protected] Editor: Rhea Paul Associate Editor: Margaret Blake This article is a companion to Fergadiotis et al., “Item Response Theory Modeling of the Philadelphia Naming Test,” JSLHR, doi:10.1044/2015_JSLHR-L-14-0249 Received October 22, 2014 Revision received January 26, 2015 Accepted February 26, 2015 DOI: 10.1044/2015_JSLHR-L-14-0297

878

Disclosure: The authors have declared that no competing interests existed at the time of publication.

Journal of Speech, Language, and Hearing Research • Vol. 58 • 878–890 • June 2015 • Copyright © 2015 American Speech-Language-Hearing Association

Downloaded From: http://jslhr.pubs.asha.org/pdfaccess.ashx?url=/data/journals/jslhr/934203/ by a Univ of York-England User on 06/14/2017 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

PNT30B). The items constituting the short forms were selected to produce error distributions that matched the error distribution of the full PNT across severity levels and to match the full PNT in terms of target-word frequency, length in phonemes, and distribution of semantic categories. Both PNT short forms produced scores that correlated highly with the full PNT (r = .93 and .98), with mean percent correct scores that were slightly, though significantly, higher (4 and 6 points, respectively) than that of the full PNT. These results suggested that the PNT could be shortened by up to 80% and still maintain most of its positive features (Walker & Schwartz, 2012). In the present study, we took an alternative approach to shortening the PNT based on adaptive testing supported by IRT. The major advantage of this approach is that it has the potential to provide better measurement precision than static short forms such as those developed by Walker and Schwartz, especially for participants with particularly mild or severe naming impairment.

IRT IRT (Lord, Novick, & Birnbaum, 1968) is a framework for the construction and evaluation of psychological tests in which the construct being measured is conceptualized as a latent (unobservable) continuum along which both items and persons can be situated. IRT models generally seek to predict how individual persons will respond to particular items, based on mathematical representations of relevant characteristics of each. IRT and related models are currently widely used in clinical and rehabilitation measurement (e.g., Cella et al., 2010; Eadie et al., 2006; Jette et al., 2007; Reeve et al., 2007) and have recently seen increased use in speech-language pathology for the analysis of both self-reported (Babbitt, Heinemann, Semik, & Cherney, 2011; Baylor et al., 2013; Donovan, Velozo, & Rosenbek, 2007; Doyle, Hula, McNeil, Mikolic, & Matthews, 2005; Hula, Doyle, & Austermann Hula, 2010) and performance-based (Beltyukova, Stone, & Ellis, 2008; del Toro et al., 2011; Doyle et al., 2005; Edmonds & Donovan, 2012; Gutman, DeDe, Michaud, Liu, & Caplan, 2010; Hula, Donovan, Kendall, & Gonzalez-Rothi, 2010; Hula, Doyle, McNeil, & Mikolic, 2006) assessments. The one-parameter logistic (1-PL) IRT model defines the probability that an examinee responds correctly to an item, given person ability and item difficulty. The 1-PL model can be represented mathematically as P xi ¼ 1qj ¼

eaðqj di Þ ; 1 þ eaðqj di Þ

where P(xi = 1|qj) is the probability of a correct response xi = 1 by examinee j given her latent trait qj, a is the item discrimination parameter (assumed to be equal for all items), and di is item i’s difficulty parameter. Item difficulty describes the location of an item on the ability spectrum; it is typically defined such that the probability of a correct response is 50% when ability and difficulty are equal. Items with higher difficulty values are more likely to produce an

incorrect response from a given person, and correspondingly, persons with higher ability are more likely to respond correctly to a given item. Ability is often scaled to have M = 0 and SD = 1 in the calibration sample, with the item difficulties placed on the same scale. Item discrimination describes how well items distinguish between examinees with different latent-trait values. Whereas the 1-PL model has a single discrimination parameter common to all items, the two-parameter logistic (2-PL) model permits discrimination to vary across items. Figure 1a presents the 1-PL model in graphical terms, in the form of item characteristic functions for two items from the PNT (“key,” “bottle”). For the item “bottle” (di = −0.2), the model predicts that 50% of persons with q (ability) = −0.2 will answer correctly, whereas approximately 90% of such persons are expected to name “key” (di = −1.80) correctly. In a companion article (Fergadiotis et al., 2015), we examined the fit of the 1-PL and 2-PL models to an archival data set of PNT responses collected from 251 persons with aphasia. We found that the 2-PL model fitted the data marginally better, but judged that the fit of the 1-PL model was adequate. The results also suggested that the item difficulty and person ability estimates were quite precise, suggesting good potential to shorten the PNT via IRT-based adaptive testing. A summary of the fit analysis, an item– person map with selected item content, and a complete table of item parameter estimates and item-level fit statistics can be found in the online supplemental materials. To investigate the validity of the PNT, we also examined the extent to which item difficulty values estimated by the model could be predicted by lexical properties of the target names. Word length in phonemes, rated age of acquisition (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012), and a measure of lexical frequency (Lg10CD frequency norms from the SUBTLEXUS corpus; Brysbaert & New, 2009) each contributed significant unique variance, with the overall model accounting for 63% of the variance in item difficulty estimates. Because the effects of these variables on naming performance are either instantiated in or consistent with current models of word production (Dell, 1986, 1990; Levelt, Roelofs, & Meyer, 1999), this result provides support for the construct validity of the PNT as a measure of naming ability. The relatively strong predictive power of these variables may also make them useful for generating items in the context of IRT-based assessment of naming. Computerized Adaptive Testing A primary rationale for adaptive testing is that by administering only those items targeted to provide the most information about a particular examinee, one can shorten the test without compromising measurement precision (de Ayala, 2009; Wainer, 2000). Basal and ceiling administration rules such as those included in the Boston Naming Test (Kaplan, Goodglass, & Weintraub, 1983) and Peabody Picture Vocabulary Test–Revised (Dunn & Dunn, 1981) are one way of achieving this. Computerized adaptive testing (CAT) methods take advantage of the features of IRT models to make the process more flexible and powerful.

Hula et al.: Computer Adaptive Philadelphia Naming Test


879

Figure 1. (a). Item characteristic functions of two items. (b). Item and test information functions and the corresponding standard error of measurement for a two-item test. (c). Item and test information functions and the corresponding standard error of measurement for a four-item test.

Information is a key statistical concept in IRT. Under the theory, each item of the test measures the latent construct, and each item characteristic curve can be transformed into an item information function. The item information function reflects the information (i.e., precision) an item contributes at any level of the latent trait for the estimation of a parameter (e.g., ability estimate). Thus, the amount of information of any single item can be estimated at any ability level q and is denoted by Ii (q), where i indexes the item. Conceptually, information can be thought of as an index quantifying an item’s capacity to decrease the uncertainty in an ability estimate. For the 1-PL model, an item’s information at a given q level is defined as Ii ðqÞ ¼ Pi ðqÞQi ðqÞ; where Pi(q) is the probabiity of a correct response at a given q level and Qi(q) is its complement.

880

Figures 1b and 1c show the item information functions of two and four items, respectively, from the PNT. Note that information depends on how closely the difficulty of the items matches the ability of the person; therefore, items are maximally informative for q levels near each item’s difficulty level. When multiple items are administered, information has an additive nature. The test information function represents the sum of the item information estimates of all the items at a given ability level: I ðqÞ ¼

X

Ι ðqÞ: i i

Figures 1b and 1c show how test information increases after the administration of additional items (solid line). Importantly, information is the reciprocal of the precision with which a parameter (e.g., a participant’s ability level) can be

Journal of Speech, Language, and Hearing Research • Vol. 58 • 878–890 • June 2015


estimated (dashed line). The standard error of measurement (SEM ) at a given q level can be defined as

Figure 2. Flowchart of a computerized adaptive test.

1 SEM ðqÞ ¼ pffiffiffiffiffiffiffiffiffi : I ð qÞ The SEM can be used to build 95% confidence intervals around parameter estimates, given that it is expressed on the same metric as the latent-trait level: 95% CI ¼ Ability EstimateðqÞ 1:96 SEM ðqÞ: Because IRT-based person score estimates are independent of the particular items administered, different sets of items may be given to different examinees with scores that are expressed on a single common scale. Also, the fact that IRT models express the precision of score estimates as a function of individual ability level permits the test user to select the items that will contribute the most to the precision of a given examinee’s score estimate. In the case of static short forms such as the 30-item PNT short forms by Walker and Schwartz (2012) or the Boston Naming Test short form developed by del Toro et al. (2011), score estimates will be most precise for those participants who are best targeted by the particular items chosen, typically with progressively less precise estimates for participants with milder or more severe impairment. On the other hand, a CAT based on a sufficiently large, wellcalibrated item bank can provide score estimates that better maintain their precision across the range of ability (de Ayala, 2009; Wainer, 2000). A basic CAT administration typically begins with a provisional ability estimate, often the average in the test calibration sample. The item that provides the most information at that average ability estimate is selected from the bank, administered, and scored. Based on that item score, a revised ability estimate is computed. A new item that provides the most information at the revised ability estimate is then selected, administered, and scored. The ability estimate is updated again, now based on the two collected responses. This procedure is repeated until a stopping rule is satisfied. The stopping rule is specified either in terms of a number of items, for a fixed-length CAT; a desired level of measurement precision, for a variable-length CAT; or a combination of the two. With a fixed-length CAT, participants who are best targeted by the item bank will receive more precise score estimates than those with higher or lower scores. With a variable-length CAT, participants who are better targeted by the item bank will receive fewer items than those whose ability levels are less well matched by the item bank. A flowchart for a basic CAT is presented in Figure 2. One obvious potential application of a computer adaptive version of the PNT (PNT-CAT) is as a substitute for the full PNT, in order to reduce response burden and administration time. In order to be valid for this purpose, the PNT-CAT should produce score estimates that have acceptable precision and agree well with the full PNT in both relative and absolute terms. A second potential application is to

provide equivalent alternate short forms with nonoverlapping item content, for the purpose of measuring change over time while minimizing test practice effects. This was an explicit goal of Walker and Schwartz (2012) in developing two distinct static 30-item PNT short forms. In order to be valid for this purpose, the alternate forms should produce score estimates that are acceptably precise and, in the absence of underlying change in ability, show high relative and absolute agreement with each other. The purpose of the current study was to develop an IRT-based CAT version of the PNT and compare it to the existing static PNT short forms developed by Walker and Schwartz (2012). After we established that the PNT response data could be adequately fitted to an appropriate IRT model (Fergadiotis et al., 2015; see online supplemental materials, Supplemental Materials 1, for summary), we used data previously collected from participants with aphasia on the full 175-item PNT to simulate the PNT-CAT and the static short forms and to compare them in terms of their agreement with scores derived from the full test. We also



881

compared the simulated adaptive and static short forms in terms of the modeled precision of the ability estimates. Finally, we asked whether the PNT-CAT algorithm could be used to generate equivalent alternate test forms with nonoverlapping item content.

Method Participants Archival PNT response data from 251 participants with a complete first administration of the PNT were obtained from the Moss Aphasia Psycholinguistics Project Database (Mirman et al., 2010) on May 6, 2012, for use in the current study. All 251 participants were people living in the community who had experienced a left-hemisphere stroke resulting in aphasia, were right-handed native English speakers, and had no comorbid neurologic illness or history of psychiatric illness (Mirman et al., 2010). See Fergadiotis et al. (2015) for a complete description of the preparation of the data set. Demographic data and descriptive statistics are provided in Table 1. Of note, the current data set includes the 150 participants Table 1. Demographic and clinical characteristics of the participants. Characteristic Ethnicity African American Asian Hispanic White Missing Education (years) M SD Minimum Maximum Missing Age (years) M SD Minimum Maximum Missing Months since aphasia onset M SD Minimum Maximum Missing Western Aphasia Battery–Aphasia Quotient M SD Minimum Maximum Missing Philadelphia Naming Test (% correct) M SD Minimum Maximum Note. Total sample was N = 251 participants.

882

Value

34.% 0.4% 1.2% 44.% 20.% 13.6 2.8 7. 21. 20.% 58.8 13.2 22. 86. 20.% 32.9 51.0 1. 381. 20.% 73.4 16.6 27.2 97.8 51.% 61.% 28.% 1.% 98.%

used by Walker and Schwartz (2012) in their initial development and validation of the static PNT short forms.

PNT-CAT Assessment Using the 1-PL model, we conducted a real-data simulation study to compare two PNT-CAT versions to the previously published static short forms. In a real-data CAT simulation, actual responses collected in a standard administration are used to simulate the CAT procedure. This is done by selecting from the data set the response to a particular item whenever the CAT algorithm selects that item for administration. Thus, the CAT ability estimates are based on subsets of the response strings given to the full test, and the simulation permits us to estimate how a participant would have scored had she taken only the shorter PNT-CAT. The software used to conduct the simulations was written by the second author of this article. It provides for simulation of IRT-based adaptive and static tests with a variety of stopping rules and scoring methods (discussed later). The item selection and scoring algorithms were tested against results from published software packages for IRT scoring and CAT simulation, including Multilog (Thissen, Chen, & Bock, 2003) and Firestar (Choi, 2009). The first version of the PNT-CAT was a fixed-length CAT with the stopping rule set to 30 items in order to make it maximally comparable to the Walker and Schwartz (2012) short forms. We refer to this version as PNT-CAT30. The second version was a variable-length CAT with the stopping rule set to a maximum standard error of 0.33 or administration of 100 items, whichever was achieved first. This stopping rule was set to implement a variable-length CAT that would administer an average of approximately 30 items to the participants in the present test sample. We refer to this variable-length version as PNT-CATVL. For both CAT versions, the initial item was the one that provided maximum information at the average ability estimate for the calibration sample (“sailor”). In the first iteration of the CAT algorithm for each subject, the response to this item was evaluated and new ability and standard error estimates were computed. Then the next item selected was the one that provided maximum information at the new ability estimate. This procedure was reiterated until the stopping rule was satisfied. There are multiple methods available for computing score estimates for IRT-based tests. In the present CAT simulations, ability estimates were computed using expected a posteriori scoring whenever response strings were extreme (only for the first few items) and maximum-likelihood estimation whenever the response string included both correct and incorrect responses (after the first few items). This approach is typically employed because maximum-likelihood estimation is free of distributional assumptions, whereas expected a posteriori scoring requires prior assumptions about ability and may introduce bias. However, maximumlikelihood estimation returns infinite ability estimates whenever the response string is extreme (i.e., consists of all correct or all incorrect responses), which often happens during the first few items administered.



In order to make the data set used to evaluate CAT performance independent of the set used to calibrate the items, we used a leave-one-out, or jackknife, strategy (Miller, 1974). We used 250 participants to estimate the 1-PL model item parameters, used the remaining participant to estimate scores for the full PNT and the simulated adaptive and static short forms, and then reiterated this procedure 250 times until each participant had served in the calibration sample 250 times and in the test sample once. The advantage of this approach was that it maximized the size of the calibration sample while maintaining independence between the calibration and test samples. We compared the four short versions to one another in terms of their agreement with the full PNT using four variables: correlation, root-mean-square deviation (RMSD), bias, and variable error. The correlations provide an overall index of relative agreement between the short forms and the full PNT. RMSD is an index of absolute agreement and is calculated as

RMSD ^q ¼

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uXn u ^q j qj 2 t j¼1

PNT-CAT administrations in which the algorithm for the second CAT was adjusted to prevent administration of any items included in the first CAT. In this simulation, the first PNT-CAT was the PNT-CAT30. For the second PNT-CAT, we selected the initial item as the one that provided the most information at the ability estimate obtained from the PNTCAT30 and set the variable-length stopping rule as precision equal to that obtained with the PNT-CAT30. We compared the agreement between these two simulated PNTCATs to the agreement between the Walker and Schwartz (2012) short forms (PNT30A, PNT30B). We estimated the precision of all statistics using twotailed bootstrapped 95% bias-corrected and accelerated confidence intervals (DiCiccio & Efron, 1996) using the R package boot, version 1.3-11 (Canty & Ripley, 2014). We also computed bootstrapped confidence intervals about the relevant differences between the static and CAT short forms to assess the statistical significance of those comparisons. In addition to the statistical comparisons of the short forms, we inspected scatter plots of the ability estimates provided by each of the shortened versions over those provided by the full 175-item PNT, as well as plots of the standard error curves for each version.

n

where ^q j is the ability estimate from the short version of the PNT for person j, qj is the estimate from the full PNT, and n is the number of persons (n = 251). Low RMSD values indicate more accurate measurement, based on agreement with the full PNT estimates. RMSD represents the total error between the two PNT versions and can be decomposed into constant error (bias), which is the average signed difference score, and variable error, which is the standard deviation of the difference scores (Schmidt & Lee, 2005). As noted before, the response strings used to estimate scores on the four short versions were subsets of the responses to the full PNT. The estimates of agreement with the full PNT thus likely represent an upper bound of the agreement that would be observed in practice. In order to reduce the dependency between the short and full versions and produce estimates that may be more realistic, we also compared the short-form score estimates to estimates obtained from the nonoverlapping items remaining in the data set after each short-form administration. We refer to these tests composed of the remaining nonoverlapping items as complement tests. For example, when the PNT-CAT30 simulation selected a particular set of 30 items for a participant, we estimated a PNT-CAT30-complement score from the remaining 145 PNT items. We also computed the marginal reliability for each PNT short form. Marginal reliability is conceptually similar σ2 to Cronbach’s alpha and is computed as r ¼ 1 σɛ2 , where σ2θ θ is the total variance of the ability estimates and σ2ɛ is their average error variance (Thissen, 2000). Finally, in order to evaluate the ability of the PNT-CAT to provide equivalent nonoverlapping alternate short forms, we compared score estimates obtained from two simulated

Analysis and Results Agreement of the Adaptive and Static Short Forms With the Full PNT In order to provide context for the CAT simulation results, Table 2 provides descriptive data for the namingability estimates based on the full 175-item PNT (using item estimates obtained from the full 251-person sample and from the 250-person jackknife samples), each simulated short form, and each short form’s complement test. The simulated PNT-CAT30, PNT-CATVL, PNT30A, and PNT30B all correlated highly (> .94) with the full PNT and with their respective complement tests. The correlations are presented in Table 3, along with comparisons between the adaptive and static short forms. For each short form, the correlation with its complement test was lower than the correlation with the full PNT, and the differences were larger for the adaptive forms. In comparing the adaptive and static short forms, both the PNT-CAT30 and the PNT-CATVL obtained a significantly (p < .05) higher correlation with the full PNT than the PNT30A did, and the PNT-CATVL correlated more strongly with the full PNT than the PNT30B did. For the complement-test correlations, none of the comparisons was significant. Scatter plots of CAT and short-form ability estimates over full-PNT scores are presented in Figure 3. It is apparent from the plots that the PNT30A did not discriminate among subjects at the high or low ends of the ability range as well as the CAT versions did. The results for RMSD, presented in Table 4, were very similar to those for the correlations. When compared to the full PNT, the PNT-CAT30 had significantly lower total error than the PNT30A, whereas the PNT-CATVL



883

Table 2. Descriptive statistics for scaled naming-ability scores. Test version Full PNT Full PNT (jackknife) PNT-CAT30 PNT-CAT30-Complement PNT-CATVL PNT-CATVL-Complement PNT30A PNT30A-Complement PNT30B PNT30B-Complement

M

SD

Minimum

Maximum

Mean standard error

0.10 0.04 0.10 0.03 0.08 0.04 0.09 0.04 0.08 0.10

1.44 1.48 1.47 1.51 1.46 1.40 1.47 1.48 1.49 1.68

−3.55 −4.31 −4.18 −4.42 −4.75 −3.35 −3.50 −4.15 −3.47 −4.18

3.22 3.14 2.89 3.46 3.02 2.57 2.11 3.39 2.28 2.87

0.19 0.19 0.34 0.25 0.34 0.27 0.48 0.21 0.47 0.23

Note. Scores on the full 175-item Philadelphia Naming Test (PNT) are based on item parameter estimates from the full sample (N = 251) and from the 251 jackknife samples (n = 250 each). The simulated short forms are the 30-item computerized adaptive PNT (PNT30-CAT), the variable-length computerized adaptive PNT (PNT-CATVL), and the static PNT30A and PNT30B. The complement tests comprise the items remaining from the full test after the simulated administration of each short form.

had significantly lower error than both static short forms. As with the correlations, none of the comparisons between the adaptive and static forms based on the complement tests was significant. Figure 4 presents the results for RMSD relative to the full PNT, with the sample stratified by severity of naming impairment, as estimated by the full PNT. We identified three subgroups based on ability level (q): q ≤ −1, severe impairment, n = 52; −1 < q ≤ 1, moderate impairment, n = 131; and q > 1, mild impairment, n = 68. These results, in agreement with the impression provided by the scatter plots in Figure 3, suggest that the lower error associated with the adaptive forms was driven primarily by improved score estimates for participants with mild, and to a lesser extent, severe naming impairment. Figure 4 also shows that all four short forms performed similarly for participants in the middle of the ability distribution. Importantly, the analysis shown in Figure 4 found that the PNT-CATVL obtained significantly lower error than the PNT-CAT30 for participants with mild and severe impairment. This is directly attributable to the fact that the PNT-CATVL

administered more items when ability estimates were more extreme. The mean (SD) length in items of the PNT-CATVL was 44 (24), 25 (1), and 46 (30) for the mild-, moderate-, and severe-impairment groups, respectively. Six participants in the mild-impairment group and 10 in the severe-impairment group received the maximum of 100 items. Results for bias and variable error are provided in the online supplemental materials (see Supplemental Materials 2) and briefly summarized here. All four short forms showed small positive bias with respect to the full PNT, with average ability estimates that were 0.04–0.06 scaled score points higher. The results for variable error were almost identical to those for RMSD, with the exception being that the PNT-CAT30 obtained significantly lower variable error than both static short forms. Marginal reliability was significantly higher for both CATs (.95) than for the static short forms (.89 and .90, respectively, for A and B). The 1-PL model standard error curves presented in Figure 5 show that the PNT-CAT30 had better expected measurement precision than both static short forms across the full range of ability, except for a

Table 3. Correlations of the simulated short forms with the full 175-item Philadelphia Naming Test (PNT) and with their respective complement tests, and comparisons between adaptive and static short forms. Full PNT Short form PNT-CAT30 PNT-CATVL PNT30A PNT30B Comparison PNT-CAT30 − PNT30A PNT-CAT30 − PNT30B PNT-CATVL − PNT30A PNT-CATVL − PNT30B

Complement

Estimate

95% confidence interval

Estimate


.974 .978 .958 .967

[0.967, 0.980] [0.971, 0.983] [0.949, 0.966] [0.959, 0.973]

.950 .945 .946 .956

[0.937, 0.960] [0.932, 0.955] [0.933, 0.956] [0.944, 0.964]

.016* .008+ .020* .011*

[0.007, 0.025] [0.000, 0.016] [0.011, 0.030] [0.004, 0.020]

.004 −.006 −.001 −.011+

[−0.010, 0.018] [−0.019, 0.008] [−0.015, 0.012] [−0.023, 0.001]

Note. The simulated short forms are the fixed-length adaptive PNT (PNT-CAT30), the variable-length adaptive PNT (PNT-CATVL), and the static PNT30A and PNT30B. +

Significantly different from 0, p < .10. *Significantly different from 0, p < .01.

884



Figure 3. Scatter plots of naming-ability scores for simulated static short forms (PNT30A, PNT30B) and 30-item and variable-length computerized adaptive versions (PNT-CAT30, PNT-CATVL) over scores estimated from the full 175-item Philadelphia Naming Test. The points have been jittered to make overlapping points visible. The line in each plot is the identity line.

Figure 4. Root-mean-square deviation (RMSD) of the simulated adaptive and static short forms with respect to the full Philadelphia Naming Test, stratified by naming impairment (or ability q), as estimated by the full test. Marked comparisons are significantly different from 0 at p < .10 (+), p < .05 (*), and p < .01 (**).

Comparison of Adaptive and Static Equivalent Alternate Forms

narrow range between −1.0 and −0.5, where they performed similarly. The PNT-CATVL approximated the target standard error of 0.33 across most of the ability range, and had better expected precision than the remaining short forms for the most extreme ability estimates. The median number of items administered in the PNT-CATVL simulation across the full sample was 26 (minimum = 24, maximum = 100, interquartile range = 25.0–29.5).

For the alternate-form comparisons, we compared scores from the simulated PNT-CAT30 to a second variable-length PNT-CAT simulation in which the item-selection algorithm was adjusted to avoid items given in the first CAT, and for which the stopping rule was set as precision equal to or lower than that obtained by the PNT-CAT30, or a maximum of 100 items. For the static alternate forms, we compared scores from simulated administrations of the PNT30A and PNT30B. The results, presented in Table 5, indicate that agreement, although lower than that observed with the full PNT, was acceptable and similar for the adaptive and static alternate forms, with one exception. The PNT-CAT alternate forms demonstrated significantly greater bias than the

Table 4. Root-mean-square deviations of the simulated adaptive and static short forms with respect to the full Philadelphia Naming Test (PNT) and their respective complement tests, and comparisons between adaptive and static short forms. Full PNT Short form PNT-CAT30 PNT-CATVL PNT30A PNT30B Comparison PNT-CAT30 − PNT30A PNT-CAT30 − PNT30B PNT-CATVL − PNT30A PNT-CATVL − PNT30B

Complement

Estimate


Estimate


0.338 0.309 0.427 0.383

[0.305, 0.380] [0.283, 0.342] [0.384, 0.477] [0.348, 0.428]

0.476 0.481 0.487 0.441

[0.433, 0.535] [0.439, 0.533] [0.441, 0.539] [0.402, 0.491]

−0.089* −0.045+ −0.118* −0.075*

[−0.141, −0.033] [−0.095, 0.005] [−0.174, −0.064] [−0.127, −0.024]

−0.011 0.036 −0.006 0.041

[−0.077, 0.061] [−0.031, 0.104] [−0.063, 0.053] [−0.015, 0.093]

Note. The simulated short forms are the fixed-length adaptive PNT (PNT-CAT30), the variable-length adaptive PNT (PNT-CATVL), and the static PNT30A and PNT30B. +




885

Figure 5. Plot of one-parameter logistic model standard error curves as a function of naming ability for the simulated static Philadelphia Naming Test (PNT) short forms (PNT30A, PNT30B), the 30-item and variable-length computerized adaptive PNTs (PNT-CAT30, PNT-CATVL), and the full 175-item PNT. The histogram in the lower panel shows the distribution of naming-ability estimates from the full PNT.

static forms, with a mean difference of 0.063 between the initial PNT-CAT30 and subsequent variable-length alternate form, compared to a difference of 0.001 for the static short forms. The median (interquartile range) number of items administered for the PNT-CAT alternate form was 33 (30–48). The mean was 43, which was significantly greater than 30 (p < .001).

Discussion The purpose of this study was to evaluate the potential performance of an IRT-based CAT version of a popular

test of naming in aphasia as compared with nonadaptive short versions, using real-data simulations. To do this, we used data obtained from 251 participants with aphasia archived in the Moss Aphasia Psycholinguistics Project Database (Mirman et al., 2010). After establishing that the 1-PL model assumptions were met (Fergadiotis et al., 2015), we used the responses from the database to simulate administrations of a 30-item adaptive version of the PNT (PNT-CAT30), a variable-length adaptive version (PNTCATVL), and the two static 30-item PNT short forms (Walker & Schwartz, 2012). We evaluated the performance of all four versions based on multiple criteria, including correlation with and root-mean-square deviation from the full PNT and complement tests, marginal reliability, and measurement precision conditional on naming ability. The results suggest that a 30-item computerized adaptive version of the PNT may provide estimates of naming ability that are more accurate and precise than existing static short versions. Although all test forms produced estimates that highly agreed with the full 175-item PNT estimates, the PNT-CAT forms produced estimates that correlated more strongly with and had less total deviation from full 175-item PNT estimates than did the static 30-item short forms. Comparisons of the short forms in subgroups stratified by naming ability showed that the differences between the adaptive and static forms occurred mainly for individuals with high and, to a lesser extent, low naming ability. The four short forms performed similarly for individuals in the middle of the ability distribution. In addition to comparing the simulated adaptive and static short versions to the full PNT, we compared them to complement tests composed of the items remaining in the data set after simulated administration of each short version. We did this because the responses contributing to the simulated short versions were subsets of the responses to the full PNT, potentially inflating their agreement with one another. The complement tests were a more conservative approach that avoided this limitation by computing score estimates based on nonoverlapping subsets of items, at the potential cost of reducing the quality of the full-PNT score estimates. The complement test comparisons found no significant differences. This greater similarity between the adaptive and static short forms may be explained by considering that adaptive tests are composed of items that are increasingly targeted to the subject’s ability level as

Table 5. Correlations, root-mean-square deviations (RMSD), bias, and variable error for simulated adaptive (PNT-CAT) and static (PNT30) alternate forms, and comparisons between them. PNT-CAT Variable Correlation RMSD Bias Variable error

PNT-CAT − PNT30

PNT30

Estimate


Estimate


Estimate


.937 0.528 0.063+ 0.524

[0.922, 0.949] [0.482, 0.585] [−0.003, 0.126] [−0.478, 0.582]

.943 0.501 0.001 0.501

[0.927, 0.954] [0.455, 0.555] [−0.064, 0.064] [0.456, 0.555]

−.006 0.027 0.062* 0.023

[−0.021, 0.009] [−0.039, 0.094] [0.012, 0.148] [0.042, 0.090]

+


886



the test progresses. Thus, the remaining items constituting the complement test are likely to be on average less statistically informative than the items remaining after a static test. Although the complement-test comparisons were less favorable for the adaptive short forms, they do not fundamentally alter the conclusion that adaptive testing may offer psychometric advantages over static short forms. In a final simulation using the same data set, we asked whether equivalent alternate test forms created by the adaptive-testing algorithm produced better agreement than the recently published static alternate short forms (Walker & Schwartz, 2012). In this simulation, we ran the PNT-CAT30 simulation as before, and then simulated a variable-length PNT-CAT with the stopping rule set to the precision achieved by the PNT-CAT30 and with the item-selection algorithm altered to begin at the ability level estimated by the PNT-CAT30 and to avoid items administered for the PNT-CAT30. The simulation suggested that this is a feasible approach, but it did not significantly improve agreement between the alternate forms relative to the existing static short forms. Indeed, in contrast to the static forms, the PNT-CAT alternate forms actually showed marginally significant bias with respect to one another. One possible reason for this is related to the point raised before for the complement-test comparisons. In the simulation of the alternate forms, the actual underlying ability level (which was unknown but best approximated by the full 175-item PNT) did not change between the two administrations. Thus, the initial PNT-CAT30 “used up” many of the optimal items for each subject, leaving less informative items for the second variable-length PNT-CAT. This was especially the case for participants with mild impairment, because the PNT has only two items with difficulty values > 1.1 (“stethoscope” and “microscope”). The fact that the second PNT-CAT required significantly more items to achieve the same precision as that obtained by the initial PNT-CAT30 suggests that this was the case. If naming ability actually changed between the two assessments, or if the item bank were augmented with additional items, it is possible and even expected that advantages would emerge for the CAT approach. Another possibility would be to modify the item-selection algorithm to reserve a subset of optimally targeted items for later testing with the alternate form. These questions may be addressed by additional Monte Carlo simulation studies but must ultimately be answered by the collection of independent administrations of CAT alternate forms to actual persons with aphasia. Indeed, the fact that this was a simulation study constitutes its most fundamental limitation. All of the conclusions drawn from the work await confirmation in an independent sample. A second issue that some readers may perceive as a limitation is the lack of correction for the multiple statistical comparisons between the adaptive and static short forms. In adopting this approach, we have followed the logic of Gelman and Hill (2007), who argue that correction for multiple comparisons is unnecessary, provided that one recognizes that statistical tests will

sometimes be mistaken and that the consequences of occasional type I errors are not overly concerning. In the context of the present work, where we conducted 49 direct pairwise comparisons between CAT and static short forms on various statistics (including bias and variable error, which are reported in the online supplemental materials), we would have expected between one and seven comparisons to be significant at p < .05 due to chance alone if the tests were independent. Given that the tests were not independent, but instead were likely positively dependent, the expected type I error rate was probably lower (Simes, 1986). In fact, there were 18 significant comparisons. Furthermore, we deemed the consequences of a type II error, that is, declining to pursue further research on the application of CAT procedures in aphasia testing when in fact they provide more accurate and precise score estimates than static short forms, to be less desirable than the consequences of a type I error, that is, pursuing further research on CAT procedures in aphasia testing when in fact they provide score estimates with accuracy and precision comparable to static short forms. It is also informative to consider the practical, as opposed to statistical, significance of the observed differences between the CAT and static short forms. CAT procedures are essentially targeted at reducing measurement error. In the present situation, where the baseline reliability of the static tests is already relatively high, large proportional reductions in error variance can appear small in correlational terms. Given the RMSD results reported in Table 4, one can estimate that the PNT-CAT forms reduced the error variance relative to the full PNT by 36% over the static short forms, on average. In the context of static short-form reliabilities in the .9–.95 range, the increased precision offered by CAT may be particularly important for analyses of individual participants, given that .95 has been cited as a desired minimum to support inferences about individuals (Nunnally & Bernstein, 1994). Also, the results presented in Figures 3–5 suggest that the benefits of adaptive testing may be greatest for individuals with particularly mild or severe impairments. One of the main benefits of IRT-based computer adaptive testing is its flexibility in being tailored to the needs, constraints, and opportunities of specific measurement situations. The efficiency of the CAT algorithm can be improved by expanding the logistic model to build models that incorporate various sources of ancillary information (e.g., de la Torre, 2009). For example, a CAT algorithm can be augmented to start at q values predicted from other tests that may have been administered at the same time—a typical scenario in aphasiology, where confrontationnaming tests are part of larger batteries. Other sources of information could include response times (e.g., van der Linden, 2007) and clinical impressions. Utilizing information from such sources would lead to the administration of more “on-target” items during fixed-length tests and thus increasing measurement precision or of shorter variablelength tests of fixed precision. The present results suggest, as is typical, that the greatest benefits are likely to accrue in



887

cases of particularly high or low ability, that is, when impairment is mild or severe. IRT-based computer adaptive tests can be specified even further. For example, in this article we specified the administration algorithm of the PNT-CATVL to obtain constant test information across a wide range of scores. This would be preferred in studies that expect great variability in terms of severity, when the researcher is interested in ensuring that scores are being estimated with equal reliability regardless of level of severity. In other applications, the goal may be to establish whether test takers’ latent trait is above or below some predetermined cutoff criterion (e.g., mastery tests; Weiss & Kingsbury, 1984). Such a test could be constructed by selecting items that would maximize information around the specific level of the trait and withholding the rest of the items (e.g., Thompson, 2009). This would reduce test burden without sacrificing classification accuracy. Screening to identify individuals with possible naming impairment is one clinical situation in which this particular application could be useful. With respect to the PNT, one additional benefit is the potential to expand the item bank beyond the 175 PNT items to include items from other naming tests as well as items not currently included in any naming test. An expanded item bank would offer both increased content coverage and better ability to compare results across studies and clinics. Also, to the extent that new items are targeted toward individuals with particularly mild or severe naming impairment, they will provide increased precision for CAT score estimates. The most straightforward approach to incorporating new items into an item bank is to administer them to a large sample of participants along with a selection of current items, permitting calibration of the new items to the current scale. However, it may be possible to develop automatic item calibration procedures in which the difficulty of new items is estimated from their stimulus properties instead of empirical item response data (e.g., Embretson, 1998; Gorin & Embretson, 2006; Stenner, Burdick, Sanford, & Burdick, 2006). We previously demonstrated that three variables (word length, lexical frequency, and rated age of acquisition) accounted for 63% of the variance in item difficulty estimates (Fergadiotis et al., 2015). If this result can be confirmed and improved, ideally with a small number of additional, inexpensively obtained stimulus variables, automatic item generation for the assessment of naming in aphasia may be possible. It has also been noted that a major challenge in interpreting results from treatment studies for aphasia is that the methodology used in assessing treatment efficacy differs across studies (Nickels, 2002). Generalizing results across different assessments is due in part to the difficulties in accurately interpreting and comparing scores across different samples in these studies (Nickels, 2002). IRT-based itembank calibration, equating, and linking techniques may offer a means of improving the generalizability of research findings across studies. Several practical concerns for the application of CAT methods to aphasia assessment arise around the software

888

tools needed to implement them and the feasibility and validity of the procedures surrounding their use. The first and second authors of this article have recently developed a user-friendly software application to support adaptive administration and scoring of a new self-reported outcome measure for aphasia, and they are currently modifying it to support administration of the PNT-CAT. In traditional CAT implementations, item responses are typically entered directly by the participant and scored automatically by the computer. Although there does appear to be promise for automatic scoring of confrontation-naming responses by speakers with aphasia, it is not currently feasible with acceptable accuracy rates (Abad et al., 2013; Messamer, Ramsberger, & Hardin, 2012). Instead, we are designing the PNT-CAT so that the examining clinician will manually enter a dichotomous correct/incorrect judgment as each naming response is given. The CAT procedures will reside solely in the selection and presentation of the items and in the estimation of scores, given the dichotomous judgments entered by the clinician. Of course, it will also be advisable to audio-record the responses so that items scored incorrectly online may be revised and naming error patterns be evaluated. In conclusion, this study demonstrated that developing and implementing an IRT-based computerized adaptive version of a popular naming test could potentially increase the test’s clinical utility without sacrificing the positive psychometric properties associated with the full-length test. IRT-based tests can provide objective evidence of treatment efficacy in a form that may be easily understood by consumers and third-party payers. From a clinician’s perspective, obtaining this evidence does not require extraordinary time and effort to administer, score, and accurately interpret the test. From the client’s perspective, IRT-based tests do not cause significant strain or boredom due to long test length and inappropriately difficult or easy test items. Importantly, IRT modeling allows flexible use of the tool to suit the needs and abilities of the individual user, while placing the burden of validity and reliability on test developers and designers.

Acknowledgments This research was supported by VA Rehabilitation Research & Development Career Development Award C7476W, and the VA Pittsburgh Healthcare System Geriatric Research Education and Clinical Center. The contents of this paper do not represent the views of the Department of Veterans Affairs or the United States Government.

References Abad, A., Pompili, A., Costa, A., Trancoso, I., Fonseca, J., Leal, G., . . . Martins, I. P. (2013). Automatic word naming recognition for an on-line aphasia treatment system. Computer Speech & Language, 27, 1235–1248. Babbitt, E. M., Heinemann, A. W., Semik, P., & Cherney, L. R. (2011). Psychometric properties of the Communication



Confidence Rating Scale for Aphasia (CCRSA): Phase 2. Aphasiology, 25, 727–735. Baylor, C., Yorkston, K., Eadie, T., Kim, J., Chung, H., & Amtmann, D. (2013). The Communicative Participation Item Bank (CPIB): Item bank calibration and development of a disorder-generic short form. Journal of Speech, Language, and Hearing Research, 56, 1190–1208. Beeson, P. M., King, R. M., Bonakdarpour, B., Henry, M. L., Cho, H., & Rapcsak, S. Z. (2011). Positive effects of language treatment for the logopenic variant of primary progressive aphasia. Journal of Molecular Neuroscience, 45, 724–736. Beltyukova, S. A., Stone, G. M., & Ellis, L. W. (2008). Rasch analysis of word identification and magnitude estimation scaling responses in measuring naïve listeners’ judgments of speech intelligibility of children with severe-to-profound hearing impairments. Journal of Speech, Language, and Hearing Research, 51, 1124–1137. Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990. Canty, A., & Ripley, B. (2014). boot: Bootstrap R (S-Plus) functions. R package version 1.3-11 [Computer software]. Available from http://cran.r-project.org/web/packages/boot Cella, D., Riley, W., Stone, A., Rothrock, N., Reeve, B., Yount, S., . . . Hays, R. (2010). The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. Journal of Clinical Epidemiology, 63, 1179–1194. Choi, S. W. (2009). Firestar: Computerized adaptive testing simulation program for polytomous item response theory models. Applied Psychological Measurement, 33, 644–645. de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford. de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33, 465–485. doi:10.1177/0146621608329890 del Toro, C. M., Bislick, L. P., Comer, M., Velozo, C., Romero, S., Gonzalez Rothi, L. J., & Kendall, D. L. (2011). Development of a short form of the Boston Naming Test for individuals with aphasia. Journal of Speech, Language, and Hearing Research, 54, 1089–1100. Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93, 283–321. Dell, G. S. (1990). Effects of frequency and vocabulary type on phonological speech errors. Language and Cognitive Processes, 5, 313–349. Dell, G. S., Schwartz, M. F., Martin, N., Saffran, E. M., & Gagnon, D. A. (1997). Lexical access in aphasic and nonaphasic speakers. Psychological Review, 104, 801–838. Dell, G. S., Schwartz, M. F., Nozari, N., Faseyitan, O., & Coslett, H. B. (2013). Voxel-based lesion-parameter mapping: Identifying the neural correlates of a computational model of word production. Cognition, 128, 380–396. DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11, 189–212. Donovan, N. J., Velozo, C. A., & Rosenbek, J. C. (2007). The Communicative Effectiveness Survey: Investigating its itemlevel psychometric properties. Journal of Medical SpeechLanguage Pathology, 15, 433–447. Doyle, P. J., Hula, W. D., McNeil, M. R., Mikolic, J. M., & Matthews, C. (2005). An application of Rasch analysis to the

measurement of communicative functioning. Journal of Speech, Language, and Hearing Research, 48, 1412–1428. Dunn, L. M., & Dunn, L. M. (1981). Peabody Picture Vocabulary Test–Revised. Circle Pines, NM: AGS. Eadie, T. L., Yorkston, K. M., Klasner, E. R., Dudgeon, B. J., Deitz, J. C., Baylor, C. R., . . . Amtmann, D. (2006). Measuring communicative participation: A review of self-report instruments in speech-language pathology. American Journal of SpeechLanguage Pathology, 15, 307–320. Edmonds, L. A., & Donovan, N. J. (2012). Item-level psychometrics and predictors of performance for Spanish/English bilingual speakers on An Object and Action Naming Battery. Journal of Speech, Language, and Hearing Research, 55, 359–381. Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396. Fergadiotis, G., Kellough, S., & Hula, W. D. (2015). Item response theory modeling of the Philadelphia Naming Test. Journal of Speech, Language, and Hearing Research, 58, 865–877. Francis, W. N., & Kučera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston, MA: Houghton Mifflin. Fridriksson, J., Moser, D., Bonilha, L., Morrow-Odom, K. L., Shaw, H., Fridriksson, A., . . . Rorden, C. (2007). Neural correlates of phonological and semantic-based anomia treatment in aphasia. Neuropsychologia, 45, 1812–1822. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York, NY: Cambridge University Press. Gorin, J. S., & Embretson, S. E. (2006). Item difficulty modeling of paragraph comprehension items. Applied Psychological Measurement, 30, 394–411. Gutman, R., DeDe, G., Michaud, J., Liu, J. S., & Caplan, D. (2010). Rasch models of aphasic performance on syntactic comprehension tests. Cognitive Neuropsychology, 27, 230–244. Hula, W., Donovan, N. J., Kendall, D. L., & Gonzalez-Rothi, L. J. (2010). Item response theory analysis of the Western Aphasia Battery. Aphasiology, 24, 1326–1341. Hula, W. D., Doyle, P. J., & Austermann Hula, S. N. (2010). Patient-reported cognitive and communicative functioning: 1 construct or 2? Archives of Physical Medicine and Rehabilitation, 91, 400–406. Hula, W., Doyle, P. J., McNeil, M. R., & Mikolic, J. M. (2006). Rasch modeling of Revised Token Test performance: Validity and sensitivity to change. Journal of Speech, Language, and Hearing Research, 49, 27–46. Jette, A. M., Haley, S. M., Tao, W., Ni, P., Moed, R., Meyers, D., & Zurek, M. (2007). Prospective evaluation of the AM-PACCAT in outpatient rehabilitation settings. Physical Therapy, 87, 385–398. Kaplan, E., Goodglass, H., & Weintraub, S. (1983). Boston Naming Test. Philadelphia, PA: Lea and Febiger. Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978–990. Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences, 22, 1–38. Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Messamer, P., Ramsberger, G., & Hardin, K.. (2012, May). Automated assessment of aphasic speech using discrete speech recognition systems. Poster session presented at the 42nd Clinical



889

Aphasiology Conference, Lake Tahoe, CA. Retrieved from http://aphasiology.pitt.edu/archive/00002426/01/305-524-1-RV_ (Messamer_Ramsberger_Hardin).pdf Miller, R. G. (1974). The jackknife—A review. Biometrika, 61, 1–15. Mirman, D., Strauss, T. J., Brecher, A., Walker, G. M., Sobel, P., Dell, G. S., & Schwartz, M. F. (2010). A large, searchable, web-based database of aphasic performance on picture naming and other tests of cognitive function. Cognitive Neuropsychology, 27, 495–504. Nickels, L. (2002). Therapy for naming disorders: Revisiting, revising, and reviewing. Aphasiology, 16(10–11), 935–979. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., . . . Cella, D. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Medical Care, 45(5, Suppl. 1), S22–S31. Roach, A., Schwartz, M. F., Martin, N., Grewal, R. S., & Brecher, A. (1996). The Philadelphia Naming Test: Scoring and rationale. Clinical Aphasiology, 24, 121–133. Copyright © 1996 Albert Einstein Healthcare Network/Moss Rehabilitation Institute. Ruml, W., Caramazza, A., Shelton, J. R., & Chialant, D. (2000). Testing assumptions in computational theories of aphasia. Journal of Memory and Language, 43, 217–248. Schmidt, R. A., & Lee, T. D. (2005). Motor control and learning: A behavioral emphasis (4th ed.). Champaign, IL: Human Kinetics. Schwartz, M. F., Dell, G. S., Martin, N., Gahl, S., & Sobel, P. (2006). A case-series test of the interactive two-step model

890

of lexical access: Evidence from picture naming. Journal of Memory and Language, 54, 228–264. doi:10.1016/j.jml.2005. 10.001 Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73, 751–754. Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7, 307–322. Thissen, D. (2000). Reliability and measurement precision. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed.; pp. 159–184). Mahwah, NJ: Erlbaum. Thissen, D., Chen, W. H., & Bock, R. D. (2003). Multilog (Version 7) [Computer software]. Lincolnwood, IL: Scientific Software International. Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69, 778–793. van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287–308. Wainer, H. (2000). Introduction and history. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed.; pp. 1–21). Mahwah, NJ: Erlbaum. Walker, G. M., & Schwartz, M. F. (2012). Short-form Philadelphia Naming Test: Rationale and empirical evaluation. American Journal of Speech-Language Pathology, 21(Suppl.), S140–S153. Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361–375.



Development of an abbreviated form of the Penn Line Orientation Test using large samples and computerized adaptive test simulation.

Item Response Theory Modeling of the Philadelphia Naming Test.

Psychometrics behind Computerized Adaptive Testing.

Limitations of computerized adaptive testing for anxiety.

Computerized adaptive testing of population psychological distress: simulation-based evaluation of GHQ-30.

Test-Retest Reliability of a Computerized Adaptive Depression Screener.

Development and validation of the computerized bilateral motor coordination test.

Development of the adaptive music perception test.

The EORTC emotional functioning computerized adaptive test: phases I-III of a cross-cultural item bank development.

Comparison of the test-retest reliability of the balance computerized adaptive test and a computerized posturography instrument in patients with stroke.

The Accuracy of Computerized Adaptive Testing in Heterogeneous Populations: A Mixture Item-Response Theory Analysis.

Validation of the cross-linguistic naming test: a naming test for different cultures? A preliminary study in the Spanish population.

Testing the embodied account of object naming: a concurrent motor task affects naming artifacts and animals.

Acceptance of New Technology: A Usability Test of a Computerized Adaptive Test for Fatigue in Rheumatoid Arthritis.

Working mechanism of a multidimensional computerized adaptive test for fatigue in rheumatoid arthritis.

Construct Validation of a Multidimensional Computerized Adaptive Test for Fatigue in Rheumatoid Arthritis.

Validation of Computerized Adaptive Testing in an Outpatient Nonacademic Setting: The VOCATIONS Trial.

Optimal number of strata for the stratified methods in computerized adaptive testing.

Effects of Education on Differential Item Functioning on the 15-Item Modified Korean Version of the Boston Naming Test.

Item response theory, computerized adaptive testing, and PROMIS: assessment of physical function.

Psychometric properties of the Balance Computerized Adaptive Test in residents in long-term care facilities.

Reliability of a Computerized Neurocognitive Test in Baseline Concussion Testing of High School Athletes.

A new technique to measure online bullying: online computerized adaptive testing.

A Rate Function Approach to Computerized Adaptive Testing for Cognitive Diagnosis.