ORIGINAL ARTICLE OTA Highlight Paper

Computerized Adaptive Testing Using the PROMIS Physical Function Item Bank Reduces Test Burden With Less Ceiling Effects Compared With the Short Musculoskeletal Function Assessment in Orthopaedic Trauma Patients Man Hung, PhD, Ami R. Stuart, PhD, Thomas F. Higgins, MD, Charles L. Saltzman, MD, and Erik N. Kubiak, MD

Purpose: Patient-reported outcomes are important to assess effectiveness of clinical interventions. For orthopaedic trauma patients, the short Musculoskeletal Function Assessment (sMFA) is a commonly used questionnaire. Recently, the Patient-Reported Outcome Measurement Information System (PROMIS) PF Function Computer Adaptive Test (PF CAT) was developed using item response theory to efficiently administer questions from a calibrated bank of 124 PF questions using computerized adaptive testing. In this study, we compared the sMFA versus the PROMIS PF CAT for trauma patients.

Methods: Orthopaedic trauma patients completed the sMFA and the PROMIS PF CAT on a tablet wirelessly connected to the PROMIS Assessment Center. The time for each test administration was recorded. A 1-parameter item response theory model was used to examine the psychometric properties of the instruments, including precision and floor/ceiling effects. Results: One hundred fifty-three orthopaedic trauma patients participated in the study. Mean test administration time for PROMIS PF CAT was 44 seconds versus 599 seconds for sMFA (P , 0.05). Both instruments showed extremely high item reliability (Cronbach alpha = 0.98). In terms of instrument coverage, neither instrument showed any floor effect; however, the sMFA revealed 14.4% ceiling effect, whereas the PROMIS PF CAT had no appreciable ceiling effect.

Conclusions: Administered by electronic means, the PROMIS PF CAT required less than one-tenth the amount of time for patients to complete than the sMFA while achieving equally high reliability and less ceiling effects. The PROMIS PF CAT is a very attractive and innovative method for assessing patient-reported outcomes with minimal burden to patients. Accepted for publication December 18, 2013. From the Departmant of Orthopaedic Surgery Operations, University of Utah, Salt Lake City, UT. Presented at the Orthopaedic Trauma Association Annual Meeting, October 6, 2012, Minneapolis, MN. None of the authors received payment or support in kind for any aspect of the submitted work. Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions this article on the journal’s Web site (www.jorthotrauma.com). Reprints: Man Hung, PhD, Department of Orthopaedic Surgery Operations, University of Utah, 590 Wakara Way, Salt Lake City, UT 84108 (e-mail: [email protected]). Copyright © 2013 by Lippincott Williams & Wilkins

J Orthop Trauma  Volume 28, Number 8, August 2014

Key Words: orthopaedic trauma, PROMIS, PF, computerized adaptive testing, sMFA (J Orthop Trauma 2014;28:439–444)

INTRODUCTION Understanding the outcomes of treatment is the fundamental step toward improving care of patients. Patientreported outcomes (PROs) have been recognized as critically important to guiding treatment decisions and assessing the effectiveness of clinical interventions. Increased use of PROs has resulted in a drive toward elimination of unreliable and invalid assessment tools and development of assessment tools that are not only reliable and valid but also efficient. Numerous PROs have been developed, validated in many different populations, and endorsed by professional societies and agencies; most are subjective and thus have problematic floor and ceiling effects.1 For orthopaedic trauma patients, the short Musculoskeletal Function Assessment (sMFA) is a commonly used PRO questionnaire developed in 1999. sMFA is meant to be used to determine the effectiveness of treatment and measure current health status. The sMFA has been endorsed by the American Academy of Orthopaedic Surgeons and has been shown to have excellent internal consistency, stability, content, convergent, and discriminant construct validity.2 Consisting of 46 items, the sMFA has 2 indices—a 34-question dysfunction index and a 12-question bother index. Completion of patient questionnaires is becoming a technologically advanced endeavor. Hospitals and clinics are using electronic means of completion to make data collection more efficient. Along with the advancement in collection procedures, efforts at developing more precise questionnaires have increased. Item response theory (IRT) is being used to examine response characteristics of individual questions and the relationship between items within a domain.3 This theory has been used to develop Computer Adaptive Testing (CAT), considered a more efficient way of measuring and collecting patient data. CATs use an iterative testing method that selects subsequent questions based on the responses until predetermined termination criteria are met.3 This decreases the number of questions the patient responds to by eliminating questions that do not pertain to their www.jorthotrauma.com |

439

J Orthop Trauma  Volume 28, Number 8, August 2014

Hung et al

condition, resulting in shortened questionnaire administration time and reduced burden on the patient. In an effort to develop improved PROs for a wide variety of chronic diseases and conditions, the National Institutes of Health has funded the Patient-Reported Outcome Measurement Information System (PROMIS). The PROMIS group has developed many assessment tools with calibrated item banks that measure the domains of PF, fatigue, role functioning, pain, and mental health.4,5 The tools developed using IRT and CATs have resulted in assessments that reduce the time burden on patients while retaining precision. PROMIS CATs have been shown to be highly reliable while drastically shortening the number of questions required to complete the questionnaires compared with legacy scores,6 such as the sMFA. The PROMIS Physical Function (PF) CAT item bank was developed through a rigorous 6-phase process that included (1) identifying, evaluating, and revising 1865 extant items for an item pool; (2) classification and selection of items; (3) review and revision of items; (4) domain focus group input; (5) individual item cognitive interview; (6) final revision to 124 items prior to field testing.7 Rose et al8 tested the PF domain for musculoskeletal disorders in 2008. The group evaluated 136 items from 9 questionnaires and developed a 70-item question bank that addressed lower extremity function, central body functions, and activities of daily living. The authors found an adequately reliable and precise 10-question instrument using CAT.8 Using the first version of PROMIS PF item bank containing 124 IRT calibrated items, Hung et al9 studied the psychometric properties of the items with an outpatient, orthopaedic foot, and ankle population. This prospective study found the upper extremity items to have a pronounced ceiling effect and recommended development of separate upper and lower extremity banks. As separate upper and lower extremity PF banks are not yet validated and available for use, this study used the PROMIS PF CAT version 1 to examine the differences in psychometric properties and time to administer between PROMIS PF CAT and sMFA. Although validation studies have been completed, few studies have been conducted to compare the PROMIS CATs with legacy scores. We present the first, to our knowledge, results of a prospective study of orthopaedic trauma patients who completed both the PROMIS PF CAT and the sMFA.

PATIENTS AND METHODS Participants and Data Collection Procedure A total of 153 patients with orthopaedic trauma conditions completed the sMFA and the PROMIS PF CAT in a University outpatient clinic from August 2011 through February 2012. The default rules for administering the PROMIS PF CAT were used (minimum = 4 items, maximum = 12 items, minimum standard error = 0.33) via the graded response model. Participants included those who were seen for lower extremity postoperative follow-up visits, nonoperative fracture care, and consultation. Exclusion criterion was those under the age of 18 years. This study was conducted with institutional review board approval, and informed consent

440

| www.jorthotrauma.com

was obtained from all the patients. Each patient completed demographic information, the sMFA, and the PROMIS PF CAT using a tablet.

Data Analysis Standard descriptive statistics were used to describe demographic characteristics for the entire study sample. The average time required for each patient to complete the PROMIS PF CAT and the sMFA was also calculated. The data were analyzed using the Rasch Partial Credit model, which is a 1-parameter IRT model. The Rasch Partial

TABLE 1. Demographic Characteristics (N = 153) Variable

Mean (SD) Minimum Maximum

Age (y) 48.7 (17.5) 18 Gender Male Female Missing Race White Asian American Indian or Alaska native Other Not provided Missing Ethnicity Not Hispanic or Latino Hispanic or Latino Missing Insurance None Private Worker’s Compensation VA Public employee HMO Medicaid Medicare Missing Injury type Lower extremity Lower extremity and pelvis Lower and upper extremity and pelvis Pelvis Total hip arthroplasty Total knee arthroplasty Upper extremity Upper extremity and pelvis Upper and lower extremity

n

%

91 83 54.2 66 43.1 4 2.6 124 81.0 3 2.0 6 3.9 10 2 8

6.5 1.3 5.2

104 68.0 17 11.0 32 20.9 7 4.9 75 49.0 10 6.5 1 0.7 1 0.7 3 2.0 12 7.8 32 20.9 12 7.8 137 51.1 6 2.2 1

0.4

15 5.6 31 11.6 16 6.0 58 21.6 2 0.8 2

0.8

HMO, health maintenance organization; VA, veterans affairs.

Ó 2013 Lippincott Williams & Wilkins

J Orthop Trauma  Volume 28, Number 8, August 2014

Credit model was selected because of the small sample size in the study and was applied to place both the sMFA and the PROMIS PF CAT on a common metric to facilitate interpretation of results. The analysis was centered on the mean of the sample. Current psychometric practice strongly encourages the use of IRT models to provide robust support for instrument development and evaluation. Such models allow both the item difficulty and the patients’ physical functioning ability to be measured and placed on the same scale. There are 2 important requirements for the Rasch model: item fit and dimensionality. Item fit to the Rasch model is an indicator of the quality of measurement. It is commonly measured by mean-square residual fit statistics, such as the outfit MNSQ. The outfit MNSQ is sensitive to outliers or unexpected responses; its value ranges from 0 to infinity. Items that have an outfit MNSQ of 1.0 are considered as good fit and greater than 2.0 are considered as lacking fit between the item and the model.7 Dimensionality describes whether the data form a single underlying construct, such as physical functioning, and whether this single construct can explain all the variance in the data (ie, unidimensionality). Dimensionality can be evaluated using principal component analysis of the residuals after removing the initial Rasch factor. The variance explained by the first contrast in residuals ,10% and the eigenvalue of the first contrast ,3.0 would constitute evidence to support unidimensionality.10 To compare measurement precision, we computed the standard error of measurement (SEM) for PROMIS PF CAT and sMFA. SEM is a reliability index that quantifies the degree to which the measurement is free of error. A maximum SEM of 0.33 is generally accepted as an excellent measurement precision.11 To examine instrument coverage, person–item maps were constructed for PROMIS PF CAT and sMFA. These maps are basically vertical scales or rulers created from the measurement of patients’ PF in response to the instruments. Before construction of the person–item maps, equating was

sMFA and PROMIS PF CAT

conducted to adjust the differences in scoring between PROMIS PF CAT and sMFA, so that they can be interpreted on a common metric. The person–item map orders the patients’ physical functioning ability on the left column and the difficulty of the items on the right column and yields a graphic representation of the range of coverage of the test. Patients with higher physical functioning ability are placed on the top of the ruler; patients with lower physical functioning ability are at the bottom of the ruler. Items correspond to more difficult tasks as they move from the bottom to the top of the ruler. Items at the top of the ruler contain physical functioning concepts that are more difficult to perform by patients, whereas items at the bottom are easier to perform. To measure physical functioning ability of all patients, the items in an instrument must cover all areas on the ruler. When there is a lack of items to cover the patients on the top of the ruler, it is called ceiling effect. In other words, a test with a ceiling effect would be unable to sufficiently differentiate the functional abilities of the patients at the top end of the clinical range of PF. Similarly, floor effect refers to the extent that the items are unable to cover the individual patients on the bottom of the ruler.

RESULTS Demographics Table 1 shows the demographic characteristics of the participants. Of the 153 patients who participated in the study, 54.2% were males; 81% self-identified as White and 11.1% as Hispanic or Latino; and 49.0% had private insurance, 7.8% had Medicaid, and 20.9% had Medicare.

Instrument Completion Time On average, each patient responded to 4 questions from the PROMIS PF CAT and responded to all 46 questions from the sMFA. The mean administration time was significantly less for PROMIS PF CAT than for sMFA (P , 0.05), 44

FIGURE 1. Instrument precision. Ó 2013 Lippincott Williams & Wilkins

www.jorthotrauma.com |

441

J Orthop Trauma  Volume 28, Number 8, August 2014

Hung et al

seconds (less than 1 minute) and 599 seconds (approximately 10 minutes), respectively.

Item Fit and Dimensionality The Outfit MNSQ for sMFA was 1.07 and for PROMIS PF CAT was 1.31, indicating that the data from both instruments adequately fit the Rasch model. In terms of dimensionality, the variances explained by the first contrast in residuals were 3.6% for PROMIS PF CAT and 5.4% for sMFA, whereas the eigenvalues of the first contrast were 2.1 for PROMIS PF CAT and 5.5 for sMFA. This provides strong support for unidimensionality for PROMIS PF CAT and fair support for sMFA.

Instrument Precision High instrument precision is indicated by low SEM and the breadth of the precision curve with low SEM. Figure 1 presents the instrument precision curves for sMFA and PROMIS PF CAT. It seems that instruments had similar precision levels across the range of the PF trait. Both instruments had good measurement precision (SEM , 0.33) across a wide range of the trait. At lower end of the trait, both were relatively less precise. We also examined the internal consistency reliability (Cronbach alpha) for the total set of items in sMFA and PROMIS PF CAT. The sMFA was expected to have much higher metric in reliability than the PROMIS PF CAT because it had many more items than PROMIS PF CAT. However, both instruments showed equally high internal consistency reliability of 0.98.

Instrument Coverage Supplemental Digital Content 1 (http://links.lww. com/BOT/A162) shows the person–item map for sMFA, and Supplemental Digital Content 2 (http://links.lww. com/BOT/A163) shows the map for PROMIS PF CAT. In terms of instrument coverage, neither instrument showed any floor effect; however, the sMFA revealed 14.4% ceiling effect, whereas the PROMIS PF CAT had none. The percentage of ceiling was calculated by taking the number of patients whose average functioning abilities were above the average item difficulty of the most difficult item in the instrument divided by the total number of patients and then multiplied by 100. The results indicate that the PROMIS PF CAT is more reliable than sMFA for measuring patients with a wider range of function, allowing for assessment of treatment effectiveness or improvement of patients’ function. Clinically, this means that among the study population, 14% of patients demonstrated function at too high a level to be adequately differentiated by the sMFA, a limitation not present with the PROMIS PF CAT.

DISCUSSION This study shows that the PROMIS PF CAT is an accurate and precise replacement for the sMFA for collection of PF data from orthopaedic trauma patients with lower extremity problems. The PROMIS PF CAT is an assessment tool developed using IRT, the foundation of CAT. CATs make it possible to administer only the items from a bank that is

442

| www.jorthotrauma.com

targeted to the person’s likely location on the latent trait.8 Unlike classical test theory and legacy scores, CATs reduce the number of impertinent questions by basing the selection of the next question on the response to the previous question and only asking questions until a certain preset level of precision, chosen by the test administrator, is met. This results in highly precise individual measurements and reduces time burden on the patient by limiting the number of questions to which the patient is asked to respond. Classical test theory–derived instrument cannot have a missing response on any item, or the instrument is not valid because all the items in that instrument function as a whole. However, IRT-derived instrument can have missing responses and still be considered valid because each item has been validated individually. In other words, each item can function as an instrument by itself. Benefits of using IRT for questionnaire development include the following: having minimum concerns for missing data, facilitating the administration of CAT, providing more insights on each items and the test as whole, and enabling objective measurement via the Rasch model. Gosling et al12 compared administration time of the Short Form Health Survey (SF-12), Sickness Impact Profile, and sMFA and found sMFA to take participants between 10 and 15 minutes to complete using a paper-based form with multiple questions on one page, consistent with administration time in this study. Administered by electronic means with 1 question per page, the PROMIS PF CAT required less than one-tenth the amount of time (44 seconds) than the sMFA (599 seconds) for patients to complete using a tablet while achieving equally high reliability as the sMFA. Asking a single question per page and not allowing the patient to advance without completing the last question reduces the risk of human error confounding the test instrument results. Test–retest reliability is increased with the use of computer-based administration when compared with paperbased form.13 Computer-based administration has also shown (a) better distributional properties, (b) lower means, (c) more variance, (d) higher internal consistency reliabilities, and (e) stronger intercorrelations.14 The time burden often limits patient enthusiasm for completing the questionnaire during both current and future visits, thereby making the collection of repeated measures difficult to impossible. The PROMIS PF CAT also showed desirable psychometric properties, whereas the sMFA showed some restrictions in the upper end of the PF continuum. This can be problematic for assessing changes in outcomes and for assessing patients with a high level of physical functioning. In terms of instrument coverage, neither instrument showed any floor effect in this study; however, the sMFA revealed 14.4% ceiling effect, whereas the PROMIS PF CAT had none. A ceiling effect is when an instrument is not precise in assessment at the upper end of functioning. If a patient continues to improve over time and an instrument has a noticeable ceiling effect, the instrument will not assess the change over time. Use of PROMIS PF CAT rather than sMFA may eliminate the need to worry about ceiling effects when assessing changes in outcomes over time, making more accurate determinations of continued improvement over time. A weakness of the present study was the inability to collect data from every patient because of a lack of Ó 2013 Lippincott Williams & Wilkins

J Orthop Trauma  Volume 28, Number 8, August 2014

resources. Despite this limitation, we feel that our test group demographics reflect a true cross section of our trauma patient population. Moving forward, we are collecting serial assessments to further evaluate patient multiple measures, fatigue, and burnout. The improved instrument coverage and diminished time for patient completion may make PROMIS PF CAT more effective and useful than sMFA for both clinical and research purposes. However, further research may be needed to examine responsiveness to change to PROMIS PF CAT over time. Future research should be conducted to compare responsiveness to change between PROMIS PF CAT and sMFA. Current research is being conducted to compare lower versus upper extremity patient responses to PROMIS PF CAT and sMFA and validation of a lower extremity PF CAT.11

CONCLUSIONS Administered by electronic means with 1 question per page, the PROMIS PF CAT required less than one-tenth the amount of time than the sMFA for patients to complete while achieving equally high reliability. The PROMIS PF CAT also showed desirable psychometric properties, whereas the sMFA showed some restrictions in the upper end of the PF continuum. This improved instrument coverage and diminished time for patient completion may make the PROMIS PF CAT more effective and useful than sMFA for both clinical and research purposes. REFERENCES 1. Fries JF, Spitz P, Kraines RG, et al. Measurement of patient outcome in arthritis. Arthritis Rheum. 1980;23:137–145.

sMFA and PROMIS PF CAT 2. Swiontkowski M, Engleberg R, Martin D, et al. Short musculoskeletal function assessment questionnaire: validity, reliability, and responsiveness. J Bone Joint Surg. 1999;81:1245–1260. 3. Revicki DA, Cella DF. Health status assessment for the twenty-first century: item response theory, item banking and computer adaptive testing. Qual Life Res. 1997;6:595–600. 4. Gershon RC, Rothrock NE, Hanrahan RT, et al. The development of a clinical outcomes survey research application: Assessment Center. Qual Life Res. 2010;19:677–685. 5. Gershon RC, Rothrock NE, Hanrahan RT, et al. The use of PROMIS and assessment center to deliver patient-reported outcome measures in clinical research. J Appl Meas. 2012;11:304–314. 6. Cella D, Yount S, Rothrock N, et al. The Patient Reported Outcomes Measurement Information System (PROMIS): progress of an NIH Roadmap Cooperative Group during its first two years. Med Care. 2007;45:S3–S11. 7. DeWalt DA, Rothrock N, Yount S, et al; for the PROMIS Cooperative Group. Evaluation of item candidates—The PROMIS qualitative item review. Med Care. 2007;45:S12–S21. 8. Rose M, Bjorner JB, Becker J, et al. Evaluation of a preliminary physical function item bank supports the expected advantages of the PatientReported Outcomes Measurement Information System (PROMIS). J Clin Epidemiol. 2008;61:17–33. 9. Hung M, Clegg DO, Greene T, et al. Evaluation of the PROMIS physical function item bank in orthopaedic patients. J Orthop Res. 2011;29:947–953. 10. Hung M, Carter M, Hayden C, et al. Psychometric assessment of the patient activation measure short form (PAM-13) in rural settings. Qual Life Res. 2013;22:521–529. 11. Hung M, Clegg DO, Greene T, et al. A lower extremity physical function computerized adaptive testing instrument for orthopaedic patients. Foot Ankle Int. 2012;33:326–335. 12. Gosling CM, Gabbe BJ, Williamson OD, et al. Validity of outcome measures used to assess one and six month outcomes in orthopaedic trauma patients. Injury. 2011;42:1443–1448. 13. Truman J, Robinson K, Evans AL, et al. The strengths and difficulties questionnaire: a pilot study of a new computer version of the self-report scale. Eur Child Adolesc Psychiatry. 2003;12:9–14. 14. Ployhart RE, Weekley JA, Holtz BC, et al. Web-based and paper-and-pencil testing of applicants in a proctored setting: are personality, biodata and situational judgment tests comparable? Pers Psychol. 2006;56:733–752.

Invited Commentary

A

s a technological leap to decrease the time burden of patient reported outcomes collection, Man Hung, et al demonstrate that the PROMIS PF computer adaptive testing (CAT) can deliver when compared to a 46 question short Musculoskeletal Function Assessment. As such they should be applauded for taking an important “first step” in the use of CAT modeling in musculoskeletal care. Of the basic broad elements of a quality PRO instrument—validity, responsiveness and reliability—future work is needed to demonstrate critical factors beyond the author’s finding of high item reliability. Correlations between scores and the translation into clinically important and relevant meaning are a key next step. How for example, does a “high score” on the PROMIS PF CAT compare on the disability or bother index of the short Musculoskeletal Ó 2013 Lippincott Williams & Wilkins

Function Assessment? For the purpose of future benchmarking and historical comparison—it would be important to understand when comparing treatment protocols, techniques and procedures. With that said, it is our belief that the future of the patient reported outcomes movement will be heavily dependent on advances such as represented here. Any and all ways to insure greater patient engagement, clinician utility and population based information collection will undoubtedly change the way we understand and perform the work we do. Daniel S. Horwitz, MD Michael Suk, MD, JD, MPH Danville, PA www.jorthotrauma.com |

443

Hung et al

In response: We thank Dr. Horwitz and Dr. Suk for their comments on our article and the editors for the invitation to respond. We fully agree that the future of patient-reported outcomes (PRO) heavily depends on advances such as those reported here. The ability to conduct comparative effectiveness studies and evaluate patient responses to treatment and intervention rests on the availability of high quality, valid and efficient PRO measures. In this study, we have taken the fundamental step in understanding validities and reliabilities of the physical function computer adaptive test in orthopaedic trauma patients. Our next steps will focus on enhancing comparability and interpretability of PRO measures in musculoskeletal care and discerning the influences on these measures. Specifically, our future projects include establish-

444

| www.jorthotrauma.com

J Orthop Trauma  Volume 28, Number 8, August 2014

ing a common metric between the physical function computer adaptive test and existing measures to facilitate comparability and benchmarking of results from different studies, and establishing minimum clinically important difference for these measures to enable improved interpretation of scores. We welcome ideas and suggestions from the scientific community and hope to collaborate with providers, patients, researchers and various stakeholders to accomplish this work and shape the future of value-driven health care. Man Hung, PhD Ami R. Stuart, PhD Thomas F. Higgins, MD Charles L. Saltzman, MD Erik N. Kubiak, MD

Ó 2013 Lippincott Williams & Wilkins

Computerized Adaptive Testing Using the PROMIS Physical Function Item Bank Reduces Test Burden With Less Ceiling Effects Compared With the Short Musculoskeletal Function Assessment in Orthopaedic Trauma Patients.

Patient-reported outcomes are important to assess effectiveness of clinical interventions. For orthopaedic trauma patients, the short Musculoskeletal ...
163KB Sizes 0 Downloads 0 Views