Patient DOI 10.1007/s40271-014-0071-2

SHORT COMMUNICATION

Test–Retest Reliability of an Interactive Voice Response (IVR) Version of the EORTC QLQ-C30 J. Jason Lundy • Stephen Joel Coons Neil K. Aaronson



Ó Springer International Publishing Switzerland 2014

Abstract Objective The objective of this study was to assess the test–retest reliability of an interactive voice response (IVR) version of the European Organisation for Research and Treatment of Cancer (EORTC) QLQ-C30. Methods A convenience sample of outpatient cancer clinic patients (n = 127) was asked to complete the IVR version of the QLQ-C30 twice, 2 days apart. The QLQC30 is a 30-item, cancer-specific questionnaire composed of single-item and multi-item scales. The instrument has five functional scales (physical, role, cognitive, emotional, and social), three symptom scales (fatigue, pain, and nausea/vomiting), and a global quality-of-life scale. The remaining single items assess dyspnea, appetite loss, insomnia, constipation, diarrhea, and financial problems. The analyses focused on intraclass correlation coefficients (ICCs), comparing the ICC 95 % lower confidence interval (CI) value with a critical value of 0.70. Results The ICCs for the nine multi-item scales were all above 0.69, ranging from 0.698 to 0.926 (ICC 95 % lower CI value range 0.588–0.895). All of the scales were significantly different from our threshold reliability of 0.70, with the exception of the cognitive functioning scale. The ICCs for the six single items ranged from 0.741 to 0.883

J. J. Lundy (&)  S. J. Coons Patient-Reported Outcome Consortium, Critical Path Institute, 1730 E. River Road, Tucson, AZ 85718-5893, USA e-mail: [email protected] N. K. Aaronson Division of Psychosocial Research and Epidemiology, Department of Psychosocial Research, The Netherlands Cancer Institute, 121 Plesmanlaan, 1066 CX Amsterdam, The Netherlands

(ICC 95 % lower CI value range 0.646–0.835), and three of the six were statistically different from 0.70. The evidence supports the stability of 11 of the 15 scores obtained on the IVR version of the QLQ-C30 upon repeated measurement. Conclusion The measurement equivalence of the IVR and paper versions of the QLQ-C30 has been reported elsewhere. This analysis provides evidence demonstrating adequate test–retest reliability of the IVR version for 11 of the QLQ-C30’s 15 scores.

Key Points for Decision Makers The European Organisation for Research and Treatment of Cancer (EORTC) QLQ-C30 is a widely used cancer-specific instrument, which has been migrated to the interactive voice response (IVR) platform This study assessed the stability of scores produced by the IVR version of the QLQ-C30 The results of the test–retest analyses support the stability of the scores derived from the IVR version of the QLQ-C30 in this study sample

1 Introduction There has been increasing interest in the migration of paper-based patient-reported outcome (PRO) instruments to electronic modes of data collection. Many studies have assessed the equivalence between the original paper-based

J. J. Lundy et al.

instruments and electronic versions [1]. However, once measurement equivalence between the data collection modes is demonstrated, it is important to assess the stability of scores produced by the electronic version. The European Organisation for Research and Treatment of Cancer (EORTC) QLQ-C30 is a widely used cancerspecific instrument, which includes five functional scales (physical, role, cognitive, emotional, and social), three symptom scales (fatigue, pain, and nausea/vomiting), and a global quality-of-life scale. The remaining single items assess dyspnea, appetite loss, insomnia, constipation, diarrhea, and financial problems. Each multi-item scale and each single-item score are linearly converted to a 0–100 scale. For the functional scales and the overall quality-oflife scale, a higher score indicates a higher level of functioning and well-being (e.g., better emotional functioning or quality of life). For the symptom scales and items, higher scores indicate a higher level of symptoms or problems (e.g., more pain). The scale and item scores are not combined into an overall score; rather, each is reported individually [2]. Recently, the QLQ-C30 was adapted for use in an interactive voice response (IVR) system. Evidence supporting the equivalence of the scores obtained from the original paper-and-pencil version and the IVR version has been reported previously [3]. The objective of this study was to assess the test–retest reliability of the IVR version of the QLQ-C30.

2 Methods Test–retest reliability was assessed by administering the IVR version of the QLQ-C30 to a single group of subjects on two occasions, 2 days apart. The primary threat to the validity of this type of study design is a potential carryover or memory effect. Thus, the intervening time between administrations of the QLQ-C30 was chosen to be sufficiently short to assume that the subjects’ health status had not changed, but long enough to ‘‘wash out’’ any memory effects [4]; however, it should be noted that an optimal retest period has not been empirically demonstrated. The study hypotheses tested intraclass correlation coefficients (ICCs) between the two administrations, as well as mean scale/score differences. The required number of subjects was derived from sample size calculations [5] for our tests of the ICC, which was the primary analysis. The recruitment goal was 110 subjects, which was a sufficient sample to be 95 % certain that the ICC was above 0.70. 2.1 Subjects A non-probability sample was obtained from the patient population at the Arizona Cancer Center outpatient clinics.

To qualify for inclusion in this study, subjects needed to be cancer survivors 18 years of age or older who were not on active cancer treatment (i.e., post-treatment). Further, the subjects had to have access and the ability to use a touchtone phone, and had to understand written and spoken English. The potential study subjects reflected a variety of cancer types and stages. Studying subjects with a variety of cancer types increases the generalizability of the test–retest findings. The potential subjects were recruited through the outpatient oncology offices/clinics at the Arizona Cancer Center. Interested individuals had the opportunity to enroll on the basis of face-to-face contact with a recruiter at the clinics. Also, individuals who had learned about the study from our flyers were able to enroll by calling a dedicated phone line. All recruitment procedures were in full compliance with applicable Health Insurance Portability and Accountability Act requirements, and the study was conducted under the auspices of the University of Arizona’s Human Subjects Protection Program. All subjects who agreed to complete the study questionnaires received a $20 gift card. 2.2 Data Analysis All statistical analyses were performed using SPSS version 16.0 software (Chicago, IL, USA) and evaluated using an alpha (a) level of 0.05. The overall research objectives were to assess the test–retest reliability of the IVR version of the QLQ-C30. Two measures of test–retest equivalence were assessed: the reliability of the scores as measured by the ICC, and the mean difference between the scores obtained via the two administrations. We used the ICC to test the hypotheses regarding the stability of the scores derived from the IVR version of the QLQ-C30 across the two assessments. The ICC may be conceptualized as the ratio of between-groups variance to total variance [6]. The ICC is calculated on the basis of the analysis of variance model that includes factors for mode and subject; the analysis is the same regardless of whether mode is treated as a fixed effect or as a random effect. The absolute agreement form of ICC (3.1) based on the notation in Shrout and Fleiss [6] was used in this study. A one-sided 95 % confidence interval (CI) value for the lower bound was computed using the formula provided in McGraw and Wong [7]. Equivalence on this measure was considered to have been established if the lower bound of the 95 % CI exceeded 0.70 [8]. Although the ICC is the preferred means of assessing score stability between two administrations of the same instrument in a test–retest context [9], we also used a paired Student’s t test of the mean scores from the two administrations of the QLQ-C30 to assess the robustness of

Test–Retest Reliability of Interactive Voice Response

the ICC findings. We calculated the mean difference together with the associated two-sided 95 % CI for the difference. We considered the scores obtained at two points in time to be equivalent if the 95 % CI of the mean difference was wholly contained within the equivalence interval. The equivalence interval was estimated a priori on the basis of the estimated minimally important difference (MID) of the scale or item score. A distribution-based approach (i.e., one half of a standard deviation) was used to estimate the MIDs [10] for the QLQ-C30 scale and item scores from data obtained from stage I and II cancer patients [11]. The equivalence intervals listed in Table 2 are based on the full width of the estimated MID for that scale or item (e.g., if the equivalence interval is -4.9 to ?4.9, the MID for that scale or item is estimated to be 9.8 points). The QLQ-C30 test–retest results were based on the entire sample of subjects who completed both administrations within 72 h.

3 Results A total of 163 subjects agreed to participate, with 127 completing both administrations, for a response rate of 78 % (Fig. 1). Of the 163 subjects who were recruited, 86 % (n = 140) were recruited in person, and 14 % (n = 23) were enrolled over the phone after responding to a flyer. Of the 36 subjects who did not complete both IVR administrations, 11 completed the first IVR administration only, while the remaining 25 subjects did not attempt any IVR administration. Thus, the test–retest analytical sample consisted of the 127 subjects who completed both IVR administrations. The sample was 65.4 % female, and the mean age was 61.6 with a range of 20–83 years. The two largest survivorship groups were breast cancer (31.5 %) and melanoma (28.3 %), followed by colorectal cancer and lymphoma (9.5 % and 7.1 %, respectively). The remaining 30 subjects were divided among 19 different cancer types. Of the 127 subjects who completed both assessments, six subjects had a total of 12 missing responses on the IVR version of the QLQ-C30. Of those 12 missing responses, 10 were on the second administration of the QLQ-C30. 3.1 Reliability The ICCs for the nine multi-item scales in the QLQ-C30 were all above 0.69, ranging from 0.698 to 0.926 (Table 1). The ICCs for eight of the scales were significantly different from our threshold reliability of 0.70, with the only exception being the cognitive functioning scale. The ICCs for the six single items in the QLQ-C30 ranged from 0.741 to 0.883. The ICCs for three of the six single items were

Subjects Recruited (n=163)

Completed Study

Dropped Out

(n=127 [78%])

(n=36 [22%])

Completed one IVR administration (n=11 [31%])

Did not attempt (n=25 [69%])

Fig. 1 Study flow diagram. IVR interactive voice response

significantly higher than the minimally acceptable reliability of 0.70, with the exceptions being the appetite loss item, the constipation item, and the diarrhea item. However, the ICC point estimates for these three single items were all above 0.70. Hence, the evidence supported the stability of 11 of the 15 scores derived from the IVR version of the QLQ-C30. 3.2 Means and Mean Differences The adjusted means and standard deviations for each of the two IVR administrations of the multi-item scales and items in the QLQ-C30 are reported in Table 2, along with the results from the paired t tests for the mean differences and Table 1 Test–retest QLQ-C30 scale and item intraclass correlation coefficients (ICCs) and 95 % lower confidence interval (CI) values QLQ-C30 scale or item

ICC (95 % lower CI value)

Quality-of-life scale

0.860 (0.803)

Physical functioning scale

0.926 (0.895)

Role functioning scale

0.793 (0.714)

Emotional functioning scale

0.848 (0.768)

Cognitive functioning scale

0.698 (0.588)

Social functioning scale

0.804 (0.728)

Fatigue scale Nausea/vomiting scale

0.858 (0.799) 0.868 (0.814)

Pain scale

0.889 (0.843)

Dyspnea item

0.797 (0.711)

Insomnia item

0.854 (0.793)

Appetite loss item

0.782 (0.699)

Constipation item

0.741 (0.646)

Diarrhea item

0.777 (0.692)

Financial problems item

0.883 (0.835)

ICCs in italics were not significantly different from 0.70

J. J. Lundy et al. Table 2 Test of mean score differences QLQ-C30 scale or item

Adjusted mean (SD)

Equivalence interval

Mean difference (95 % CI)

First administration

78.39 (17.83)

-5.9 to ?5.9

-0.442 (-2.21 to 1.33)

Second administration

78.83 (17.95) -4.7 to ?4.7

-0.696 (-1.82 to 0.43)

-6.9 to ?6.9

0.145 (-2.35 to 2.64)

77.68 (18.26) 81.09 (18.15)

-6.0 to ?6.0

-3.406 (-5.17 to -1.64)

First administration

84.65 (15.22)

-5.1 to ?5.1

-2.632 (-4.83 to -0.43)

Second administration

87.28 (15.73) -6.5 to ?6.5

-1.901 (-4.18 to ?0.38)

-6.4 to ?6.4

2.126 (?0.15 to ?4.10)

-3.7 to ?3.7

0.146 (-0.82 to ?1.11)

-6.7 to ?6.7

0.725 (-1.25 to ?2.70)

-6.8 to ?6.8

2.900 (?0.97 to ?4.82)

-7.7 to ?7.7

2.900 (?0.51 to ?5.29)

-6.5 to ?6.5

1.160 (-1.28 to ?3.60)

-6.1 to ?6.1

2.047 (-0.84 to ?4.93)

-4.9 to ?4.9

2.047 (-0.46 to ?4.55)

-6.4 to ?6.4

0.580 (-1.86 to ?3.02)

Quality-of-life scale

Physical functioning scale First administration

87.42 (15.79)

Second administration

88.12 (15.91)

Role functioning scale First administration

86.23 (22.00)

Second administration

86.09 (19.86)

Emotional functioning scale First administration Second administration Cognitive functioning scale

Social functioning scale First administration

88.30 (20.19)

Second administration

90.20 (19.31)

Fatigue scale First administration

24.64 (20.33)

Second administration

22.51 (20.31)

Nausea/vomiting scale First administration

4.53 (10.70)

Second administration

4.39 (9.42)

Pain scale First administration

17.54 (23.03)

Second administration Dyspnea item

16.81 (22.13)

First administration

9.57 (18.08)

Second administration

6.67 (15.42)

Insomnia item First administration

27.25 (25.20)

Second administration

24.35 (23.50)

Appetite loss item First administration Second administration

10.43 (20.88) 9.28 (19.01)

Constipation item First administration

13.74 (22.96)

Second administration

11.70 (20.31)

Diarrhea item First administration Second administration Financial problems item

11.98 (21.78) 9.94 (18.78)

First administration

15.07 (27.30)

Second administration

14.49 (27.26)

CI confidence interval, SD standard deviation

Test–Retest Reliability of Interactive Voice Response

associated 95 % CIs. The 15 mean differences that were tested were within the equivalence interval set a priori, supporting the equivalence of the means between the two IVR administrations of the QLQ-C30 for all scales and items.

4 Discussion The results of the test–retest analyses generally support the consistency of the scores produced by the IVR version of the QLQ-C30 upon repeated administration within a short time interval where it is assumed that no or only very minor changes will have taken place in the subjects’ functional health or symptom burden. There are few previous studies with which to compare the results from this study regarding the test–retest reliability of the QLQ-C30 [12]. Test–retest analyses with earlier versions of the QLQ-C30 [13] used suboptimal (and thus non-comparable) analytic approaches (i.e., Pearson’s correlation coefficients rather than ICCs). Our results are comparable with or better than those of earlier studies for which direct comparisons are possible [14–16]. In our study, 11 of the 15 multi-item scales and items had ICCs significantly above our threshold ICC value of 0.70. The cognitive functioning scale and the three single items assessing appetite loss, constipation, and diarrhea did not demonstrate equivalence on the basis of the ICC. However, Uwer and colleagues [16] have previously presented evidence that the cognitive functioning scale and the appetite loss, constipation, and diarrhea items may have lower levels of reliability than the threshold value used in this study (i.e., 0.70). For instance, they reported an ICC lower CI value for the cognitive functioning scale of 0.62, compared with the finding in this study of 0.588. Similarly, their ICC lower CI values for the appetite loss item (0.54), the diarrhea item (-0.003), and the constipation item (0.54) were all lower than those found in this study. Hence, it is possible that the IVR version of the QLQ-C30 is being held to a higher reliability standard than that of the original paper version. An important limitation of this study was the potential for selection bias in that we employed a non-probability, convenience sample, and we offered a small financial incentive for subjects to participate. One cannot exclude the possibility that individuals who did not volunteer to participate in this study would have generated somewhat different results. Furthermore, the distribution of cancer types within gender did not entirely reflect that of the overall population of cancer patients. That, in part, is due to the specialty areas of the clinicians at the recruitment site (i.e., Arizona Cancer Center). Also, since we did not recruit subjects undergoing active treatment, these findings

may not be generalizable to that patient population. In addition, it was not possible to determine if the QLQ-C30’s test–retest reliability varies as a function of certain demographic characteristics (e.g., age, gender) and clinical characteristics (e.g., cancer type/site, stage of disease, performance status, and treatment status). This would require a sample size substantially larger than that of the current study. Despite these limitations, we believe that the sample was sufficiently diverse to test the reliability of the IVR version of the QLQ-C30.

5 Conclusion The results of the primary test–retest analyses (i.e., the ICCs) support the stability of 11 of the 15 scores derived from the IVR version of the QLQ-C30 in this study sample. Further research will be needed to determine if the test– retest reliability of the IVR version is better or worse than the test–retest reliability of the original paper version, since new modes of data collection (e.g., IVR) should not be held to a higher standard of test–retest reliability than the original data collection mode (i.e., paper). Acknowledgments The data used for this research were collected as part of a study funded by ClinPhone PLC (now Perceptive Informatics). Additional support was provided by Arizona Cancer Center Support Grant CA023074 from the National Cancer Institute. The authors gratefully acknowledge the staff and facility support provided by the University of Arizona College of Pharmacy and the Arizona Cancer Center’s Behavioral Measurements Shared Service. J.J.L. and S.J.C. were employed by the University of Arizona at the time when this study was conducted. The authors have no financial interest in Perceptive Informatics or ClinPhone PLC. Each author made a substantial contribution to the conception, design, and content of the manuscript; was involved in drafting the manuscript and revising it critically for important intellectual content; has given final approval of the version to be published; and has agreed to be accountable for all aspects of the work.

References 1. Gwaltney CJ, Shields AL, Shiffman S. Equivalence of electronic and paper-and-pencil administration of patient-reported outcome measures: a meta-analytic review. Value Health. 2008;11:322–33. 2. Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A, Duez NJ, Filiberti A, Flechtner H, Fleishman SB, De Haes JC. The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst. 1993;85:365–76. 3. Lundy JJ, Coons SJ, Aaronson NK. Testing the measurement equivalence of paper and interactive voice response system versions of the EORTC QLQ-C30. Qual Life Res. 2014;23:229–37. 4. Marx RG, Menezes A, Horovitz L, Jones EC, Warren RF. A comparison of two time intervals for test–retest reliability of health status instruments. J Clin Epidemiol. 2003;56:730–5.

J. J. Lundy et al. 5. Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. 4th ed. New York: Oxford University Press; 2008. 6. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–8. 7. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1(1):30–46 (Correction, Vol. 1, No. 4, 390). 8. Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York: McGraw-Hill; 1994. 9. Coons SJ, Gwaltney CJ, Hays RD, et al. Recommendations on evidence needed to support measurement equivalence between electronic and paper-based patient-reported outcome (PRO) measures: ISPOR ePRO Good Research Practices Task Force report. Value Health. 2009;12(4):419–29. 10. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care. 2003;41:582–92. 11. Scott NW, Fayers PM, Aaronson NK, Bottomley A, de Graeff A, et al. The EORTC QLQ-C30 reference values manual. Brussels: The EORTC Quality of Life Group; 2008. 12. Luckett T, King MT, Butow PN, Oguchi M, Rankin N, Price MA, Hackl NA, Heading G. Choosing between the EORTC QLQ-C30

13.

14.

15.

16.

and FACT-G for measuring health-related quality of life in cancer clinical research: issues, evidence, and recommendations. Ann Oncol. 2011;22:2179–90. Hjermstad MJ, Fossa SD, Bjordal K, Kaasa S. Test–retest study of the European Organisation for Research and Treatment of Cancer core quality of life questionnaire. J Clin Oncol. 1995;13:1249–54. Velikova G, Wright EP, Smith AB, et al. Automated collection of quality-of-life data: a comparison of paper and computer touchscreen questionnaires. J Clin Oncol. 1999;17:998–1007. King M, Winstanley J, Kenny P, Viney R, Zapart S, Boyer M. Validity, reliability and responsiveness of the EORTC QLQ-C30 and EORTC QLQ-LC13 in Australians with early stage nonsmall cell lung cancer. CHERE Working Paper 2007/13, Centre for Health Economics Research and Evaluation, Sydney. 2007. http://www.chere.uts.edu.au/pdf/wp2007_13.pdf. Accessed 1 Jun 2013. Uwer L, Rotonda C, Guillemin F, Miny J, Kaminsky MC, et al. Responsiveness of EORTC QLQ-C30, QLQ-CR38 and FACT-C quality of life questionnaires in patients with colorectal cancer. Health Qual Life Outcomes 2011;9:70. http://www.hqlo/content/ 9/1/70.

Test-Retest Reliability of an Interactive Voice Response (IVR) Version of the EORTC QLQ-C30.

The objective of this study was to assess the test-retest reliability of an interactive voice response (IVR) version of the European Organisation for ...
269KB Sizes 1 Downloads 4 Views