The Journal of Pain, Vol 15, No 12 (December), 2014: pp 1360-1365 Available online at www.jpain.org and www.sciencedirect.com

Using Multiple Daily Pain Ratings to Improve Reliability and Assay Sensitivity: How Many Is Enough? Alicia Heapy,*,y James Dziura,y Eugenia Buta,y Joseph Goulet,*,y Joseph F. Kulas,*,y and Robert D. Kerns*,y *Veterans Affairs Connecticut Healthcare System, West Haven, Connecticut. y Yale School of Medicine, New Haven, Connecticut.

Abstract: The Initiative for Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) has reported diminished assay sensitivity in pain treatment trials and recommended investigation of the causes. Specific recommendations included examination of outcome measure reliability and lengthening the baseline measurement period to allow more measurements to be collected. This secondary data analysis evaluated the minimum number of daily pain intensity ratings required to obtain a reliability of at least .90 and whether a composite of this smaller number of ratings was interchangeable with the composite of all ratings. Veterans Affairs medical center patients made 14 daily calls to an automated telephone system to report their average daily pain intensity rating. A single daily rating produced less than adequate reliability (intraclass correlation coefficient = .65), but a composite of the average of 5 ratings resulted in reliability above .90. A BlandAltman analysis revealed that the differences between a 5-day composite and the composite of all ratings were small (mean .09 points, standard deviation = .45; 95% confidence interval = .05 to .23) and below the threshold for a clinically meaningful difference, indicating that the 2 measurements are interchangeable. Our results support the IMMPACT recommendations for improving assay sensitivity by collecting a multiple-day baseline of pain intensity ratings. Perspective: This study examined the minimum number of pain ratings required to achieve reliability of .90 and examined whether this smaller subset of ratings could be used interchangeably with a composite of all available ratings. Attention to measure reliability could enhance the assay sensitivity, power, and statistical precision of pain treatment trials. Published by Elsevier Inc. on behalf of the American Pain Society Key words: Interactive voice response, assay sensitivity, chronic pain, self-report assessment, reliability.

T

he measurement of pain intensity is complicated by its very nature as a ‘‘personal, subjective experience influenced by cultural learning, the meaning of the situation, attention, and other psychological variables.’’11 Because pain is subjective and susceptible to

Received May 8, 2014; Revised September 9, 2014; Accepted September 19, 2014. This material is based on work supported by Jansen Pharmaceutical. A.H., J.G., and R.D.K. were supported by the Veterans Health Administration Health Services Research and Development Service Center of Innovation (CIN 13-407). The authors have no conflicts of interest to declare. The project was not registered with clinicaltrials.gov since it began in 2003 prior to the requirement to register. The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the U.S. Government. Address reprint requests to Alicia Heapy, PhD, VA Connecticut Healthcare System (11ACLGS), 950 Campbell Avenue, West Haven, CT 06516. E-mail: [email protected] 1526-5900/$36.00 Published by Elsevier Inc. on behalf of the American Pain Society http://dx.doi.org/10.1016/j.jpain.2014.09.012

1360

the effects of placebo,20 its reliable and valid assessment in clinical trials is particularly important to ensure that any observed changes are attributable to specific rather than nonspecific effects (eg, participant expectancies) and/or study design. An extensive literature attests to attempts to reliably measure pain intensity (see review by Jensen and Karoly9). However, a consensus statement by the Initiative for Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) noted several recent trials where previously efficacious analgesic medications failed to demonstrate superiority to placebo.3 This raised questions about the assay sensitivity of those trials or ‘‘the ability of a trial to distinguish an effective from an ineffective treatment.’’7 Inadequate assay sensitivity undermines the interpretation of negative findings, particularly against the effects of placebo. In response, IMMPACT identified potential ways to improve assay sensitivity including examining outcome measure reliability and lengthening the baseline assessment period to increase reliability.

Heapy et al Improved reliability reduces the occurrence of statistical regression to the mean and reduces the possibility that it will be interpreted as a treatment effect.14 Reliability also sets an upper bound for the power, effect size, and statistical precision of a trial.13 In a simulation, Perkins and colleagues demonstrated that improving outcome reliability from .70 to .90 would result in a 22% decrease in required sample size and an increase in study power from .64 to .72,16 thereby reducing study costs and increasing the likelihood of detecting treatment-related improvement. Classical test theory dictates that reliability will be improved by increasing the total number of measurements obtained or the number of items in a measure.15 Despite the benefits of improved reliability, there is little empirical guidance for determining how many pain intensity measurements are necessary to reach a reliable baseline and at what point additional measurements do not yield improvements in reliability but simply add to participant burden and trial costs. IMMPACT recommendations of 1 week of daily baseline measurements (and 2 weeks for longer studies) were based on expert opinion, and further empirical examinations were encouraged. In 1 early study, Kerns and colleagues examined the stability of hourly pain intensity ratings over a 2-week period in 98 individuals with chronic pain.12 Participants reported high levels of stability between pain intensity ratings collected during week 1 versus week 2 (r = .93) and variability in pain intensity (r = .92). Jensen and McFarland10 found that single ratings were less reliable than composite scores in a sample of 200 participants who completed hourly pain ratings for 7 days. Composite scores containing ratings across multiple days were required to achieve validity and stability above r = .90. Prior work has been criticized, however, for examining reliability using bivariate correlations between a subset of pain ratings and the grand mean of all available ratings.2 An association between 2 measures is not equivalent to demonstrating that the measures agree or can be used interchangeably, which is the presumed end goal of using a composite measure. Given concerns about assay sensitivity in pain treatment trials, our aims were to 1) evaluate the reliability of a single average daily pain rating among patients with long-term chronic pain, 2) determine the minimum number of daily average pain ratings that were required to obtain a composite measure with reliability of .90 or higher, and 3) assess whether this composite of fewer than all of the daily pain ratings was interchangeable with the mean of all available ratings.

Method Procedure Overview This study was a secondary analysis of data collected from a 13-week randomized controlled open-label clinical trial examining the relative efficacy of transdermal fentanyl (TDF) compared to oral short-acting opioids

The Journal of Pain

1361

for the treatment of chronic noncancer pain. The parent study6 was designed to test the hypothesis that a longacting medication, TDF, because it is associated with more stable and predictable levels of pain relief and improved sleep, would be associated with increased activity, reduced perceived disability, and improved sleep, functioning, and overall quality of life relative to short-acting opioids. In the parent study, participants were asked to make daily reports of pain intensity and other pain-related outcomes via automated telephone calls enabled by an interactive voice response (IVR) system. IVR is a computerized interface that allows participants to provide responses to prerecorded questions using their telephone’s numeric keypad. During 3 prespecified 2-week data collection periods, trial participants called the IVR system daily and answered 18 automated questions about their pain, its effect on their physical and emotional functioning, and their adherence to prescribed pain medications.5 The analysis presented here focuses solely on data obtained from the pain intensity question (‘‘On a scale from 0 to 10, with 0 representing no pain and 10 representing the worst pain imaginable, rate your average pain today.’’) obtained during the first 2-week period and prior to randomization and any associated medication changes. The project was approved by the Veterans Affairs Connecticut Healthcare System’s Human Studies Subcommittee and the Yale School of Medicine Human Investigation Committee. Participants were paid $5 for each completed daily call.

Participants Participants were veterans receiving care at Veterans Affairs Connecticut Healthcare System who were recruited through advertisements, referrals from primary care providers, and direct invitation to patients seen in a multidisciplinary pain management clinic. Eligibility criteria included 1) presence of pain with average pain intensity of 4 or greater on a 0 (no pain) to 10 (worst pain imaginable) numeric rating scale, 2) daily use of an oral short-acting opioid equivalent of at least 60 mg oral morphine per day for at least 6 consecutive months prior to study enrollment, 3) age 18 or older, 4) no medical contraindications to TDF therapy, and 5) consent to urine drug screen to evaluate potential substance misuse. Persons with evidence of active alcohol or substance abuse or dependence, active psychotic disorder, active suicidal or homicidal risk, back surgery within the past 6 months, pregnancy, or lactation were excluded.

Data Analysis We analyzed 14 daily baseline pain ratings of study participants. We excluded subjects with less than 7 ratings during the 14-day period. Daily baseline pain ratings were collected prior to randomization and any randomization-related medication change from oral opioid to TDF. Because our aim was to examine the number of consecutive daily ratings needed to achieve .90 reliability, we used only those ratings that were made

1362

Multiple Daily Pain Ratings

The Journal of Pain

within 14 days of entry into the study. When the analysis required using fewer than all of a participant’s pain ratings, the first n ratings were used. We assumed that missed calls were ‘‘missing at random,’’ meaning that on days when a participant missed a call, their pain level did not systematically differ from their pain levels on days when they did call. Although there is no way to fully test whether symptom-contingent calling occurred based on the observed data alone, an investigation of the call time stamps revealed that the distribution of the time of day the calls were made had a median of 6:53 PM (25th percentile = 5:43 PM, 75th percentile = 8:42 PM), which indicates a clustering of the calls in the early to late evening. Also, we looked at the subject-specific spread or interquartile interval (IQR) of the time when the calls were made. The IQRs had a median of .9 hour (25th percentile = .5 hour, 75th percentile = 2.0 hours); that is, half of the subjects had the middle half of their calls fall within a rather narrow window of .9 hours or less. First, we evaluated the reliability of a single pain intensity rating using the intraclass correlation coefficient (ICC) calculated from a 1-way random effects analysis of variance model that assumed that, for each subject, a daily pain rating is the sum of a true underlying pain rating and measurement error.18 ICC is a measure of reliability used when examining consistency between raters or multiple ratings. Second, we used the Spearman-Brown Prophecy formula19 to determine the added reliability that could be achieved by successively increasing the number of daily pain ratings included in a composite rating of pain intensity versus a single rating; based on these results, we identified the number of ratings, m, needed to reach a reliability of .90 or greater. A target reliability of .90 was selected based on classical test theory and prior studies that have found that reliability of .90 produced desirable effects on study power and the precision of statistical analyses relative to lower levels.13,16 Nunnally recommended selecting a target for reliability based on the objective of the assessment and its stakes. ‘‘In those applied settings where important decisions are made with respect to specific test scores, a reliability of .90 is the minimum that should be tolerated, and a reliability of .95 should be considered the desirable standard.’’4 The Bland-Altman method1 was then used to examine the agreement between the mean of the first m ratings of each subject and the mean of pain ratings across all 14 reporting days, or the ‘‘true’’ mean. By examining the agreement between the shorter composite measure and the 14-day mean, which we will consider the gold standard for the purposes of this study, we were able to evaluate if the shorter composite is interchangeable with the grand mean. The Bland-Altman plot was also used to determine if systematic or fixed biases existed between the composite averages of pain intensity (ie, less than 14 days) and the 14-day mean. Prior reports have used correlation coefficients to evaluate the relationship between a single measurement and all available measurements. As noted by Bolton,2 a large and significant correlation can be present between 2 measures without one being an accurate reflection of

the other. Measures may be associated because they are assessing the same variable, but the measures may not be interchangeable. For example, a single pain rating and a mean weekly pain rating may move in the same direction (up or down) and therefore demonstrate a large correlation or association, but they may not necessarily display a strong degree of agreement. We focused on agreement—could one measure be used in place of another—rather than correlation to evaluate alternate measurement strategies. In fact, examining whether a composite of fewer than all pain ratings could be used in place of all available ratings without loss of reliability is seldom examined, though it is often the objective of using a shorter or otherwise advantageous alternative measure in place of the gold standard.

Results Participant Characteristics Participants were 43 individuals from a randomized clinical trial comparing the relative efficacy of TDF to short-acting opioids for treating chronic noncancer pain who provided at least 7 pain intensity ratings over the entire 14-day period. Of the 50 participants who provided written informed consent, 4 participants withdrew prior to completing any IVR calls and 3 were excluded from this analysis because they had less than 7 ratings during the 14-day period. Demographic data were not collected at baseline but later in the trial; thus, there are no demographic data for the participants who withdrew early (n = 12). Participants were on average 54.8 years old (range = 26–79, standard deviation [SD] = 11.96) and had longstanding pain (M = 185.13 months, SD = 142, range = 24–528). Most were male (96.8%) and white (87.1%). During the 14day reporting window, participants missed 10% of planned calls and completed an average of 12.6 calls (range = 8–14, SD = 1.6). Sixteen participants (37%) completed all 14 calls, 16 participants (37%) completed 12 to 13 calls, and 11 participants (26%) completed 8 to 11 calls. The mean of the pain intensity ratings across all 14 days was 6.13 (SD = 1.53), indicating that participants were experiencing moderate levels of pain.17

ICC First, we calculated the ICC in order to determine the reliability of a single pain rating or the average correlation between all possible pairs of measurements from each subject. The ICC was .65, with a 95% confidence interval of .55 to .76 based on Smith’s method,3 indicating that approximately 65% of the variation in single pain ratings is variation between subjects and approximately 35% is measurement error. The reliability that would be achieved by using a mean based on m repeated pain measurements (for several values of m) was calculated according to the SpearmanBrown Prophecy formula and is given in Table 1. Comparison of the ICC for a single rating (.65) and the 2-day composite rating (.79) demonstrates that substantial improvement in reliability was achieved with the addition

Heapy et al

Reliability of m-Day Mean Pain Ratings for Several Values of m

Table 1.

M

Reliability of m-day mean

VALUES

2

3

4

5

6

7

8

9

10

11

12

13

14

.79

.85

.88

.90

.92

.93

.94

.94

.95

.95

.96

.96

.96

of a single rating. Though the largest increase in reliability occurs between 1 and 2 ratings, a 5-day composite rating was needed to surpass the .90 threshold of reliability (.95 confidence interval = 3–8 days) and 10 days to surpass .95 (.95 confidence interval = 6–15 days). Next, we assessed the agreement between pain rating means calculated based on 5 days of reporting, which resulted in a reliability of .90, versus the mean of pain ratings across all available reporting days, or the ‘‘true’’ mean. The plot of the mean based on the first 5 ratings versus the mean of all ratings revealed good agreement (correlation of true mean and 5-day mean = .96; see Fig 1), with most points scattered closely along the equality line. A Bland-Altman plot of the difference between the 2 means (the mean of the first 5 ratings and the mean of all available ratings) versus the average of the 2 means revealed no obvious relationship between the difference and the average, indicating that no systematic bias exists between the shorter composite and total ratings (see Fig 2). On average, the difference between the mean of 5 measurements and the mean of all available measurements was .09 points (SD = .45; 95% confidence interval = .05 to .23). The limits of agreement displayed in Fig 2 suggest that 95% of the time the participant’s mean pain rating over 5 days would only differ from their ‘‘true’’ pain rating (over 14 days) by between .82 and 1.00 points. Because this range is less than 2 points, and therefore below the threshold of being clinically meaningful,4 we can conclude that the 2 measurement methods are interchangeable.

Discussion

composite pain ratings and the agreement between the overall mean of 14 days of pain ratings and a smaller subset of those ratings. Our results demonstrated the psychometric advantages of obtaining more than 1 daily rating when assessing pain intensity using the numeric rating scale. In a sample of persons with longstanding chronic pain, a single pain rating produced less than adequate reliability (.65), but adding 1 pain rating to form a composite of 2 daily ratings produced a substantial improvement in reliability (.79). Continuing to add pain ratings to form a composite of 5 daily ratings resulted in reliability greater than .90. Our results support IMMPACT recommendations for improving assay sensitivity in pain treatment trials by collecting a multiple-day baseline of pain intensity ratings. Although our result supports 5 days of ratings, given the small sample size and potential that the findings may not be generalizable to a broader population of patients with pain, we believe that these findings are supportive of the IMMPACT recommendation of a multiple-day baseline. Our results, however, are not consistent with obtaining more than a 7-day baseline, a suggested area of investigation in the IMMPACT recommendations for improving assay sensitivity. Adding ratings beyond 7 days did not result in an appreciable improvement in reliability and is therefore not likely justified in terms of the added participant burden and project costs, unless the importance of the decisions being made as a result of the ratings is high. In those cases, a reliability of .95 may be desired and a 10-day baseline may be required. Our results are consistent with those of Jensen and McFarland,9 who also found that several days of ratings were required in order for test-retest reliability, internal consistency, and validity to reach .90. When examining validity or the

1

Mean+2SD

1.00

Mean 0

0.09

Mean-2SD -1

-0.82

0

-2

2

4

6

8

5 - d a y me a n - " t r u e " me a n

2

10

In this secondary analysis of data from a randomized clinical trial, we investigated the reliability of single and

5-day mean

1363

The Journal of Pain

0

0

2

4

6

8

10

2

4

6

8

10

(5-day mean + "true" mean)/2

"True" mean

Figure 1. Agreement between 5-day composite pain ratings values and the ‘‘true’’ mean.

Figure 2. Bland-Altman plot of the difference between the mean of the first 5 ratings and the mean of all available ratings versus the average of the 2 means.

1364

The Journal of Pain

Multiple Daily Pain Ratings

association between a subset of fewer than all ratings and the 1-week average, Jensen and McFarland found that after exceeding .90 with 5 days of daily ratings, adding more ratings resulted in minimal improvement. In addition to examining reliability, we evaluated the interchangeability between alternative measures or the degree to which a composite of fewer than all available pain ratings could be used interchangeably with the mean of all 14 days of pain ratings. In this way, we extend the earlier work of Jensen.10 Using a Bland-Altman plot, we determined that the difference between 1) the mean of 5 daily ratings (the point at which reliability reached .90) and 2) the mean of all available daily ratings was small and not clinically meaningful. We concluded that for this group of patients with chronic pain, the 2 measurement methods were interchangeable. This finding lends further support for the multiple-day composite as a valid measure of the 14-day mean and, by extension, average pain more broadly. The collection of a multiple-day baseline could be viewed as cumbersome, but it may be facilitated by technology-assisted assessment methods. A study by Jamison and colleagues demonstrated that participants displayed higher levels of daily reporting when using electronic diaries compared to paper diaries (89.9% vs 55.9%), and the ratings obtained via electronic diaries demonstrated high levels of correlation with paper diaries.8 Similarly, the present study, using IVR-based assessments, replicated the findings of prior studies that used paper diaries while obtaining high rates of call compliance. The utility of these technology-assisted methods may be most relevant in the research context where issues of assay sensitivity and statistical power and precision are of particular importance. Beyond the research setting, technology-assisted methods may put multiple-day assessment within reach in the clinical setting. The widespread availability of mobile telephones with the capability of collecting data via text message or mobile phone application (app) provides a method to facilitate this type of data collection. Wearable devices such as accelerometers that allow the automated prompting and collection of patient-reported

data can also be used to collect daily data. Daily technology-assisted assessment may facilitate evaluation of the efficacy of new treatments or the modification of an existing treatment such as a change in analgesic medication dosage. Some limitations of the study should be noted. The sample size was small, and although our findings are consistent with those of Jensen and McFarland,9 further replication with a larger sample is warranted. Because participants were responsible for initiating the daily call, it is possible that they were more likely to place a call during times of either high or low pain, violating the assumption that missing calls are missing at random. This concern is attenuated, though not fully dismissed, by our examination of call time patterns that indicated that participants tended to place their calls within a narrow time window in the evening and not in a more varied pattern that would suggest that calls were associated with periods of either high or low pain. Future studies can avoid this uncertainty by removing the choice of call time from the participant’s discretion by sending system-initiated calls to participants at scheduled or random times. The almost exclusively male and white sample used in the study may limit the generalizability of the findings. In summary, the results of this study support IMMPACT recommendations for using a multiple-day composite of pain intensity ratings to enhance the assay sensitivity of pain treatment trials and for the use of a composite of fewer than all ratings as a valid and reliable substitute for the average of all ratings. Further examination of the reliability of outcome measures and the effect of reliability on study findings would likely enhance assay sensitivity as well as reveal when inadequate reliability may be implicated in negative study findings. Future efforts to more precisely identify the inflection point at which additional assessments do not yield benefit in terms of reliability and how the inflection point is influenced by the characteristics of participants and trials could prove fruitful for enhancing assay sensitivity, reducing patient burden and trial costs, and promoting participant retention in trials.

References

tivity in chronic pain clinical trials: IMMPACT recommendations. Pain 153:1148-1158, 2012

1. Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327:307-310, 1986

4. Farrar JT, Young JP, LaMoreaux L, Werth JL, Poole RM: Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain 94:149-158, 2001

2. Bolton JE: Accuracy of recall of usual pain intensity in back pain patients. Pain 83:533-539, 1999

5. Heapy AA, Sellinger JJ, Higgins DM, Chatkoff DK, Bennett TC, Kerns RD: Using interactive voice response to measure pain and quality of life. Pain Med 8:S145-154, 2007

3. Dworkin RH, Turk DC, Peirce-Sandner S, Burke LB, Farrar JT, Gilron I, Jensen MP, Katz NP, Raja SN, Rappaport BA, Rowbotham MC, Backonja MM, Baron R, Bellamy N, Bhagwagar Z, Costello A, Cowan P, Fang WC, Hertz S, Jay GW, Junor R, Kerns RD, Kerwin R, Kopecky EA, Lissin D, Malamut R, Markman JD, McDermott MP, Munera C, Porter L, Rauschkolb C, Rice AS, Sampaio C, Skljarevski V, Sommerville K, Stacey B, Steigerwald I, Tobias J, Trentacosti AM, Wasan AD, Wells GA, Williams J, Witter J, Ziegler D: Considerations for improving assay sensi-

6. Higgins DH, Sellinger JJ, Chatkoff DK, Heapy AA, Shulman M, Bennett T, Bellmore W, Kerns RD: Feasibility of interactive voice response for monitoring pain treatment effects. Paper presented at the annual meeting of the American Psychological Association, New Orleans, LA, 2006. 7. International Conference on Harmonisation. E10: Choice of control groups and related issues in clinical trials. Available at: http://www.ich.org/fileadmin/Public_Web_Site/ICH_Prod

Heapy et al ucts/Guidelines/Efficacy/E10/Step4/E10_Guideline.pdf Accessed December 26, 2013. 8. Jamison RN, Raymond SA, Levine JG, Slawsby EA, Nedeljkovic SS, Katz NP: Electronic diaries for monitoring chronic pain: 1-year validation study. Pain 91:277-285, 2001 9. Jensen MP, Karoly P: Self-report scales and procedures for assessing pain in adults, in Turk DC, Melzack R (eds): Handbook of Pain Assessment, 3rd ed. New York, Guilford Press, 2011, pp 19-44

The Journal of Pain

1365

14. McDonald CJ, Mazzuca SA, McCabe GP: How much of the placebo ‘‘effect’’ is really statistical regression? Stat Med 2:417-427, 1983 15. Nunnally JC, Bernstein IH: Psychometric Theory, 2nd ed. New York, McGraw-Hill, 1978 16. Perkins DO, Wyatt RJ, Bartko JJ: Penny-wise and poundfoolish: The impact of measurement error on sample size requirements in clinical trials. Biol Psychiatry 47:762-766, 2000

10. Jensen MP, McFarland CA: Increasing the reliability and validity of pain intensity measurement in chronic pain patients. Pain 55:195-203, 1993

17. Serlin RC, Mendoza TR, Nakamura Y, Edwards KR, Cleeland CS: When is cancer pain mild, moderate or severe? Grading pain severity by its interference with function. Pain 61:277-284, 1995

11. Katz J, Melzack R: Measurement of pain. Surg Clin North Am 79:231-252, 1999

18. Shrout PE, Fleiss JL: Intraclass correlations: Uses in assessing rater reliability. Psychol Bull 86:420-428, 1979

12. Kerns RD, Finn P, Haythornthwaite J: Self-monitored pain intensity: Psychometric properties and clinical utility. J Behav Med 11:71-82, 1988

19. Spearman CC: Correlation calculated from faulty data. Br J Psychol 3:271-295, 1910

13. Lachin JM: The role of measurement reliability in clinical trials. Clin Trials 1:553-566, 2004

20. Turner JA, Deyo RA, Loeser JD, Von Korff M, Fordyce W: The importance of placebo effects in pain treatment and research. J Am Med Assoc 271:1609-1614, 1994

Using multiple daily pain ratings to improve reliability and assay sensitivity: how many is enough?

The Initiative for Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) has reported diminished assay sensitivity in pain treatment ...
326KB Sizes 1 Downloads 9 Views