ORIGINAL CONTRIBUTION

Comparison of Physician-Rating and Self-Rating Scales for Patients With Major Depressive Disorder Ching-Hua Lin, MD, PhD,*† Mei-Jou Lu, MS,* Julielynn Wong, MD, MPH,‡ and Cheng-Chung Chen, MD, PhD*† Abstract: Physician-rating scales remain the standard in antidepressant clinical trials. The current study aimed to examine the discrepancies between physician-rating scales and self-rating scales for symptoms and functioning, before and after treatment, in newly hospitalized patients. A total of 131 acutely ill inpatients with major depressive disorder were enrolled to receive 20 mg of fluoxetine daily for 6 weeks. Symptom severity and functioning were assessed at baseline and again at week 6. Symptom severity was rated using the 17-item Hamilton Depression Rating Scale (HDRS-17) and the Zung Self-rating Depression Scale (ZDS). Functioning was measured by the Global Assessment of Functioning (GAF) and the Work and Social Adjustment Scale (WSAS). Pearson correlation coefficients (r) between HDRS-17 and ZDS and between GAF and WSAS were calculated at week 0 and week 6. Sensitivity to change was measured using effect sizes. One-hundred twelve patients completed the 6-week trial. After 6 weeks of treatment, correlations between HDRS-17 and ZDS or correlations between GAF and WSAS became larger from baseline to end point. All correlations were statistically significant (P < 0.001). Effect sizes measured by physician-rating scales (ie, HDRS-17 and GAF) were larger than by self-rating scales (ie, ZDS and WSAS). Correlations between baseline physician-rating scale scores and self-rating scale scores improved after 6 weeks of treatment. Physician-rating scales had larger effect sizes than self-rating scales. Physician-rating scales were more sensitive in detecting symptom or functional changes than self-rating scales. Key Words: major depressive disorder, physician-rating scale, self-rating scale, correlation, effect size (J Clin Psychopharmacol 2014;34: 716–721)

P

hysician-rating scales remain the standard in antidepressant clinical trials. Traditionally, physician-rating scales have been considered more reliable measures than self-rating scales for clinical trials.1,2 After intensive training, physician rating reaches a high degree of reliability.3 Assessing severity of illness using physician-rating scales is based on the physician’s own observations together with information given by the patients. There are, however, several limitations to physician-rating scales. First, they are performed by professionally qualified physicians, who use professional time, effort, and expense. Second, physicians may be unintentionally biased, assessing the severity of illness to meet the enrollment criteria for clinical trials.4 Third, psychiatrists may be unwilling to regard patients with specific personality disorders as those with severe depression.3 Fourth, psychiatrists may be inherently biased in their belief that a specific therapy provides greater therapeutic benefit than other therapies.3 From the *Kaohsiung Municipal Kai-Syuan Psychiatric Hospital; †Department of Psychiatry, School of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan; and ‡Center for Innovative Technologies and Public Health, Toronto, Canada. Received March 27, 2014; accepted after revision July 15, 2014. Reprints: Ching-Hua Lin, MD, PhD, 130 Kai-Syuan 2nd Rd, Ling-Ya District, Kaohsiung 802, Taiwan (e‐mail: [email protected]). Copyright © 2014 by Lippincott Williams & Wilkins ISSN: 0271-0749 DOI: 10.1097/JCP.0000000000000229

716

www.psychopharmacology.com

The increasing importance of self-rating scales was recently demonstrated in the Sequenced Treatment Alternatives to Relieve Depression study,5 which used self-rating scales (ie, Quick Inventory of Depressive Symptomatology) as an outcome measure. As a corollary to the limitations of physician-rating scales, several potential disadvantages have been noted when using self-rating scales. For example, patients can exaggerate or conceal symptoms, or exhibit response bias and social desirability effects.6 Furthermore, self-rating scales have implicit disadvantages in assessing aspects of psychopathology, including the following: appearance of depression, psychomotor retardation, adverse effects of medication, hypochondriacal ideation, or psychosis.7–9 Self-rating scales are appropriate for patients with mild to moderate depression.3 It is unclear, however, whether self-rating scales are truly effective for patients with limited education or diminished concentration spans.10 To our knowledge, no study has as yet compared a physicianrated to a self-rated scale and shown correlations with separate measures of function.7,11–14 The magnitude of clinical improvement has been shown to be greater with physician-rating scales than with self-rating scales.7,13,15–22 To explore these issues for Chinese inpatients, we conducted a post hoc analysis from a previous study.23 The first goal of this study was to replicate previous studies, to determine whether correlations between baseline physician-rating score and self-rating score for symptom severity would improve over time. The second goal was to assess whether physician-rating scale had a greater effect size than self-rating scale from baseline to end point. To evaluate the treatment for major depressive disorder in clinical trials, more attention is paid to depressive symptoms than to functional impairments. The multimodal approach, including both physician-rating and self-rating scales, covering different domains such as depressive symptoms and functioning, has been the preferred method for assessing patients.8 The American Psychiatric Association guideline for the treatment of patients with major depressive disorder24 also emphasizes the importance of adding functional measures to adequately capture the full impact of depression and its treatment. To our knowledge, no study has compared the physician-rating scale to the self-rating scale for functioning before. Therefore, the third goal was to extend the symptom scales to examine correlations at baseline and end point using the 2 functioning scales. We hypothesized that correlation between physicianrating scale and self-rating scale for functioning would become greater after treatment. The physician-rating scale would be more sensitive at detecting functional change than the self-rating scale. The final goal was to investigate the agreements of the responses between the physician-rating scales and self-rating sales.

MATERIALS AND METHODS Subjects This study was approved by Kai-Syuan Psychiatric Hospital’s institutional review board and conducted in accordance

Journal of Clinical Psychopharmacology • Volume 34, Number 6, December 2014

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Journal of Clinical Psychopharmacology • Volume 34, Number 6, December 2014

with Good Clinical Practice procedures and the current revision of the Declaration of Helsinki policy statement. Written informed consent was obtained from the participants after a full explanation of the study’s aims and procedures. This study was registered on Clinical.trials.gov (Identifier number: NCT01075529). Han Chinese subjects were recruited from Kai-Syuan Psychiatric Hospital, Kaohsiung, Taiwan. Participants were considered eligible if they were newly hospitalized patients for acute treatment, between 18 and 70 years old, physically healthy, and had a diagnosis of a major depressive disorder using the Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition.25 The exclusion criteria were as follows: a baseline score of 17-item Hamilton Depression Rating Scale (HDRS-17) of less than 18, Clinical Global Impression of Severity26 of less than 4, psychotic depression, bipolar I or II disorder, schizophrenia or any other psychotic disorder, a Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition diagnosis of substance abuse or dependence (including alcohol) within the past 6 months, mental disorders due to organic factors, severe cognitive impairment, initiating or ending formal psychotherapy within 6 weeks before enrollment, treatment-resistant depression (defined as a lack of response to 2 or more adequate courses of antidepressant treatment), a history of poor response to fluoxetine (20 mg/d for ≥4 weeks), a history of electroconvulsive therapy, and pregnancy or lactation.

Procedures and Assessments After a washout period of at least 72 hours, the patients received open-label fluoxetine treatment at a fixed dose of 20 mg daily27 for 6 weeks. During the course of treatment, psychiatrists had the option of adding certain anxiolytic and/or sedativehypnotic medications for brief periods, based on clinical necessity. No other psychotropic agents were used at bedtime for insomnia. Drug adherence was monitored and ensured by psychiatric nurses. Depressive symptom severity was evaluated by boardcertified psychiatrists using HDRS-17 and by depressed patients using the Zung Self-rating Depression Scale (ZDS)28 at baseline (pretreatment) and again at the end of week 6 (after treatment).

Physician-Rating Versus Self-Rating Scales

Seventeen-item HDRS scores range from zero to 52, with higher scores indicating more severe depression. As part of the HDRS17, behavior and somatic symptoms account for at least 50% of the total score. The raters’ intraclass correlation coefficient of interrater reliability was 0.95 for HDRS-17. The ZDS contains 20 items that rate the affective, psychological, and somatic symptoms associated with depression. Item responses are ranked from 1 to 4, with higher scores corresponding to more frequent symptoms. Ten items are positively worded, and the other 10 are negatively worded. Behavior and somatic symptoms explain about 50% of the total possible score on the ZDS. Therefore, ZDS is considered to match the HDRS-17 more closely than the Beck Depression Inventory (BDI), one of the most commonly used self-rating scales for depression.29 Functioning was measured by the Global Assessment of Functioning (GAF)30 and the Work and Social Adjustment Scale (WSAS).31 The GAF is used to report a physician’s judgment of a patient’s overall functioning. The GAF is scored on a 1–100 scale, with a lower score indicating more severe impairment of functioning. The WSAS is a self-rating scale, consisting of 5 Likert scales that measure an individual’s perception of work and social functioning, with higher scores representing greater impairment of functioning. Each item is scored from zero (not affected at all) to 8 (severely affected), with a maximum total score of 40.

Statistical Analysis We conducted a descriptive analysis by calculating the mean value of each clinical variable. Pearson correlation coefficients (r) between HDRS-17 and ZDS, as well as between GAF and WSAS, were calculated at baseline and again at week 6. The statistician Francis Anscombe32 created the quartet to emphasize the importance of visualization. Each scatter plot is visualized to identify any simple linear relationships. Sensitivity to change was measured using effect sizes. We defined the effect size as the difference of the mean change between baseline and posttreatment score for each scale divided by the pooled standard deviation.33 A value of 0.20 indicates a small effect, 0.50 a medium effect, and 0.80 a large effect of the intervention.33

TABLE 1. Descriptive Statistics, Effect size, Correlations Between Physician-Rating Scale Scores and Self-Rating Scales, and Kappa Values Before Treatment and After Treatment (N = 112)

Variable Symptom severity HDRS-17 ZDS r* Kappa‡ Functioning GAF WSAS r Kappa

Before Treatment Mean ± SD

After Treatment Mean ± SD

31.6 ± 6.7 60.3 ± 8.5 0.45†

13.6 ± 8.2 52.9 ± 12.1 0.68†

Effect Size 2.40 0.708

Score Change Mean ± SD

No. Responders (%)

−18.0 ± 8.5 −7.5 ± 10.7 0.46†

66 (58.9) 3 (2.7) 0.038

41.0 ± 9.5 30.2 ± 9.7 −0.40†

62.8 ± 11.5 21.6 ± 12.7 −0.63†

−2.067 0.761

21.8 ± 12.8 −8.5 ± 12.1 −0.49†

64 (57.1) 32 (28.6) 0.192§

*r, Pearson correlation coefficient. † P < 0.001. ‡ Kappa indicates agreement of response between the physician-rating scale and self-rating scale. § P < 0.05. SD indicates standard deviation.

© 2014 Lippincott Williams & Wilkins

www.psychopharmacology.com

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

717

Journal of Clinical Psychopharmacology • Volume 34, Number 6, December 2014

Lin et al

For purposes of comparing changes on physician-rating scales with changes on self-rating sales, response was defined as an improvement of 50% or more of the baseline total score for each scale after 6 weeks of treatment.2,34 Cohen’s kappa statistics were used to measure the agreement of responses between patientand physician-rating scales. Cohen’s kappa coefficient can be classed as poor (0.8).35 All tests were two-tailed, and statistical significance was defined as an alpha of less than 0.05. All data were processed by SPSS version 17.0 for Windows (SPSS Inc, Chicago, IL) and MedCalc (MedCalc Software, Belgium).

RESULTS Characteristics of Subjects A total of 131 acutely ill inpatients with major depressive disorder were enrolled. One-hundred and twelve (85.5%) of them completed the 6-week fluoxetine trial as well as assessments of depressive symptoms (by HDRS-17 and ZDS) and functioning (by GAF and WSAS) at baseline and the end of week 6. Details of the patient sample have been presented in an earlier study,23 which showed that dropout patients (n = 19) and the completers (n = 112) were comparable for sex, age, age at onset, number of previous episodes, and baseline HDRS-17 scores. In the current study, we also found that the patients who had dropped out and those who completed the trial were comparable for ZDS (62.7 ± 6.7 vs 60.3 ± 8.5; P = 0.34), GAF (36.4 ± 12.3 vs 41.0 ± 9.5; P = 0.07), and WSAS scores (41.9 ± 9.5 vs 37.7 ± 12.1; P = 0.17). Eighty-eight (78.6%) of the completers were women. The mean ± SD age was 45.6 ± 11.0 years. Table 1 shows the descriptive statistics, effect size, correlation between HDRS-17 and ZDS before treatment and after treatment, correlation between GAF and WSAS before treatment and after treatment, correlation between HDRS-17 score change and ZDS score change, correlation between GAF score change and WSAS score change, and kappa values.

Correlation Figure 1 presents the scatter diagrams (HDRS-17 score vs ZDS score before treatment and after treatment only). Each scatter plot appears to approach to a simple linear relationship. Correlations between HDRS-17 and ZDS improved after 6 weeks of treatment (from r = 0.45 to r = 0.68). At baseline, correlation between GAF and WSAS was −0.40. After treatment, the GAF correlation with WSAS increased (r = −0.63). All correlations were statistically significant (P < 0.001).

Effect Size and Response In Table 1, effect size measured by HDRS-17 was larger than that measured by ZDS after 6 weeks of treatment. Effect size measured by GAF was also larger than that of WSAS for 6 weeks of treatment. Physician-rating scales and self-rating scales differed in their sensitivity to change. The response rate measured by HDRS-17 was much greater than that measured by ZDS. This agreement (kappa = 0.038) was not statistically significant. Response rate assessed by GAF was also larger than that of WSAS. However, the level of agreement was statistically significant (kappa = 0.192).

DISCUSSION In this study, we found that correlations between baseline physician-rating scale scores and self-rating scale scores increased

718

www.psychopharmacology.com

FIGURE 1. Scatter plots of the associations between 17-item HDRS and ZDS before treatment (week 0) (A) and after treatment (week 6) (B).

after 6 weeks of treatment regardless of the scales used to determine symptom severity or functioning. Physician-rating scales had larger effect sizes than self-rating scales, that is, symptom and functioning assessed by physician-rating scales improved much more than by self-rating scales after treatment. Our findings are consistent with previously published studies7,11–22 about the scales used to measure symptom severity. Correlations between HDRS-17 and ZDS or between GAF and WSAS increased substantially from before treatment to after treatment. In Table 1, there are significant correlations between the HDRS-17 score change and the ZDS score change, as well as between the GAF score change and the WSAS score change. Mean score change for HDRS-17 (−18.0 ± 8.5) is larger than that for ZDS (−7.5 ± 10.7). This indicates that a small change in ZDS is relatively well correlated with a marked change in HDRS-17. The work and activities item (item 7) of HDRS-17 has been used to measure patient’s functioning,36,37and ranges from zero (no difficulty) to 4 (stopped working because of current depression).37 If we used item 7 of HDRS-17 to measure patient’s functioning, correlations between item 7 of HDRS-17 and WSAS became larger from the baseline (r = 0.515; P < 0.001) to end point (r = 0.544; P < 0.001) too. Increased correlation between HDRS and BDI with repeated measurement has also been reported.13,38 One possibility is that physician-rating and self-rating scales measure © 2014 Lippincott Williams & Wilkins

Copyright © 2014 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Journal of Clinical Psychopharmacology • Volume 34, Number 6, December 2014

different aspects.17,39,40 For example, symptoms such as guilt, retardation, hypochondriasis, and insight are not included in ZDS. Severity of illness also affected the level of correlation. Patients with severe illness may have difficulty completing the self-rating scales on their own, whereas their self-perception and cognitive distortion may not reflect reality for rating scales.8,41,42 Physicians tend to recognize severe illness based on nonverbal evidence.3 As patients’ conditions improve after treatment, their verbal reports become more important for less severe illness because they are more capable of clearly identifying their problems.42 Additionally, patients with an HDRS-17 score of at least 18 were enrolled for this clinical trial. This restriction would underestimate the correlations between the scales.42 Effect size is used to measure the magnitude of a treatment effect.43 All effect sizes measured in this study were approaching 0.8 or larger. Therefore, all scales were sensitive enough to detect improvements after intervention. Physician-rating scales had larger effect sizes than self-rating scales in both symptom severity measures and functional measures. Because patients with depression might consider themselves less improved than the physicians, patient ratings of improvement usually lag behind physician ratings of such change.44 There was also profound difference in response rates between those based on HDRS-17 and those based on ZDS (58.9% vs 2.7%; Table 1). The reason for this difference is that HDRS-17 is more sensitive to change than ZDS. This is one possible reason that BDI and ZDS have very rarely been used to evaluate the effect of selective serotonin reuptake inhibitors.9 Nevertheless, some authors have argued that the larger effect size of physician-rating scales in clinical trials, compared to self-report, might be due to a physician’s bias in favor of treatment rather than to true sensitivity to change.18,19 Self-rating scales are useful in detecting the presence/ absence of symptoms/functional impairment but not for quantifying their severity.1 For example, a previous study29 showed that the ZDS scores of inpatients (51.9 ± 12.1) did not differ significantly from those of the day hospital patients (56.3 ± 10.0) or the outpatients (49.0 ± 8.4). However, the HDRS-17 scores of the inpatients, day hospital patients, and outpatients were 29.5 ± 7.8, 23.7 ± 4.2, and 14.7 ± 5.8, respectively. This indicates that ZDS cannot distinguish the degree of severity of the 3 groups. Our findings indicate that discrepancies still exist between physician-rating and self-rating scales, regardless of the scales used for measuring symptoms or functioning. Moreover, the level of agreement between HDRS-17 and ZDS to define response (kappa = 0.038) and the level of agreement between GAF and WSAS to define response (kappa = 0.192) were poor (ie, Cohen’s kappa coefficient

Comparison of physician-rating and self-rating scales for patients with major depressive disorder.

Physician-rating scales remain the standard in antidepressant clinical trials. The current study aimed to examine the discrepancies between physician-...
274KB Sizes 1 Downloads 4 Views