Movement Disorders Vol. 6 , No. 4, 1991, pp. 33&335 Q 1991 Movement Disorder Society

Interobserver Reliability Between Neurologists in Training of Parkinson’s Disease Rating Scales A Multicenter Study Giuliano Geminiani, *Bruno M. Cesana, tFilippo Tamma, SPatrizia Contri, QClaudioPacchetti, SFrancesco Carella, tRoberto Piolti, §Emilia Martignoni, Paolo Giovannini, Floriano Girotti, and Tommaso Caraceni Istituto Nazionale Neurologico “C. Besta, ” Milan, *Epidemiologic Laboratory, Ospedale Maggiore I.R.C.C.S., Milan, ?Department of Neurology, Ospedale S . Gerardo, Monza, $Department of Neurology, Ospedale Sacco, Milan, and 9lstituto Neurologico “ C . Mondino,” Pavia, Italy

Summary: A multicenter study has been conducted to determine the interobserver reproducibility of four of the most frequently used rating scales for Parkinson’s disease: the Columbia University Rating Scale (CURS) and the Webster Rating Scale (WRS), both for assessing clinical signs; the Northwestern University Disability Scale (NUDS); and the Hoehn and Yahr staging. Four resident neurologists, inexperienced in the use of the four scales, independently examined 48 parkinsonian patients. The extent to which their assessments agreed was determined by calculating the Cohen k index after the scores had been recodified. The physicians’ scores agreed substantially for the CURS and the Hoehn and Yahr scale, while those for the NUDS and the WRS agreed only moderately. Analysis of individual item scores within the scales suggests improvements that would offer greater interobserver consistency. Key Words: Parkinson’s disease-Rating scale-Reliability.

College Hospital Parkinson’s Disease Rating Scale (6). Second, there are daily living disability rating scales such as the Northwestern University Disability Scale (NUDS) (7), the Activities on Daily Living Scale (8), part of the University of California, Los Angeles Scale (4),part of the New York University Parkinson’s Disease Disability Scale (3,and part of the King’s College Hospital Parkinson’s Disease Rating Scale (6). Third is the clinical staging of Hoehn and Yahr (9). Within each category, the scales differ from one another in numerous ways. The number of items used to evaluate a given clinical sign may vary: for example, in the CURS five items evaluate muscular rigidity, while in the WRS only one is used. Other scales, such as the University of California, Los

In Parkinson’s disease, quantitative clinical evaluation of illness severity constitutes the basic method by which disease evolution and efficacy of treatments are monitored in patients (1). There are three main types of Parkinson’s disease rating scales. First, there are clinical sign evaluation scales such as the Columbia University Rating Scale (CURS) (2), all items except one of the Webster Rating Scale (WRS) (3), about half the items in the University of California, Los Angeles Scale (4), some parts of the New York University Parkinson’s Disease Disability Scale (3,and part of the King’s

Address correspondence and reprint requests to Dr. G . Geminiani, Istituto Nazionale Neurologico “C. Besta,” via Celoria l l , 20133 Milan, Italy.

330

RELIABILITY OF PD RATING SCALES Angeles Scale and the New York University Parkinson’s Disease Disability Scale, employ numerical constants to give different weightings to individual item scores within the total score. The former scale employs a weighting factor of 2 for seborrhea and of 10 for walking, for example. Another difference between scales concerns the number of severity gradations for each item; the University of California, Los Angeles Scale gives only a three-point gradation (absent, present, marked), while the NUDS employs an 11-point gradation; most scales use four or five severity gradations. Scales also differ with respect to the criteria used to evaluate a given sign. For example, in the WRS, posture is assessed in terms of the amount of neck flexion measured in inches, whereas in the CURS, posture is not measured directly but is evaluated as a whole. In the CURS the examiner is instructed to determine whether rigidity can easily be overcome manually, whereas in the WRS there is no such suggestion. Almost all the scales that have been widely used do not provide for assessment of the clinical phenomena usually encountered after long-term L-dopa treatment or in the advanced phase of the disease: phenomena such as dyskinesias, wearing off, onoff, akinesia paradoxica, autonomic symptoms, mental deterioration, and psychopathological disorders. In some rating scales (e.g., in the New York University Parkinson’s Disease Disability Scale) ad hoc items for assessing dyskinesias have been included; in other cases (King’s College Hospital Parkinson’s Disease Rating Scale), the rating scale is applied in both the “on” and “off” phases. Recently the Unified Parkinson’s Disease Rating Scale has been proposed (lo), which includes items designed to assess both motor fluctuations and the complications of treatment. Studies on Parkinson’s disease today, be they clinical, pharmacological or epidemiological, often require large numbers of patients, entailing collaboration between different neurological centers. Consequently, statistical analyses are conducted on evaluations performed by several examiners. It is important, therefore, to ascertain how uniformly rating scale tests are performed and their results interpreted by clinicians from different institutions. Few studies have been concerned with reliability of Parkinson’s disease rating scales. In the study by Ginanneschi et al. (1 I), six neurologists examined

33 1

the patients together; in other studies ( 1 2 ~ 3 )the assessments were based on video recordings. A methodology involving direct and individual examination of the patient by each clinician seems the correct one for this type of study. The use of videotape cannot be considered satisfactory since it is not sufficient simply to assess a given clinical sign; the diagnostic maneuvers prescribed to elicit that sign must also be studied, since these as well as their interpretation may vary. In a study by Kennard et al. (12) on the reproducibility of the WRS involving the use of video, some neurologists complained that some of the procedures described by Webster and used in the video recording were not their standard technique. The use of video or any methodology not involving the examiner himself carrying out the diagnostic maneuvers may therefore lead to overestimation of the actual uniformity. Moreover, direct examination is necessary to assess important clinical signs such as rigidity that are impossible to gauge from a video recording. In this study the reliability of four of the more frequently used Parkinson’s disease rating scales was assessed by determining the concordance between examiners from four different neurological centers as they applied the scales independently to evaluate patients’ conditions. METHODS Rating Scales The scales investigated were the CURS (2), the WRS (3), the NUDS (7), and the Hoehn and Yahr clinical staging scale (9). The CURS and WRS evaluate clinical signs. The former contains 25 items and assigns the severity of each to a 0 4 scale; the latter has 10 items employing a 0-3 severity scale. The NUDS, a disability assessment scale, contains five items, each gradable 0-10. The Hoehn and Yahr staging scale assigns the patient to one of five stages indicating the overall severity of his illness. Administration of the entire protocol of four rating scales took on average 20 min for each patient. Subjects In order that the study test reproducibility in the widest possible neurological context, the four resident neurologists chosen as examiners (each from a different hospital) were not particularly expert in

Movement Disorders, Vol. 6 , No. 4, 1991

332

G . GEMINIANI ET AL.

the evaluation of Parkinson’s disease patients. Each examiner was, however, instructed beforehand by neurologists from hidher respective institutions experienced in the application of the rating scales. There was no prior accord among the examiners as to how to apply the scales nor on how to interpret the various clinical pictures they might observe. All had a complete written description of the scales to be used. The neurological centers were the “C. Besta” National Neurological Institute, Milan; the “C. Mondino” Neurological Institute, Pavia; the Department of Neurology, Ospedale Sacco, Milan, and the Department of Neurology, Ospedale San Gerardo, Monza, both part of the University of Milan. Patients Patients were selected from among those attending the four centers on an outpatient basis. They all suffered from idiopathic Parkinson’s disease but did not have unpredictable or unexpected variations in their clinical state, so that during examination by two neurologists in succession there would be no change in their clinical situation. Forty-eight patients were selected, 27 men and 21 women; their mean age was 58.2 years (SD ? 5.0); mean illness duration was 6.0 years (24.2); mean length of treatment, 4.7 years (k3.0); and mean Hoehn and Yahr stage, 2.5 (+0.7). None of the patients had ever been examined previously by, nor were they known to, any of the testing neurologists. Procedure At each neurological center in turn an assessment session was organized in which four patients were examined. There were 12 sessions in all, three at each center. The testing neurologists were divided into six pairs and each pair examined eight patients on each of the four scales. The patients were examined first by one of the pair and immediately after by the other. The order in which the pair examined patients alternated. All six possible examiner pairs were formed during the course of the sessions. Each examiner assessed 24 of the total 48 patients. Statistical Methods The extent of agreement between the examiners was analyzed using a model allowing for a fixed number of doctors examining the patients and for

Movement Disorders, Vol. 6 , No. 4, 1991

more than two severity gradations for the items. The model produces an estimate of agreement for each gradation within an item, as well as for each item as a whole, in terms of the k statistic (14). If k = 0 the agreement between the examiners is not better than that expected by chance; if k = 1 there is perfect agreement, and if k < 0 the agreement is less than that expected by chance and may be interpreted as disagreement. Landis and Koch (15) proposed the following interpretations of the numerical values of k: between 0 4 . 2 0 the strength of agreement is slight; between 0.21-0.40 it is fair; between 0.41-0.60 it is moderate; between 0.61-0.80it is substantial; and between 0.81-1 it is almost perfect. Of course, it should be considered the maximum value that the k statistic can achieve owing to the marginal distributions of its contingency table. Given that many of the gradations within items were rarely marked, where appropriate adjacent gradations were run together, taking into account the clinical significance of the score attributed to each. Thus, gradations 1 and 2, and 3 and 4 were united in the CURS; in WRS, gradations 1 and 2 were joined together; gradations 4 and more were run together in the NUDS; Hoehn and Yahr stages I and 11, and IV and V were united. Furthermore, the total CURS and WRS scores, obtained by adding the constituent items recodified as detailed earlier, were themselves assigned to a position on a four-point scale to render the rating scales mutually comparable. Specifically, CURS scores were divided &lo, 11-20, 21-30, and >30; WRS scores were divided 0-6, 7-12, 13-18, and >18. For the NUDS it was possible to obtain only three gradations: 0-5, 5-10, and >lo.

RESULTS From analysis of the total scores for each scale it emerged that the examiners were “substantially” concordant in their CURS (k = 0.62) and Hoehn and Yahr ( k = 0.68) assessments. There was “moderate” concordance for the WRS (k = 0.42) and the NUDS ( k = 0.50). The k values for individual items are given in Table 1. For only a few of these was there even moderate concordance. There was greatest agreement ( k > 0.4) for the items evaluating tremor, both in the WRS and the CURS; for alternating movements, facial expression, posture, and postural stability in the CURS; and for bradykinesia in the WRS. Gen-

RELIABILITY OF PD RATING SCALES TABLE 1. Interobserver concordance (k-index)for each item of the CURS, WRS, and NUDS Item

CURS

WRS

Facial expression Seborrhea Sialorrhea Speech disorder Tremor upper R upper L lower R lower L facial Rigidity upper R upper L lower R lower L axial Finger tapping (R-L) Foot tapping (R-L) Succession movements (R-L) Bradykinesia Arising from chair Posture Postural stability Gait/walking Upper extremity swing Self-care Dressing Hygiene Eating and feeding

0.48 0.25 0.36 0.39

0.37 0.29

NUDS

-

0.34 0.68

0.52 0.32 0.73 0.80 0.47 0.20 0.20 0.24 0.30 0.25 0.53 0.47-0.53 0.24-0.23 0.474.36 0.28 0.37 0.43 0.36 0.35 -

0.73

-

333

The least concordant items (k generally less than 0.30) were those assessing rigidity in the CURS and the WRS, posture in the WRS, speech in the NUDS, seborrhea in the CURS and the WRS, and bradykinesia and foot tapping in the CURS. Certain items in the CURS and the WRS that were directly comparable (specifically posture and bradykinesia) showed different concordance within each scale. Speech, seborrhea, and gait, on the other hand, had similar k values in both the CURS and the WRS. For the CURS and WRS items most characteristic of Parkinson’s disease, the k values for each severity gradation within these items were also calculated (Table 2). There was moderate agreement between the examiners for the reported absence of a clinical sign (gradation 0), but for the intermediate gradations of severity there was generally slight or fair concordance.

0.18

-

DISCUSSION

0.33 0.31 0.70

-

CURS, Columbia University Rating Scale; WRS, Webster Rating Scale; NUDS, Northwestern University Disability Scale; R, right; L, left.

erally satisfactory concordance was obtained for assessment of disability in daily activity (last item of the WRS and the walking, dressing, and hygiene items of the NUDS).

The interobserver agreement of these scales cannot be considered satisfactory, particularly when it is recalled that the gradations were recodified to give coarser severity distinctions than originally intended. The best interobserver agreement was obtained for the Hoehn and Yahr scale, a clinical staging scale that provides an overall assessment of illness severity based on clinical features and functional disability and that is poorly sensitive to slight changes in clinical status (16). Interobserver concordance for the CURS and the WRS was lower in this study than reported previ-

TABLE 2. Interobserver concordance (k-index)for each severity gradation within the more important CURS and WRS items Category

0 (score 0) I (scores 1 and 2) 2 (scores 3 and 4)” CURS rigidity (upper right) tremor (upper right) bradykinesia finger tapping (right) succession movements (right) postural stability gait WRS rigidity tremor bradykinesia gait

0.44 0.58 0.79 0.49 0.51 0.53 0.38

0.16 0.36 0.25 0.40 0.44 0.31 0.34

- 0.07

0.66 0.74 0.73 0.22

0.18 0.67 0.73 0.29

- 0.05

0.67 0.08 0.56 0.48 0.40 -0.01 0.48 1

CURS, Columbia University Rating Scale; WRS, Webster Rating Scale. The WRS includes only score 3.

Movement Disorders, Vol. 6 , No. 4, 1991

334

G . GEMINIANI ET AL.

ously (1 1,13). At least in part this must be due to the fact that in our study the examiners did not agree on uniform interpretations and techniques beforehand. Moreover, in this study the examiners were not expert in the evaluation of parkinsonian patients. We regarded inexperienced neurologists as appropriate for evaluation of reproducibility of the rating scales in Parkinson’s disease because they are often in charge of the follow-up observation of patients in clinical trials. The greater reproducibility of the CURS compared to the WRS has been noted by others (11). In principle, scales that contain fewer (coarser) severity gradations are more reproducible. Hence, the individual items in the WRS (severity range, &3) achieve higher k values than the corresponding items in the CURS (severity range, 0-4). Analogously, the k values for individual gradations within a range are more discordant in the middle than at the ends of that range. Notwithstanding, the greater number of items in the CURS (25, as opposed to 10 in the WRS) permits the clinical picture to be defined more accurately and leads to greater concordance between examiners. Accurate definition of clinical status is connected with consideration of a range of different diagnostic features, not with the fractionation of a single character into subitems: if the several items in the CURS that assess a single clinical feature (tremor or rigidity) are compared with the corresponding single item in the WRS, it is seen that the reproducibility of the CURS is not substantially superior for those items (11). In both the CURS and the WRS the items assessing rigidity were among the least reproducible. There thus appear to be clinical features that, though fundamental in defining the disease, produce poorly concordant evaluations of their severity, whatever rating scale is used. By contrast, the items concerned with tremor seem much more reproducible, in all probability because tremor is such a distinct and obvious symptom. The conspicuously low k value for the “posture” item in the WRS suggests that accurately determined measurements (in this case the extent of head flexion measured in inches) that nonetheless evaluate a clinical feature partially and somewhat artificially are of less use than more general criteria commonly used in clinical practice (overall impression of forward flexion of the trunk), even though the latter are less accurately formulated. The consistent interobserver assessment of the “bradykinesia” item in the WRS

Movement Disorders, Val. 6 , No. 4 , 1991

seems related to its specification by several evaluation criteria. The evaluation of disability in daily activities is based on anamnestic information, so good interobserver reliability is expected. However, our results showed that the NUDS was only moderately reliable; a reason for this could be the large number of severity gradations in this scale. It is evident that many of the difficulties encountered today when using the better-known assessment scales derive from the fact that after a few years of levodopa treatment the parkinsonian patient presents a clinical picture that is profoundly different from any envisaged when the scales were devised. Even where both off- and on-phase evaluations are considered, it is problematic to apply the classical scales to parkinsonian patients with motor fluctuations. In these patients it is difficult to quantify dyskinesia, to distinguish the intermediate phases of motor performance, and to evaluate disability in those activities that are avoided in off phases. The Unified Parkinson’s Disease Rating Scale (10) does not completely overcome these difficulties. Moreover, it consists of items from the long-established scales (including the CURS and the Hoehn and Yahr) whose reproducibility leaves something to be desired. Our conclusions resulting from this study can be summarized as follows. Clinical sign scales should include only items that are clearly relevant to defining the clinical state and are easily quantifiable. A cardinal diagnostic sign such as rigidity that is poorly reproducible could be excluded from scales whose objective is not diagnosis. This type of rating scale would remain fundamental for evaluating slightly compromised patients in the uncomplicated phase of their illness. For patients with motor fluctuations it seems helpful to use self-assessment items more extensively. Self-assessment may be limited by patient compliance and the possibility of disagreement between patient and examiner (17). On the other hand, Brown and co-workers (18) reported that parkinsonian patients can provide accurate self-report of their level of disability. Self-report of on-off hours might allow easy and reproducible quantification of motor fluctuations. The use of simple objective motor tests (e.g., count of finger tapping or screw time) would allow more accurate evaluation of akinesia. The disability rating scale and clinical staging of

RELIABILITY OF PD RATING SCALES Hoehn and Yahr should be retained with some modifications to accommodate patients with motor fluctuations. Finally, with regard to the conduct of multicenter studies, we feel it is important that all examiners have equal familiarity with the rating scales. In particular, it is important that examiner training sessions be held prior to the trial, to harmonize criteria for both the application and interpretation of the rating scales. Acknowledgment: Research was supported by Regione Lombardia Grant no. 956, 1988.

REFERENCES 1. Larsen TA, LeWitt PA, Calne DB. Theoretical and practical issues in assessment of deficits and therapy in parkinsonism. In: Calne DB, Horowski R, McDonald RJ, Wuttke W, (eds.) Lisuride and other dopamine agonists. New York: Raven Press, 1983:363-373. 2. Duvoisin RC. The evaluation of extrapyramidal disease. In: de Ajuriaguerra J, Gauthier G, (eds.) Monoamines noyaux gris centruux et syndrome de Parkinson. Geneve: George & Cie, 1971:313-325. 3. Webster DD. Critical analysis of the disability in Parkinson’s disease. Mod Treat 1968;5:257-282. 4. Markham CH. The choreoathetoid movement disorder induced by levodopa. Clin Pharmcicol Ther 1971;12:340-343. 5. Lieberman A. Parkinson’s disease: a clinical review. A m J Med Sci 1974;277:66-80. 6. Parkes JD, Zilkha KJ, Calver DM, Knill-Jones RP. Controlled trial of amantadine hydrochloride in Parkinson’s disease. Lancet 1970;1:259-262.

335

7. Canter GJ, de La Torre R, Mier M. A method for evaluating disability in patients with Parkinson’s disease. J Nerv Ment Dis 1961;133:143-147. 8. England AC, Schwab RS. Post-operative evaluation of selected patients with Parkinson’s disease. J A m Geriatr Soc 1956;4:121e l 232. 9. Hoehn MM, Yahr MD. Parkinsonism: onset, progression and mortality. Neurology 1967;17:427-442. 10. Fahn S, Elton RL, members of the UPDRS Development Committee. The Unified Parkinson’s Disease Rating Scale. In: Fahn S, Marsden CD, Calne DB, Goldstein M, eds. Recent developments in Parkinson’s diseuse. Vol 2. Florham Park, New Jersey: Macmillan Healthcare Information, 1987:153-163, 293-304. 11. Ginanneschi A, Degl’lnnocenti F, Magnolfi S, et al. Evaluation of Parkinson’s disease: reliability of three rating scales. Neuroepidemiology 1988;7:3841. 12. Kennard C, Munro AJ, Park DM. The reliability of clinical assessment of Parkinson’s disease. J Neurol Neurosurg Psychiatry 1984;47:322-323. 13. Montgomery GK, Reynolds NC, Warren RM. Qualitative assessment of Parkinson’s disease: study of reliability and data reduction with an abbreviated Columbia Scale. Clin Neuropharmacol 1985;18:83-92. 14. Cohen J. A coefficient of agreement for nominal scales. Educ Psycho1 Measuremeni 1960;20:3746. 15. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biomerrics 1977;33:159-174. 16. Diamond SG, Markham CH. Evaluating the evaluations, or how to weigh the scales of parkinsonian disability. Neurology 1983;33:109&1099. 17. Golbe LI, Pae J . Validity of mailed epidemiological questionnaire and physical self-assessment in Parkinson’s disease. Mov Disord 1988;3:245-254. 18. Brown RG, MacCarthy B, Jahanshahi M, Marsden CD. Accuracy of self-reported disability in patients with parkinsonism. Arch Neurol 1989;46:955-959.

Movement Disorders, Vol. 6 , N o . 4 , 1991

Interobserver reliability between neurologists in training of Parkinson's disease rating scales. A multicenter study.

A multicenter study has been conducted to determine the interobserver reproducibility of four of the most frequently used rating scales for Parkinson'...
524KB Sizes 0 Downloads 0 Views