Inter- and intrarater reliability of osteoarthritis classification at the trapeziometacarpal joint.

SCIENTIFIC ARTICLE

Inter- and Intrarater Reliability of Osteoarthritis Classification at the Trapeziometacarpal Joint R. M. Choa, MBChB, H. P. Giele, MS

Purpose To assess the reliability of the Eaton and Glickel classification for base of thumb osteoarthritis. Methods The interrater and intrarater reliability of this classification were assessed by comparing ratings from 6 raters using quadratic weighted kappa scores. Results Median inter-rater reliability ranged from kappa of .53 to .54; intrarater reliability ranged from kappa of .60 to .82. Using unweighted kappa interrater reliability was “slightly” reliable, and intrarater reliability was “fairly” reliable. Overall, the value of the intraclass correlation for all 6 raters was .56. Conclusions This radiological classification does not describe all stages of carpometacarpal joint osteoarthritis accurately enough to permit reliable and consistent communication between clinicians. Therefore we believe it should be used with an understanding of its limitations when communicating disease severity between clinicians or as a tool to assist in clinical decision making. (J Hand Surg Am. 2015;40(1):23e26. Copyright Ó 2015 by the American Society for Surgery of the Hand. All rights reserved.) Type of study/level of evidence Diagnostic III. Key words Trapeziometacarpal joint, osteoarthritis, thumb.

T

osteoarthritis (OA) is a common clinical problem, most frequently encountered in the dominant hand and in women. Various classification systems exist for the purpose of facilitating communication between clinicians regarding TMC joint OA disease stage and treatment options. The Burton classification1 takes into account clinical and radiographic findings. Eaton and Littler2 devised a purely radiological classification system for RAPEZIOMETACARPAL (TMC) JOINT

From the Department of Plastic Reconstructive and Hand Surgery, Oxford University Hospitals, Oxford, UK. Received for publication March 10, 2014; accepted in revised form September 4, 2014. The authors are grateful to Dr Peter Nightingale for statistical support received. No benefits in any form have been received or will be received related directly or indirectly to the subject of this article. Corresponding author: Robert Choa, MBChB, Department of Plastic Reconstructive and Hand Surgery, Level LG1, West Wing, Oxford Radcliffe Hospital, Headley Way, Oxford OX3 9DU; e-mail: [email protected]. 0363-5023/15/4001-0006$36.00/0 http://dx.doi.org/10.1016/j.jhsa.2014.09.007

staging the severity of OA at the TMC joint. Eaton and Glickel3 later refined their classification to include scaphotrapezial joint degeneration. This refined classification system is the most commonly used scale in clinical practice. A number of studies have investigated the interrater and intrarater reliability of this classification system.4,5 These studies used unweighted kappa scores, whereas we believe the kappa scores should be weighted. The Eaton and Glickel classification has ordinal categories that reflect increasing levels of joint disease. Disagreement by 1 scale point (eg, grade 1 e grade 2) is less serious than disagreement by 2 scale points (eg, grade 1 e grade 3). To reflect the degree of disagreement, kappa can be weighted so that it attaches greater emphasis to large differences between ratings than to small differences. Weighted kappa penalizes disagreements in terms of their seriousness, whereas unweighted kappa treats all disagreements equally. Unweighted kappa, therefore, is inappropriate for ordinal scales. Through the use of weighted kappa we believe that the levels of interrater

Ó 2015 ASSH

r

Published by Elsevier, Inc. All rights reserved.

r

23

24

TMC JOINT OA INTER- AND INTRARATER RELIABILITY

and intrarater agreement can be assessed more accurately than in previous studies.6

TABLE 1. Interrater Reliability Between Different Clinician Groups

METHODS Six raters, including 3 hand surgery consultants and 3 registrars in hand surgery, participated in the study. All 3 consultants participating were plastic surgeons with a subspecialty interest in hand and upper limb surgery and who frequently treat this condition. The 3 registrars included one year-3 and one year-4 trainee and a senior fellow in hand surgery. This permitted comparisons of agreement to be drawn within individuals, within the consultant and registrar groups, and also between these 2 groups. The purpose of making comparisons both within and between groups of doctors of different grades was to identify whether increased experience of interpreting TMC joint radiographs correlated with better grading reliability. Quadratic weighted kappa was used to assess the correlation between 2 ratings at a time, which allowed inter- and intrarater comparisons to be made. It is not simple to extend quadratic weighted kappas to more than 2 raters. Therefore, to assess correlation between all the raters, intra-class correlation was used. The justification for using intra-class correlation coefficients was that the calculated differences between quadratic weighted kappa and intra-class correlation coefficients were less than .005. Calculating weighted kappa using quadratic weights is virtually the same as calculating intra-class correlation coefficients provided the sample size is large enough (the sample size in this study was sufficient).7 Each rater was asked to score base of thumb radiographs from 52 patients arranged into a slideshow. The sample size was calculated based on the width of confidence intervals for the intra-class correlation using previously devised formulas.8 A sample size of 52 was sufficient for the width of the 95% confidence interval to be .25 or less if the intra-class correlation was greater than .5, and .2 or less if the intra-class correlation was greater than .68. Because the radiographs were de-identified, our institution did not require institutional review board approval. At the outset of the slideshow, a presentation explaining the scale to be evaluated was explained. Radiographs for each grade were shown, highlighting the pertinent features that distinguish each of them. Understanding of the scale by each rater was checked prior to proceeding with the slideshow. The posteroanterior and lateral radiographs included a spectrum of images, ranging from normal joints to stage 4 disease with at least 8 of each grade. J Hand Surg Am.

Clinician Group

Median Score

Range

Consultant vs consultant

0.55

0.51e0.66

Consultant vs registrar

0.54

0.49e0.65

Registrar vs registrar

0.53

0.50e0.60

Overall score: all raters

0.56

N/A

Median quadratic weighted kappa scores (interrater reliability) between different clinician groups and overall score between all raters (intraclass correlation coefficient). N/A, not applicable.

The radiographs had been obtained from 23 men and 29 women with a mean age of 58 years (range, 35e82 y). The distribution of right:left TMC joint radiographs was 30:22. The images were randomly selected over a 6-month period from patients known to have TMC joint OA and patients with normal joints whom had had appropriate radiographs taken for other reasons. The slideshow was shown to all raters simultaneously to ensure each rater spent the same amount of time analyzing each patient’s radiographs. After an interval of one week, the same raters were shown the same radiographs arranged in a different order and asked to score them again. Raters were asked not to confer with each other regarding the scoring of the radiographs to reduce bias. RESULTS The median results and ranges for the quadratic weighted kappa scores in the different clinician groups are seen in Table 1. This includes an overall score for all 6 raters. In order to make comparison with previous studies, we also derived unweighted kappa scores from our data. These are shown in Table 2. We calculated quadratic weighted kappa scores for each individual clinician, which are equivalent to intrarater reliability. These results are shown in Table 3. Additional analyses were undertaken to evaluate whether the raters showed greater reliability at scoring those grades at the extremes of the scale (ie, grades 0, 1, and 4), compared with those in the middle (grades 2 and 3). These analyses are presented in Table 4. DISCUSSION Quadratic weighted kappa scores are valuable because they demonstrate the proportion of the total r

Vol. 40, January 2015

25


TABLE 2.

Comparison With Previous Studies

Study

Year

No. Patients

Observers

Intrarater Reliability

Interrater Reliability

Kubik and Lubahn4

2002

3 hand surgeons, 3 orthopedic residents

40

.66

Dela Rosa et al9

2004

6 hand surgeons

40

.59e.61

.37e.56

Spaans et al5

2011

5 radiologists, 8 hand surgeons

40

NA

.50

Hansen et al10

2013

2 hand surgeons, 2 residents, 1 radiologist

43

.54

.11

Choa and Giele (current study)

2014

3 hand surgeons, 3 hand surgery registrars

52

.37 (.17e.53)

.22 (.05e.38)

.53

Comparison with previous studies on intra- and interrater reliability of the Eaton classification using unweighted kappa. NA, not available.

TABLE 3.

grades would be due to variability in the radiographs, so the quadratic weighted kappa scores would be 1 because the raters were in total agreement. The overall score of .56 across all the raters means that approximately half of the total variability in the grades is due to variability in the radiographs being graded, with the other variability being due to the raters themselves. We interpret this as inadequate agreement for the grading system to be deemed useful. The anticipated biases in this study were those involving study design. To reduce this type of bias a validated outcome measure would normally be used, but given that the Eaton scale itself was being assessed this was not possible. It is conceivable that the participants understood the nature of the study or had preconceptions about the applicability of the Eaton classification, which could have led to bias when scoring the radiographs. The participants, however, were not asked to comment on the scale themselves. It is difficult to control for any preconceptions; nevertheless, by asking participants not to confer with each other, these sources of bias may have been minimized. It is not surprising that intrarater reliability was better than interrater reliability as the same person is likely to rate the same radiograph with the same score. In general, consultants had higher levels of intrarater reliability than registrars, possibly reflecting more experience with interpretation of radiographs in the allotted time. All raters graded the radiographs at the extremes of the rating scale (Although the scoring of the radiographs at the extremes of the scale were not perfect, the reduced reliability of scoring grades 2 and 3 appear to affect the reliability of the scale to a greater extent. Previous studies used various scales for categorizing their unweighted kappa scores to determine the

Individual Intrarater Reliability

Clinician

Intrarater Reliability

Consultant 1

.82

Consultant 2

.78

Consultant 3

.76

Registrar 1

.60

Registrar 2

.74

Registrar 3

.77

Individual quadratic weighted kappa scores (intrarater reliability).

TABLE 4. Reliability of Grading Radiographs at the Extremes of the Scale (0, 1, and 4) Compared With the Central Grades (2 and 3) Group 1 (grades 0, 1, 4)

Group 2 (grades 2, 3)

Consultant 1

.84

.78

Consultant 2

.85

.68

Consultant 3

.79

.54

Registrar 1

.67

.36

Registrar 2

.75

.70

Registrar 3

.80

.69

Rater

variability in the grades that is due to variability in the rated radiographs. If every radiograph was given the same grade by rater 1 and every subject was given the same grade by rater 2 but the grades given by the 2 raters were different, none of the variability in the grades would be due to variability in the radiographs, so the quadratic weighted kappa scores would be 0 because the raters were in total disagreement. Conversely, if not all the radiographs were given the same grade by rater 1 and if rater 2 gave the same grade as rater 1 in every case, all the variability in the J Hand Surg Am.

r


26


strength of inter- and intrarater agreement.4,5 The Landis and Koch scale is one of these, assigning the following descriptive terms for strength of agreement to various kappa scores: < .00 ¼ Poor; .00 to .20 ¼ Slight; .21 to .40 ¼ Fair; .41 to .60 ¼ Moderate; .61 to .80 ¼ Substantial; and .81 to 1.00 ¼ Almost perfect.11 The problem with using the type of scale proposed by Landis and Koch is that the categories are arbitrary and the effects of prevalence and bias on kappa must be considered when judging its magnitude.6,12 We therefore did not apply a rating to our quadratic weighted results. Instead, we believe that the interrater reliability is inadequate and the intrarater reliability only marginally better. A systematic review of intra- and interrater reliability of the Eaton classification13 compared 4 studies and found interrater reliability to be poor to fair, with intrarater reliability being fair to moderate. The system used to classify these amalgamated results was not specified, however. When we used the Landis and Koch scale on our unweighted kappa values, there was slight to fair interrater agreement between all the clinician groups. Furthermore, the intrarater reliabilities ranged from slight to moderate. We believe that the quadratic weighted kappa scores are more appropriate for testing inter- and intrarater reliabilities for this type of clinical test with multiple ordered categories. Regardless of the type of test used, the level of interrater agreement was in the range from slight to moderate when using a scale such as the one described previously on the unweighted kappa data. Given this relatively low level of interrater reliability, we do not believe this classification should be used to communicate disease severity or as a tool to assist in clinical decision making if the limitations of the test are not fully

J Hand Surg Am.

understood. Furthermore, we do not believe that the decision to treat patients should be based upon a radiographic system alone, given that some patients with severe radiograph findings are asymptomatic and vice versa. Instead, the decision to treat should be based on a combination of clinical and radiographic findings. REFERENCES 1. Burton RI. Basal joint arthrosis of the thumb. Orthop Clin North Am. 1973;4(2):331e338. 2. Eaton RG, Littler JW. Ligament reconstruction for the painful thumb carpometacarpal joint. J Bone Joint Surg Am. 1973;55(8): 1655e1666. 3. Eaton RG, Glickel SZ. Trapeziometacarpal osteoarthritis: staging as a rationale for treatment. Hand Clin. 1987;3(4):455e471. 4. Kubik NJ, Lubahn JD. Intra-rater and inter-rater reliability of the Eaton classification of basal joint arthritis. J Hand Surg Am. 2002;27(5):882e885. 5. Spaans AJ, van Laarhoven CM, Schuurman AH, van Minnen LP. Inter-observer agreement of the EatoneLittler classification system and treatment strategy of thumb carpometacarpal joint osteoarthritis. J Hand Surg Am. 2011;36(9):1467e1470. 6. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005;85(3): 257e268. 7. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas. 1973;33(3):613e619. 8. Shoukri MM, Asyali MH, Donner A. Sample size requirements for the design of reliability study: review and new results. Stat Methods Med Res. 2004;13(4):251e271. 9. Dela Rosa TL, Vance MC, Stern PJ. Radiographic optimization of the Eaton classification. J Hand Surg Br. 2004;29(2):173e177. 10. Hansen TB, Sørensen OG, Kirkeby L, Homilius M, Amstrup AL. Computed tomography improves intra-observer reliability, but not the inter-observer reliability of the EatoneGlickel classification. J Hand Surg Eur Vol. 2013;38(2):187e191. 11. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159e174. 12. Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ. 1992;304(6840):1491e1494. 13. Berger AJ, Momeni A, Ladd AL. Intra- and inter-observer reliability of the Eaton classification for trapeziometacarpal arthritis. Clin Orthop Relat Res. 2014;472(4):1155e1159.

r


Inter- and intrarater reliability of the Chicago Classification in pediatric high-resolution esophageal manometry recordings.

Surgical treatment of trapeziometacarpal joint osteoarthritis.

Surgery for thumb (trapeziometacarpal joint) osteoarthritis.

Early osteoarthritis of the trapeziometacarpal joint is not associated with joint instability during typical isometric loading.

Inter- and intrarater reliability of ulna variance versus lunate subsidence measurements in Madelung deformity.

Osteoarthritis of the trapeziometacarpal joint: the pathophysiology of articular cartilage degeneration. I. Anatomy and pathology of the aging joint.

Trapeziometacarpal osteoarthritis: pyrocarbon interposition implants.

Osteoarthritis of the trapeziometacarpal joint: the pathophysiology of articular cartilage degeneration. II. Articular wear patterns in the osteoarthritic joint.

Trapeziectomy With a Tendon Tie-in Implant for Osteoarthritis of the Trapeziometacarpal Joint.

Osteoarthritis Classification Scales: Interobserver Reliability and Arthroscopic Correlation.

Trapeziometacarpal osteoarthritis: surgical technique and results of "stabilized resection-arthroplasty".

Cultural adaptation, content validity and inter-rater reliability of the "STAR Skin Tear Classification System".

Partial trapeziectomy and pyrocarbon interpositional arthroplasty for trapeziometacarpal joint osteoarthritis: results after minimum 2 years of follow-up.

[Ligament reconstruction for trapeziometacarpal joint instability].

Reliability of the classification and treatment of dislocations of the acromioclavicular joint.

Interrater reliability and intrarater reliability of lateral scapular slide tests of females in their 20s.

[Comparative clinical study of 2 surgical techniques for trapeziometacarpal osteoarthritis].

Trapeziometacarpal arthrodesis or trapeziectomy with ligament reconstruction in primary trapeziometacarpal osteoarthritis: a randomized controlled trial.

De Quervain Tenosynovitis Following Trapeziometacarpal Ball-and-Socket Joint Replacement.

Inter-clinician and intra-clinician reliability of force application during joint mobilization: a systematic review.

Inter- and Intrarater Agreement on the Outcome of Endovascular Treatment of Aneurysms Using MRA.

The interrater and intrarater reliability of the Philpott-Javer staging system based on level of training.

Interrater and Intrarater Reliability of the Tuck Jump Assessment by Health Professionals of Varied Educational Backgrounds.

Surgical treatment of trapeziometacarpal joint arthritis: a historical perspective.