558642

research-article2014

CRE0010.1177/0269215514558642Clinical RehabilitationFranki et al.

CLINICAL REHABILITATION

Article

A study of whether video scoring is a reliable option for blinded scoring of the Gross Motor Function Measure-88

Clinical Rehabilitation 1­–7 © The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0269215514558642 cre.sagepub.com

Inge Franki1,2, Chris Van den Broeck1, Josse De Cat2,3, Guy Molenaers3, Guy Vanderstraeten1 and Kaat Desloovere2,3

Abstract Objective: To investigate the agreement between live and video scores of the Gross Motor Function Measure-88. Design: Reliability study. Subjects: Forty children with bilateral spastic cerebral palsy. Interventions: Fifty evaluations were administered according to the test guidelines, and were videotaped. After a minimum interval of one month, the video recordings were again rated by the same assessor. Two physical therapy students also each scored the recordings twice, with a minimal interval of one month. Main measures: Agreement between live and video scores as well as inter-rater and intra-rater agreement of the video scores were assessed using intra-class correlation coefficients (ICC), standard error of measurements (SEM), and smallest detectable changes (SDC). Weighted kappa coefficients were used to analyse individual items. Results: The live and video scores from the same assessor showed good to very good agreement for the total score (ICC, 0.973; SEM, 2.28; SDC, 6.32) and dimensions B (ICC, 0.938), D (ICC, 0.965), and E (ICC, 0.992) but lower agreement for A (ICC, 0.720) and C (ICC, 0.667). Live-versus-video agreement for the total score was higher than inter-rater agreement by video (ICC, 0.949; SEM, 3.15; SDC, 8.73) but lower than intra-rater agreement by video (ICC, 0.989; SEM, 1.42; SDC, 3.96). Conclusion: The Gross Motor Function Measure-88 can be reliably scored using video recordings. The agreement between live and video scores is lower than the intra-rater reliability using video recordings only. Future clinical trial results should be interpreted using the appropriate SEM and SDC values. Keywords Video scoring, gross motor function, cerebral palsy, physical therapy Received: 14 May 2014; accepted: 12 October 2014

1Department

of Rehabilitation Sciences and Physiotherapy, Ghent University, Belgium 2Department of Rehabilitation Sciences, KU-Leuven, Leuven, Belgium 3Clinical Motion Analysis Laboratory, University Hospital Pellenberg, Belgium

Corresponding author: Inge Franki, Ghent University - Rehabilitation Sciences and Physiotherapy, University Hospital Ghent, Campus Heymans 1B3, De Pintelaan 185, 9000 Ghent, Belgium. Email: [email protected]

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

2

Clinical Rehabilitation 

Introduction

Methodology

The Gross Motor Function Measure is a criterionreferenced observational measure of gross motor function that is validated for use in children with cerebral palsy, down syndrome, and traumatic brain injury.1–3 The original Gross Motor Function Measure-88 comprises 88 items, divided into five dimensions.4 The Gross Motor Function Measure-66 was subsequently developed, which is shorter and more reliable but may be less descriptive for children functioning at lower ability levels and requires specific software to calculate the child’s total score.5 Therefore, the Gross Motor Function Measure-88 is still very commonly used. A well-known challenge faced in paediatric research is the requirement of blinded assessment. Blinded evaluation can reduce tester bias and thereby increase study quality. However, gross motor function evaluation requires that the child feels comfortable. Especially among young children, an unknown assessor might make a child insecure and frightened and, thereby, less compliant—which could reduce performance. Thus, video recordings can be a useful tool to create a qualitative and optimal testing situation. Video scoring can allow blinded evaluation, and may also increase reliability by enabling the person scoring the test to pause scoring and to review items in case of doubt. Several researchers have used video recordings and have investigated the inter-rater reliability of the Gross Motor Function Measure-88 with the use of video-scoring.1,6–8 However, to our knowledge, none of these studies have evaluated the agreement between live and video scores. There may be limitations specific to scoring video recordings, particularly for the Gross Motor Function Measure- 88, which is known to be very vulnerable for missing items.1,4 It is important to investigate whether scoring video recordings increases the chance for missing items and whether this creates systematic error. The present study aimed to examine the agreement between the results of the Gross Motor Function Measure-88 when scored live and by video. The agreement between the live and video scores was compared to the inter-rater and intrarater agreement.

Children were recruited at the Cerebral Palsy Reference Centre of the University Hospital Pellenberg, using data from an ongoing intervention study. This study included 4- to 9-year-old children with bilateral spastic cerebral palsy who were ambulant and classified as gross motor function level I–III. Children were excluded if they showed severe problems that could influence test results, such as blindness, deafness, or severe cognitive limitations. All caretakers of the children signed an informed consent form. Each child was tested by the first author of the study, a paediatric physical therapist who is acquainted with the test items and manual of the Gross Motor Function Measure-88. The English version of the test manual was used, and all evaluations were administered accordingly, with all 88 items tested for each assessment. Testing was performed in the paediatric physical therapy room where the child’s usual therapist worked, such that the setting was comfortable and familiar. During the tests, children were barefoot and no assistive devices were used. Each session took 40–60 minutes. All assessments were videotaped for subsequent extra scoring. To avoid distracting the child, the camera was positioned on a tripod and no additional person was present for filming. Videorecordings were not edited. After a minimal interval of one month, the video recordings of the test were scored again by the same paediatric physical therapist. The recordings were also copied and distributed to two physical therapy Master’s degree students specializing in paediatric physical therapy. Each student scored each video-recorded test twice, again with a minimal interval of one month between scorings moments. These students were also familiar with the test and the guidance manual. As training, they scored five video recordings with the ability to ask questions. During video scoring, scorers were allowed to pause the video recordings and to review items as many times as necessary. According to the test guidelines, all items were tested and scored on an ordinal scale of zero (did not initiate) to three (completed item). Subsequently,

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

3

Franki et al. the total dimension scores were calculated and expressed as percentage scores. Finally, total scores were calculated as the average of all dimensions. Data analysis began with determining the agreement of test results between the live and video scores as assessed by the first evaluator. Secondly, we calculated the intra-rater agreement between the first and second video scores from the independent raters. Next, the inter-rater agreement was calculated between the second video scores of all three raters. The association between the total and dimension percentage scores was evaluated using intraclass correlation coefficients (ICC, 3.1) for absolute agreement with 95% confidence intervals.9 The ICC can be defined as a measure of association, but it specifically assesses the consistency or reproducibility of the same quantitative measurements made by different observers. Therefore, the ICC will henceforth be described as an agreement measure. Bland-Altman plots were used to investigate systematic differences between the measurements and to identify possible outliers. Data were reported as the limits of agreement with 95% confidence intervals (SD, 1.96).10 Additionally, the standard error of measurement (SEM) and the smallest detectable change (SDC) were calculated based on the variance components.11 The SEM provides an absolute index of reliability, indicating the variability of scores around the subject’s true score and thereby providing a measurement error value in the same unit as the measurement itself. The SDC represents the amount of change in a patient’s score that is too great to be explained by measurement error. The agreement between the live and video ratings was also assessed at the item level. As the item scores used an ordinal scale, weighted kappa coefficients (ƙ) were applied. However, weighted kappa coefficients cannot be calculated when there is a limited range of scores. Therefore, the proportion of positive agreement was used as an additional measure for item analysis.12 Proportions of positive agreement were calculated as the proportion of scores with positive agreement divided by the total number of scores. We additionally registered the number of items that could not be scored based on the video recordings due to practical reasons, for

example, because the child was out of the video range. Statistical analyses were performed using the IBM SPSS 22 Statistics for Windows version 22.0 (Armonk, NY, IBM Corp). Weighted kappa coefficients were obtained using MedCalc for Windows, version 12.5 (MedCalc Software, Ostend, Belgium).

Results The study included 40 children with a mean age at the time of testing of 5.8 years (SD, 1.3 years). All children were diagnosed with bilateral spastic cerebral palsy, with 12 children classified at gross motor function classification system level I, 18 at level II, and 10 at level III. For 10 children, two different measurements were used, applied at a minimal interval of 10 weeks. This resulted in a total of 50 test results. Before the study, none of the children had previously been tested with the Gross Motor Function Measure-88. Table 1 presents an overview of the agreement between the live and video scores by the first evaluator. These results showed good to very good agreement for the total score (ICC, 0.973) and for dimensions B (ICC, 0.938), D (ICC, 0.965), and E (ICC, 0.992) and good agreement for dimensions A (ICC, 0.720) and C (ICC, 0.667). The SEM for the total score was 2.28% with an SDC of 6.32%. The Bland-Altman plot demonstrated limits of agreement for the total scores ranging from −5.28% to 7% (Figure 1 and Table 1), and the mean value of the difference was 0.81%. Weighted kappa coefficients (see online Appendix) revealed a moderate to high agreement for all except four items: item 32 (attaining 4 points over left side; ƙ = −0.101; positive agreement, 0.96), item 38 (prone, forward creeping; ƙ = −0.101; positive agreement, 0.96), item 40 (4 points, attains sitting arms free; ƙ = −0.204; positive agreement, 0.92) and item 45 (crawling reciprocally; ƙ = 0.310; positive agreement, 0.96). Table 2 presents the intra-rater reliability for the scoring of video recordings by the two independent evaluators. The ICC could be interpreted as very good for both the total score (ICC, 0.989) and all dimension scores (ICC ranging from 0.894–0.993). The SEM for the total score was only 1.43% with an

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

4

Clinical Rehabilitation 

Table 1.  Live and video scores of the GMFM-88 with corresponding agreement values (first rater). GMFM-88 parameter

Total score

Live score, Mean 83.03 (14.01) % (SD) Video score, 82.22 (13.74) Mean % (SD) Mean dif between 0.81 (3.15) live and video (SD) Limits of −5.28–7.00 agreement (Bland-Altman) ICC agreement 0.973 (95% CI) (0.952–0.985) SEM, % 2.28 SDC, % 6.32

Dimension A

Dimension B

Dimension C

Dimension D

Dimension E

99.25 (2.13)

96.8 (6.26)

90.71 (12.46)

70.87 (25.97)

57.53 (28.12)

99.25 (1.63)

97.3 (5.15)

88.00 (15.04)

69.28 (25.51)

57.28 (27.78)

    0 (1.43)

−0.5 (1.97)

2.71 (11.11)

1.59 (6.65)

0.25 (3.61)

−2.80–2.80

−4.36–3.36

−19.08–24.50

0.72 (0.553–0.831) 1.00 2.77

0.938 0.667 0.965 (0.893–0.965) (0.481–0.796) (0.939–0.980) 1.42 8.01 4.79 3.94 22.02 13.27

−11.44–14.63 −6.83–7.33   0.992 (0.986–0.995) 2.53 7.01

GMFM-88: Gross Motor Function Measure-88; SD: Standard Deviation; Dif: Difference; ICC: Intraclass correlation coefficient (ICC between 0.00–0.19 is considered as very weak, 0.20–0.39 as weak, 0.40–0.59 as moderate, 0.60–0.79 as strong, and 0.80–1.00 as very strong agreement);11 95% CI: 95% confidence interval; SEM: Standard error of measurement, calculated as √(σ2PT+σ2residual); SDC: Smallest detectable change, calculated as 1.96*√2*SEM.

Figure 1.  Bland-Altman plot representing the difference between the live and video scoring of the total score of the gross motor function measure-88. The average difference and the 95% confidence interval (±1.96 SD) of the difference are indicated.

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

PT 2

Rater

0.98

–0.89

0.989 0.918 (0.976–0.995) (0.847–0.995) 1.42 1.91 3.93 5.29

−4.01–5.97

96.55 (6.78)

78,78 (14.78)

−4.44–2.66

97.53 (6.51)

PT 2

77.89 (13.80)

PT 3

Dimension A

−7.96–11.16

1.60

92.57 (9.29)

94.17 (8.70)

PT 2

0.894 0.842 (0.797–0.943) (0.729–0.909) 1.96 3.6 5.43 9.97

−6.14–3.94

−1.10

96.51 (5.55)

95.41 (6.38)

PT 3

Dimension B

0.954 (0.920–0.974) 2.03 5.62

−6.02–5.22

−0.40

93.57 (9.24)

93.17 (9.64)

PT 3

−13.55–9.36

−2.10

85.57 (14.79)

83.48 (15.71)

PT3

0.926 0.919 (0.873–0.957) (0.853–0.955) 4.51 4.35 12.49 12.05

−11.95–13.19

0.62

82.19 (16.41)

82.19 (16.68)

PT 2

Dimension C

0.986 (0.975–0.992) 3.03 8.39

−6.91–9.36

1.23

66.36 (25.22)

67.59 (25.91)

PT 2

Dimension D

0.983 (0.971–0.991) 3.2 8.86

−9.81–7.56

−1.13

66.77 (24.81)

65.64 (24.87)

PT 3

−7.24–7.80

0.28

51.47 (26.80)

51.75 (26.58)

PT 3

0.993 0.990 (0.987–0.996) (0.982–0.994) 2.24 2.69 6.2 7.45

−6.03–6.48

0.22

51.42 (26.23)

51.64 (26.47)

PT 2

Dimension E

GMFM-88: Gross Motor Function Measure-88; SD: Standard deviation; Diff: Difference as measured by score 1-score 2; ICC: Intraclass correlation coefficient (ICC between 0.00–0.19 is considered as very weak, 0.20–0.39 as weak, 0.40–0.59 as moderate, 0.60–0.79 as strong, and 0.80–1.00 as very strong agreement);11 95% CI: 95% confidence interval; SEM: Standard error of measurement, calculated as √(σ2PT+σ2residual); SDC: Smallest detectable change, calculated as 1.96*√2*SEM.

Score 1: Mean 78.75 (13.78) %(SD) Score 2: Mean 77.82 (13.55) % (SD) Mean Difference 0.93 (%) Limits of −2.63–4.50 agreement (Bland-Altman) ICC agreement 0.989 (95% CI) (0.974–0.995) SEM, % 1.43 SDC, % 3.96

Total score

GMFM-88 parameter

Table 2.  First and second ratings of the two independent video raters with corresponding agreement values.

Franki et al. 5

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

6

Clinical Rehabilitation 

Table 3.  Overall inter-rater agreement values for the second ratings of all three raters. GMFM-88 parameter

Total score

Dimension A

ICC agreement (95% CI)   SEM, % SDC, %

0.949

0.374

Dimension B 0.603

Dimension C 0.836

Dimension D

Dimension E

0.962

0.966   (0.808–0.979) (0.200–0.548) (0.417–0.746) (0.723–0.905) (0.938–0.977) (0.894–0.985) 3.15 4.23 5.34 6.35 4.93 4.99 8.73 11.72 14.8 17.6 13.67 13.83

GMFM-88: Gross Motor Function Measure-88; SD: Standard Deviation; ICC: Intraclass correlation coefficient (ICC between 0.00–0.19 is considered as very weak, 0.20–0.39 as weak, 0.40–0.59 as moderate, 0.60–0.79 as strong, and 0.80–1.00 as very strong agreement);11 95% CI: 95% confidence interval; SEM: Standard error of measurement, calculated as √(σ2PT+σ2residual); SDC: Smallest detectable change, calculated as 1.96*√2*SEM.

SDC of 3.96%. Each independent assessor reported an average of 4.14 items per registration that could not be scored based on the video recordings. Table 3 presents the inter-rater agreement for the second scores reported by all three raters. These results revealed a weak agreement for dimension A (ICC, 0.374), a good agreement for dimension B (0.603), and a very good agreement for all other dimensions and for the total score (ICC ranging from 0.836–0.996). The overall SEM was 3.15% with an SDC of 8.73%.

Discussion The present study aimed to investigate the agreement between live and video scores of the Gross Motor Function Measure-88. The results showed a very good agreement between the total scores obtained by live and video ratings, with a calculated ICC of 0.973. Additionally, the mean value of the difference (0.81% demonstrated on BlandAltman plots) did not significantly differ from 0, demonstrating no fixed or consistent bias. The ICC values comparing the association between the video versus live scores were lower than the intrarater assessment values (ICC, 0.989) but higher than the inter-rater agreement as evaluated based on video ratings (ICC, 0.949). Similarly, the SEM for video versus live agreement (SEM, 2.28) was lower than the intra-rater values (SEM, 3.32) but higher than the inter-rater values (SEM, 3.15). In general, the lowest ICC scores were found for dimensions A and C. The lower agreement for

dimension A is consistent with previous reports by other authors, likely because these items are more difficult to score and more prone to subjectivity.4,6 The lower agreement for dimension C was probably due to missing items, as most of the items that could not be scored based on the video recordings were items from dimension C. Items from dimension C usually require more active involvement of, and demonstration by the tester, resulting in the tester more frequently blocking the camera. These results suggest that it may be beneficial to use an additional person for filming instead of a tripod. However, the presence of an additional person might distract the child and therefore negatively influence the test results. In general, our results showed lower inter-rater and intra-rater agreement values compared to previous studies using experienced raters.5,6,13 However, our present agreement values were higher than those reported by Nordmark et al., who investigated the agreement between un-experienced raters.8 As the test manual could be used during the video ratings, the rater’s experience might be of less importance regarding reliability. In our present study, the Master’s students were trained in the observation of children’s motor functioning and were familiar with the test but were not as experienced in rating the gross motor function measure as the first assessor. Since the present study included only ambulant children with bilateral spastic cerebral palsy, our study population was relatively homogeneous. This limitation specifically relates to dimensions A and B of the test, in which very high scores were registered.

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

7

Franki et al. Furthermore the SEM and SDC values were very variable when using live and video scoring, thus requiring cautious interpretation of clinical trial results. For the total test score, the SEM of 2.28% should be considered as the measurement error when comparing live scores to video scores. When using video-scores, 1.43% should be considered the intrarater reliability and 3.15% the inter-rater reliability. For interpreting changes in scores on the Gross Motor Function Measure-88 in future intervention studies, the appropriate SEM and SDC should be used, which should be chosen depending on the type of scoring used. Clinical messages •• There was a high agreement between the live and video scores of the Gross Motor Function Measure-88. •• Agreement between live and video scores was lower than the intra-rater reliability of video  scores. •• When using video scores in a clinical trial, the appropriate reliability measures should be  considered. Acknowledgements The authors would like to acknowledge Sophie Van Hoey and Sofie Chevalier for their important contributions to the scoring of the Gross Motor Function Measure-88 recordings. Conflict of interest The authors declare that there is no conflict of interest.

Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References 1. Bjornson KF, Graubert CS, Buford V and Mc Laughlin J. Validity of the Gross Motor Function Measure. Ped Phys Ther 1998; 10: 43–47. 2. Russell DJ, Palisano R, Walter SD, Rosenbaum P, Gemus M, Gowland C, et al. Evaluating motor function in children with Down syndrome: validity of the GMFM. Dev Med Child Neurol 1998; 40: 693–701. 3. Linder-Lucht M, Othmer V, Walther M, Vry J, Michaelis U, Stein S, et al. Validation of the Gross Motor Function Measure for use in children and adolescents with traumatic brain injuries. Pediatrics 2007; 120: 880–806. 4. Russell DJ, Rosenbaum PL, Cadman DT, Gowland C, Hardy S and Jarvis S. The Gross Motor Function Measure: A means to evaluate the effects of physical therapy. Dev Med Child Neurol 1989; 31: 341–352. 5. Russell DJ, Avery LM, Rosenbaum PL, Raina PS, Walter SD and Palisano RJ. Improved scaling of the Gross Motor Function Measure for children with Cerebral Palsy: Evidence of reliability and validity. Phys Ther 2000; 80: 873–885. 6. Ko J and Kim M. Reliability and responsiveness of the Gross Motor Function Measure-88 in children with Cerebral Palsy. Phys Ther 2013; 93: 393–400. 7. Lundkvist Josenby A, Jarnlo GB, Gummesson C and Nordmark E. Longitudinal construct validity of the GMFM-88 total score and goal total score and the GMFM66 score in a 5-year follow-up study. Phys Ther 2009; 89: 342–350. 8. Nordmark E, Hagglund G and Jarnio GB. Reliability of the Gross Motor Function Measure in cerebral palsy. Scand J Rehab Med 1997; 29: 25–28. 9. Landis JR and Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174. 10. Bland JM and Altman DG. Measuring agreement in method comparison. Stat Methods Med Res 1999; 8: 135– 160. 11. de Vet HCW, Terwee CB, Knol DL and Bouter LM. When to use agreement versus reliability measures. J Clin Epidem 2006; 59: 1033–1039. 12. Fleiss JL. Analysis of data from multiclinic trials. Control Clin Trials 1986; 7: 267–275. 13. Russell DJ, Rosenbaum PL and Gowland C. Gross Motor Function Measure Manual. Ontario, Canada: Wiley Blackwell Publishing 1993.

Downloaded from cre.sagepub.com at UCSF LIBRARY & CKM on March 20, 2015

A study of whether video scoring is a reliable option for blinded scoring of the Gross Motor Function Measure-88.

To investigate the agreement between live and video scores of the Gross Motor Function Measure-88...
NAN Sizes 1 Downloads 12 Views