Potential for inter-observer and intra-observer variability in x-ray review to establish stone-free rates after lithotripsy.

0022-5347 /92/14 73-0559$03.00/0 Vol. 147 1 559-562) IVIarch 1992 Printed in U.S.A.

THE JOURNAL OF UROLOGY AMERICAN UROLOGICAL ASSOCIATION, INC.

Copyright© 1992 by

POTENTIAL FOR INTER-OBSERVER AND INTRA-OBSERVER VARIABILITY IN X-RAY REVIEW TO ESTABLISH STONE-FREE RATES AFTER LITHOTRIPSY MICHAEL A. S. JEWETT, CLAIRE BOMBARDIER, DOMINIQUE CARON, MICHELE R. RYAN, ROBIN R. GRAY, EUGENE L. ST. LOUIS, STEPHEN J. WITCHELL, SANJIVE KUMRA AND KOSTANTINOS E. PSIHRAMIS From the Divisions of Urology, Clinical Epidemiology and Department of Radiology, The Wellesley Hospital and Lithotripsy/Urolithiasis Program, University of Toronto, Toronto, Ontario, Canada

ABSTRACT

The potential for variability among observers interpreting diagnostic tests is well known but has not been well established for radiological imaging of urolithiasis. We measured the inter-observer and intra-observer variability in the reporting of plain abdominal films and tomograms from patients who had undergone extracorporeal shock wave lithotripsy (ESWL*). Unlabeled copies of the plain abdominal films and tomograms from 58 patients were individually submitted to 3 different radiologists. Selected films from 25 patients were resubmitted to the same radiologists. We found differences among radiologists reporting plain abdominal films alone 52% of the time and even by the same radiologist rereading the films 24 % of the time. Tomograms alone decreased the uncertainty but differences still occurred among radiologists 24% of the time and with themselves 16% of the time. When plain abdominal films and tomograms were read together there were differences among radiologists 28% of the time and with themselves 7% of the time but these were usually minor. We concluded from this study that the plain abdominal film alone was frequently difficult to interpret, resulting in uncertainty about the presence or absence of residual stone fragments. Tomograms alone or a plain abdominal film plus tomograms is superior to a plain abdominal film alone. Finally, radiological assessment with all modalities probably overestimates stone-free rates after ESWL even without consideration of the potential for reporting variability among observers. KEY WORDS:

extracorporeal shockwave lithotripsy, urinary calculi, radiography

The definition of success for extracorporeal shock wave lithotripsy (ESWL) is not uniform. Classical stone treatment has been lithotomy with removal of an intact stone. Therefore, a successful outcome was defined as stone-free and this was easily determined with x-ray imaging. With ESWL stones are no longer removed intact but are reduced to fragments for spontaneous excretion. Some time must elapse for this to occur. The detection of residual fragments after lithotripsy depends on the accuracy of the imaging technology as well as its interpretation.1· 2 Although the tradition of measuring success as a stone-free rate continues with ESWL, fragmentation to a size that may be excreted (that is 4 or 5 mm. or less) is often included. in reports as a successful outcome. Imaging plain abdominal film of the kidneys, ureters and bladder, plain linear tomography or ultrasound is used to determine the stone status of a treated. patient. The interval at which imaging is performed varies but assessment at 3 and 12 months after completion of treatment is commonly used. It has been reported that tomography may be more accurate than plain abdominal films in the detection of renal calculi. 3- 5 The potential for variability among observers is well known for other diagnostic tests but it has not been extensively measured. in this context. 6 Therefore, we measured the inter-observer and intra-observer variability when reporting plain abdominal films and tomograms from patients who had undergone ESWL. METHODS AND MATERIAL

X-rays from 58 patients who had undergone ESWL for renal calculi 3 months previously were reviewed to detect residual fragments. Each patient had a plain abdominal film and 3 adjacent, 1 cm. thick, plain, 20-degree linear tomograms Accepted for publication August 2, 1991. * Dornier Medical Systems, Inc., Marietta, Georgia.

559

through the kidney with the levels determined by an experienced technologist. This was routine in all patients undergoing ESWL at our unit during that time. The film selection was done by a urologist (M.A. S. J.) who interpreted the films with clinical records. Of the patients 40 were believed to have residual stone fragments and 18 appeared to be clear. Unlabeled copies of the plain abdominal films and tomograms were individually submitted to 3 different radiologists, designated as raters 1, 2 and 3 for interpretation. They were instructed to report the film(s) as 1) yes when residual fragments were noted, 2) no for clear films without residual fragments or 3) unsure if they could not be certain of the presence or absence of residual fragments for whatever reason, for example overlying gas, poor film quality and so forth. The plain abdominal films and tomograms were then combined for each patient and resubmitted to the same radiologist. The frequencies of each interpretation for the 3 readings, that is plain abdominal films alone, tomograms alone, and the combined sets of plain abdominal films and tomograms, provided a measure of inter-observer reliability. Complete agreement occurred when all 3 radiologists agreed on the individual film, that is all 3 said yes, all 3 said no or all 3 said unsure. Partial agreement occurred when at least 1 radiologist responded unsure while the other 2 said yes or no. Finally, disagreement occurred when at least 1 radiologist responded yes while at least 1 other responded no. The possible combinations of yes, no or unsure among the 3 raters and the agreement that they achieve are shown in table 1. Percentages of complete agreement were calculated and reported along with the 95% confidence intervals to provide a measure of inter-rater reliability. Kappa statistics were also calculated. This last statistic actually measures the portion of the agreement that can be credited to true agreement above chance alone. The scale used to interpret kappa values was less

560 TABLE

JEWETT AND ASSOCIATES 1. Possible combinations of yes, no or unsure between 3 raters

TABLE

and the agreement that they achieve Combination No.

Rater 1

Rater 2

Plain Film

Rater 3

Yes No Unsure

Yes No Unsure

Yes No Unsure

Partial agreement 4 5 6 7

Yes Yes No No

Unsure Unsure Unsure Unsure

Yes Unsure No Unsure

29 3 26

Yes No Unsure

TABLE

Yes Yes Yes

TABLE

Yes No No

No Unsure No

2. Possible combinations of yes, no or unsure between 2 replications and the agreement that they achieve Combination No.

Review 1

Review 2

Complete agreement 1

2 3

Yes No Unsure

Yes No Unsure

Yes No

Yes

than 0-poor, 0 to 0.20-slight, 0.21 to 0.40-fair, 0.41 to 0.60moderate, 0.61 to 0.80-substantial and 0.81 to LOO-almost perfect. 7 The kappa statistics for the 3 raters were also calculated with a 95% confidence interval. Of the patient films 25 were reduplicated and mixed with the original films for a second interpretation by the 3 radiologists, that is 25 plain abdominal films, 25 tomograms and their combined sets. The frequencies of the repeated interpretations and kappa statistics of the 2 ratings provided measures of intraobserver variability by comparison with the individual's initial report from the same 25 patients. 8 Agreement was complete if both reports were the same. Partial agreement occurred if 1 of the reports was unsure while the other was a definite yes or a definite no. Disagreement occurred when a film was considered as yes in 1 instance and no in another. The possible combinations of yes, no or unsure between 2 replications and the agreement that they achieve are shown in table 2. RESULTS

When the 3 raters, who were diagnostic radiologists familiar with genitourinary imaging, examined the unlabeled plain abdominal films and tomograms, and the paired sets of plain abdominal films plus tomograms from the 58 patients there were differences in the frequencies of interpretation (table 3). The major differences occurred with raters 1 and 3 who were more frequently unsure about the presence or absence of residual fragments with the plain abdominal films. The most important factor in their uncertainty was overlying intestinal gas and feces, which obscured the kidneys. Uncertainty due to possible vascular calcifications was also cited. There was a

38 14 6

44 12 2

39 13 6

38 10

10

4. Frequency of agreement among 3 raters for 58 patient x-rays Tomogram

Both

28 28 2 48 35-61

4 76 65-87

42 13 3 72 60-84

0.41 0.29-0.54

0.64 0.45-0.83

0.61 0.41-0.80

44 10

5. Frequencies of repeated x-ray interpretation by 3 raters for 25 patients Tomogram

Plain Film

Unsure Unsure

No

40 15 3

Test of difference between proportions: plain film versus tomogram p

Sonographic assessment of fatty liver: intraobserver and interobserver variability.

CT Parameters in Pulmonary Tumors.

Interobserver and intraobserver agreement for gastric mucosa atrophy.

Intraobserver and interobserver reliability of measures of cervical sagittal rotation.

Interobserver variability in echocardiography.

Interobserver variability in grading acute rejection after lung transplantation.

Intraobserver and interobserver reproducibility for radial, circumferential and longitudinal strain echocardiography.

Intraobserver and interobserver agreement of Goutallier classification applied to magnetic resonance images.

Thymic measurements in pathologically proven normal thymus and thymic hyperplasia: intraobserver and interobserver variabilities.

Intraobserver and interobserver reproducibility in linear measurements on axial images obtained by cone-beam computed tomography.

Interobserver and Intraobserver Agreement of Sonographic BIRADS Lexicon in the Assessment of Breast Masses.

Interobserver and intraobserver variations in sonographic renal length measurements in children.

Clinical Nomograms to Predict Stone-Free Rates after Shock-Wave Lithotripsy: Development and Internal-Validation.

CT-based quantitative evaluation of radiation-induced lung fibrosis: a study of interobserver and intraobserver variations.

Evans' classification of trochanteric fractures: an assessment of the interobserver and intraobserver reliability.

Interobserver variability of sonography for prediction of placenta accreta.

Interobserver variability for the WHO classification of pulmonary carcinoids.

Magnetic resonance imaging arthrography following type II superior labrum from anterior to posterior repair: interobserver and intraobserver reliability.

Analysis of interobserver variability for endomicroscopy of the gastrointestinal tract.

Distal radial traction radiographs: interobserver and intraobserver reliability compared with computed tomography.

Interobserver variability in medical record review: an epidemiological study of asthma.

An independent evaluation on the interobserver reliability and intraobserver reproducibility of Toyama classification system for cervical dumbbell tumors.

Intra and interobserver variability in cancer patients' performance status assessed according to Karnofsky and ECOG scales.

Interobserver and intraobserver reliability of the modified Waldenström classification system for staging of Legg-Calvé-Perthes disease.