Clinical Radiology 69 (2014) 397e402

Contents lists available at ScienceDirect

Clinical Radiology journal homepage: www.clinicalradiologyonline.net

Mammography test sets: Reading location and prior images do not affect group performance B.P. Soh a, b, *, W.B. Lee c, M.F. McEntee a, P.L. Kench a, W.M. Reed a, R. Heard a, D.P. Chakraborty d, P.C. Brennan a a

Medical Image Optimisation and Perception Group (MIOPeG), Discipline of Medical Radiation Sciences, University of Sydney, Sydney, NSW, Australia b Department of Diagnostic Radiology, Singapore General Hospital, Singapore c Cancer Institute NSW, Alexandria, NSW, Australia d Department of Radiology, University of Pittsburgh, Pittsburgh, PA, USA

art icl e i nformat ion Article history: Received 5 August 2013 Received in revised form 29 October 2013 Accepted 13 November 2013

AIM: To examine how the location where reading takes place and the availability of prior images can affect performance in breast test-set reading. MATERIALS AND METHODS: Under optimized viewing conditions, 10 expert screen readers each interpreted a reader-specific set of images containing 200 mammographic cases. Readers, randomly divided into two groups read images under one of two pairs of conditions: clinical read with prior images and laboratory read with prior images; laboratory read with prior images and laboratory read without prior images. Region-of-interest (ROI) figure-of-merit (FOM) was analysed using JAFROC software. Breast side-specific sensitivity and specificity were tested using Wilcoxon matched-pairs signed rank tests. Agreement between pairs of readings was measured using Kendall’s coefficient of concordance. RESULTS: Group performances between test-set readings demonstrated similar ROI FOMs, sensitivity and specificity median values, and acceptable levels of agreement between pairs of readings were shown (W ¼ 0.75e0.79, p < 0.001) for both pairs of reading conditions. On an individual reader level, two readers demonstrated significant decreases (p < 0.05) in ROI FOMs when prior images were unavailable. Reading location had an inconsistent impact on individual performance. CONCLUSION: Reading location and availability of prior images did not significantly alter group performance. Ó 2013 The Royal College of Radiologists. Published by Elsevier Ltd. All rights reserved.

Introduction

* Guarantor and correspondent: B.P. Soh, Medical Image Optimisation and Perception Group (MIOPeG), Discipline of Medical Radiation Sciences (C42), University of Sydney, East Street, Lidcombe, NSW 2141, Australia. Tel.: þ61 468968223; fax: þ61 293519146. E-mail addresses: [email protected], paulinesoh85@yahoo. com.sg (B.P. Soh).

Test-sets are commonly used to evaluate the performance of radiologists and are often read in an environment other than radiologists’ reading rooms. Results from these test-sets evaluations can be used for a variety of applications, from quality assurance of radiology performance to clinical audits.1 The artificiality of the environment where test-sets can be read may introduce a number of confounding factors, such as the nature and extent of scrutiny

0009-9260/$ e see front matter Ó 2013 The Royal College of Radiologists. Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.crad.2013.11.010

398

B.P. Soh et al. / Clinical Radiology 69 (2014) 397e402

of one’s action, the over-simplification of responses and prevalence of abnormality,1 and each of these can potentially have an impact upon performance scores and the value of subsequent conclusions. Although a recently published study demonstrated that test-set conditions can predict actual clinical reporting under specific circumstances to a reasonable level, no consideration was given to how different test reading conditions can impact upon readers’ performance even when viewing monitors and ambient light conditions are optimized.2 Two such conditions include the location where test-set readings are performed and the availability of prior images. With regard to location, there are good reasons why researchers prefer readings to take place in non-clinical environments, including controlling viewing conditions, minimizing clinical interruption, and encouraging focused attention on the task in hand3e7; however, to the authors’ knowledge, no work is available to describe how well the laboratory reading location represents tests performed in the clinic. With regard to priors, these are often omitted from reader studies owing to the difficulty in obtaining and presenting these images and the increase in study times when these images are available. Some workers have explored the impact of the availability of prior images and, although this does seem to have an impact, particularly on mammographic specificity,8e11 the relevance of these studies to currently performed research studies is questionable as the previous work was carried out using film,8e12 high abnormality prevalences that exceed clinical prevalences by up to a factor of 100,8e13 stated normal to abnormal ratios,10,12 and varying types of prior images.9 Also little consideration has been given to the standardization of viewing conditions.8e12 Therefore, the purpose of the present study was to establish how the reading location and the availability of prior images can impact upon performance in breast test-set reading in the digital environment when viewing conditions are optimized.

Materials and methods The study was granted institutional ethics approval and informed consent was obtained from each participating screen reader. The need for informed consent pertaining to the use of patient materials was waived.

General study design Ten expert screen readers from BreastScreen New South Wales (BSNSW) with 4e38 years (median 19 years) experience in interpreting screening mammograms and who read 2000e20,000 (median 6000) breast examinations each year participated in this study (none of the authors participated in reading). The readers, all with expertise in screen reading, consisted of nine radiologists and one nonradiologist physician, the latter being responsible for the clinical management of screening service centres. Each reader interpreted a mammographic screening test-set under two of the following conditions (Fig 1) between February and August 2012: (1) reading condition 1:

interpretation of a test-set in the radiologist’s own clinical reporting environment, with the most recent prior images provided; (2) reading condition 2: interpretation of a testset in a laboratory setting that simulated the clinical environment, with the most recent prior images provided; and (3) reading condition 3: interpretation of a test-set in a laboratory setting that simulated the clinical environment, without prior images. Based on the above three conditions, readers were randomly allocated in equal numbers, to one of the following two experimental groupings (Fig 1): (a) reading location: the difference in test-set readings when performed in a clinic or laboratory (reading condition 1 and 2, comparison A); or (b) prior images: the impact of prior images on test-set readings (reading condition 2 and 3, comparison B). For each comparison, reading conditions were separated by 4 months. Readers in each group for each comparison were counter-balanced between each reading condition so that three out of five readers started with the first reading condition whilst the other two participants began with the second reading condition. The sequence of images in each test-set was randomly re-ordered for each reading condition. Test-set reading sessions took place in a clinical environment (determined by the readers’ usual reporting site within the BSNSW) for reading condition 1, where quality assurance on viewing conditions was performed on a regular basis, and a laboratory environment (Brain and Mind Research Institute, University of Sydney) for reading conditions 2 and 3 (Fig 1). The ambient lighting levels in the laboratory environment were measured with an luminance meter (CL-200; Konica Minolta, Osaka, Japan) and kept at 25e40 lux throughout the study to adhere to the available evidence-based guidelines.5,14 To closely resemble the clinical workstations used in BSNSW, the laboratory workstation was set up with a pair of EIZO Radiforce GS510 (EIZO, Ishikawa, Japan) five megapixel medical-grade monochrome liquid crystal display monitors. Performance of the laboratory monitors were optimized to ensure that display efficacy were comparable to those available within BSNSW as assessed by the authors in another recent study.7 Calibration was performed on the laboratory display monitors using an EIZO UX1 Sensor and the quality-control software RadiCS15 to adhere to the greyscale display function.16 Mammographic hanging protocols and the picture archiving communication systems (Sectra € ping, Sweden) were the same in both clinical Imtec, Linko and laboratory environments. It should be emphasized that although the laboratory was designed to closely resemble the clinic, interruptions such as noise and distraction that are commonly found in the latter environment were eliminated in the former setting. Before the start of the initial reading condition for each reader, individuals were asked to complete a questionnaire on the approximate amount of clinical time they spent on interpreting screening and diagnostic mammograms (Table 1). Two out of 10 readers were solely involved in breast screen reading, whereas the rest spent a varying

B.P. Soh et al. / Clinical Radiology 69 (2014) 397e402

399

Figure 1 Comparisons employed in this study.

proportion ranging from 34e95% (median 60%) of their clinical time interpreting screening mammograms (and the remaining of their time reading diagnostic mammograms; Table 1). Readers were then asked to follow their usual clinical practice and rate each side of the breast separately: return a case to routine screening for normal or benign findings or recall to assessment if further diagnostic investigation was required. For examinations that were recalled for further assessments, readers in line with Australian reporting practice were asked to rate the lesion (equivocal, suspicious or malignant) and provide information on the lesion such as the side, site, and nature of the abnormality. To further reflect usual clinical practice, a “technical recall” decision was provided as one of the options to readers and any technical recalls were excluded from the analysis; technical recall decisions were only given in reading condition 1 (four normal and one abnormal cases) and reading condition 2 (four normal and two abnormal cases). Throughout the study, readers were unaware of the composition of the test-sets or the specific aims of the study and no time limit was imposed on reading sessions.

Selection of mammograms Each reader had a unique set of test images that comprised 200 reader-specific digitally acquired mammography examinations, which each individual had read clinically over the last 5 years (between January 2007 and October 2011) to form 10 unique reader-specific testsets. Although the aim of each reader-specific test-set was to closely represent clinical prevalence, the number of abnormalities had to be increased for statistical purposes; therefore, this resulting in a target of 10 true-positive (TP), 20 false-positive (FP), 160 true-negative (TN), and 10 falseTable 1 Percentage of clinical time spent interpreting screening and diagnostic mammography examinations. Reader

Screening mammography examinations (%)

Diagnostic mammography examinations (%)

1 2 3 4 5 6 7 8 9 10

50 34 70 95 80 100 50 80 40 100

50 66 30 5 20 0 50 20 60 0

negative (FN) cases for each test-set. Definitions for each case type made use of the independent double-reading conditions in the screening service and were based on actual clinical reports: TP was defined as pathologically confirmed cancers with the correct side of the breast correctly recalled by the reader; FP was defined as examinations that were incorrectly recalled by the reader and found to be normal by two other readers or through subsequent diagnostic investigation; TN was defined as examinations that were correctly reported to be normal by participating reader and verified through either a concordant agreement by one other reader, or found to be normal through further diagnostic investigation; and FN was defined as pathologically confirmed cancers that were detected at screening by second and third readers and confirmed at the recall assessment clinic but missed by the reader responsible for the test-set. A follow-up normal screening round was not required to confirm a TN status, as the FN categorization within the Screening Service’s clinical audit only considered cancers missed by one reader and detected by the second reader in the current screening round. Each mammography examination comprised two-view, digitally acquired, bilateral mammograms (craniocaudal and mediolateral oblique). They were obtained from the BreastScreen Digital. Imaging Library with all health record data de-identified and were gathered by B.P.S. (one of the authors who is a radiological technologist and researcher with 3 years of experience). Images with visible post-biopsy markers or surgical scars were excluded from the test-sets to minimize any memory effect.17 Five out of the 10 reader-specific test-sets were presented with six to nine (an average of seven) FN cases instead of the targeted 10 cases, which was due to the rarity of these types of images, and these were replaced by TN examinations to maintain the total number within each test-set.

Data analysis Following the method used in the BreastScreen clinical audit, this study focused on a side-specific analysis where an examination consisted of two sides of breasts and a TP score was recorded when a recall rating was correctly given to the side of the breast containing the pathologically confirmed cancer. Lesion and site-specific analysis is not performed in the clinical audit. The following confidence scores were assigned by readers to either one or both sides

400

B.P. Soh et al. / Clinical Radiology 69 (2014) 397e402

of the breast: return to routine screen (score 1); equivocal (score 2); suspicious (score 3); malignant (score 4). The performance of readers in reading condition 1e3 was assessed through region-of-interest (ROI) figure-ofmerit (FOM), side-specific sensitivity and specificity. The dataset follows the ROI paradigm, where each patient contributes two ROIs, i.e., a rating for each side of the breast, and therefore, the ROI paradigm is appropriate for analysing such clustered data.18 The true diagnostic status was known for each breast where it was classified as either abnormal (contained cancer) or normal (no cancer). ROI FOM is defined as the empirical probability that a cancercontaining ROI is rated higher than a normal ROI.18 This method is implemented in JAFROC software, version 4.1 (D.P.C., Pittsburgh, PA, USA). Details of the ROI analysis, which yielded a p-value and confidence intervals for the ROI FOM for each reader-specific dataset, are given in the appendix from a study elsewhere.2 Side-specific sensitivity was defined as the proportion of abnormal mammographic examinations with the side of breast correctly detected and recalled (confidence score of 2, 3, 4) and specificity as the proportion of normal mammographic examinations recognized and given a nonrecall rating (confidence score of 1). Significance was then tested with a non-parametric Wilcoxon matched-pairs signed rank test. Kendall’s coefficient of concordance (also known as Kendall’s W)19 was used to analyse the confidence scores assigned to the respective decisions made by readers to evaluate the level of agreement between pairs of reading conditions (Fig 1). According to the present authors’ interpretation on Charters’ debate20 regarding various proposed cut-offs for Kendall’s W, a score more than 0.9 is excellent; 0.8  W < 0.9 is good; 0.7  W < 0.8 is acceptable; 0.6  W < 0.7 is moderate; less than 0.6 is no agreement. Statistical significance was set at p < 0.05 for all statistical comparisons. ROI analysis was performed using

JAFROC software, version 4.1, whereas SPSS (SPSS, Chicago, IL, USA) software was employed for the remaining analyses.

Results The ROI FOMs, side-specific sensitivity, specificity, and level of agreement results for all readers for comparison A, B, and C are shown in Table 2. Although the group values were similar, significant changes in individual ROI FOM were shown for the clinical and laboratory comparison with one reader demonstrating an increase (p < 0.05) and the other a decrease (p < 0.05) for the latter environment (Table 2). When prior images were unavailable, the group value was similar to that demonstrated when prior images were present; however a significant decrease (p < 0.05) for two readers in ROI FOMs was noted. There were no significant sensitivity or specificity findings for any of the comparisons. When agreement was considered, significant levels of agreement (p < 0.001) varying from an acceptable to a good level for individual readers, were demonstrated in all reading condition comparisons with W value ranging 0.720.82 (group value: 0.79) and 0.720.81 (group value: 0.75) for the environment and prior image comparisons, respectively (Table 2).

Discussion Screening test-sets such as BREAST (Breastscreen REader Assessment STrategy)21 and PERFORMS (PERsonal perFORmance in Mammographic Screening)22 are increasingly utilized as tools for assessing readers’ performance. These strategies were developed to augment clinical audit owing to associated limitations including the length of amount of time required for underperformance to be identified due to the low incidence of breast cancers.1 With this increasing

Table 2 Results for region-of-interest (ROI) figure-of-merit (FOM), sensitivity, specificity, and Kendall’s coefficient of concordance for comparison A (reading condition 1: clinical read with prior images; reading condition 2: laboratory read with prior images) and comparison B (reading condition 2: laboratory read with prior images; reading condition 3: laboratory read without prior images). Sensitivity (%)a

ROI FOMs A (reading location) 1 2 3 4 5 Median B (prior images) 6 7 8 9 10 Median

Reading condition 1 0.84 (0.72, 0.96) 0.76 (0.64, 0.88) 0.74 (0.62, 0.86) 0.95 (0.88, 1.02) 0.94 (0.86, 1.01) 0.84 Reading condition 2 0.86 (0.75, 0.98) 0.88 (0.78, 0.97) 0.78 (0.67, 0.90) 0.89 (0.79, 0.98) 0.99 (0.99, 1.00) 0.88

Reading condition 2 0.72 (0.59, 0.85) 0.86 (0.75, 0.96) 0.85 (0.74, 0.95) 0.91 (0.82, 1.01) 0.88 (0.78, 0.98) 0.86 Reading condition 3 0.73 (0.60, 0.87) 0.85 (0.75, 0.95) 0.78 (0.66, 0.90) 0.86 (0.75, 0.97) 0.88 (0.78, 0.98) 0.85

p-Value

Mammography test sets: reading location and prior images do not affect group performance.

To examine how the location where reading takes place and the availability of prior images can affect performance in breast test-set reading...
306KB Sizes 2 Downloads 0 Views