Carl J. D’Orsi, MD #{149}David J. Getty, PhD Steven E. Seltzer, MD #{149}Barbara J. McNeil,
John A. Swets, MD, PhD
Reading and Decision Aids Accuracy and Standardization of Mammographic Diagnosis’ Image-reading and decision aids were designed to improve the accuracy of mammogram interpretation. The reading aid was a list of diagnostic radiographic features and scales for quantification of each feature. The decision aid, a computer program, converted the reader’s scaled values, weighted for predictive power, into an advisory estimate of the probability of malignancy. The features were identified and their importance was assigned in four steps: (a) interviews of five expert readers to establish an initial set of features, (b) perceptual tests to refine the feature set, (c) a consensus meeting to refine this set and establish nomenclature and scales, and (d) the expert’s scaling of each feature in a set of 150 mammograms. Those scaled judgments were analyzed to provide the final list of features and their relative importance and to program the computer decision aid. To test the enhancement effect, six other radiologists interpreted a different set of mammograms without, and later with, the two aids. Receiver operating characteristic analysis showed a gain of approximately 0.05 in sensitivity or specificity when the other value remained at 0.85. In a subset of the more difficult cases, the enhancement effect was approximately 0.15 in either sensitivity or specificity. Index terms: Breast neoplasms, diagnosis, 00.31, 00.32 #{149} Breast neoplasms, radiography, 00.11 #{149} Breast radiography, quality assurance, 00.11 #{149} Computers, diagnostic aid #{149} Receiver operating characteristic curve (ROC)
Radiology
1992;
184:619-622
A
PhD
for
Ronald
M. Pickett,
#{149}
Improved
175,000 new cases of breast cancer will be diagnosed this year in the United States, and approximately 45,000 women will die of the disease (1). Early discovery with mammography can reduce the number of deaths caused by breast cancer. However, this gain comes at a price: an increased number of biopsies performed because of mammograms that are suggestive of, but not diagnostic for, malignancy. Nationally, the posilive predictive value of a mammogram with suspected malignancy ranges from 15% to 30% (2-4). This relatively low value is caused in part by limitations of the examination, the use of substandard mammography units, and lack of expertise among technologists and radiologists (5). Efforts continue to design and distribute better equipment, and awareness of the need to standardize existing techniques of imaging has been growing (6). Despite these efforts, a need still exists for procedures that can improve the accuracy of radiologists’ performance and standardize their practice. This article describes a means of standardization and enhancement of diagnostic interpretalion and reporting and the application of such procedures to the problem of improved mammographic diagnosis of breast cancer. Our approach includes the following: (a) systematic determination of the relevant perceptual features in mammograms, (b) a checklist of these features along with a scale for each so that the radiologist can assess every feature quantitatively, (c) a computer program that accepts scaled values from the radiolPPROXIMATELY
att
ogist as a case is read and immediately issues an advisory probability of malignancy (based on the relative value of the features and the optimal combination of the scaled values), and (d) a computer-based report writer that converts the scaled values assigned by the radiologist into a standardized prose report for the referring physician and surgeon. The potential usefulness of this approach is twofold. First, experiments indicate that it can substantially increase the radiologist’s diagnostic accuracy (7). That is, the specificity of the examination can be increased while the desired sensitivity is maintamed, or vice versa. Second, it may be useful to explicitly and quanlitalively standardize the diagnostic and reporting process at the fundamental level of feature analysis rather than tacitly assume connections between radiologists’ perceptual tendencies and their vocabulary or lexicon. Our study followed the same general approach of a previous study (8) and a more recent study (9), both of which demonstrated marked gains in accuracy. Our approach advances technique in three important respects: It explores radiographic features more extensively, uses a more quantitative and perceptually effective method of feature scaling, and applies a more general receiver operating characterislic (ROC)-based analysis of the gains in accuracy than previous approaches. MATERIALS
From
the
Department
of Radiology,
University
of Massachusetts
Medical
Center,
55 Lake
Ave.
Worcester, MA 01655 (C.J.D.); Department of Experimental Psychology, Bolt Beranek and Newman Inc. Cambridge, Mass (C.J.D., D.J.G.,J.A.S., R.M.P., S.E.S.); and the Departments of Health Care Policy (J.A.S., B.J.M.) and Radiology (S.E.S., B.J.M.), Harvard Medical School, Boston. From the 1990 RSNA scientific assembly. Received November 15, 1991; revision requested January 14, 1992; revision received March 9; accepted April 9. Address reprint requests to C.J.D. #{176} RSNA, 1992
N,
AND
METHODS
The experiment summarized here is an extension of one reported in 1988 (9). In our previous experiment, we used xeromammograms that depicted known lesions
I
PhD
at indicated
locations;
ologist’s judgment was the a specified lesion as benign In our current experiment,
Abbreviation: acteristic.
ROC
=
receiver
thus,
the
radi-
classification or malignant. we used film
operating
of
char-
619
mammograms and a case set that included healthy patients; the radiologist studied the mammogram and made a detection, as well as a classification, judgment. A primary, detailed description of this experiment has appeared in a journal in another field (10); the experiment is reviewed here to present the approach to a wider radiologic audience and to support a fuller dis-
or the reader’s confidence that it was present (Fig 2). In step four, the specialists individually gave scale values for each feature for each of 150 cases; these scale values were analyzed by means of discriminant analysis (10,12) relative to the actual diagnosis to
cussion
cally a stepwise,
of its clinical
implications.
determine
sis)
Film
Material
Cases for this study were the University of Cincinnati; sity
of Massachusetts
collected from the Univer-
Medical
Center,
Worcester, Mass; and the Mary Hitchcock Clinic, Dartmouth, Mass. At these institutions, a total of 2,763 patients were invited to participate in a collaborative study of diaphanography, or light scanning. Mammograms were obtained with a screen-film technique and consisted of a craniocaudal and medial-oblique view of each breast. Malignant and benign status was determined by means of biopsy, and normal status was judged on the basis of unchanged mammographic findings for 2 years or longer. We used all the 96 cases of malignancy available and 100 each of benigri and normal cases matched for age. (Cysts were excluded because biopsy samples were not obtained from them.) Analytic
Measures
Four steps were taken to identify the minimal but sufficient set of relevant features, design scales for them, and determine their relative value. (In general, the features were categorized as masses, clusters of calcifications, and secondary signs of malignancy, as detailed below.) In step one, five specialists in mammography were interviewed to compile a cornprehensive set of possibly relevant features. In step two, the specialists rated (on a 10-point scale) the similarity of members of pairs of a subset of cases relative to their indication of malignancy. Multidimensional scaling analysis (1 1), a statistical analysis of the ratings, was performed to provide a second method of identifying possibly relevant features. This analysis constructs
a multidimensional,
perceptual
space
in which the locations and distances between cases reflect the similarity ratings. The several axes of this space can be interpreted as independent perceptual features. This analysis confirmed and refined the features mentioned in the interviews and indicated which of the features were so highly correlated that they were redundant. The set of features that remained is listed in Figure 1. In step three, this tures was discussed five mammography
remaining
set
at a meeting specialists
of the to establish
consensus on what these features be called and on how they should scaled. The scales were either (a) cal
measure
the extent 620
or (b) to which
#{149} Radiology
a 10-point
a feature
of fea-
should be
a physi-
judgment
was present
of
the
indicate
capability
malignancy.
indicated
of each This
linear that
feature
analysis
Intramammary
to
were
relative
advantage
in memory
diagnostic
set,
it was
Appraisal
Six radiologists, highly experienced but not specialists in mammography, were recruited to interpret the second set of 146 cases in order to test the efficacy of the checklist
plus
dard”
condition,
as they
stances,
would
classifier.
they under
In a first,
read normal
of
irregularity
shape
irregularity
border border due
to invasion
of calcifications
Not
skin
Size
of of
Size
artifact cluster
(cranial-caudad
cluster
(oblique
Number
view)
view)
of elements
of cluster
Shape Variability
of size
Irregular
shape
Unear
of elements
of elements
or branching
Secondary
elements
signs
Architectural
distortion
Asymmetric
density or retraction
Skin thickening Regional
catcifications
data Age
Figure
1.
First
refinement
of feature
list.
to train
the classifier to assign appropriate weights to the features. Our second set (47 cases with findings of malignancy, 49 cases with benign findings, and 50 cases of normalcy; n = 146) was a “test” set (ie, a separate set used to provide a realistic, independent test of the efficacy of the checklist plus the accuracy of the classifier).
Reading
of fat of shape
Indistinctness Cluster
Other
used
view)
view)
Spiculated
accuracy.
in that
(oblique
Indistinct
and
The 296 cases were divided into two sets. The first set (50 cases with findings of malignancy, 50 cases with benign findings, and 50 cases of normalcy; n = 150) was a “training”
Size
Type
computational skills. We tested the two aids jointly and hence do not have data on their separate contributions to the reader’s improved
(cranial-caudad
Degree
inde-
node
Size
Inclusion
analy-
pendent and were of sufficient diagnostic relevance to be retained (Fig 3). The actual weights assigned to the various features are not listed here because they are specific to the particular way each feature is scaled; the order in which the features are listed indicates the ranking of importance of the 12 features. Discriminant analysis was the basis for the computer program that accepts scaled values from the reader for any case and computes an estimated probability of malignancy. The checklist may be considered a reading aid, and the computer program (often called a pattern classifier, or “classffier”), a decision aid. Together, they take advantage of the human’s relative advantage in perceptual skills and the computer’s
Abnormality Mass
(specifi-
discriminant
12 features
Focal
“stan-
mammograms circum-
and then, in the single breast specified by the test administrator, they estimated the probability of malignancy on a 100-point scale. Several months later, they read the same mammograms in the “enhanced” condition, always using the complete checklist and using the classifier’s advisory probability estimate to whatever extent they wished. Between the two readings, the readers were given brief instructions on how to scale the features and were given 15 practice cases. After they completed a case, they were told the median of the specialists’ ratings for each feature and the status of the case (malignant, benign, or normal). At the conclusion of the training session,
the test administrator reader any important his ists
reviewed differences
or her ratings and that might have
those
been
with each between
of the
observed
special-
across
cases.
In the enhanced ministrator scaled value
typed assigned
condition, the test adinto the computer the to each
feature
while the reader both verbalized it and recorded it by pencil on a form. As soon as the last scale value was entered for a case, the classifier produced its estimate of malignancy. The reader recorded the classifier’s
estimate
on
the
form
and
then
his
or
her own estimate. For each reader, ROCs were calculated from the probability estimates of malignancy (10,13) in both the standard and enhanced reading conditions in the 146 test case cases
cases.
In addition,
set, the 51 most (approximately
of the
total
difficult abnormal half benign and
a subset
half
malignant), was also examined. Difficulty was indexed according to the average divergence of the readers’ probability estimates from 100 in cases of malignancy and from
0 for
benign
cases.
RESULTS The pooled ROC data for the six readers, for the full set of 146 cases, are shown in Figure 4. These ROCs show that the gain achieved by use of the aids September
1992
punctate
was
or
near
the chance
level
(along
the
branching/curvilinear major
diagonal
of Figure
5).
#{149}
#{149}
#{149}
#{149} .
-5
Our study approach to tual features dence across bility estimate
#{149}
I
#{149}#{149}#{149}
-4
definitely indication
DISCUSSION
b
t
#{149} 4
-3
-2
4
0
no of
1
2
3
uncertain
$
#{149}
‘
4
5
definitely indication
branching
some of
branching
Figure 2. Example of scaled feature, with scaling tain confidence levels that at least some indication
for the shape of branching
of calcific elements to asceror curvilinearity exists.
indicates that a systematic scaling the relevant percepand combining the evithese features into a probaor decision enhances diagnostic accuracy to a considerable degree. In a group of patients similar to the group we sampled, the aids would enable detection of approximately five more malignancies among every 100 patients
with
malignancy;
_______________________________________________________________
malignancy.
Final
Feature
our
rienced
Focal
abnormality
Age
test
only
minimal
and
decision performed
4.
Indistinct
mately
distortion border
5.
Shape
6.
Number
7.
Size
of mass
8.
Size
of calcium
9.
Indistinct
due
of calcium
(spiculation)
0.12
of calcium
within
cluster
with
five
Skin
thickening
11.
Linear
12.
Irregular
or variable
no evidence
of invasion
biopsy. ence
calcific
elements
shape
and
size
of calcific
elements
simply affect
3. Features on final listare ranked in order status and malignancy. 1 = most important,
was approximately 0.05 in sensitivity (specificity, 0.85) or approximately 0.05 in specificity (sensitivity, 0.85). In a onetailed Student t test, the first quantity has a P value less than .05, and the second quantity, a P value equal to .10. Figure 5 represents the 51 most difficult cases. These ROCs show a larger gain of 0.15 in sensitivity or specificity (for senVolume
184
Number
#{149}
3
reading
test
readers
in mamread approxi-
mammograms
a week
for
the gains resulting higher: approximately or specificity
from (7).
As the
between
of importance for discrimination 12 = least important.
sitivity,
P < .02; for specificity,
cases
various
of mammograms
kinds
one
may
of social
experi-
pressure
demonstrates
were effective even reading performance
P < .01).
that
the
aids
when the standard was so low that
effect
it
movement
along
to
a single
which would be another way the biopsy yield, but rather
caused a shift that represents
Another way to assess the gain in performance is noted in Figure 5, which demonstrates a simultaneous gain in sensitivity and specificity of approximately 0.08. The gain for the most difficult
number
to increase,
increase the yield of biopsy. The aids described here are a means by which to increase the yield without reducing the rate of detection of malignancies. It is to be emphasized that the aids did not ROC,
Figure benign
the
planning such a study. Presently, mammograms are typically read with a very lenient decision criterion for recommendation of biopsy, so that malignancy is found in only two or three of every 10 patients who undergo
or retraction
or branching
with
expe-
received
In our previous with
in sensitivity
continues
10.
highly
ther on a larger number of cases and directly in a clinical setting. We are
cluster
border
sub-
Whenever such gains are multiplied by the large number of women who could benefit from annual mammography, the effect is potentially great. The enhancement techniques should be tested fur-
cluster
of elements
are
and
less experienced (but who had
several years), the aids were
to invasion
were
training
mography
Architectural
readers
aids.
somewhat
3.
percentages
in mammography
study,
2.
These
stantially (approximately three times) higher when one considers the most difficult cases. Moreover, these estimates are likely to be conservative be-
List
cause
1.
alternatively,
approximately five biopsies could be avoided in every 100 patients without
to a higher a greater
ROC, capacity
to
an ROC to dis-
criminate between malignant and nonmalignant cases. Routine use of a computer-based system like the one we have developed for experiments would bring several benefits. Radiologists could be trained to evaluate the perceptual features appropriately. A growing data base of an individual’s performance, including pathology reports and follow-up examinations, would provide updated measures of his or her Radiology
621
#{149}
True-Negative 1
0.80
0.70
0.20
0.30
Fraction
0.60
0.50
True-Negative
(Specificity) 0.40
0.30
1 .0
0.90
0.80
0.70
0
0.10
0.20
0.30
Fraction
0.60
(Specificity)
0.50
0.40
0.50
0.60
0.30
0.20
0.
0.80
0.90
1.0
0.90 0.80 > U)
C
0
U.
0. 0
2
0
0.10
0.40
0.50
False-Positive
0.60
0.70
0.80
0.90
1.0
Fraction
False-Positive
4.
0.70
1.0
Fraction
5.
Figures
4, 5.
(4) ROCs
for
performance.
Review
and enhanced reading of 146 cases in the test set. In 4 and 5, the dashed arrows indicate differences in sensitivity and specificity (cf Results). (Adapted, enhanced reading of a subset of 51 difficult cases. (Adapted, with permission,
of problematic
cases by specialists could provide detail about which features are being evaluated inadequately and hence give specific guidance for further training. Radiologists and radiology departments could keep track of biopsy yield apart from overall accuracy and reach agreement on how to implement changes in decision criteria in order to alter that yield if desirable. These aids would help develop a standardization
of film
reading
that,
with additions to the computer system, could help standardize the reports written by the radiologist for the clinician and thereby enhance the quality of the clinician’s recommendations for action. An automatic report writer would generate an appropriate selection from a set of standardized phrases that are mdividually associated with the appropriate ratings for each diagnostic feature of the checklist. Thus, a low numerical rating for spiculation might trigger the phrase definite
evidence
diagonal tine from top to bottom with permission, from reference from reference 10.)
standard
indicates chance performance; 10.) (5) ROCs for standard and
“No
0.40
for spiculation
seen.” Relatively few keystrokes on the computer keyboard-the dozen or so keystrokes made to assign scaled values to relevant features-could thus quickly
produce a standardized prose mammography report, including an advisory estimate of the probability of malignancy. Provision could be made for the radiologist to edit the report on-line. The standardization of film reading and of action recommendations, along with the assessment of diagnostic performance, that these aids would enable should be particularized and guided in detail by the radiology community (eg, through committees of the American College of Radiology). We believe that, properly managed, the techniques described herein to delimit and assign degrees of relevancy to perceptual features of a diagnostic image, present a checklist of features for the radiologist’s routine use, combine (in an optimal manner) scaled values of the features as assigned
by the
radiologist,
and
gener-
ate a perception-based, standardized radiologic report can markedly improve quality assurance in mammography performed to diagnose breast cancer. The same principle applies to other imaging modalities used to detect other diseases. #{149}
2.
3.
Tabar
1989; l71:605-6l8. FM, Storella JM,
Nonpalpable for biopsy
noma Hendrick
Gaw
gram ticing
7.
8.
9.
10.
13.
of breast
can-
Silverstone lesions: on suspicion
DZ, Wyshak recommendaof carci-
breast based
Radiology
1988; 167:
Standardization of image dose in mammography.
quatRa-
1990; 174:648-654. SM, D’Orsi CJ, et at. A promammography skills of practechnologists. QRB 1991; 17:
48-53. Getty DJ, Pickett RM, D’Orsi CJ, SwetsJA. Enhanced interpretation of diagnostic images. Invest Radiol 1988; 23:240-252. Ackerman LV, Mucciardi AN, Gose EE, Alcorn FS. Classification of benign and malignant breast tumors on the basis of 36 radiographic properties. Cancer 1973; 31:342-352. Gale AG, Roebuck EJ, Riley P. Worthington Computer aids to mammographic diagnosis.
BS.
Br J Radiol 1987; 60:887-891. Swets JA, Getty DJ, Pickett RM, D’Orsi CJ, Settzer SE, McNeil BJ. Enhancing and evaluating
Shiffman duction York:
12.
RE. radiation
VP, Bush to improve radiologic
diagnostic 11:9-18.
11.
control
at mammography.
ity and diology 6.
The
1-fill tions
5.
PB.
C.
ogy
4.
L, Dean
cer through mammography screening. Radiol Clin North Am 1987; 25:933-1005. Moskowitz M. Impact of a priori medical decisions on screening for breast cancer. Radiol-
accuracy.
Med
SS, Reynolds to multidimensional Academic Press,
Decis
Making
ML, Young 1981;
FW.
1991;
Intro-
scaling. New 3-88, 169-210.
Dawes RM, Corrigan B. Linear models in decision making. Psychol Bull 1974; 81:95-106. Beg CB, McNeil BJ. Assessment of radiologic tests: control of bias and other design considerations. Radiology 1988; 167:565-369.
References 1.
Boring
CC, Squirer
TS, Tong
T.
Cancer
statis-
tics. CA 1991; 41:19-36.
622
Radiology
#{149}
September
1992