Reading and decision aids for improved accuracy and standardization of mammographic diagnosis.

Carl J. D’Orsi, MD #{149}David J. Getty, PhD Steven E. Seltzer, MD #{149}Barbara J. McNeil,

John A. Swets, MD, PhD

Reading and Decision Aids Accuracy and Standardization of Mammographic Diagnosis’ Image-reading and decision aids were designed to improve the accuracy of mammogram interpretation. The reading aid was a list of diagnostic radiographic features and scales for quantification of each feature. The decision aid, a computer program, converted the reader’s scaled values, weighted for predictive power, into an advisory estimate of the probability of malignancy. The features were identified and their importance was assigned in four steps: (a) interviews of five expert readers to establish an initial set of features, (b) perceptual tests to refine the feature set, (c) a consensus meeting to refine this set and establish nomenclature and scales, and (d) the expert’s scaling of each feature in a set of 150 mammograms. Those scaled judgments were analyzed to provide the final list of features and their relative importance and to program the computer decision aid. To test the enhancement effect, six other radiologists interpreted a different set of mammograms without, and later with, the two aids. Receiver operating characteristic analysis showed a gain of approximately 0.05 in sensitivity or specificity when the other value remained at 0.85. In a subset of the more difficult cases, the enhancement effect was approximately 0.15 in either sensitivity or specificity. Index terms: Breast neoplasms, diagnosis, 00.31, 00.32 #{149} Breast neoplasms, radiography, 00.11 #{149} Breast radiography, quality assurance, 00.11 #{149} Computers, diagnostic aid #{149} Receiver operating characteristic curve (ROC)

Radiology

1992;

184:619-622

A

PhD

for

Ronald

M. Pickett,

#{149}

Improved

175,000 new cases of breast cancer will be diagnosed this year in the United States, and approximately 45,000 women will die of the disease (1). Early discovery with mammography can reduce the number of deaths caused by breast cancer. However, this gain comes at a price: an increased number of biopsies performed because of mammograms that are suggestive of, but not diagnostic for, malignancy. Nationally, the posilive predictive value of a mammogram with suspected malignancy ranges from 15% to 30% (2-4). This relatively low value is caused in part by limitations of the examination, the use of substandard mammography units, and lack of expertise among technologists and radiologists (5). Efforts continue to design and distribute better equipment, and awareness of the need to standardize existing techniques of imaging has been growing (6). Despite these efforts, a need still exists for procedures that can improve the accuracy of radiologists’ performance and standardize their practice. This article describes a means of standardization and enhancement of diagnostic interpretalion and reporting and the application of such procedures to the problem of improved mammographic diagnosis of breast cancer. Our approach includes the following: (a) systematic determination of the relevant perceptual features in mammograms, (b) a checklist of these features along with a scale for each so that the radiologist can assess every feature quantitatively, (c) a computer program that accepts scaled values from the radiolPPROXIMATELY

att

ogist as a case is read and immediately issues an advisory probability of malignancy (based on the relative value of the features and the optimal combination of the scaled values), and (d) a computer-based report writer that converts the scaled values assigned by the radiologist into a standardized prose report for the referring physician and surgeon. The potential usefulness of this approach is twofold. First, experiments indicate that it can substantially increase the radiologist’s diagnostic accuracy (7). That is, the specificity of the examination can be increased while the desired sensitivity is maintamed, or vice versa. Second, it may be useful to explicitly and quanlitalively standardize the diagnostic and reporting process at the fundamental level of feature analysis rather than tacitly assume connections between radiologists’ perceptual tendencies and their vocabulary or lexicon. Our study followed the same general approach of a previous study (8) and a more recent study (9), both of which demonstrated marked gains in accuracy. Our approach advances technique in three important respects: It explores radiographic features more extensively, uses a more quantitative and perceptually effective method of feature scaling, and applies a more general receiver operating characterislic (ROC)-based analysis of the gains in accuracy than previous approaches. MATERIALS

From

the

Department

of Radiology,

University

of Massachusetts

Medical

Center,

55 Lake

Ave.

Worcester, MA 01655 (C.J.D.); Department of Experimental Psychology, Bolt Beranek and Newman Inc. Cambridge, Mass (C.J.D., D.J.G.,J.A.S., R.M.P., S.E.S.); and the Departments of Health Care Policy (J.A.S., B.J.M.) and Radiology (S.E.S., B.J.M.), Harvard Medical School, Boston. From the 1990 RSNA scientific assembly. Received November 15, 1991; revision requested January 14, 1992; revision received March 9; accepted April 9. Address reprint requests to C.J.D. #{176} RSNA, 1992

N,

AND

METHODS

The experiment summarized here is an extension of one reported in 1988 (9). In our previous experiment, we used xeromammograms that depicted known lesions

I

PhD

at indicated

locations;

ologist’s judgment was the a specified lesion as benign In our current experiment,

Abbreviation: acteristic.

ROC

=

receiver

thus,

the

radi-

classification or malignant. we used film

operating

of

char-

619

mammograms and a case set that included healthy patients; the radiologist studied the mammogram and made a detection, as well as a classification, judgment. A primary, detailed description of this experiment has appeared in a journal in another field (10); the experiment is reviewed here to present the approach to a wider radiologic audience and to support a fuller dis-

or the reader’s confidence that it was present (Fig 2). In step four, the specialists individually gave scale values for each feature for each of 150 cases; these scale values were analyzed by means of discriminant analysis (10,12) relative to the actual diagnosis to

cussion

cally a stepwise,

of its clinical

implications.

determine

sis)

Film

Material

Cases for this study were the University of Cincinnati; sity

of Massachusetts

collected from the Univer-

Medical

Center,

Worcester, Mass; and the Mary Hitchcock Clinic, Dartmouth, Mass. At these institutions, a total of 2,763 patients were invited to participate in a collaborative study of diaphanography, or light scanning. Mammograms were obtained with a screen-film technique and consisted of a craniocaudal and medial-oblique view of each breast. Malignant and benign status was determined by means of biopsy, and normal status was judged on the basis of unchanged mammographic findings for 2 years or longer. We used all the 96 cases of malignancy available and 100 each of benigri and normal cases matched for age. (Cysts were excluded because biopsy samples were not obtained from them.) Analytic

Measures

Four steps were taken to identify the minimal but sufficient set of relevant features, design scales for them, and determine their relative value. (In general, the features were categorized as masses, clusters of calcifications, and secondary signs of malignancy, as detailed below.) In step one, five specialists in mammography were interviewed to compile a cornprehensive set of possibly relevant features. In step two, the specialists rated (on a 10-point scale) the similarity of members of pairs of a subset of cases relative to their indication of malignancy. Multidimensional scaling analysis (1 1), a statistical analysis of the ratings, was performed to provide a second method of identifying possibly relevant features. This analysis constructs

a multidimensional,

perceptual

space

in which the locations and distances between cases reflect the similarity ratings. The several axes of this space can be interpreted as independent perceptual features. This analysis confirmed and refined the features mentioned in the interviews and indicated which of the features were so highly correlated that they were redundant. The set of features that remained is listed in Figure 1. In step three, this tures was discussed five mammography

remaining

set

at a meeting specialists

of the to establish

consensus on what these features be called and on how they should scaled. The scales were either (a) cal

measure

the extent 620

or (b) to which

#{149} Radiology

a 10-point

a feature

of fea-

should be

a physi-

judgment

was present

of

the

indicate

capability

malignancy.

indicated

of each This

linear that

feature

analysis

Intramammary

to

were

relative

advantage

in memory

diagnostic

set,

it was

Appraisal

Six radiologists, highly experienced but not specialists in mammography, were recruited to interpret the second set of 146 cases in order to test the efficacy of the checklist

plus

dard”

condition,

as they

stances,

would

classifier.

they under

In a first,

read normal

of

irregularity

shape

irregularity

border border due

to invasion

of calcifications

Not

skin

Size

of of

Size

artifact cluster

(cranial-caudad

cluster

(oblique

Number

view)

view)

of elements

of cluster

Shape Variability

of size

Irregular

shape

Unear

of elements

of elements

or branching

Secondary

elements

signs

Architectural

distortion

Asymmetric

density or retraction

Skin thickening Regional

catcifications

data Age

Figure

1.

First

refinement

of feature

list.

to train

the classifier to assign appropriate weights to the features. Our second set (47 cases with findings of malignancy, 49 cases with benign findings, and 50 cases of normalcy; n = 146) was a “test” set (ie, a separate set used to provide a realistic, independent test of the efficacy of the checklist plus the accuracy of the classifier).

Reading

of fat of shape

Indistinctness Cluster

Other

used

view)

view)

Spiculated

accuracy.

in that

(oblique

Indistinct

and

The 296 cases were divided into two sets. The first set (50 cases with findings of malignancy, 50 cases with benign findings, and 50 cases of normalcy; n = 150) was a “training”

Size

Type

computational skills. We tested the two aids jointly and hence do not have data on their separate contributions to the reader’s improved

(cranial-caudad

Degree

inde-

node

Size

Inclusion

analy-

pendent and were of sufficient diagnostic relevance to be retained (Fig 3). The actual weights assigned to the various features are not listed here because they are specific to the particular way each feature is scaled; the order in which the features are listed indicates the ranking of importance of the 12 features. Discriminant analysis was the basis for the computer program that accepts scaled values from the reader for any case and computes an estimated probability of malignancy. The checklist may be considered a reading aid, and the computer program (often called a pattern classifier, or “classffier”), a decision aid. Together, they take advantage of the human’s relative advantage in perceptual skills and the computer’s

Abnormality Mass

(specifi-

discriminant

12 features

Focal

“stan-

mammograms circum-

and then, in the single breast specified by the test administrator, they estimated the probability of malignancy on a 100-point scale. Several months later, they read the same mammograms in the “enhanced” condition, always using the complete checklist and using the classifier’s advisory probability estimate to whatever extent they wished. Between the two readings, the readers were given brief instructions on how to scale the features and were given 15 practice cases. After they completed a case, they were told the median of the specialists’ ratings for each feature and the status of the case (malignant, benign, or normal). At the conclusion of the training session,

the test administrator reader any important his ists

reviewed differences

or her ratings and that might have

those

been

with each between

of the

observed

special-

across

cases.

In the enhanced ministrator scaled value

typed assigned

condition, the test adinto the computer the to each

feature

while the reader both verbalized it and recorded it by pencil on a form. As soon as the last scale value was entered for a case, the classifier produced its estimate of malignancy. The reader recorded the classifier’s

estimate

on

the

form

and

then

his

or

her own estimate. For each reader, ROCs were calculated from the probability estimates of malignancy (10,13) in both the standard and enhanced reading conditions in the 146 test case cases

cases.

In addition,

set, the 51 most (approximately

of the

total

difficult abnormal half benign and

a subset

half

malignant), was also examined. Difficulty was indexed according to the average divergence of the readers’ probability estimates from 100 in cases of malignancy and from

0 for

benign

cases.

RESULTS The pooled ROC data for the six readers, for the full set of 146 cases, are shown in Figure 4. These ROCs show that the gain achieved by use of the aids September

1992

punctate

was

or

near

the chance

level

(along

the

branching/curvilinear major

diagonal

of Figure

5).

#{149}

#{149}

#{149}

#{149} .

-5

Our study approach to tual features dence across bility estimate

#{149}

I

#{149}#{149}#{149}

-4

definitely indication

DISCUSSION

b

t

#{149} 4

-3

-2

4

0

no of

1

2

3

uncertain

$

#{149}

‘

4

5

definitely indication

branching

some of

branching

Figure 2. Example of scaled feature, with scaling tain confidence levels that at least some indication

for the shape of branching

of calcific elements to asceror curvilinearity exists.

indicates that a systematic scaling the relevant percepand combining the evithese features into a probaor decision enhances diagnostic accuracy to a considerable degree. In a group of patients similar to the group we sampled, the aids would enable detection of approximately five more malignancies among every 100 patients

with

malignancy;

_______________________________________________________________

malignancy.

Final

Feature

our

rienced

Focal

abnormality

Age

test

only

minimal

and

decision performed

4.

Indistinct

mately

distortion border

5.

Shape

6.

Number

7.

Size

of mass

8.

Size

of calcium

9.

Indistinct

due

of calcium

(spiculation)

0.12

of calcium

within

cluster

with

five

Skin

thickening

11.

Linear

12.

Irregular

or variable

no evidence

of invasion

biopsy. ence

calcific

elements

shape

and

size

of calcific

elements

simply affect

3. Features on final listare ranked in order status and malignancy. 1 = most important,

was approximately 0.05 in sensitivity (specificity, 0.85) or approximately 0.05 in specificity (sensitivity, 0.85). In a onetailed Student t test, the first quantity has a P value less than .05, and the second quantity, a P value equal to .10. Figure 5 represents the 51 most difficult cases. These ROCs show a larger gain of 0.15 in sensitivity or specificity (for senVolume

184

Number

#{149}

3

reading

test

readers

in mamread approxi-

mammograms

a week

for

the gains resulting higher: approximately or specificity

from (7).

As the

between

of importance for discrimination 12 = least important.

sitivity,

P < .02; for specificity,

cases

various

of mammograms

kinds

one

may

of social

experi-

pressure

demonstrates

were effective even reading performance

P < .01).

that

the

aids

when the standard was so low that

effect

it

movement

along

to

a single

which would be another way the biopsy yield, but rather

caused a shift that represents

Another way to assess the gain in performance is noted in Figure 5, which demonstrates a simultaneous gain in sensitivity and specificity of approximately 0.08. The gain for the most difficult

number

to increase,

increase the yield of biopsy. The aids described here are a means by which to increase the yield without reducing the rate of detection of malignancies. It is to be emphasized that the aids did not ROC,

Figure benign

the

planning such a study. Presently, mammograms are typically read with a very lenient decision criterion for recommendation of biopsy, so that malignancy is found in only two or three of every 10 patients who undergo

or retraction

or branching

with

expe-

received

In our previous with

in sensitivity

continues

10.

highly

ther on a larger number of cases and directly in a clinical setting. We are

cluster

border

sub-

Whenever such gains are multiplied by the large number of women who could benefit from annual mammography, the effect is potentially great. The enhancement techniques should be tested fur-

cluster

of elements

are

and

less experienced (but who had

several years), the aids were

to invasion

were

training

mography

Architectural

readers

aids.

somewhat

3.

percentages

in mammography

study,

2.

These

stantially (approximately three times) higher when one considers the most difficult cases. Moreover, these estimates are likely to be conservative be-

List

cause

1.

alternatively,

approximately five biopsies could be avoided in every 100 patients without

to a higher a greater

ROC, capacity

to

an ROC to dis-

criminate between malignant and nonmalignant cases. Routine use of a computer-based system like the one we have developed for experiments would bring several benefits. Radiologists could be trained to evaluate the perceptual features appropriately. A growing data base of an individual’s performance, including pathology reports and follow-up examinations, would provide updated measures of his or her Radiology

621

#{149}

True-Negative 1

0.80

0.70

0.20

0.30

Fraction

0.60

0.50

True-Negative

(Specificity) 0.40

0.30

1 .0

0.90

0.80

0.70

0

0.10

0.20

0.30

Fraction

0.60

(Specificity)

0.50

0.40

0.50

0.60

0.30

0.20

0.

0.80

0.90

1.0

0.90 0.80 > U)

C

0

U.

0. 0

2

0

0.10

0.40

0.50

False-Positive

0.60

0.70

0.80

0.90

1.0

Fraction

False-Positive

4.

0.70

1.0

Fraction

5.

Figures

4, 5.

(4) ROCs

for

performance.

Review

and enhanced reading of 146 cases in the test set. In 4 and 5, the dashed arrows indicate differences in sensitivity and specificity (cf Results). (Adapted, enhanced reading of a subset of 51 difficult cases. (Adapted, with permission,

of problematic

cases by specialists could provide detail about which features are being evaluated inadequately and hence give specific guidance for further training. Radiologists and radiology departments could keep track of biopsy yield apart from overall accuracy and reach agreement on how to implement changes in decision criteria in order to alter that yield if desirable. These aids would help develop a standardization

of film

reading

that,

with additions to the computer system, could help standardize the reports written by the radiologist for the clinician and thereby enhance the quality of the clinician’s recommendations for action. An automatic report writer would generate an appropriate selection from a set of standardized phrases that are mdividually associated with the appropriate ratings for each diagnostic feature of the checklist. Thus, a low numerical rating for spiculation might trigger the phrase definite

evidence

diagonal tine from top to bottom with permission, from reference from reference 10.)

standard

indicates chance performance; 10.) (5) ROCs for standard and

“No

0.40

for spiculation

seen.” Relatively few keystrokes on the computer keyboard-the dozen or so keystrokes made to assign scaled values to relevant features-could thus quickly

produce a standardized prose mammography report, including an advisory estimate of the probability of malignancy. Provision could be made for the radiologist to edit the report on-line. The standardization of film reading and of action recommendations, along with the assessment of diagnostic performance, that these aids would enable should be particularized and guided in detail by the radiology community (eg, through committees of the American College of Radiology). We believe that, properly managed, the techniques described herein to delimit and assign degrees of relevancy to perceptual features of a diagnostic image, present a checklist of features for the radiologist’s routine use, combine (in an optimal manner) scaled values of the features as assigned

by the

radiologist,

and

gener-

ate a perception-based, standardized radiologic report can markedly improve quality assurance in mammography performed to diagnose breast cancer. The same principle applies to other imaging modalities used to detect other diseases. #{149}

2.

3.

Tabar

1989; l71:605-6l8. FM, Storella JM,

Nonpalpable for biopsy

noma Hendrick

Gaw

gram ticing

7.

8.

9.

10.

13.

of breast

can-

Silverstone lesions: on suspicion

DZ, Wyshak recommendaof carci-

breast based

Radiology

1988; 167:

Standardization of image dose in mammography.

quatRa-

1990; 174:648-654. SM, D’Orsi CJ, et at. A promammography skills of practechnologists. QRB 1991; 17:

48-53. Getty DJ, Pickett RM, D’Orsi CJ, SwetsJA. Enhanced interpretation of diagnostic images. Invest Radiol 1988; 23:240-252. Ackerman LV, Mucciardi AN, Gose EE, Alcorn FS. Classification of benign and malignant breast tumors on the basis of 36 radiographic properties. Cancer 1973; 31:342-352. Gale AG, Roebuck EJ, Riley P. Worthington Computer aids to mammographic diagnosis.

BS.

Br J Radiol 1987; 60:887-891. Swets JA, Getty DJ, Pickett RM, D’Orsi CJ, Settzer SE, McNeil BJ. Enhancing and evaluating

Shiffman duction York:

12.

RE. radiation

VP, Bush to improve radiologic

diagnostic 11:9-18.

11.

control

at mammography.

ity and diology 6.

The

1-fill tions

5.

PB.

C.

ogy

4.

L, Dean

cer through mammography screening. Radiol Clin North Am 1987; 25:933-1005. Moskowitz M. Impact of a priori medical decisions on screening for breast cancer. Radiol-

accuracy.

Med

SS, Reynolds to multidimensional Academic Press,

Decis

Making

ML, Young 1981;

FW.

1991;

Intro-

scaling. New 3-88, 169-210.

Dawes RM, Corrigan B. Linear models in decision making. Psychol Bull 1974; 81:95-106. Beg CB, McNeil BJ. Assessment of radiologic tests: control of bias and other design considerations. Radiology 1988; 167:565-369.

References 1.

Boring

CC, Squirer

TS, Tong

T.

Cancer

statis-

tics. CA 1991; 41:19-36.

622

Radiology

#{149}

September

1992

Mammographic compression--a need for mechanical standardization.

The accuracy of mammographic diagnosis in surgically occult breast lesions.

Performance assessment of a NaI(Tl) gamma counter for PET applications with methods for improved quantitative accuracy and greater standardization.

Decision aids, empowerment, and shared decision making.

An improved electrode for electroretinography: design and standardization.

Improved accuracy of continuous measurement of arterial oxygen tension in sick newborn infants. Criteria for reading and recalibrating the electrode.

Shared Decision Making and the Use of Decision Aids.

Language differences in the brain network for reading in naturalistic story reading and lexical decision.

Development and application of patient decision aids.

Decision aids for shared decision-making in Barrett's esophagus surveillance.

Decision aids for organ transplant candidates.

Novel and improved cell recognition for diagnosis.

Lifelong Reading Disorder and Mild Cognitive Impairment: Implications for Diagnosis.

[Classification and aids to diagnosis].

Development and validation of a fully automated system for detection and diagnosis of mammographic lesions.

Standardization of patient registries for improved data collection and outcome measurement.

Computer AIDS for the diagnosis of anxiety and depression.

Shared decision making and use of decision AIDS for localized prostate cancer : perceptions from radiation oncologists and urologists.

Method for improving accuracy of virus titration: standardization of plaque assay for Junin virus.

Accuracy of clinical diagnosis for TMJ internal derangement and arthrosis.

Field trials of medical decision-aids: potential problems and solutions.

Shared decision making and patient decision aids: knowledge, attitudes, and practices among Hawai'i physicians.

Comparison of physician judgment and decision aids for ordering chest radiographs for pneumonia in outpatients.

Role of decision aids in orthopaedic surgery.