Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
63i
Perspective
L
‘..:?
Strategies Radiology Craig
for Improving Research
Power
in Diagnostic
A. Beam1
Research
studies
in diagnostic
radiology
often compare
the
improve
the power
of studies
that compare
imaging
tech-
diagnostic abilities of two imaging techniques. The “‘power” of such studies is the probability that they will detect a difference in abilities of a certain amount when, indeed, such a difference does exist. This article outlines several strategies that can be used to assess and improve the power of radiologic diagnostic studies. These strategies include selection of cases and controls, matching, use of one-tailed tests, selection of significance level, and choice of sample size.
niques. This familiarity will promote and facilitate interaction with statisticians at the design stage of a research study. The other aspect of a study’s reliability is the precision with which estimates of diagnostic ability are made. Although space limitations do not allow review of this other aspect of a study’s reliability, strategies that increase the power of a study often
A story told in statistical circles goes like this: There once was an investigator presenting his findings about the safety
diagnostic
of a new compound investigator reports
Strategies
also increase the precision of estimates. Table i presents some of the concepts used in the analysis of the power of
to a scientific gathering. At one point the to his audience, “Thirty-three percent of
I
June 24, 1991;
Department
of Radiology
accepted
after revision
and Division
February
of Biometry,
for Increasing
that the experiment will find a difference in the two sensitivities (or specificities) when, in fact, there is a difference of a certain amount to be found. The power of a diagnostic study depends to a large extent on its design and analysis. Accordingly, several strategies can be used to increase power via study design and analysis. Selection
of Cases
September
1992
0361 -.803X/92/1
Subjects
25, 1992.
Department
593-0631
and Control
One way to optimize the power to detect a difference in sensitivities is to select cases that will be neither too easy nor too difficult to diagnose. This method is based on the notion
of Community
and Family
Medicine,
27710. AJR 159:631-637,
Power
Quite often in diagnostic imaging research, studies are conducted to determine which of two techniques has better sensitivity or specificity. Power here, then, is the probability
the rats died within 24 hr after administration of the agent, and 33% survived at least 24 hr after administration. Unfortunately, the third rat got away.” The joke, of course, is the incredulity of findings based on so small a sample and the use of “statistification” to dress things up. The fact that almost everyone who hears this story appreciates its point shows that we all somehow understand that larger sample sizes give more reliable results and that, indeed, sometimes sample sizes can be too small to arrive at sound scientific conclusions. The reliability of a study can be viewed in two ways. In one, we consider the ability the study has to find something if, indeed, it is there to be found. This is the power of the study, and the purpose of this paper is to familiarize the radiologist with several strategies that can be used to assess and Received
studies.
© American
Roentgen
Ray Society
Box 3808,
Duke
University
Medical
Center,
Durham,
NC
BEAM
632
Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
TABLE
1: Concepts
Used in Power Analysis
Null hypothesis: The statement that the two imaging techniques are equivalent in their ability to diagnose or that one technique (the
“contender’) is no better than the other (the “reference”
tech-
nique).
p value: The probability
of observing
data as extreme
extreme than those observed in the study, assuming hypothesis is true. Significance rejection
level: The cutoff for deciding
which
p
or more the null
will detect
a difference
in
diagnostic abilities of a certain amount given such a difference actually exists.
that larger differences in diagnostic performance ought to be easier to detect than smaller differences and that tests can be made to look similar in their performance if the cases selected are all too easy or too difficult to diagnose. For example, inclusion of patients who have only large lesions will make sensitivities of both tests nearly 1 00% and, hence, very close. Similar concerns apply to the selection of control subjects for the comparison of specificities.
Study
Design
Another way to increase power is by choosing the most judicious study design and selecting the most sensitive method of analysis. Study design is often more important than the method of analysis because the latter depends on the characteristics of the data, which are often determined by the design of the study. For example, diagnostic studies in which both techniques are evaluated in the same subjects (“paired”
studies)
are always
at least as powerful,
and usually
more powerful, than studies in which separate groups of subjects are used for each technique [i]. This statement is true, however, only insofar as the method of analyzing the data makes good use of the additional information that is acquired through pairing. Hence, it is essential to choose the most sensitive statistical method appropriate to the study design.
of the Null Hypothesis
The null hypothesis is the position of the doubting Thomas. But, not only is the scientific Thomas a doubter, he also comes from Missouri, and the goal of the scientific experiment is often to acquire the evidence that shows the doubting Thomas he is wrong. In other words, the null hypothesis is the assumption that no differences exist until proved otherwise. In regard to studies comparing two diagnostic techniques, the null hypothesis is either the assertion of equality in abilities between the techniques or the assertion that one of the
1992
observed
in the study.
This probability
is computed
under the assumption that the null hypothesis is true. Largely by convention, data that have p values less than .05 are considered too extreme and lead to rejection of the null hypothesis. But, in fact, the cutoff used to define which p values are too small (the “significance level” of the hypothesis test) is arbitrary and can be selected by the researcher to be a value different from the conventional 5%. The next section discusses this strategy in greater detail. When the null hypothesis claims that the two diagnostic techniques are equivalent, more extreme data come from the performance by the contender that is either better or worse than the performance of the reference. As such, evidence against the null hypothesis is to be found in either extreme or “tail” of the probability distribution of the statistic used to compare the techniques. This type of hypothesis test has come to be known as a “two-tailed” test. On the other hand, if the null hypothesis states that the contender is simply no better than the reference, evidence contrary to this hypothesis can come only when the contender outperforms the reference. Hence, here the evidence against the null hypothesis comes from only one extreme or “tail” of the probability distribution of the statistic used to compare the techniques. This type of hypothesis test is known as a one-tailed test. One-tailed
hypothesis
tests
are usually
more
powerful
than
tests, and thus, another tool that we can use to influence power is our specification of the null hypothesis. When diagnostic techniques are compared, a one-tailed hypothesis test seeks to show superiority of one particular technique over another. Thus, one-tailed tests are used when we wish to show that a new technique (the “contender”) is two-tailed
better
than
a “reference”
technique.
In this
case,
the
null
hypothesis states that the contender is no better (could even be worse!) than the reference. If we reject this null hypothesis, we then conclude that the contender is, indeed, superior. On the other hand, if we simply seek to decide whether the two techniques differ diagnostically, then we would use a “twotailed”
Specification
September
techniques (the “contender”) is simply no better than another technique (the “reference” technique). The statistical method of hypothesis testing rejects the null hypothesis whenever the differences in the results of a study are too unlikely relative to the outcome expected by the null hypothesis. The “unlikeliness” of the data in relation to the null hypothesis is measured by the p value, which is the probability of observing data as extreme or more extreme than those
values lead to
of the null hypothesis.
Power: The probability that the study
AJR:159,
hypothesis
test
with
the null hypothesis
stating
that
they are equal. The investigator must be sure not to use this strategy indiscriminately, for there are situations in which the onetailed hypothesis test is not appropriate, as in those cases in which our interest is in finding the best diagnostic technique, whichever that might be. The one-tailed test fails to be adequate here because it can provide only evidence of the superiority of one particular technique over the other and cannot demonstrate superiority the other way around. Sometimes “weeding” out inferior techniques is as important as finding better ones, and in these cases the two-tailed test is required.
AJA:159,
Selection
Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
IMPROVING
September1992
of Significance
POWER
IN DIAGNOSTIC
Level
A fourth way to influence power is by the choice of the “significance level” of our hypothesis test. The significance level defines which p values will be considered “too small” and acts as a cutoff to decide whether or not to reject the null hypothesis. with this cutoff
The p value from the experiment is compared and, if it is smaller, we then declare the results
too extreme, decide against the null hypothesis, and conclude that the diagnostic techniques differ. Power and the significance level share an important relationship: as the significance level of a test increases, the power of the test increases. For example, consider increasing a significance
level from
5% to i 0%.
In the first
case
only
data that have less than a 5% (i.e., a one in 20) chance of occurring are considered too “unlikely” and lead to the rejection of the null hypothesis. In the second case, the data can be even more likely to occur (as likely as having a one in 10 chance of occurring) and will still be considered too “unlikely.” Thus, we have an easier time rejecting the null hypothesis in the i 0% case than in the 5% case and so have greaten power. Yet, although one might be tempted to use a larger significance level in order to increase the chance of finding something,
the other
side of the issue
is that the significance
level
is also the probability of rejecting a true null hypothesis; that is, of making the error of declaring there is a difference when in fact none exists. Hence, although we do gain power by increasing the significance level from 5% to 10%, we also increase the probability of making a false-positive type of finding in our study from 5% to i 0%. The 5% significance level (p < .05) currently exists in the literature
as the standard,
and so the use of a different
level
might arouse suspicion. Nonetheless, the technique of changing the significance level should not be overlooked as a means of increasing power when designing a study in those cases in which,
for example,
every
is crucial and the penalty sion is minimal.
Of course,
bit of power
that can be mustered
arising from a false-positive this alteration
must
conclu-
be done
at the
study design stage, because it is not ethical to adjust the significance level after the fact in order to yield statistically “significant” results. Finally, as is well known, one can influence the power of a study
via the choice
of sample
size.
However,
use of this
strategy to increase power requires a bit more in-depth consideration on our part. The next section discusses the selection
of sample
radiologic
size
diagnostic
Power and Sample and Specificities
to achieve
desired
levels
of power
Size
When
RESEARCH
633
applied to the same group of subjects would be called a “matched-groups” on, more specifically, a “pained” study. Aside from ethical considerations, each type of study has its own statistical considerations that influence the choice of sample size for obtaining a certain probability, or power, to detect differences between techniques. As will be seen later, these different considerations arise from different statistical approaches to the comparison of the techniques. A standard method for comparing sensitivities or specificities of techniques in unmatched-groups diagnostic studies is to assign separate samples of patients to be imaged by each of the two techniques and then summarize diagnostic per-
formance
in a table such as given in Figure i . Here we have
n1 patients being imaged by the reference technique and n2 by the contender. Proportions of correct diagnoses in the two
samples of patients can then be compared statistically by using the Yates continuity-corrected x2 test [2]. Important assumptions underlie this statistical test. If they are not met, the results obtained will be questionable. The first assumption is that each sample of subjects represents a random
sample
from
the same
population.
case, besides
invalidating
the statistical
any observed
differences
could
between the
the populations
diagnostic
assumption same from
If this is not the
properties
be due instead
of the test,
to differences
and not due to real differences
performance
of the
techniques.
in
A second
is that the probability of correct diagnosis is the patient to patient within each sample. This as-
sumption is largely met by randomly sampling each population. However, it does also require constancy and uniformity in imaging and image interpretation. For example, this statistical procedure will be invalid if learning accompanies the
interpretation. A final assumption is that the images are interpreted independently; that is, how one image is interpreted in no way influences the interpretation of another. Absence of either of these two latter assumptions invalidates the estimate of variability used to form the statistic and in the calculation
of its p value.
Casagrande and Pike [3] provide a formula to estimate the common sample size for each imaging technique required to obtain a desired level of power for the comparison of sensitivities (or specificities) in an unmatched-groups diagnostic study. This formula is presented in Table 2 and pertains to the one-tailed situation, in which we are interested in determining
only
if one test
(the contender)
is better
(e.g.,
more
in
studies.
+
Comparing
be called an “unmatched-groups” study. A study in which both
Reference
nl
Contender
fl
Sensitivities
An important initial consideration in the design of studies that compare the sensitivity on specificity of nadiologic techniques is whether or not both techniques will be applied to the same set of subjects. A study that uses separate groups of subjects for each imaging technique would, in standard terminology, ent-groups”
RADIOLOGY
or “independtechniques are
2
Fig. 1.-Typical summary of diagnostic performance in an unmatchedgroups study design: n1 patients have been imaged by the reference technique and n2 have been imaged by the contender. Letters in squares indicate numbers of patients. If all patients have the abnormality, then the sensitivity of the reference is estimated by a1/n1 and that of the contender by a2/n2. If all patients are control subjects, then the specificity of the reference is estimated by b,/n1 and that of the contender by b/n2.
Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
634
BEAM
AJA:159,
September
1992
TABLE 2: Sample Size Estimation When Comparing Sensitivities (Specificities) in the Unmatched-Groups Diagnostic Study
TABLE 3: Example Unmatched-Groups
Data: As in Figure 1
Suppose we want to design our study so that there is an 80% chance of detecting a difference in the two imaging techniques when the sensitivity of the reference technique is 80% and the sensitivity of the contender is 95%.
Test: x2 with Yates correction
(see Snedecor
and Cochran
Sample size: Using the method given by Casagrande the sample size for each group is estimated Ax[i
[2])
of Sample Diagnostic
Size Estimation Study
in the
and Pike [3],
to be
From Table 1 , the power factor (PF) for 80% power is found to be 0.840. Also, for a 5% significance level, the significance level factor (SLF) is 1.645.
+jl +4x(P2-P1)/A]2 4 x (P2 P1)2 -
where: P1 sensitivity P2 sensitivity =
A
=
(specificity) (specificity)
The following quantities 0.875, Qi 0.20, =
of the reference technique, of the contender,
[SLF x (2 x P x
)#{189}
PF x (P1 x Q
+
(P1
-
+
A P2 x
+ 0.840
n
2 01
=
02
1
-
Pi, and
1
-
P2.
Significance Level Factor (SLF)
1 5 10
sensitive although developed
2.325 1.645 1.280
equally
Power (%)
Power Factor (PF)
99 95 90 80 70
2.325 1.645 1.280 0.840 0.525
to the analysis
of specificity
as well.
Also note that none of the formulas introduced here can be used to compare diagnostic accuracy or predictive values, as these quantities require the additional estimation of prevalence rates, which is not typically done in these studies. As can be seen from Table 2, the formula
from Casagrande
and Pike requires specification of: (a) the significance level of the hypothesis test, (b) the desired power, and (c) the sensitivities of the two techniques. The last item required for specification shows that sample size determination depends on subjective appraisal and prior knowledge,
because
to use this formula
we must
be able to
suggest values for the sensitivities of the two techniques that are not only plausible but also represent the smallest clinically important difference. Obviously, to consider implausible values is a waste of time and to consider differences too small to be of practical interest is a waste of resources. Table 3 provides
and Pike formula
an example
to estimate
of the use of the Casagrande
sample
=
=
or specific) than another test (the reference). Also, sensitivity is used in the examples, everything so applies
=
x (2 x 0.875 x 0.125)#{189} x (0.8 x 0.2
size. Here we suppose
+
0.95 x
0.05)v]2
sample size is 1.327 [1+ i
+
4(0.15)/i
4(0.1
Factors: The following factors associated with the significance level and power are used in the formulas presented in this paper for estimating sample size:
Significance Level (%)
=
1.327
=
Then, the estimated
+
(Q1
are required for the formula in Table 2: 0.05, 0.125, and
02
[1.645
=
Q2)#{189}]2,
+
2 -
=
.3272]2
5)2
71.7
As we cannot have a fraction of a patient, we should round this estimate up to ensure our study has 80% power.
we wish to compare two techniques in which, from our experience, we expect the reference technique to have approximately 80% sensitivity. Also suppose that we are interested in the contender only if its sensitivity exceeds 95% (only then will the cost of this new procedure be worthwhile to recommend it in place of the reference). Also, suppose we
wish to have at least 80% power
in picking
difference between the two techniques. find we would need about 72 patients
up this type of a
From the formula, we per technique (i.e., i44
patients total) to have an 80% chance improvement oven the reference.
of finding
such an
Certainly this is not an encouraging result from the standpoint of practical study design. For, how often is it reasonable
to expect total sample sizes to exceed i 00 in those ubiquitous situations in which ethical and economic considerations severely restrict enrollment of patients? Fortunately, the matched-groups design offers an alternative that usually provides greater power per sample size than afforded by the unmatched-groups
study.
In the matched-groups imaging study each patient is imaged by both techniques. A radiologist then gives two diagnoses for each patient based on separate interpretations of the two images. The resulting data from n patients can then be presented as in the cross-tabulation illustrated in Figure 2. An important piece of information obtainable from this table, and the reason this table should be presented whenever reporting diagnostic results from matched-groups studies, is the patterns of agreement and disagreement between the two techniques. For example, from this table we see that in (a + d) patients of the n patients the contender agreed with the reference. Supposing that all n patients had the abnormality (i.e., positives or +), then we would also observe that in the cases
of disagreement,
b patients
were
correctly
clas-
AJR:159,
September
IMPROVING
1992
POWER
IN DIAGNOSTIC
Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
+
+ -
a
b
C
d
a+b
a+c
RESEARCH
635
of values possible for this probability. Table 4 provides a formula for the limits on the probability of disagreement. Table 5 provides an example. For example, suppose as before that our reference test has 80% sensitivity and that we wish to detect an improvement in the contender at least as great as 95%. On the basis of these values, we know from
Reference
Contender
RADIOLOGY
n TABLE
Fig. 2.-Typical
summary of diagnostic performance in a matchedstudy design: n patients have been imaged by both techniques. Letters in squares indicate numbers of patients. If all patients have the abnormality, then the sensitivity of the reference is estimated by (a + c)/n and that of the contender by (a + b)/n. Difference in sensitivities would be estimated by (b - c)/n. If all patients are control subjects, then the specificity of the reference is estimated by (b + d)/n and that of the contender by (c + d)/n. Difference in specificities would be estimated by (b - c)/n.
4:
Sample
Sensitivities Diagnostic
Size
Estimation
(Specificities) Study
When
Comparing
in the Matched-Groups
groups
2
Data: As in Figure
Test: McNemar Sample
size:
size for the
(see Dwyer [1])
Using entire
the method given study is estimated
[SLF x (‘I’
n=
sified by the contender and incorrectly by the reference. Similarly, c patients were correctly classified by the reference and not by the contender. If all n patients indeed have the abnormality, then the sensitivity
of the reference
is estimated
by (a
+
c)/n,
the
sensitivity of the contender is estimated by (a + b)/n, and the difference in sensitivities is then estimated by (b c)/n. Hence, when we compare the sensitivities of these two techniques, the only useful information we have in deciding between the two resides in the instances of their disagreement. Furthermore, if the techniques really are equal, then we should expect disagreement to occur solely by chance with there being an
where, t5
nique,
chance
of disagreement
possible
in either
in which high power
groups
as in the unmatched-groups
case,
the matched-
design also requires specification of the probability of the tests. In fact, our choice of sensi-
disagreement between tivities (or specificities)
for the two tests
PF x (‘I’
-
probability of disagreement between the techniques, Pi , Pi = sensitivity (specificity) of the reference techP2 = sensitivity (specificity) of the contender, SLF = signifi=
power factor (see
=
of Disagreement
(‘I’):
values
for P1 and P2, the minimum probability of disagreement equals P2 - P1. For practical purposes, the maximum probability of disagreement is when agreement occurs solely by chance. In this case, the probability of disagreement equals P1 x (1 - P2) + (1 - P1) x P2.
TABLE
5: Example
Matched-Groups As
of Sample
Diagnostic
Size Estimation
in the
Study
in Table 3, suppose we want to design our study so that there is an 80% chance of detecting a difference in two imaging techniques when the sensitivity of the reference technique is 80% and the sensitivity of the contender is 95%. As in the earlier example, PF 0.840 and SLF 1 .645. From Table 4, we see that the lowest possible probability of disagreement between the two techniques is 0.95 0.80 0.15. The greatest probability of disagreement is =
=
-
=
0.80 x (0.05) Assuming
the
lowest
possible
+
(0.20) x 0.95 probability
=
0.23.
of disagreement,
our
esti-
mated sample size is -
n
[1.645
x 0.15””
+
0.840 x (0.15 0.152
-
02
,152)’/]2 ‘
might be obtained
from small sample sizes are unlikely to occur in radiologic research and can be ignored. Table 4 summarizes this formula. In addition to specifying plausible values for the two tests’ sensitivities
sample
direction
(either b or c in Figure 2). Assuming this null hypothesis to be the case, we would expect to observe about equal values for b and c in the cross-classification table when the techniques are equal. This is the idea behind the McNemar test. The McNemar test is used for the analysis of the matchedgroups imaging study and assumes that the n patients are a random sample from some population. As before, this study design also assumes the sensitivity or specificity of each technique is constant. Violations of these assumptions can compromise the reliability of this statistical procedure. Further discussion about the use of the McNemar test in radiologic research can be found in Dwyer [11. Conner [4] provides a formula that can be used to estimate sample size in the matched-groups study. Although this formula can be a bit too optimistic by underestimating the sample size, those rare cases
the
-
Bounds on the Probability Given
[4],
to be
cance level factor (see Table 2), and PF Table 2).
-
equal
‘I’ P2
=
+
by Conner
determines
the range
up to ensure 80% power, we would estimate 40 patients would be required if there is the lowest possible level of disagreement between the two techniques.
Rounding
If we assume the two techniques agree only at the level expected by chance, then, using 0.23 as the probability of disagreement, we would estimate that 62 patients would be required to ensure an 80% chance of detecting this particular difference between the two techniques.
636
BEAM
AJR:159,
Fig. 3.-Example
Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
**
This of a with
ESTIMATED
TOTAL
NUMBER
OF PATIENTS
that
*5
65.00 106. 70.00 116. 75.00 130. 80.00 144. 85.00 164. 90.00 190. 95.00 232. (Use PRINTSCREEN If you want to run SSADSIR. Press to continue.
LOW (
48./group) C 53./group) ( 58./group) C 65./group) ( 72./group) C 82./group) ( 95./group) ( 116./group) button to print
MED
24. 27. 31. 35. 40. 46 54. 67. this analysis}
of disagreement
between
the two tests.
One can
readily see from this example the efficiency (and economy) that can be obtained through the use of matched groups in the study design. However, also note that although the savings to be had is in the number of patients imaged, it is not necessarily in the number of images to be interpreted. In the unmatched-groups study, 144 total images had to be interpreted. But, although fewer patients needed to be enrolled in the matched-groups study, if the probability of disagreement is “high,” then 124 images (two from each patient) will have to be interpreted. Thus, ignoring the issue of patients, the major benefit to be had from the matched-groups design comes when the chance of disagreement between the tests is likely to be low. In fact, as seen in our example,
size
from a program (SSADSIR).
In the
with a conThe output shows sample size estimates for both unmatchedand matched-groups studies that are required to obtain powers ranging from 60% to 95%. The program also reports the lowest and highest probabilities of disagreement possible
and reports sample sizes for these extremes as well as for a probability of disagreement halfway between the extremes.
HIGH
30. 34. 39. 44. 51. 58. 69. 86.
37. 42. 47. 54. 62. 71. 84. 106.
.
Table 4 that the probability of disagreement ranges from a low of 15% to a high of 23%. If we assume the lowest probability of disagreement between the two techniques, then from the formula we would estimate that about 40 patients would be required to achieve 80% power. On the other hand, if we assume the highest probability of disagreement (which, for practical purposes, occurs when the techniques agree solely by chance), we would estimate that about 62 patients would be needed. Compare these sample size requirements with those from the unmatched-groups design. There we estimated 144 patients would be required to achieve 80% power. Thus, in going to a matched-groups design we save at least 82 patients in the situation of high probability of disagreement and up to 1 02 patients in the very plausible situation of low probability
of output
sensitivity and is being compared tender that has 95% sensitivity.
when using a 5% Significance Level. Based upon the values you selected: The LOWEST possible probability of disagreement is .150 The highest probability of disagreement considered by this analysis occurs when the techniques agree solely by chance. For your values this HIGHEST probability of disagreement is .230 SSADSIR has used .190 as a MEDIUM amount of disagreement MATCHED STUDY UNMATCHED Probability of Disagreement
STUDY 96.
sample
example given, the reference technique has 80%
power analysis is for the comparison technique having 95.0% sensitivity/specificity a REFERENCE technique having 80.0% sensitivity/specificity
POWER 60.00
estimates
September1992
if the chance
of disagreement
is as
low as 15%, then we require 80 images (since n = 40) to be interpreted-still a big savings over the requirements of the unmatched-groups design.
SSADSIR
Figure 3 shows output from an interactive FORTRAN program that does the sample size estimation described in this article. This program (called SSADSIR for Sample Size Analysis of Diagnostic Studies in Radiology) asks the user to enter values for the sensitivity or specificity of the reference technique and that of the contender. The program then prints a table of sample-size estimates for both unmatchedand matched-groups studies to obtain powers ranging from 60% to 95%. The program also reports the lowest and highest probabilities of disagreement possible for the set of sensitivities or specificities specified by the user, and reports sample sizes for matched-groups studies using these extreme values as well as sample size using a value of disagreement halfway between the extremes? The example shown in Figure 3 continues the earlier examples with the reference technique having 80% sensitivity and the contender having 95% sensitivity. The sample sizes required for 80% power are the same as in the earlier exampIes. We also see in this output sample sizes for other powers. Thus, we observe that increasing our desired level of power increases our sample-size requirement. For example, going from 80% power to 95% power will cost us an additional 88 patients in the unmatched study. The same increase in power, however, costs only an additional 27 patients in the matchedgroups study when the disagreement between the techniques is at the lowest possible level.
* A free copy of the SSADSIR program will be provided upon receipt of a formatted diskette and mailer. Please send requests to: SSADSIR, Section of Imaging Research Statistics, Box 3808, Department of Radiology. Duke Uni-
versity Medical Center, Durham, NC 27710.
AJR:159,
IMPROVING
September1992
POWER
IN
DIAGNOSTIC
I thank
I reviewed
several
when designing cessful
637
RESEARCH
ACKNOWLEDGMENTS
Summary
Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved
RADIOLOGY
outcome.
strategies
a study Methods
for
to optimize for
an investigator
the chances
estimating
required
to use
of a sucsample
size were reviewed for study situations and designs commonly reported in the radiologic literature. Other methods are available (see, for example, Snedecor and Cochran [2] and Cohen [5]). The accuracy of estimates of sample size depends on how closely the required assumptions are met. Investigators should be aware of these assumptions. Failure to meet these assumptions does not eliminate the possibility of doing the investigation. Alternative procedures with their own methods to estimate sample size often exist for the statistical procedunes I have described. In these cases, the investigator should consult with a statistician.
gesting
Mark this
E. Baker,
paper
and
Duke for
University
helpful
Medical
discussions.
Sullivan, Susan Paine, and the other reviewers their suggestions.
Center, I also
for thank
of this manuscript
sugDan
for
REFERENCES 1 . Dwyer
AJ. Matchmaking
and McNemar
modalities. Radiology 1991;178:328-330 2. Snedecor GW, Cochran WG. Statistical State University Press, 1980: 124-125, 3. Casagrande JT, Pike MC. An improved sample sizes for comparing two
in the comparison methods,
of diagnostic
7th ed. Ames: The Iowa
129-1 30 approximate formula binomial distributions.
for calculating Biometrics
1978;34:483-486 4. Conner RJ. Sample size for testing differences in proportions for the paired-sample design. Biometrics 1987;43: 207-211 5. Cohen J. Statistical power analysis for behavioral sciences. Hillsdale, NJ: Eribaum Associates, 1988