SPECIAL CONTRIBUTION biostatistics

Introduction to Biostatistics: Part 3, Sensitivity, Specificity, Predictive Value, and Hypothesis Testing Diagnostic tests guide physicians in assessment of clinical disease states, just as statistical tests guide scientists in the testing of scientific hypotheses. Sensitivity and specificity are properties of diagnostic tests and are not predictive,of, disease in individual patients. Positive and negative predictive values are predictive of disease in patients and are dependent on both the diagnostic test used and the prevalence of disease in the population studied. These concepts are best illustrated by study of a two by two table of possible outcomes of testing, which shows that diagnostic tests m a y lead to correct or erroneous clinical conclusions. In a similar manner, hypothesis testing m a y or m a y not yield correct conclusions. A two by two table of possible outcomes shows that two types of errors in hypothesis testing are possible. One can falsely conclude that a significant difference exists between groups (type I error). The probability of a type I error is ~. One can falsely conclude that no difference exists between groups (type II error). The probability of a type II error is ~. The consequence and probability of these errors depend on the nature of the research study. Statistical power indicates the ability of a research study to detect a significant difference between populations, when a significant difference truly exists. Power equals 1 - ~. Because hypothesis testing yields "yes" or "no" answers, confidence intervals can be calculated to complement the results of hypothesis testing. Finally, just as some abnormal laboratory values can be ignored clinically, some statistical differences m a y not be relevant clinically. [Gaddis GM, Gaddis ML: Introduction to biostatistics: Part 3, sensitivity, specificity, predictive value, and hypothesis testing. Ann Emerg Med May 1990;19:591-597.]

Gary M Gaddis, MD, PhD* Monica L Gaddis, PhDt Kansas City, Missouri From the Departments of Emergency Health Services* and Surgery,t University of Missouri -- Kansas City School of Medicine, Truman Medical Center, Kansas City. Received for publication September 1, 1989. Accepted for publication January 30, 1990. Address for reprints: Gary M Gaddis, MD, PhD, Department of Emergency Health Services, University of Missouri -- Kansas City School of Medicine, Truman Medical Center, 2301 Holmes, Kansas City, Missouri 64108.

INTRODUCTION Diagnostic tests guide the physician in assessment of clinical disease entities. In a similar manner, statistical inference theory guides the scientist in the testing of scientific hypotheses. Before discussing inferential techniques (parts 4 and 5 of this series), it is necessary to understand the basis of hypothesis testing, to gain an appreciation of the type of questions inferential statistics help answer. Clinical diagnostic testing and hypothesis testing have many parallels, but most clinicians are more familiar with diagnostic than hypothesis testing. Therefore, this article will focus on the components of diagnostic testing theory, including sensitivity, specificity, and predictive value. This will be followed by analogies to facilitate understanding of hypothesis testing.

EVALUATION OF DIAGNOSTIC TESTS Sensitivity and Specificity Physicians make medical diagnoses with the aid of the patient history, physical examination, and diagnostic testing. Numerous new diagnostic tests are presented each year in the medical literature, and each must be evaluated before it is introduced into the clinical setting. Most new diagnostic tests are evaluated in relation to another older, previously accepted, often more invasive, and historically reliable test (the "gold standard" test). Common examples of gold standards include the use of ECG changes plus cardiac enzyme levels to diagnose acute myocardial infarction, or pulmonary angiography to diagnose pulmonary embolism. For the purposes of

19:5 May 1990

Annals of Emergency Medicine

591/145

BIOSTATISTICS Gaddis &Gaddis

our discussion, it will be assumed that results obtained by the gold standard test are always correct. Hypothetically, imagine that a new magnetic resonance imaging (MRI) venogram has been proposed as a noninvasive means of evaluating patients suspected by clinical criteria of having a deep venous thrombosis. The MRI venogram, the proposed new diagnostic test, will be evaluated a g a i n s t the t r a d i t i o n a l and widely used gold standard, the intravenous contrast venogram. Table 1 shows that there are four possible outcomes of diagnostic testing. Patients can be diagnosed as having deep venous thrombosis or not having deep venous thrombosis by both the gold standard test and by the new MRI diagnostic test, if patients undergo both tests. In Table 1, 250 patients clinically s u s p e c t e d of h a v i n g deep v e n o u s thrombosis undergo both tests. Of the 250 patients clinically suspected to have deep venous thrombosis, 150 actually do have deep venous thrombosis, with 130 shown to have deep venous thrombosis by both the gold standard test and by the new MRI test. This group of 130 is termed the true positive (TP) group by the new d i a g n o s t i c test because t h e y are shown to have disease by the new test and are also proven to have disease by the gold standard test. However, 20 of the 150 patients who are proven by the gold standard test to have deep venous thrombosis had a negative MRI diagnostic test. These 20 are termed the false negative (FN) group because they were classified incorrectly as disease free by the new MRI test. Similarly, 100 of the patients were judged disease free by the contrast venogram, but of these, only 87 had a negative MRI test. This group of 87 constitutes the true negative (TN) group. The remaining 13 were incorrectly classified by the new MRI test as having a deep venous thrombosis, when in fact they did not have the disease. This constitutes the false positive (FP) group. The two by two outcome table in Table 1 can now be used to help us evaluate how well the new MRI test does in detecting deep venous thrombosis. We want to know the answers to two questions: Is the test sensitive enough to detect the presence of a deep venous thrombosis in a diseased 146/592

TABLE 1. Gold standard versus diagnostic test Gold Standard Test (Contrast Venogram)

Disease Evident No Disease Evident

Diagnostic Test (MRI Venogram)

Disease Evident

No Disease Evident

Total

TP (130)

FP (13)

143

FN (20)

TN (87)

107

150

100

250

TABLE 2. Gold standard versus diagnostic test Gold Standard Test (Contrast Venogram)

Diagnostic Test (MRI Venogram)

Disease Evident No Disease Evident

Disease Evident

No Disease Evident

Total

TP (35)

FP (21)

56

FN (5)

TN (139)

144

160

200

40

,

TABLE 3. Possible outcomes of hypothesis testing Reality

Decision From Statistical Test Reject Ho, Accept H~

Ho False, H1 True

Ho True, H1 False

Correct, No Error

Incorrect, Type l Error

(A) Accept Ho, Reject H 1

Incorrect, Type II Error (C)

patient? Is the test specific enough to indicate the absence of deep venous thrombosis disease only in patients who in fact are not afflicted by it? Sensitivity, which can be thought of as "positivity (of the test) in disease," is derived by working down the first column of Table 1: Sensitivity (%) = 100 x TP/(TP + FN) In this example, sensitivity equals 100 x 130/(130 + 20), or 86.7%. Annals of Emergency Medicine

(B) Correct, No Error (U)

Specificity, which can be thought of as " n e g a t i v i t y (of the test) in health," is also derived by working vertically, in the second column of Table 1: Specificity (%) = 100 x TN/(TN + FP) Here, specificity equals 100 x 87/(87 + 13), or 87.0%. The ideal diagnostic test would be 100% sensitive and 100% specific, and thus would have no FP or FN 19:5 May 1990

1,0

••

m

F I G U R E I. Operating characteristic curve. ~ is dependent on ~, n, and i . In this example, ct is fixed at .05. All else held constant, increasing i or increasing n decreases ~.

c~ 0.05 n2>nl =

/

~

fl 0,5

-

•7

0.0

low

~

high

1

TABLE 4. Prior probability and chance of error Prior Probability

Chance of Error Type l Type II

outcomes. Because v i r t u a l l y all diagn o s t i c tests have s o m e FP and F N o u t c o m e s , t h e y do n o t h a v e 100% sensitivity and specificity. Unfortunately, m a n y clinicians believe that s e n s i t i v i t y and specificity can be used to predict w h e t h e r an individual patient is diseased or disease free. This is an error. Sensitivity and specificity are m e r e l y properties of a test. Sensitivity and specificity should not be used to m a k e predictive s t a t e m e n t s a b o u t an i n d i v i d u a l patient.

Predictive Value P r e d i c t i v e v a l u e s can be u s e d to help predict the l i k e l i h o o d of disease in an i n d i v i d u a l . A p o s i t i v e predictive value (PPV) is useful to indicate the proportion of individuals who actually have the disease w h e n the dia g n o s t i c test i n d i c a t e s the presence 19:5 May 1990

Low

High

High Low

Low High

of that disease. A negative predictive v a l u e (NPV) is useful to d e t e r m i n e the proportion of individuals who are t r u l y free of t h e d i s e a s e t e s t e d for when the diagnostic test indicates the absence of that disease. P r e d i c t i v e v a l u e s are d e r i v e d by w o r k i n g h o r i z o n t a l l y on the two by two o u t c o m e table in Table 1: PPV (%) = 100 x TP/(TP + FP) NPV (%) = 100 x T N / ( T N + FN) From the e x a m p l e in Table 1, PPV = 100 x 130/(130 + 13), or 90.9%, and NPV = 100 x 87/(87 + 20), or 81.3%. PPV and NPV are affected by the prevalence of disease in the population. Prevalence is defined as the proportion of the p o p u l a t i o n afflicted by the disease in question. In the example in Table 1, the prevalence of deep venous t h r o m b o s i s w h e n it was clinAnnals of Emergency Medicine

i c a l l y s u s p e c t e d w a s 60% b e c a u s e the total n u m b e r of patients studied was 250, and the n u m b e r of patients w h o a c t u a l l y h a d a c o n t r a s t venogram (the gold standard test) indicative of deep venous t h r o m b o s i s was 150. Next, the effects of decreased prevalence of deep venous thrombosis on the predictive value of the MRI venogram test will be examined. Imagine a sample of 200 patients, only 20% of w h o m h a v e a deep v e n o u s t h r o m bosis (prevalence, 20% ). This group is d e p i c t e d (Table 2). Because 20% of the patients have a deep venous thrombosis, the s u m of TP + FN in c o l u m n 1 m u s t be 0.2 x 200, or 40. Of these, a b o u t 35 w i l l c o n s t i t u t e t h e TP group b e c a u s e t h e s e n s i t i v i t y of the test has already been shown to be 86.7% (0.867 x 40 = 34.7). The rem a i n i n g five can be expected to be in the F N group because sensitivity is a p r o p e r t y of the test i n d e p e n d e n t of disease prevalence. Because the prevalence of deep venous t h r o m b o s i s is only 20%, the r e m a i n i n g 0.8 x 200, or 160, will not have a deep venous thrombosis, so the s u m of T N + FP results in c o l u m n 2 will be 160. Of this set of 160, 87%, or a b o u t 139, will be in the T N group, and the rem a i n i n g 21 will be in the FP group because specificity is also a property of t h e test, i n d e p e n d e n t of d i s e a s e prevalence. T h e c h a n g e of p r e v a l e n c e m a r k e d l y i n f l u e n c e s t h e PPV and N P V v a l u e s o b t a i n e d (Table 2). W i t h a 20% prevalence, the PPV falls to 100 x 35/(35 + 21), or 62.5%, w h i l e the NPV increases to 100 x 139/(139 + 5), or 96.5%. N o t e that as disease prevalence falls, the PPV of any test will fall and the NPV of any test will increase. F r o m this, it is easy to see w h y m a n y n e w diagnostic tests that seem from initial reports to be useful m a y not represent a diagnostic improvem e n t w h e n in c o m m o n use. M a n y diagnostic tests are validated in sett i n g s on p o p u l a t i o n s w i t h a h i g h p r e v a l e n c e of the disease for w h i c h testing is done. However, w h e n the n e w test is used in different clinical settings w i t h a l o w e r p r e v a l e n c e of 593/147

BIOSTATISTICS Gaddis & Gaddis

FIGURE 2. Clinical testing. that disease, the test does not perform up to reported expectations. A clinical example of the interrelationship between prevalence of disease and predictive value is the use of amylase levels to screen for pancreatitis. An elevated amylase level is more likely indicative of panereatitis in persons previously afflicted with pancreatitis than it is predictive of pancreatitis among all patients with a b d o m i n a l pain or o t h e r possible causes of an elevated serum amylase level. In summary, sensitivity and specificity are properties that indicate the degree of reliability of a diagnostic test. Sensitivity and specificity do not indicate predictive value. Predictive values can be applied to an individual patient's test result and are affected by the prevalence of the disease in the population to which the test is applied. The PPV will fall and the NPV will rise as the prevalence of disease decreases.

HYPOTHESIS TESTING Formulation of the Hypothesis Statistical inference involves the testing of hypotheses. A hypothesis is a numerical s t a t e m e n t about an u n k n o w n p a r a m e t e r 3 Just as a two by two table can be constructed for the four possible outcomes of a clinical diagnostic test, a two by two table can be constructed for the four possible outcomes of hypothesis testing. Before constructing this table, it is necessary to understand what a hypothesis states. The first step in hypothesis testing is a s t a t e m e n t of a h y p o t h e s i s in positive terms. This defines the " r e s e a r c h " or "alternative" hypothesis, H1.2 For example, one could h y p o t h e s i z e t h a t experienced e m e r g e n c y p h y s i c i a n s (those w i t h m o r e t h a n five years of fulltime postgraduate emergency departm e n t experience) can examine, diagnose, and treat m o r e p a t i e n t s per hour than inexperienced emergency p h y s i c i a n s (less t h a n five years of full-time ED experience). The next step is to state the "null" or " s t a t i s t i c a l " h y p o t h e s i s , Ho, w h i c h follows logically from H1.1,2 The hypothesis tested statistically is H o. In this example, H o would state "Experienced emergency physicians and inexperienced emergency physi148/594

Sensitivity

The ability of a test to reliably detect the presence of disease (positivity in disease). Sensitivity (%) = 100 x TP/(TP + FN)

Specificity

The ability of a test to reliably detect the absence of disease (negativity in health). Specificity (%) = 100 x TN/(TN + FP)

Prevalence

The proportion of the population with disease. Prevalence (%) = 100 x (TP + FN)/(n)

Positive Predictive Value

The proportion of individuals with disease when the presence of disease is indicated by the diagnostic test. PPV = 100 x TP/(TP + FP)

Negative The proportion of individuals free of disease when the abPredictive sence of disease is indicated by the diagnostic test. Value NPV = 100 x TN/(TN + FN) TN, true negative; FN, false negative; TP, true positive; FP, false positive. 2

cians do n o t differ significantly in the n u m b e r of patients they can examine, diagnose, and treat per hour." We "reject" or "fail to reject" ("accept") H o based on our inferential statistical testing. 1-3 Ho hypothesizes a difference of zero between population samples tested, while H 1 hypothesizes a nonzero difference bet w e e n p o p u l a t i o n s a m p l e s tested. There exist an infinite n u m b e r of possible nonzero differences between populations. Therefore, the reason that H o rather than H 1 is tested is that mathematically, H o theorizes a single m a g n i t u d e of difference between populations studied, and it is possible to statistically assess this single hypothesis. In contrast, H 1 is a c t u a l l y an infinite n u m b e r of hypotheses because there exist an infinite n u m b e r of possible magnitudes of difference between populations. 4 It would be impossible to calculate the required statistics for each of the infinite n u m b e r of possible magnitudes of d i f f e r e n c e b e t w e e n p o p u l a t i o n samples H 1 hypothesizes. If H 0 is "accepted" as tenable, then H 1 m u s t be " r e j e c t e d , " and v i c e versa, because the two h y p o t h e s e s are mutually exclusive. When H o is tested, the probability that numerical differences between population samples are not due strictly to chance is assessed. 2 H 0 does r e c o g n i z e t h a t nonzero differences between groups are possible, even if two samples of the same population are tested, simply due to r a n d o m s c a t t e r of t h e data. 2 If H o is "accepted" as tenable, this signifies the likelihood that no significant difference exists between Annals of Emergency Medicine

the populations studied and that any numerical differences between groups are due to chance alone. If H o is rejected, this signifies that a significant difference does exist between the populations studied and that the n u m e r i c a l differences b e t w e e n the groups are not due to chance alone.

Errors in Hypothesis Testing Hypothesis testing m a y lead to erroneous inferential statistical conclusions, just as diagnostic testing m a y lead to erroneous diagnostic conclusions. Just as a two by two table of possible outcomes of diagnostic tests can be constructed, so can a two by two table of possible outcomes of inf e r e n t i a l s t a t i s t i c a l t e s t s be c o n structed (Table 3). Two types of incorrect conclusions are possible. Box B of Table 3 indicates cases in which the statistical test falsely indicates that a significant difference exists between groups, when in fact no true difference exists. It is analogous to a false-positive diagnostic test result. In other words, box B shows cases where H o is rejected, w h e n it is in fact true. This rejection of H o when H o is true is arbitrarily called a type I e r r o r . 1-3

Box C of Table 3 indicates cases in which the statistical test falsely indicates the lack of a significant difference between groups, w h e n in fact a true difference exists (H 1 is true). This is analogous to a false-negative diagnostic test result. In other words, box C shows cases in which H e is accepted when it is in fact false. The acceptance of H o when H o is false is arbitrarily called a type II error. 1-3 19:5 May 1990

FIGURE 3. Research (Alternative) Hypothesis (H1) Null (Statistical) Hypothesis

An hypothesis that states a difference exists between two (or more) populations studied. H 1 is a positive statement that a difference exists between groups. An hypothesis of no difference between two or more populations studied. H o is a negative statement, that no difference exists between groups.

(Ho) Type I Error

To reject the null hypothesis (Ho), when in fact H o is true. To falsely conclude that a significant difference exists between populations.

Type II Error

To accept the null hypothesis (Ho), when in fact H o is false. To falsely conclude that no significant difference exists between populations. The probability of making a type I error. Statistical calculations from the experimental data indicate that the probability of making a type I error is less than 5%. The probability of making a type II error. The ability of an experiment to find a significant difference exists between populations, when in fact a significant difference truly exists. Power = 1 - 13 The degree of difference between populations tested.

Alpha (o0 " • P < .05

Beta (13) Power

Delta (A) Operating Characteristic Curve Prior Probability

A function that relates the dependent variable !8 that results from independent values of ~, A, and n. The likelihood that an hypothesized difference between populations is in fact correct.

Box A and box D of Table 3 denote c o r r e c t c o n c l u s i o n s , a n a l o g o u s to true-positive and true-negative diagnostic test results. Thus, Table 3 s h o w s t h a t there exist t w o correct and t w o incorrect conclusions possible w h e n e v e r H o is tested. Next, the probability of m a k i n g inc o r r e c t c o n c l u s i o n s m u s t b e assessed. T h e probability of m a k i n g a type I error is defined as alpha (~).1,2,4 is derived from the raw data, statistical c a l c u l a t i o n s , and s t a t i s t i c a l tables a p p r o p r i a t e for t h e i n f e r e n t i a l s t a t i s t i c a l t e s t used. By c o n v e n t i o n , s t a t i s t i c a l s i g n i f i c a n c e is g e n e r a l l y accepted if the probability a of m a k ing a t y p e I error is less t h a n 0.05, w h i c h is c o m m o n l y denoted on figures and tables as P < .05.3, 4 T h o u g h conventional, selection of an a l p h a level of .05 as t h e crucial level of significance is arbitrary. Acc e p t i n g s i g n i f i c a n c e a t et = .05 m e a n s that it is recognized that one t i m e o u t of 20, a type I error will be committed, a consequence that the i n v e s t i g a t o r is w i l l i n g to accept. If the consequences of malting a type I error are judged to b e sufficiently se19:5 May 1990

vere, it m a y be appropriate to select m o r e s t r i n g e n t levels of % such as .01, as the cutoff for statistical significance. W h e n a caption or text indicates t h a t for s o m e s t a t i s t i c a l c o m parison, P = .XY, the probability of a type I error, based on the calculations performed for that inferential statistical test, is 0.XY, and the reader is left to judge w h e t h e r this level of ot is i n d i c a t i v e of a t r u e d i f f e r e n c e between populations tested. Another advantage of the reporting of P values is t h a t the a r b i t r a r y d e s i g n a t i o n of significance at .05, and the i m p r o p e r and arbitrary designation of a trend if .10 > P > .05, can be avoided. T h e probability of m a k i n g a type II error is defined as beta (13).1,~,4 ~ is m o r e difficult to derive t h a n a, and u n l i k e ~, a c t u a l l y is n o t one single probability value. [3 is often ignored by researchers, s However, it is imp o r t a n t . If s o m e t r e a t m e n t yields a 10% increase in survival or a 10% decrease in some complication, it w o u l d l i k e l y be readily incorporated into m e d i c a l practice. Unfortunately, n u m e r o u s c l i n i c a l t r i a l s h a v e suffered from errors of e x p e r i m e n t a l deAnnals of Emergency Medicine

Hypothesis testing.

sign that cause 13 to be u n a c c e p t a b l y high, such t h a t type II errors are easily made, and t r e a t m e n t s that are significantly better t h a n older m e t h o d s are rejected because of statistical artifact resulting from poor e x p e r i m e n t a l design .s By convention, [3 should be less t h a n .20, and i d e a l l y less t h a n .10, to m i n i m i z e the chance of m a k ing a type II error. 6 and f3 are i n t e r r e l a t e d . A l l else h e l d c o n s t a n t (such as t h e p o p u l a tions studied, the n u m b e r of subjects, and t h e m e t h o d of testing), as a is arbitrarily decreased, 13 is increased. As is increased, 13 is decreased.i, 2 S t a t i s t i c a l p o w e r is d e f i n e d as (1-~).1,2, 4 B e c a u s e [3 i n d i c a t e s t h e probability of m a k i n g a type II error, power indicates mathematically the probability of not m a k i n g a type II error. Power is analogous to sensitivity in hypothesis testing. Sensitivity indicates the probability that the diagnostic test can detect disease w h e n it is present. Power indicates the probability that the statistical test can detect s i g n i f i c a n t differences b e t w e e n populations, w h e n in fact such differences truly exist. Power depends on several variables: 1,2,4,7 c~: As a increases, 13 decreases, and power increases. n ( s a m p l e size): A s n i n c r e a s e s , power increases. T h e m a g n i t u d e of t h e d i f f e r e n c e actually present b e t w e e n the populations tested, delta (A): Just as it is easier to find a pitchfork t h a n a needle in a h a y s t a c k , so it is easier to find a large difference t h a n it is to find a small difference b e t w e e n populations tested. One-tailed versus two-tailed tests: O n e - t a i l e d tests are m o r e p o w e r f u l than two-tailed tests, because a statistical test result m u s t n o t vary as m u c h from the m e a n to achieve significance at any level of c~ chosen. (If c~ is .05, for a two-tailed test, a result m u s t fall in either the top or b o t t o m 21/2% of r e s u l t s to a c h i e v e significance, b u t for a o n e - t a i l e d test, t h e result m u s t m e r e l y fall in either the top or b o t t o m 5% of a distribution.) In the original h y p o t h e s i s e x a m p l e about h o w quickly e m e r g e n c y physicians can treat patients, the appropriate test w o u l d be one-tailed, because H 1 specifies the direction of the difference between groups hypothe595/149

BIOSTATISTICS Gaddis & Gaddis

sized. Parametric versus nonparametric statistical testing: Parametric tests are g e n e r a l l y m o r e p o w e r f u l . (This will be further discussed in Part 4 of this series.) Use of proper e x p e r i m e n t a l design and s t a t i s t i c s : Errors in t h e s e areas decrease power. Because so m a n y variables can affect I3, ~ is not one single value. This follows from t h e fact t h a t ~ is t h e probability of erroneously concluding t h a t H o is false, and H o specifies a single m a g n i t u d e of d i f f e r e n c e bet w e e n populations. However, as has been explained, ~ is t h e p r o b a b i l i t y of erroneously concluding that H I is false, and H 1 h y p o t h e s i z e s an infinite n u m b e r of p o s s i b l e m a g n i t u d e s of difference between populations tested. ~ is expressed as a function of A, n, and a by a function called the operating characteristic curve of the test s (Figure 1). T h e m o s t c o m m o n use of ~ is in t h e c a l c u l a t i o n of t h e a p p r o x i m a t e n u m b e r of s u b j e c t s t h a t m u s t be s t u d i e d to keep R and [3 a c c e p t a b l y small. This calculation uses estim a t e s of p o p u l a t i o n standard deviations and e s t i m a t e s o f / k , acceptable values of a and ~, and n u m b e r s from statistical tables, to derive a value of n of sufficient size. T h e d e t e r m i n a t i o n of a d e q u a t e s a m p l e size for an e x p e r i m e n t is readily referenced. 8 lo

P Values Versus Confidence Intervals H y p o t h e s i s testing yields yes or no answers about statistical significance, answers t h a t can be fraught w i t h errors, and a n s w e r s t h a t m a y represent oversimplifications. P values i m p l y l i t t l e about the magnit u d e of d i f f e r e n c e p r e s e n t b e t w e e n p o p u l a t i o n s . T h e r e f o r e , s o m e feel that the use of confidence intervals (CIs) is c o m p l e m e n t a r y or even prefe r a b l e to t h e u s e of P v a l u e s i n reporting clinical data.11 (Confidence intervals were discussed in part 2 of this series. 12) It is correct to r e p o r t b o t h CI and P v a l u e s for s c i e n t i f i c data, and the two are often complementary. 1,11

Clinical Versus Statistical Significance Statistically significant numerical differences between study groups m a y n o t be c l i n i c a l l y significant or relevant. A n analogy to clinical test150/596

ing is again useful. It is c o m m o n experience to ignore or place little emphasis on a single diagnostic test res u l t t h a t lies o u t s i d e t h e e x p e c t e d range for that test w h e n large n u m bers of tests are done. A n example is the i n t e r p r e t a t i o n of an isolated elevated a m y l a s e level in a p a t i e n t having otherwise n o r m a l routine laborat o r y d a t a after a n o r m a l s c r e e n i n g p h y s i c a l e x a m i n a t i o n at his f a m i l y physician's office. M a n y experienced clinicians can i n t u i t i v e l y sense w h e n to place l i t t l e e m p h a s i s on i s o l a t e d laboratory test results outside the n o r m a l range w h e n an abnormal res u l t is n o t e x p e c t e d . A l t e r n a t i v e l y stated, w h e n there is very little prior probability of disease, an isolated abn o r m a l laboratory value is generally n o t cause for great concern, and the clinician avoids a clinical error analogous to a type I error by a v o i d i n g concluding that disease is present in a disease-free patient. S i m i l a r l y , if e n o u g h s t a t i s t i c a l c o m p a r i s o n s are m a d e , e v e n t u a l l y type I and type II statistical errors are i n e v i t a b l e . T h e p r o b l e m c o m e s in discerning w h i c h statistically significant differences are m e a n i n g f u l and w h i c h are meaningless. Just as prevalence affects the predictive value of a positive diagnostic test, so the prior probability of a difference affects the predictive value of a statistical test. Prior probability is an expression of h o w likely an hypothesis will be true w h e n assessed b e f o r e doing s t a t i s t i cal c a l c u l a t i o n s . Prior p r o b a b i l i t y is derived from previously available k n o w l e d g e t h a t led to the f o r m u l a tion of the hypothesis being tested. W h e n a hypothesis has a low prior probability of b e i n g t r u e , y e t achieves statistical significance, such as a l i n k b e t w e e n coffee c o n s u m p tion and pancreatic cancer, 13 a significant result m u s t be interpreted cautiously. Furthermore, if a type I error is being made, repetitive study will p r o b a b l y n o t r e p l i c a t e a significant difference, as subsequently occurred in t h e case of t h e a l l e g e d l i n k bet w e e n coffee c o n s u m p t i o n and pancreatic c a n c e r J 4 However, in cases of high prior p r o b a b i l i t y , a s i g n i f i c a n t s t a t i s t i c a l difference is u s u a l l y correct, just as in cases of high disease prevalence, a positive clinical test result is m o r e l i k e l y to be correct. Table 4 s u m m a r i z e s the interrelationship between prior probability and the chance of m a k i n g a type I or Annals of Emergency Medicine

type II error. This relationship is further explained by Bayes theorem, w h i c h t h e r e a d e r is i n v i t e d to explore.

SUMMARY A n u n d e r s t a n d i n g of t h e i n t e r p r e t a t i o n of d i a g n o s t i c t e s t s facilitates an u n d e r s t a n d i n g of hypothesis testing. A diagnostic test result m a y be a t r u e - p o s i t i v e , t r u e - n e g a t i v e , f a l s e - p o s i t i v e , or f a l s e - n e g a t i v e result. For diagnostic tests, s e n s i t i v i t y and s p e c i f i c i t y are p r o p e r t i e s of t h e d i a g n o s t i c test and do n o t i n d i c a t e p r e d i c t i v e value. P r e v a l e n c e of disease is a d e t e r m i n a n t of the predictive value of b o t h positive and negative test results. Similarly, hypothesis testing can yield erroneous results. A false-positive result, w h i c h a c c e p t s t h e presence of a significant difference bet w e e n p o p u l a t i o n s w h e n in fact no significant difference exists (type I error}, occurs w i t h a probability of a. A false-negative result, rejecting the p r e s e n c e of a significant difference b e t w e e n p o p u l a t i o n s , w h e n in fact t h e y actually do differ (type II error), occurs w i t h a probability of ~. P o w e r is i-p, and is analogous to the sensitivity of a diagnostic test in t h a t b o t h s e n s i t i v i t y and p o w e r address w h e t h e r a test can detect w h a t it is designed to detect. As s e n s i t i v i t y and specificity are n o t predictive, so also power is n o t predictive. As prevalence of disease affects the predictive value of a positive test result, so the prior p r o b a b i l i t y of a difference being p r e s e n t affects the p r e d i c t i v e value of a significant statistical test result. Figures 2 and 3 summarize these points.

REFERENCES

1. Hopkins KD, Glass GV: Basic Statistics for the Behavioral Sciences. Englewood Cliffs, New

Jersey, Prentice-Hall, Inc, 1978. 2. Keppel G: Design and Analysis. A Researcher's Handbook. Englewood Cliffs, New Jersey, Prentice-Hall, Inc, 1978. 3. Elenbaas RM, Elenbaas JK, Cuddy PG: Evaluating the medical literature Part II: Statistical analysis. Ann Emerg Med 1983;12:610-620. 4. Sokal RR, Rohlf FJ. Biometry (ed 2). New York, WH Freeman and Co, 1981. 5. Freiman JA, Chalmers TC, Smith H, et al: The importance of beta, the type II error, and sample size in the design and interpretation of the randomized clinical trial. N Engl J Med 1978;299:690-694. 6. Reed JF, Slaichert W: Statistical proof in inconclusive "negative" trials. Arch Intern Med 1981;141:1307-1310. 19:5 May 1990

7. Cohen J: Differences between proportions, in: Statistics in Medicine. Boston, Little, Brown, & Co, 1974. 8. Arkin CG WachtelMS: Howm.any patients are necessary to assess test pefformance?JAMA 1990;263:275-278. 9. Fleiss JL: Statistical Methods for Rates and Proportions (ed 2). New York, John Wiley &

19:5 May 1990

Sons, 1981. 10. Young MJ, Bresnitz EA, Strom BL: Sample size nomograms for interpreting negative clinical studies. Ann Intern Med 1983;99:248-251. 11. Gardner MJ, Altman DG: Confidence intervals rather than P values: Estimation rather than hypothesis testing. Br Meal J 1986;292: 746-750.

Annals of Emergency Medicine

12. Gaddis GM, Gaddis ML: Introduction to biostatistics: Part 2, descriptive statistics. Ann Emerg Med 1990;19:309-315. 13. MacMahon B, Yen S, Trichopoulos D, et al: Coffee and cancer of the pancreas. N Engl J Med 1981;304:630-633. 14. Gorham ED, Garland CF, Garland FL, et al: Coffee and pancreatic cancer in a rural California county. West J Med 1988;148:48-51.

597/151

Introduction to biostatistics: Part 3, Sensitivity, specificity, predictive value, and hypothesis testing.

Diagnostic tests guide physicians in assessment of clinical disease states, just as statistical tests guide scientists in the testing of scientific hy...
654KB Sizes 0 Downloads 0 Views