Original Article

Verification and Classification Bias Interactions in Diagnostic Test Accuracy Studies for Fine-Needle Aspiration Biopsy Robert L. Schmidt, MD, PhD, MBA1; Brandon S. Walker, MS2; and Michael B. Cohen, MD1,2

BACKGROUND: Reliable estimates of accuracy are important for any diagnostic test. Diagnostic accuracy studies are subject to unique sources of bias. Verification bias and classification bias are 2 sources of bias that commonly occur in diagnostic accuracy studies. Statistical methods are available to estimate the impact of these sources of bias when they occur alone. The impact of interactions when these types of bias occur together has not been investigated. METHODS: We developed mathematical relationships to show the combined effect of verification bias and classification bias. A wide range of case scenarios were generated to assess the impact of bias components and interactions on total bias. RESULTS: Interactions between verification bias and classification bias caused overestimation of sensitivity and underestimation of specificity. Interactions had more effect on sensitivity than specificity. Sensitivity was overestimated by at least 7% in approximately 6% of the tested scenarios. Specificity was underestimated by at least 7% in less than 0.1% of the scenarios. CONCLUSIONS: Interactions between verification bias and classification bias create distortions in accuracy estimates that are greater than C 2014 would be predicted from each source of bias acting independently. Cancer (Cancer Cytopathol) 2014;000:000-000. V

American Cancer Society.

INTRODUCTION Reliable estimates of sensitivity and specificity are important for any diagnostic test. Such estimates help to determine the value of the test relative to other diagnostic methods and help guide the development of diagnostic algorithms. Estimates of sensitivity and specificity are obtained from diagnostic test accuracy (DTA) studies. However, DTA studies are subject to unique sources of bias that can distort estimates of a method’s sensitivity and specificity.1 Bias occurs when there is a systematic difference between the true value and the observed value. Partial verification bias and classification bias are types of bias that commonly occur in DTA studies for fineneedle aspiration biopsy (FNAB). Partial verification occurs when only a portion of the index test (ie, the test under investigation) results are verified by the reference test. This leads to a type of bias known as partial verification bias. Ideally, DTA studies should be designed so that all cases that receive the index test are verified by a reference test (eg, histopathology and/or clinical follow-up). However, full verification is often not practical or ethical because the reference test is costly or causes patient harm. Thus, only a portion of fine-needle aspiration biopsy results are verified by the reference test. The fraction of FNAB results verified by the reference test is called the verification rate. The verification rate often depends on the result of the index test. For example, cases with a positive index test result (eg, FNAB diagnosis of malignant) could have a higher verification rate than cases with a negative index test result. In general, partial verification bias occurs whenever the verification rate depends on the result Corresponding author: Robert Schmidt, MD, PhD, MBA, Department of Pathology, 15 N Medical Drive East, University of Utah, Salt Lake City, UT 84112; Fax: (801) 585-2805; [email protected] Additional Supporting Information may be found in the online version of this article. 1

Department of Pathology, University of Utah School of Medicine, Salt Lake City, Utah; 2ARUP Laboratories, Salt Lake City, Utah

Received: August 26, 2014; Revised: October 25, 2014; Accepted: November 19, 2014 Published online Month 00, 2014 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/cncy.21503, wileyonlinelibrary.com

Cancer Cytopathology

Month 2014

1

Original Article

of the index test.2,3 Partial verification bias causes inaccurate estimates of both sensitivity and specificity. More specifically, partial verification bias generally causes overestimation of sensitivity and underestimation of specificity.4 Partial verification is a common source of bias in DTA studies of FNAB.3-5 Classification bias, the other common form of bias, occurs when the reference test is imperfect and results in misclassification of cases. Histopathology and clinical follow-up are the usual reference standards in DTA studies for FNAB. The impact of misclassification depends on the relationship between the index test (ie, FNAB), and the reference test.6 There are 2 forms of misclassification. In the first form, known as nondifferential misclassification, the misclassification rate in the reference test is independent of the index test results (ie, the misclassification rate of the reference test is the same for positive index test results and negative index test results). Nondifferential misclassification bias typically produces estimates of sensitivity and specificity that are lower than the true values. Differential misclassification occurs when the error rate in the reference test depends on the index test result. For example, differential misclassification would occur if histopathology errors occurred more frequently in cases with a positive FNAB result. Differential misclassification can cause overestimation or underestimation of both sensitivity and specificity, depending on the type of misclassification. Partial verification bias and classification bias are common in FNAB DTA studies. A recent review estimated that 50% of FNAB DTA studies had partial verification bias.5 A focused, systematic review found that 84% of FNAB DTA studies published in otolaryngology journals had partial verification bias.3 Misclassification is also common in DTA studies. Misclassification rates are estimated by disagreement studies. The disagreement rate depends on the definition of disagreement. Although there is variation in classification schemes, studies generally divide disagreements into major and minor disagreements. Major disagreements are generally defined as changes in diagnosis that result in changes in therapy or prognosis.7 Major disagreement rates show wide variability, ranging from 0.3% to 10%, depending on many factors such as the tissue type, the type of referral (physician requested or routine transfer), and the institution.8-17 We are not aware of any studies that have explored the interactions between these 2 types of bias in FNAB DTA studies. In addition and importantly, verification bias 2

TABLE 1. Analysis Framework Verification Bias Classification Bias

Absent

Present

Absent

Scenario 1 No bias Observed: Sn, Sp

Scenario 2 Partial verification bias only Observed: Snv, Spv

Present

Scenario 3 Classification bias only Observed: Snc, Spc

Scenario 4 Classification and verification bias Observed: Snvc, Spvc

The table shows the 4 scenarios analyzed and the notation for the observed accuracy statistics in each scenario. For example, with no bias, one would observe the true sensitivity and specificity, Sn and Sp. When partial verification bias is present, one would observe biased estimates of sensitivity and specificity, Snv and Spv. When classification bias is present, one would observe biased estimates of sensitivity and specificity, Snc and Spc. When both classification and verification bias are present, one would observed the biased estimates, Snvc and Spvc.

and classification bias may interact in unpredictable ways so that it is difficult to estimate the total bias based on the estimated bias due to each component. For example, these 2 biases may be synergistic or antagonistic. To address this, we developed equations to determine the interaction between verification bias and classification bias.

MATERIALS AND METHODS We derived analytical expressions for the effect of verification bias, classification bias, and the combined effect of verification bias and classification bias (Table 1). In each case, the bias was expressed in terms of primary input variables (ie, sensitivity, specificity, prevalence, verification rates, and misclassification rate). We then illustrate the use of these expressions by providing example calculations that show the combined effect of verification bias and classification bias on estimates of sensitivity and specificity. We used computer experiments to investigate the frequency at which bias interactions might have sufficient magnitude to have an appreciable effect on diagnostic accuracy studies. To that end, we used Monte Carlo simulation to generate 100,000 case scenarios that spanned a plausible range of values of the primary input variables (see Table 2 for ranges of input variables). Monte Carlo simulation is a technique in which calculations are repeated many times by taking random samples for the input variables (see Supplement 1). A case scenario was defined as a set of values for the primary input variables. Each case represents a DTA study. Each individual scenario was formed by taking a random sample from the distribution for each primary input Cancer Cytopathology

Month 2014

Bias in Diagnostic Accuracy Studies/Schmidt et al

TABLE 2. Sampling Distributions for Input Parameters Distribution Parameter Sensitivity Specificity Prevalence Positive verification rate Negative verification rate Misclassification rate

Symbol

Minimum

Mean

Maximum

References

Sn Sp u a b f

60% 80% 0.1% 60% 4% 0.3%

80% 90% 25% 80% 42% 5%

100% 100% 50% 100% 80% 10%

1–3 1–3 1–3 4,5 4,5 6–15

Scenarios were produced by sampling from uniform distributions for each parameter with the minimum and maximum indicated in this table. Verification bias and classification bias are common in diagnostic accuracy studies of fine-needle aspiration biopsy. When combined, these 2 types of bias can interact to create more bias than would be expected from each type of bias acting independently.

variable, using the ranges shown in Table 2. Using this technique, we generated the results for a large number of cases (DTA studies) with a wide range of properties. We then examined this set of generated studies to see how often interactions were important. We assumed that each input parameter was distributed beta distribution centered at the mean of the plausible range width equal to the plausible range. (Each input variable had the form, Y 5 a 1 (b - a)*X, where a is the lower limit of the plausible range and b is upper limit of the plausible range, and X is a beta distributed random variable with both parameters of the beta distribution set equal to 2.). Each case scenario had a unique value for the input parameters (Sn, Sp, u, a, b, f). For each case scenario, the total bias, the bias due to partial verification acting alone, and the bias due to classification acting alone were calculated using the equations developed below (see equations (1.2).(2.5), (1.2).(2.6), (1.3).(2.3), (1.3).(2.4), (1.4).(2.5), and (1.4).(2.6)). For each case (Sn, Sp, u, a, b, f), we calcuv v v c lated the bias components (BSn ; BSn ; BSp ; BSp Þ and total vc vc bias (BSn ; BSp Þ. We then investigated the relationship vc vc between the total bias (BSn ; BSp Þ and bias components v v v v v v v v (BSn ; BSn ; BSp ; BSp Þ. (Definitions for BSn ; BSn ; BSp ; BSp vc vc BSn ; BSp are provided below in the Theoretical Framework section.) Calculations were conducted using Stata 12 (Stata Corp, College Station, TX).

developed equations for the true positives, true negatives, false positives, and false negatives that would be observed under each scenario. We then used these results to calculate the accuracy statistics (sensitivity, specificity, positive predictive value, negative predictive value) that would be observed under each scenario. In general, the observed values will differ from the true values. Bias is defined as the difference between observed values and the true values. Scenario 1 (No Bias)

Consider a diagnostic test with sensitivity, Sn, specificity, Sp, and disease prevalence, u. For such a test, the truepositive rate, TP, false-positive rate, FP, false-negative rate, FN, and true-negative rate, TN, are1: TP 5 Sn u

(1.1)

FP 5 ð12uÞ ð12SpÞ

(1.2)

FN 5 u ð 12SnÞ

(1.3)

TN 5 ð12uÞ Sp

(1.4)

In this scenario, the observed values equal the true values. The statistics in 1.1-1.4 are based on the patients at risk (ie, those who present for an FNAB). For example, TP represents the fraction of patients at risk who will receive a true-positive diagnosis.

RESULTS Scenario 2 (Verification Bias Present) Theoretical Framework

We analyzed 4 scenarios: 1) no bias present, 2) partial verification bias present, 3) nondifferential classification bias present, and 4) both partial verification and nondifferential classification bias present (Table 1). For each scenario, we Cancer Cytopathology

Month 2014

The verification rate is the proportion of samples that are verified by a reference test. The positive verification rate, a, is the proportion of positive index test results (eg, FNAB diagnosis of malignant) that are verified. The negative verification rate, b, is the proportion of negative index test 3

Original Article

results (eg, FNAB diagnosis of benign) that are verified. Verification bias occurs when the negative and positive verification rates differ (ie, a 6¼ b). We use the superscript v to designate the observed outcomes when partial verification bias is present. For example, TP v is the truepositive rate observed under verification bias. When verification bias is present, the observed outcomes are as follows (please see flow diagram in Supplementary Fig. 1): TP v 5 Sn u a

(2.1)

FP 5 ð12uÞ ð12SpÞ a

(2.2)

v

v

FN 5 u ð 12SnÞ b

(2.3)

TN v 5 ð12uÞ Sp b

(2.4)

When verification bias is present, the bias in sensitivity and specificity are found by determining the difference between the observed and actual values: v BSn 5 Snv 2Sn5

v BSp 5 Spv 2Sp5

11

1 11 

 b 2 Sn

12Sn Sn

1 12Sp Sp

(2.9)

a

  2 Sp

(2.10)

a b

v BPPV 50

(2.11)

v BNPV 50

(2.12)

Scenario 3 (Classification Bias Present)

where a; b > 0: The observed sensitivity and specificity under verification bias are: Snv 5

TP v uSna 5 5 v v ðTP 1FN Þ uSna1 uð12SnÞb 11

1

 b

12Sn Sn

a

(2.5) Spv 5

TN v ð12uÞSpb 5 v v ðTN 1FP Þ ð12uÞSpb1 ð12uÞð12SpÞa 1    5 12Sp a 11 Sp b (2.6)

The observed positive and negative predictive values are: PPV v 5

NPV v 5

TP v Sn u a 5PPV 5 v v Sn u a 1ð12uÞ ð12SpÞ a TP 1FP (2.7) TN v ð12uÞ Sp b 5NPV 5 TN v 1FN v ð12uÞ Sp b1 u ð 12SnÞ b

(2.8) Thus, the observed predictive values are equal to the true predictive values when partial verification bias is present. The predictive values are unaffected by partial verification bias because numerator and denominator of the formula for the predictive values are both multiplied by the verification rate (a or b), which cancels out.

Snc 5

4

Let fa and fb designate the misclassification rate of the reference method on samples with positive and negative index test results, respectively (fa is the probability of misclassification by the reference method given a positive index test result; fb is the probability of misclassification given a negative index test result). There are, consequently, 2 types of classification bias. In the first, termed nondifferential classification bias, the misclassification rate is independent of the result of the index test ðfa 5 fb 5 f Þ: In the second form, differential classification bias, the misclassification rate depends on the result of the index test (fa 6¼ fb). Let us signify the outcome statistics observed in the presence of classification bias with a superscript c. For example, the observed true positive rate under classification bias is TP c . For simplicity, we will assume nondifferential classification; however, we derive formulas for the general case for completeness (see flow diagram in Supplementary Fig. 2). The expected true positives, TP c 5 Snuð12fa Þ 1 fa ð12SpÞð12uÞ

(3.1)

FP c 5ð12SpÞð12uÞ ð12fa Þ1fa Snu

(3.2)

FN c 5 ð12fb Þ ð12SnÞu1fb Spð12uÞ

(3.3)

TN c 5 Spð12uÞ ð12fb Þ 1 fb ð12SnÞu

(3.4)

When classification bias is present, the observed sensitivity and specificity are:

TP c Snuð12fa Þ1fa ð12SpÞð12uÞ  5 c c ðTP 1FN Þ Snuð12fa Þ1fa ð12 SpÞð12uÞ1 12fb ð12SnÞu1fb Spð12uÞ

Cancer Cytopathology

(3.5)

Month 2014

Bias in Diagnostic Accuracy Studies/Schmidt et al

 Spð12uÞ 12fb 1 fb ð12SnÞu TN c  Sp 5 5 ðTN c 1FP c Þ Spð12uÞ 12fb 1 fb ð12SnÞu1ð12SpÞð12uÞ ð12fa Þ1fa Snu c

(3.6)

The observed positive and negative predictive values are: PPV c 5

TP c Snuð12fa Þ1fa ð12 SpÞð12uÞ 5 c c ðTP 1FP Þ Snuð12fa Þ1fa ð12SpÞð12uÞ1 ð12SpÞð12uÞ ð12fa Þ1fa Snu

 Spð12uÞ 12fb 1 fb ð12SnÞu TN c  NPV 5 5 ðTN c 1FN c Þ Spð12uÞ 12fb 1 fb ð12SnÞu1 ð12fb Þ ð12SnÞu1fb Spð12uÞ c

TP vc 5 TP v ð12fa Þ 1 fa FP v 5 uSnað12fa Þ When classification bias is present, the bias in sensitivity and specificity is: c BSn 5 Snc 2Sn (3.9) c BSp 5 Spc 2Sp (3.10) c BPPV 5PPV c 2PPV (3.11) c 5NPV c 2NPV BNPV

(3.12)

1 fa ð12uÞð12SpÞa FP vc 5 FP v ð12fa Þ1fa TP v 5 ð12uÞð12SpÞað12fa Þ 1 fa uSna

(3.7)

(3.8)

(4.1)

(4.2)

FN vc 5 ð12fb Þ FN v 1fb TN v 5 ð12fb Þ uð12SnÞb 1 fb ð12uÞSpb (4.3)

Combined Verification Bias and Classification Bias

We designate combined verification bias and classification bias with a superscript, vc. The observed accuracy statistics under combined verification bias and classification bias are (see Supplementary Fig. 3):

Snvc 5

Spvc 5

TP vc

(4.4)

With combined verification and classification bias, the observed sensitivity and specificity are:

uSnað12f a Þ 1 f a ð12uÞð12SpÞa uSnað12f a Þ 1 f a ð12uÞð12SpÞa1ð12f b Þ uð12SnÞb 1 f b ð12uÞSpb

(4.5)

TN vc ð12f b Þ ð12uÞSpb 1f b uð12SnÞb 5 vc vc ðTN 1FP Þ ð12f b Þ ð12uÞSpb 1f b uð12SnÞb1 ð12uÞð12SpÞað12f a Þ 1f a uSna

(4.6)

ðTP vc 1FN vc Þ

5

where a; b > 0:

PPV vc 5

TN vc 5 TN v ð12fb Þ1fb FN v 5 ð12fb Þ ð12uÞSpb 1 fb uð12SnÞb

The positive and negative predictive values are: TP vc

ðTP vc 1FP vc Þ

5

uSnað12fa Þ1 fa ð12uÞð12SpÞa 5PPV c uSnað12fa Þ1 fa ð12uÞð12SpÞa1 ð12uÞð12SpÞað12fa Þ1 fa uSna

 12fb ð12SpÞð12uÞ TN c  NPV 5 5NPV c 5 ðTN c 1FN c Þ 12fb ð12SpÞð12uÞ1 ð12fb Þ ð12SnÞu1fb Spð12uÞ c

Cancer Cytopathology

Month 2014

(4.7)

(4.8)

5

Original Article

vc BSn 5 Snvc 2Sn

(4.9)

vc BSp 5 Spvc 2Sp

(4.10)

The individual components of the total bias are:

vc c BPPV 5PPV vc 2PPV 5PPV c 2PPV 5 BPPV

(4.11)

v BSp 5 Spv 2Sp50:9820:99520:01

vc c BNPV 5NPV vc 2NPV 5NPV c 2PPV 5 BPPV

(4.12)

c BSp 5 Spc 2Sp50:9220:99520:07

We defined the interaction bias as the portion of bias that was not accounted for by the independent effects of verification and classification bias: vc vc c v UBSn 5 BSn 2 BSn 2 BSn

(4.13)

vc vc c v UBSp 5 BSp 2 BSp 2 BSp

(4.14)

Example Calculations Case A: bias in sensitivity

Consider a diagnostic accuracy study in which the true sensitivity and specificity are 88% and 95%, respectively. The true underlying prevalence of malignancy is 3%, the positive verification rate is 89%, and the negative verification rate is 48%. The misclassification rate for the reference method is 4%. Assuming these values, the observed sensitivity (Snvc Þ would be 57% (equation 3.7). The total bias in sensitivity is the difference between these 2 values (equation 3.9): vc BSn 5Snvc 2Sn50:5620:88520:32

The individual components of the total bias are: v 5 Snv 2Sn50:9320:8850:05 BSn c BSn 5 Snc 2Sn50:4020:88520:48

The unexplained bias is: vc vc v c 5BSn 2BSn 2BSn 520:3220:052ð20:48Þ50:11 UBSn

Thus, the total bias differs from the bias that would occur if each component acted independently. In this case, there was an interaction effect of 0.11. Case B: bias in specificity

Consider a case with sensitivity 5 0.77, specificity 5 0.99, prevalence 5 0.49, misclassification rate5 0.093, positive verification rate 5 0.90, and negative verification rate 5 0.31. Assuming these values, the observed specificity (Snvc Þ would be 0.80. The total bias in sensitivity is the difference between these 2 values (equation 3.9): 6

vc BSp 5Spvc 2Sp50:80 20:99520:19

The unexplained bias is: vc vc v c UBSp 5BSp 2BSp 2BSp 520:192ð20:01Þ

2ð20:07Þ520:11 Practical Significance and Implications of Interactions

The previous results provide an “existent proof” for interactions between partial verification bias and classification. We provided analytical expressions for the unexplained bias and provided example calculations to show that such interactions could be present under conditions seen in DTA studies. Although these examples show that interactions can exist, they do not show whether the interactions are large enough to have practical significance and how frequently such interactions occur. To that end, we next investigated the frequency at which the magnitude of these interaction effects might reach the level of practical significance. We defined practical significance as a level of bias that would generally be appreciable relative to the error in estimates of sensitivity and specificity in a typical DTA study. A typical DTA for FNAB might have a sample size of 200 cases. Assuming a 50/50 split between malignant and benign cases, the standard error in the sensitivity and specificity would be approximately 3.5%. An interaction bias of 7% would be appreciable relative to this background error. We therefore considered interactions practically significant if they accounted for at least 7% of the total bias. The unexplained bias represents the bias due to interaction effects between verification bias and classification bias. Histograms for the unexplained bias are presented in Figure 1. Interactions caused overestimation of sensitivity (positive bias) and underestimation of specificity (negative bias). Interactions had a greater impact on the estimated sensitivity compared with the specificity. For sensitivity, the absolute magnitude of the unexplained bias was greater than 7% in 6.2% of the computer-generated scenarios. For specificity, less than 0.1% of the computer-generated scenarios had a residual bias of at least 7%. Cancer Cytopathology

Month 2014

Bias in Diagnostic Accuracy Studies/Schmidt et al

Figure 1. Histogram of unexplained bias. The unexplained bias is the portion of the bias due to interactions between verification bias and classification bias. The dotted lines indicate the threshold at which unexplained bias might be appreciable in a typical diagnostic test accuracy study.

DISCUSSION Our study shows that interactions occur between verification and classification bias when both are present. In addition, we found that interaction effects sometimes reach magnitudes that could be considered practically significant. These magnitudes were reached in realistic scenarios that were designed to mimic the conditions frequently seen in FNAC DTA studies. In addition, we showed that practically significant interactions could be relatively common. Thus, our results suggest that bias interactions have the potential to have practical impact in diagnostic accuracy studies. We found that the interactions more frequently had a significant effect (ie, greater than 7% bias) on sensitivity than on specificity. Interactions caused overestimation of sensitivity and underestimation of specificity after accounting for their individual effects. This pattern is similar to that caused by verification bias. This suggests that interactions exacerbate the effects of verification bias, at least in the scenarios generated in this study. Our findings have several implications. Verification bias is common in FNAC DTA studies, and classification bias is omnipresent. This implies that interactions are likely to be a common feature of FNAC DTA studies. There are 2 approaches that can be applied: prevention and mitigation. Prevention is the preferred approach. Studies produce statistics such as sensitivity and specificity. These statistics represent estimates of the “true” sensitivity and specificity, which are regarded as unknown fixed parameters. If a study were repeated Cancer Cytopathology

Month 2014

multiple times, we would obtain slightly different estimates each time. In the absence of bias, the average of multiple estimates will converge to the true value of the parameter. When bias is present, the average of multiple estimates will converge to some other value. Partial verification and misclassification are defined as bias because they cause systematic distortions in the estimates for sensitivity and specificity. With partial verification, sensitivity is falsely inflated and specificity is falsely deflated. With misclassification, both sensitivity and specificity are falsely deflated. There are many other forms of bias that can distort estimates in DTA studies. These include spectrum bias, review bias, incorporation bias, timing bias, and others.6,18,19 We selected partial verification bias and classification bias because they commonly occur in FNAB DTA studies. Verification bias can be prevented by designing DTA studies with complete follow-up of cases. This can be done by combining histopathology with clinical follow-up. Classification bias and the resulting interactions can be reduced by using blinded secondary review of cases. At present, neither of these is common practice in FNAC DTA studies. As a consequence, many of the accuracy estimates in the literature are unreliable. A rigorously performed study should include a blinded review — but it would be costly. The benefits would probably outweigh the costs in settings in which misclassification rates are relatively high. It may not be necessary in settings in which the misclassification is low. Unfortunately, this information is not readily available and can vary by site according to the skill of the pathologist. So, well-conducted studies should probably include blinded review. The Standards for the Reporting of Diagnostic accuracy study (STARD) guidelines can also help to reduce bias in diagnostic accuracy studies.20,21 The STARD guidelines provide a checklist for data that should be reported in diagnostic accuracy studies. Although the STARD guidelines do not directly prevent publication of poorly designed studies, the do make such studies easy to identify. Unfortunately, it appears that the cytopathology community has been slow to adopt the STARD guidelines. Bias can be mitigated by statistical adjustment. Statistical methods have been developed to adjust for both verification bias and classification bias; however, these methods depend on stringent assumptions and, in practice, are rarely applied in the FNAC DTA literature. Our study raises the possibility that these methods would be 7

Original Article

insufficient when interactions are present. We are not aware of any methods to correct for their combined effect.6 Researchers should be aware that such methods may not fully correct for bias when interactions are present. We examined the impact of bias interactions in a limited context. We chose this context because we know that verification bias and classification bias are common in FNAC DTA studies. We believe our findings can be generalized to other settings where verification bias and classification bias are relatively common. More broadly, our study applies to general concerns about diagnostic error in medicine. Error can only be improved if it can be measured, and DTA studies are the primary means by which error rates are evaluated. Thus, our study addresses fundamental issues in the design of studies that assess diagnostic error. Our study demonstrates the potential for significant distortion in accuracy estimates due to the combined effects of verification bias and classification bias. Our study is limited because we could only provide a rough estimate of the frequency at which such interactions might have clinical significance. We did not have probability distributions for the input parameters (Sn, Sp, u, f, a, b), so we assumed these parameters were independent and normally distributed over plausible ranges obtained from the literature. We then determined the fraction of cases that had appreciable (ie, greater than 7%) interaction effects. Using these assumptions, we found that approximately 5% of estimates of sensitivity and less than 0.1% of estimates of specificity would have appreciable error. These are very approximate estimates and only serve to show that practically significant interaction effects can occur. Our study is limited because we created a wide range of scenarios, and many of these scenarios contain combinations of input parameters that are unlikely to occur in real studies. More precise estimates of the frequency would require much more detailed information about the distributions of input parameters and their correlations. We used data from secondary review studies to estimate misclassification rates. The misclassification rates in such studies may not be the same as the misclassification rate that one would observe in a diagnostic accuracy study. We might expect that misclassification rates (disagreement) within an institution might be somewhat lower than the disagreement rates from secondary review. In that case, we likely overestimated the rate at which 8

clinically significant interactions occur. This is not a serious limitation because our objective was to show that such effects can occur in plausible scenarios, which were shown by our 2 examples. As noted above, our computer experiments do not accurately predict the rate at which significant interactions occur, but the ease with which it was possible to find plausible cases suggests that bias interactions most likely exist in published studies.

CONCLUSIONS We have shown that interactions between partial verification bias and classification bias can have a clinically significant contribution to the total bias. Our study has demonstrated the potential for interactions to have a significant impact on total bias; however, there is a need to characterize the impact of interactions in actual diagnostic accuracy studies. Misclassification bias and the resulting interactions could be substantially reduced if the histopathology diagnoses were subject to blinded review. FUNDING SUPPORT No specific funding was disclosed.

CONFLICT OF INTEREST DISCLOSURES The authors made no disclosures.

REFERENCES 1. 2. 3.

4. 5.

6.

7.

8.

9.

Schmidt RL, Factor RE. Understanding sources of bias in diagnostic accuracy studies. Arch Pathol Lab Med. 2013;137:558-565. Knottnerus JA, Buntinx F. The Evidence Base of Clinical Diagnosis: Theory and Methods of Diagnostic Research. Wiley; 2011. Schmidt RL, J.D. J, Allred RJ, Masuoka S, Witt BL. Verification bias in diagnostic accuracy studies for fine needle and core needle biopsy of salivary gland lesions in otolaryngology journals: a systematic review. Head Neck. 2014;36:1654-1661. Newman TB, Kohn MA. Evidence-Based Diagnosis. Cambridge University Press; 2009. Schmidt RL, Factor RE, Witt BL, Layfield LJ. Quality appraisal of diagnostic accuracy studies in fine-needle aspiration cytology: a survey of risk of bias and comparability. Arch Pathol Lab Med. 2013; 137:566-575. Zhou X-H, Obuchowski N, McLish D. Statistical Methods in Diagnostic Medicine. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2011. Renshaw AA. Comparing methods to measure error in gynecologic cytology and surgical pathology. Arch Pathol Lab Med. 2006;130: 626-629. Hahm GK, Niemann TH, Lucas JG, Frankel WL. The value of second opinion in gastrointestinal and liver pathology. Arch Pathol Lab Med. 2001;125:736-739. Raab SS, Grzybicki DM, Janosky JE, et al. Clinical impact and frequency of anatomic pathology errors in cancer diagnoses. Cancer. 2005;104:2205-2213.

Cancer Cytopathology

Month 2014

Bias in Diagnostic Accuracy Studies/Schmidt et al

10. Coblentz TR, Mills SE, Theodorescu D. Impact of second opinion pathology in the definitive management of patients with bladder carcinoma. Cancer. 2001;91:1284-1290. 11. Tsung JSH. Institutional Pathology Consultation. Am J Surg Pathol. 2004;28:399-402. 12. Swapp RE, Aubry MC, Saloma~o DR, Cheville JC. Outside case review of surgical pathology for referred patients: the impact on patient care. Arch Pathol Lab Med. 2013;137:233-240. 13. Abt AB, Abt LG, Olt GJ. The effect of interinstitution anatomic pathology consultation on patient care. Arch Pathol Lab Med. 1995;119:514-517. 14. Renshaw AA, Gould EW. Comparison of Disagreement and Amendment Rates by Tissue Type and Diagnosis. Am J Clin Pathol. 2006;126:736-739. 15. Selman AE, Niemann TH, Fowler JM, Copeland LJ. Quality assurance of second opinion pathology in gynecologic oncology. Obstet Gynecol. 1999;94:302-306.

Cancer Cytopathology

Month 2014

16. Weydert JA, De Young BR, Cohen MB. A preliminary diagnosis service provides prospective blinded dual-review of all general surgical pathology cases in an academic practice. Am J Surg Pathol. 2005;29:801-805. 17. Bruner JM, Inouye L, Fuller GN, Langford LA. Diagnostic discrepancies and their clinical impact in a neuropathology referral practice. Cancer. 1997;79:796-803. 18. Begg CB. Biases in the assessment of diagnostic tests. Stat Med. 1987;6:411-423. 19. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003. 20. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Ann Clin Biochem. 2003;40:357-363. 21. Bossuyt PM, Reitsma JB, Bruns DE, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med. 2003;138:W1-W12.

9

Verification and classification bias interactions in diagnostic test accuracy studies for fine-needle aspiration biopsy.

Reliable estimates of accuracy are important for any diagnostic test. Diagnostic accuracy studies are subject to unique sources of bias. Verification ...
143KB Sizes 0 Downloads 4 Views