Biometrical Journal 57 (2015) 5, 797–807

DOI: 10.1002/bimj.201400210

797

Assessing the predictive power of newly added biomarkers Zhanfeng Wang1 , Xiangyu Luo1 , and Yuan-chin I. Chang∗,2 1

2

Department of Statistics and Finance, University of Science Technology of China, Hefei 230026, China Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan

Received 18 September 2014; revised 3 January 2015; accepted 16 February 2015

As medical research and technology advance, there are always new biomarkers found and predictive models proposed for improving the diagnostic performance of diseases. Therefore, in addition to the existing biomarkers and predictive models, how to assess new biomarkers becomes an important research problem. Many classification performance measures, which are usually based on the performance on the whole cut-off values, were applied directly to this type of problems. However, in a medical diagnosis, some cut-off points are more important, such as those points within the range of high specificity. Thus, as the partial area under the ROC curve to the area under ROC curve, we study the partial integrated discriminant improvement (pIDI) for evaluating the predictive ability of a newly added marker at a prespecified range of cut-offs. Theoretical property of estimate of the proposed measure is reported. The performance of this new measure is then compared with that of the partial area under an ROC curve. The numerical results use synthesized are presented, and a liver cancer dataset is used for demonstration purposes.

Keywords: AUC; High specificity; Partial AUC; Predictive power; ROC curve.



Additional supporting information including source code to reproduce the results may be found in the online version of this article at the publisher’s web-site

1 Introduction In medical and clinical studies, there are always new biomarkers discovered by scientists, and it is important to study whether a newly discovered biomarker can have a significant impact in addition to the existing biomarkers and models. D’Agostino and Nam (2004) pointed out that there are some initial procedural decisions that need to be made correctly. Here, we are particularly interested in how to evaluate the incremental gain resulting from the new marker. Assessing the performance of classification rules is always an important topic in medical studies (Hand, 2012) and many other fields. Among all performance measures, the receiver operating characteristic (ROC) curve and the area under curve (AUC) are two of the most popular tools used for this purpose. Some general reviews and practical suggestions for using the existing measures in assessing the predictive ability of models and newly added biomarkers can be found in Pletcher and Pignone (2011), McGreechan et al. (2008), Vickers et al. (2011), and Baker et al. (2014); and as discussed in McGreechan et al. (2008) and Pencina et al. (2008), due to clinical practice the evaluation of a newly added marker solely by using an AUCtype measure may not be the best strategy. A thorough comparison of the area under curve (AUC; also known as the C-index or C-statistics), net reclassification improvement (NRI), and the integrated discriminant improvement (IDI) can be found in Pencina et al. (2008), and the commentaries follow it. ∗ Corresponding

author: e-mail: [email protected], Phone: +886-2-2787-1950, Fax: +886-2-2783-1523

 C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

798

Z. Wang et al.: Assessing new biomarkers

In particular, Chi and Zhou (2008) pointed out that IDI can be viewed as the integrated difference in Younden’s indices and, therefore, may also suffer from the shortcomings of Younden’s indices. In the same commentary, they also suggested that a partial integrated difference in Younden’s indices may be useful when some cut-offs are irrelevant. This point is also mentioned in Pencina et al. (2008), and it is reported in literature that in medical and clinical practices, some range of cut-offs may be more important than others. This phenomenon is similar to that of a partial area under an ROC curve (pAUC) to its corresponding AUC in evaluating classifiers, because both IDI and AUC are defined based on a whole range of cut-off points. This motivates us to study the properties of an IDI-type index for a prespecified range of cut-off points and compare it to that of the pAUC. In the rest of this paper, after brief reviews of AUC, pAUC, and IDI, we define a novel measure called the partial IDI (or pIDI for short), which only focuses on a confined range of cut-off points. The comparisons of the statistical properties between pAUC and pIDI using synthesized data and a liver dataset are presented. The estimate of the proposed measure and its statistical properties are reported. The proof of it is given in the Appendix.

2 Method Let x denote the measures of the vector of variables of subjects, and let L be a classification function that maps x to a risk score L(x) ∈ R. In practice, x can be an R p random vector and the classification function L(x) transforms x into a real-valued random variable. Let F be the cumulative distribution function of the classification function L(x). Then F (L(x)) ∈ [0, 1] rescales the risk score L(x) into a probability scale. Hence, for simplicity and without loss of generality, we assume that L(x) ∈ [0, 1] throughout this paper. Then, for a binary classification problem, based on the corresponding risk score L(x) and a given cut-off point, subjects will be assigned to one of the two classes, say, the diseased ¯ that is, for example, a subject with vector x is assigned group (D) and the nondiseased group (D); ¯ otherwise. Let notation x ∈ D to group D if L(x) ≥ c for some c ∈ [0, 1], and will be assigned to D, ¯ denote a true group to which a subject with x belongs. Define S1 (t) = P(L(x) > t|x ∈ D) and (or D) ¯ for a cut-off point t ∈ (0, 1), then S1 (t) and S0 (t) are, respectively, the S0 (t) = P(L(x) > t|x ∈ D) sensitivity and one minus specificity based on the risk score L(x) for t, and the ROC curve of L(x) is defined as the plot of {(S0 (t), S1 (t)) : t ∈ [0, 1]}. Because the ROC curve and AUC are scale-invariant, the corresponding AUC is  AUC = 0

1

¯ ≡ S1 (t) f0 (t)dt = P(L(x1 ) > L(x0 )|x1 ∈ D, x0 ∈ D)

¯ x1 ∈ D), ≡ E (I (L(x1 ) > L(x0 ))|x0 ∈ D,

(1)

¯ It follows from Pencina et al. (2008) where f0 is the probability density function of L(x) for x ∈ D. that the integral of sensitivity (IS) and integral of 1–specificity (IP) are defined as 

1

IS = 0



1

IP = 0

 S1 (t)dt =

1

 S0 (t)dt =

P(L(x) > t|x ∈ D)dt = E (L(x)|x ∈ D)

(2)

¯ ¯ P(L(x) > t|x ∈ D)dt = E (L(x)|x ∈ D).

(3)

0 1 0

Suppose that there are n diseased and m normal subjects observed. Let {x1,i : i = 1, . . . , n} and {x0, j : j = 1, . . . , m} be their corresponding vectors of measured variables. Based on these observations,  C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 57 (2015) 5

799

the empirical estimates of AUC, IS, IP of classification function L are 1  I (L(x1,i ) > L(x0, j )), nm n

 = AUC

m

(4)

i=1 j=1

I S=

1 L(x1,i ) n n

(5)

i=1

and  = 1 IP L(x0, j ). m m

(6)

j=1

 and AUC  are consistent estimates of IS, IP, and AUC, respectively. It is clear that I S, IP, Suppose that there is a “standard” risk function built on some recognized risk factors, and a newly discovered biomarker is measured for each subject. Let Lold and Lnew stand for the standard risk function and the risk function with the added biomarker, respectively. Then the difference of AUCs of these two classification functions is  new − AUC  old .  d = AUC AUC

(7)

Similarly, let subscripts “new” and “old” denote the statistic related to new and old classification functions. Then IDI (see Pencina et al., 2008) is defined as IDI = (ISnew − IPnew ) − (ISold − IPold ) = (ISnew − ISold ) − (IPnew − IPold )

(8)

It follows that  new − IP  old )  = (I Sold ) − (IP IDI Snew − I

(9)

is a consistent estimate of IDI. In Pencina et al. (2008), the authors proposed an estimate of the  below (see Eq. (15) of Pencina et al., 2008): variance of IDI ˆ 1 )2 + (se ˆ 0 )2 . (se 2.1

(10)

Partial IDI and partial AUC

For evaluating the impact of diagnostic performance of an added biomarker, Pencina et al. (2008) studied the effects of using the estimated difference of AUCs and IDI, and used an example to illustrate that when IDI indicates that a new marker offers a statistically significant improvement, it is possible that there is still no significant improvement detected using AUC (see also Vickers et al., 2011 and Baker et al., 2014 for some comments). However, as mentioned in Walter (2005) and Thompson et al. (2001) and by many other authors, the AUC under a specific range of specificity is more important than others; especially when a rare disease is studied, only the range with high specificity is of interest. Because the IDI is also defined for all possible cut-off points, these comments will also be applied to it. Hence, our goal is to measure how a newly added biomarker improves the diagnostic performance within a given range of specificity of the model based on the existing biomarkers. We will first review the pAUC below and then define the partial IDI as an extension of the IDI. A pAUC with a specificity no less than 1 − u for a u ∈ (0, 1) is defined as 

u

pAUC(u) = 0

 S1 dS0 =

S0−1 (u) 1

 S1 (t)dS0 (t) =

1

S0−1 (u)

S1 (t) f0 (t)dt =

¯ = E (I (L(x1 ) > L(x0 ), L(x0 ) ≥ S0−1 (u))|x1 ∈ D, x0 ∈ D).  C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

(11) www.biometrical-journal.com

800

Z. Wang et al.: Assessing new biomarkers

¯ and x ∈ D, respectively. Alternatively, a Let L0 (x) and L1 (x) be the risk scores of L(x) for x ∈ D pAUC can be rewritten as 

S0−1 (u)

pAUC = 1

S−1 (u)

S1 (t)dS0 (t) = S1 (t)S0 (t) |1 0

= S1 (S0−1 (u)) · u +  

S0−1 (u)

f1 (t)udt +

1

=

1



S0−1 (u)

=−



S0−1 (u)

S0−1 (u) 1

S0 (t)dS1 (t) =

(−S0 (t)) f1 (t)dt = 1

S0−1 (u)

(−S0 (t)) f1 (t)dt =



1

 −

[u − S0 (t)] f1 (t)dt =

1 0

max{u − S0 (t), 0} f1 (t)dt =

= E[max{u − S0 (L1 (x)), 0}] = E[u − min{S0 (L1 (x)), u}], where f1 is the density function of L(x) for x ∈ D. Then pAUC can be estimated using the following statistics: ⎫⎤ ⎧ ⎡ n m ⎬ ⎨1  1 ⎣  I (L(x0, j ) > L(x1,i )), u ⎦. (12) pA UC(u) = u − min ⎭ ⎩m n i=1

j=1

Thus, the difference between the pAUCs of the old (standard) and new models is    UCnew − pA UCold . pA UCd = pA

(13)

This difference provides us with the improvement of the newly added biomarker in terms of the partial AUC. It was proven in Wang and Chang (2011) that (12) is a strongly consistent estimate of pAUC(u) under some conditions on L(x). It implies that Eq. (13) is also a consistent estimate of the difference in pAUC between the new and old classification functions. For the same specificity range (1 − u, 1) as above, the partial IS (pIS) and partial IP (pIP) are  pIS(u) = and

1

S0−1 (u)

 pIP(u) =

1 S0−1 (u)

S1 (t)dt

S0 (t)dt.

(14)

Hence,  pIS(u) =

S0−1 (u)

 =

1

S0−1 (u)

 =

1

1

S0−1 (u)

 S1 (t)dt = 

1

S0−1 (u)



v

S0−1 (u)

1 S0−1 (u)



1

t

f1 (v)dvdt =

f1 (v)I[t,1] (v)dvdt =  f1 (v)dtdv =

1 S0−1 (u)

 +  = E L(x) − S0−1 (u) |x ∈ D ,  C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

  v − S0−1 (u) f1 (v)dv = (15) www.biometrical-journal.com

Biometrical Journal 57 (2015) 5

where notation (w)+ = w, if w ≥ 0 and is equal to 0, if w < 0. Similarly,  1 ¯ pIP(u) = S0 (t)dt = E[(L(x) − S0−1 (u))+ |x ∈ D]. S0−1 (u)

801

(16)

As before, we use subscripts “new” and “old” to denote which classification functions from which those statistics are derived. Then a partial IDI is defined as pIDI (u) = (pISnew − pIPnew ) − (pISold − pIPold ) = (pISnew − pISold ) − (pIPnew − pIPold ) (17) A natural estimate of pIDI is defined as  new (u) − pIP  old (u)), pI DI (u) = ( p ISnew (u) − p ISold (u)) − ( pIP

(18)

where p ISnew (u) =

+ 1  −1 Lnew (x1,i ) − Sˆnew,0 (u) , n n

i=1

n + 1  −1 p ISold (u) = Lold (x1,i ) − Sˆold,0 (u) , n i=1

 + −1  new (u) = 1 pIP Lnew (x0, j ) − Sˆnew,0 (u) , m m

j=1

 + −1  old (u) = 1 pIP Lold (x0, j ) − Sˆold,0 (u) , m m

j=1

−1 −1 and Sˆnew,0 (u) and Sˆold,0 (u) are 1 − u quantiles of {Lnew (x0, j ), j = 1, . . . , m} and {Lold (x0, j ), j = 1, . . . , m}, respectively. In the Appendix, we prove that if density function f0 is continuous at S0−1 (u) and not zero, then  p IS = 1n ni=1 (L(x1i ) − Sˆ0−1 (u))+ converges to pIS in probability. Because pIS and pIP are similar, the same arguments can be applied to prove the consistency of the estimate of pIP. It follows that pIDI is defined as the difference of pIS and pIP. Hence, the consistency of the estimates of pIS and pIP implies that pI DI is a consistent estimate of pIDI. Thus, only the proof of the consistency of the estimate of pIS is presented, and its details are given in the Appendix. It is difficult to compute the variance of pI DI (u) due to its complicated form. In addition, according  used in Pencina et al. (2008), as to our numerical study we found that the variance estimate of IDI  and the difference is not ignorable. Therefore, we (10) here, might overestimate the variance of IDI, can employ a bootstrap resampling method to estimate the variances of the four statistics AUC, IDI, pAUC(u), and pIDI (u), and the Wald-type confidence intervals for these indices can be constructed based on that. If a new biomarker is useful, then it is likely to associate with the old ones because all these markers are related to the same disease of interest. That is the reason why studying the increment of the predictive power that is due to adding a new biomarker to an existing model/classifier is a hard problem. Moreover, the decision of whether or not to adapt a new biomarker to an existing model is not just a simple statistical issue. Several facts should be taken into account. However, if a statistic is used to compare the impact of several new biomarkers to an existing classifier, then statistics such as pAUC and pIDI are legitimate methods. The numerical results below illustrate the performances of the proposed measure in this kind of scenario.

 C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

802

Z. Wang et al.: Assessing new biomarkers

Remark 2.1. There have been several follow-up studies regarding how to assess a newly added biomarker since Pencina et al. (2008). In particular, Pepe et al. (2014) assert that using net reclassification index (NRI) and its p-values to detect the biomarker should be treated with skepticism. In a supplementary document to this paper, we presented some numerical results regarding the size and power of the statistical test based on pIDI using a permutation test method. Whether IDI will follow the same phenomena as that of NRI will be discussed and reported elsewhere.

3 Numerical study We compare the performances of pIDI and pAUC using synthesized datasets in this section. The values of AUC and IDI are also summarized for completeness. Samples are generated from normal distributions for both diseased and nondiseased groups, and two logistic regression models are conducted to calculate the predictive probabilities of individuals; these two models, denoted as the old and new, are (1) the model using the “old” biomarkers only, and (2) the model with an added “new” biomarker in addition to those old biomarkers. Using the predictive probabilities of these two models, we then calculate the estimates of AUC, pAUC, IDI, and pIDI. Because the scales of pAUC and pIDI have different meanings, we report the “relative difference ratio”of the estimates of the new and old models instead of the direct difference of them. The “relative difference ratio” is defined as (new –old)/old, as stated below. Thus, the larger this ratio is, the better the performance improvement gained with the newly added biomarker. We generate 50 diseased (n = 50) and 50 normal (m = 50) samples as a training dataset and another 50 diseased and 50 normal independent data as a testing set. We use this training set to build a classification (risk score) function L(x), and this function L(x) is then applied to the testing data. The variables of “old biomarkers” of the diseased subjects are generated from N(μ1 , 1). Those variables of the normal subjects are generated from a normal mixture; that is (100 − α)% of normal subjects are sampled from N(0, 1) and α% are from N(μ1 , 1). Those “new biomarker” values of the diseased and nondiseased to be evaluated are generated from N(μ2 , 1) and N(0, 1), respectively. Let μ1 = μ2 = μ ∈ {1, 2} and choose α ∈ {10, 20}. Then we consider the following four combinations: (μ, α) = (1, 10), (1, 20), (2, 10), and (2, 20), and conduct 500 replicates for each of them. Tables 1 and 2 summarize the results of both the training and testing datasets, respectively, under different combinations, where “old” and “new” denote the estimated values of indices using the “old biomarkers” only and the “old biomarkers” and a “new biomarker”, accordingly. As mentioned before, the relative difference ratio, “(new–old)/old”, is the ratio of the difference between the estimated index values of the new model and the old model to the estimated index value of the old model. The γ % quantile denotes the empirical γ % quantile of 500 values of the relative difference ratio (new–old)/old. From Tables 1 and 2, we found that pAUC has larger (new–old)/old value than that of AUC. Similarly, this ratio for pIDI is also bigger than that when IDI is used. These results suggest that when a high specificity is of interest, pAUC and pIDI indices are more sensitive than their counterparts, AUC and IDI. From these tables, we also found that pIDI has larger ratios than that of pAUC in all cases of our numerical studies. This suggests that pIDI is more sensitive than pAUC to measure the change when a new biomarker is integrated into a model. Hence, in a case when a specific range of cut-off points is of interest, pIDI will be a good addition to the conventional pAUC.

4 Real example We apply the proposed methods to a liver cancer dataset collected from northeast of Andhra Pradesh, India, which is available from Bache and Lichman (2013). In this dataset, there are 10 variables measured for each individual including age, gender, total bilirubin (TB), direct bilirubin (DB), alkaline phosphotase (Alkphos), alamine aminotransferase (Sgpt), aspartate aminotransferase (Sgot), total proteins (TP), albumin (ALB), and the albumin and globulin ratio (A/G). It contains 583 subjects  C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

 C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

10

20

20

2

1

2

+

old new (new–old)/old 25% quantile+ 50% quantile 75% quantile old new (new–old)/old 25% quantile 50% quantile 75% quantile old new (new–old)/old 25% quantile 50% quantile 75% quantile old new (new–old)/old 25% quantile 50% quantile 75% quantile

0.7330 (0.0466)∗ 0.8314 (0.0371) 0.1370 0.0932 0.1272 0.1726 0.8809 (0.0303) 0.9676 (0.0159) 0.0994 0.0769 0.0964 0.1182 0.7086 (0.0463) 0.8216 (0.0403) 0.1624 0.1130 0.1520 0.2013 0.8367 (0.0314) 0.9582 (0.0165) 0.1465 0.1180 0.1458 0.1717

AUC

0.0204 (0.0081) 0.0374 (0.0102) 1.0806 0.4515 0.7671 1.3825 0.0387 (0.0124) 0.0782 (0.0097) 1.2471 0.6802 1.0194 1.5308 0.0169 (0.0072) 0.0355 (0.0105) 1.4475 0.6158 1.0853 1.8569 0.0253 (0.0103) 0.0734 (0.0097) 2.5163 1.2858 1.9935 3.1478

u = 0.1

pAUC

0.0627 (0.0163) 0.0992 (0.0180) 0.6589 0.3630 0.5529 0.8423 0.1113 (0.0192) 0.1723 (0.0126) 0.5881 0.4048 0.5262 0.7489 0.0551 (0.0150) 0.0953 (0.0185) 0.8238 0.4581 0.7094 1.0760 0.0838 (0.0187) 0.1652 (0.0126) 1.0767 0.7245 0.9865 1.2918

u = 0.2

Standard deviations are in parentheses. γ % quantile denotes the γ quantile of values of 500 simulation of the (new–old)/old.

10

1



α

μ

Table 1 Simulation results for training datasets.

0.1707 (0.0638) 0.3366 (0.0738) 1.2140 0.6358 0.9418 1.4829 0.4481 (0.0760) 0.7307 (0.0717) 0.6652 0.4895 0.6285 0.8089 0.1383 (0.0572) 0.3178 (0.0783) 1.6950 0.8556 1.2947 2.1007 0.3475 (0.0671) 0.6875 (0.0684) 1.0409 0.7620 0.9802 1.2405

IDI

0.0300 (0.0191) 0.0838 (0.0380) 3.3272 0.9130 1.9024 3.4818 0.1122 (0.0562) 0.4197 (0.1429) 3.7616 1.8012 2.8882 4.5913 0.0217 (0.0164) 0.0760 (0.0384) 7.6251 1.3065 2.5975 5.7473 0.0525 (0.0344) 0.3444 (0.1231) 11.8748 3.5173 6.2213 11.0715

u = 0.1

pIDI

0.0562 (0.0308) 0.1476 (0.0563) 2.3512 1.0093 1.6516 2.7850 0.2280 (0.0808) 0.6029 (0.1236) 1.9521 1.1637 1.6704 2.3980 0.0417 (0.0254) 0.1363 (0.0569) 3.7033 1.3812 2.3120 4.3124 0.1304 (0.0545) 0.5285 (0.1143) 3.8838 2.1832 3.2868 4.4677

u = 0.2

Biometrical Journal 57 (2015) 5 803

www.biometrical-journal.com

 C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

10

20

20

2

1

2

+

old new (new–old)/old 25% quantile+ 50% quantile 75% quantile old new (new–old)/old 25% quantile 50% quantile 75% quantile old new (new–old)/old 25% quantile 50% quantile 75% quantile old new (new–old)/old 25% quantile 50% quantile 75% quantile

0.7350 (0.0484)∗ 0.8281 (0.0383) 0.1295 0.0863 0.1215 0.1657 0.8792 (0.0299) 0.9641 (0.0160) 0.0975 0.0727 0.0939 0.1167 0.7112 (0.0497) 0.8162 (0.0408) 0.1509 0.1054 0.1440 0.1905 0.8357 (0.0338) 0.9545 (0.0176) 0.1438 0.1149 0.1414 0.1713

AUC

Simulation results for testing datasets.

0.0204 (0.0088) 0.0373 (0.0100) 1.1293 0.4557 0.8015 1.4923 0.0378 (0.0124) 0.0761 (0.0100) 1.2391 0.6721 1.0533 1.5157 0.0168 (0.0071) 0.0348 (0.0100) 1.3922 0.6187 1.0993 1.8133 0.0253 (0.0104) 0.0717 (0.0106) 2.3851 1.2709 1.9239 2.8431

u = 0.1

pAUC

0.0632 (0.0172) 0.0986 (0.0178) 0.6390 0.3400 0.5516 0.8566 0.1102 (0.0196) 0.1694 (0.0129) 0.5812 0.3888 0.5399 0.7002 0.0554 (0.0157) 0.0938 (0.0183) 0.7949 0.4431 0.6892 1.0325 0.0830 (0.0197) 0.1626 (0.0137) 1.0635 0.7213 0.9758 1.3055

u = 0.2

Standard deviations are in parentheses. γ % quantile denotes the γ quantile of values of 500 simulation of the (new–old)/old.

20

1



α

μ

Table 2

0.1661 (0.0489) 0.3286 (0.0556) 1.1087 0.7026 0.9798 1.3883 0.4413 (0.0564) 0.7169 (0.0538) 0.6444 0.4983 0.6285 0.7630 0.1335 (0.0431) 0.3059 (0.0595) 1.4996 0.9102 1.2852 1.8279 0.3423 (0.0516) 0.6767 (0.0519) 1.0175 0.7887 0.9635 1.2102

IDI

0.0282 (0.0186) 0.0812 (0.0369) 4.4817 1.0259 1.9037 3.5524 0.1058 (0.0555) 0.3822 (0.1435) 3.8042 1.5601 2.7043 4.5821 0.0202 (0.0138) 0.0723 (0.0356) 5.6441 1.3793 2.7139 5.0634 0.0491 (0.0329) 0.3217 (0.1267) 10.0589 3.4642 6.3688 11.2339

u = 0.1

pIDI

0.0538 (0.0269) 0.1429 (0.0479) 2.2308 1.0310 1.7191 2.7864 0.2196 (0.0689) 0.5750 (0.1097) 1.8441 1.1704 1.6469 2.2655 0.0396 (0.0218) 0.1275 (0.0482) 3.3559 1.3679 2.3222 3.9098 0.1261 (0.0478) 0.5083 (0.1047) 3.6394 2.2348 3.1863 4.3987

u = 0.2

804 Z. Wang et al.: Assessing new biomarkers

www.biometrical-journal.com

Biometrical Journal 57 (2015) 5

805

Table 3 Results of measurement indices for liver cancer data. Marker

AUC

IDI

pAUC

pIDI

u = 0.1

u = 0.2

u = 0.1

u = 0.2

Female

Old TB DB Alkphos Sgot A/G

0.6885 0.7143 0.7165 0.7161 0.6997 0.6946

0.1028 0.1319 0.1344 0.1281 0.1179 0.1105

0.0282 0.0309 0.0318 0.0297 0.0292 0.0291

0.0722 0.0736 0.0753 0.0718 0.0707 0.0700

0.0399 0.0624 0.0613 0.0607 0.0529 0.0494

0.0583 0.0808 0.0862 0.0704 0.0587 0.0597

Male

Old TB DB Alkphos Sgot A/G

0.7610 0.7777 0.7807 0.7679 0.7617 0.7629

0.1596 0.1835 0.1867 0.1655 0.1626 0.1659

0.0309 0.0410 0.0414 0.0322 0.0370 0.0325

0.0881 0.0976 0.0988 0.0892 0.0926 0.0879

0.0390 0.0681 0.0728 0.0425 0.0478 0.0448

0.0699 0.0922 0.1025 0.0803 0.0741 0.0691



“Old” denotes measurement indices when only using markers Age, Sgpt, TP, and ALB.

with 441 males and 142 females. Among them, there are 416 liver patients and 167 normal subjects. Except gender, all other variables are standardized before we calculate those indices. For illustration, we use the whole dataset except gender to select the preexisting biomarkers (old markers) using the logistic regression method. There are four markers: Age, Sgpt, TP, and ALB, with pvalues of their corresponding estimated coefficients less than 0.05. These four markers are then treated as the “old” biomarkers, and the rest five variables, TB, DB, Alkphos, Sgot, and A/G are treated as new biomarkers. We treat the male and female patients separately to calculate the index values for five “new” biomarkers. In this case, each measure can be used as an index to compare the impacts of these five new biomarkers and select the ones with high impact. That is, these indices are used to compare the additional diagnosis power to that of the existing model due to each new biomarker. Table 3 presents results of AUC, pAUC, IDI, and pIDI for each “new” biomarker, whereas “old” stands for the old model with the variables Age, Sgpt, TP, and ALB. We can easily see that pAUC has larger value for the relative difference ratio, (new–old)/old, than that of AUC. Similarly, the ratio of pIDI is also bigger than that of IDI. For example, with variable Sgot as the new biomarker, the ratio of pIDI (u = 0.1) are equal to 0.3261 and 0.2261 for the female and male groups, respectively. These values are higher than their counterparts AUC and IDI, which are 0.0163 and 0.1477 for the female group, and 0.0008 and 0.0188 for the male group. Table 4 shows that the power of the model is significantly increased with a newly added biomarker. Note that in Table 4, we also found that AUC fails to detect any one of those new variables in both groups, and IDI detects two variables, TB and DB, which significantly improve the predictive power to the old model. Within a given specificity range, pAUC detects none of new biomarkers for the female dataset and detects some of them when the male dataset is used. Note that the variable Sgot is ignored when the AUC and IDI are used as measures. However, for a confined specificity range, pAUC and pIDI can successfully detect it. Moreover, in the female dataset, the pIDI index shows that the Sgot variable can improve the predictive ability of the model with only the old biomarkers. When determining the usefulness of a newly discovered biomarker when there is an existing model, it is better to be conservative. So, as a short summary, we recommend that when a new biomarker is evaluated, it is better to report more indices, including those ones with or without a confined range of specificity. Our numerical results show that the proposed index, pIDI, is a very good addition in this kind of situation.  C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

806

Z. Wang et al.: Assessing new biomarkers

Table 4 Results of identification of markers for liver cancer data. Marker Female

Male

AUC IDI pAUC

pIDI

AUC IDI pAUC

u = 0.1 u = 0.2 u = 0.1 u = 0.2 TB DB Alkphos Sgot A/G

0∗ 0 0 0 0

1 1 0 0 0

0 0 0 0 0

0 0 0 0 0

1 1 1 1 1

1 1 0 0 0

pIDI

u = 0.1 u = 0.2 u = 0.1 u = 0.2 0 0 0 0 0

1 1 0 0 0

1 1 0 1 0

1 1 0 0 0

1 1 1 1 1

1 1 1 1 0



“1” denotes the new marker with significantly increment power on the existing model, “0” denotes the new marker with no significantly increment power on the existing model.

5 Conclusion In this paper, we propose a new measure, pIDI, for evaluating the improvement of the predictive ability of a newly added biomarker within a given specificity range. This measure allows us to focus on a specific cut-off range that is more important for disease diagnostics in a medical or clinical practice. An estimate of pIDI is proposed and proven to be consistent. Numerically, we show that pIDI can detect the improvement due to the newly added biomarker when IDI, AUC, and pAUC fail to do so. A liver disease dataset is used for illustration purposes. Similarly, as the comments to those indices using a whole range of cut-off points, we also found that it is better to report more than one index when evaluating the predictive ability of a newly added marker with a particular cut-off range of interest. This new measure, pIDI, is a good addition to current indices for measuring the impact of a newly added biomarker. Conflict of interest The authors have declared no conflict of interest.

Appendix Proof of Consistency Suppose S−1 (u) is a known constant for the moment. Then, according to the Kolmogrov strong law of large numbers, + 1  L(x1i ) − S0−1 (u) → pIS almost surely. n n

(A.1)

i=1

It follows from the properties of quantile estimates that Sˆ0−1 (u) → S0−1 (u), in probability. Thus, we have the following inequalities: 1 1 (L(x1i ) − S0−1 (u))+ − |Sˆ0−1 (u) − S0−1 (u)| ≤ (L(x1i ) − Sˆ0−1 (u))+ ≤ n n n

n

i=1

i=1



1 n

 C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

n 

(L(x1i ) − S0−1 (u))+ + |Sˆ0−1 (u) − S0−1 (u)|,

(A.2)

i=1

www.biometrical-journal.com

Biometrical Journal 57 (2015) 5

807

Hence, for any  > 0,    n n 1   1   −1 + −1 + P  (L(x1i ) − Sˆ0 (u)) − (L(x1i ) − S0 (u))  ≥  ≤ P(|Sˆ0−1 (u) − S0−1 (u)| ≥ ). n  n i=1

i=1

Then, when n → +∞,    n n 1   1   −1 + −1 + lim P  (L(x1i ) − Sˆ0 (u)) − (L(x1i ) − S0 (u))  ≥  = 0. n→+∞ n  n i=1

i=1

It implies that as n → +∞, + 1  1  L(x1i ) − Sˆ0−1 (u) − (L(x1i ) − S−1 (u))+ → 0, in probability. n n n

n

i=1

i=1

(A.3)

Hence, it follows from Eqs. (A.1) and (A.3), 1 (L(x1i ) − Sˆ0−1 (u))+ → pIS in probability. n n

i=1

 → pIP in probability, which implies that pI Similarly, we have that pIP DI is a consistent estimate of pIDI.

References Baker, S., Schuit, E., Steyerberg, E., Pencina, M., Vickers, A., Moons, K., Mol, B. and Lindeman, K. (2014). How to interpret a small increase in auc with an additional risk prediction marker: decision analysis comes through. Statistics in Medicine 33, 3946–3959. Chi, Y.-Y. and Zhou, X.-H. (2008). Commentary: The need for reorientation toward cost-effective prediction: Comments on ‘Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by Pencina et al., statistics in medicine. Statistics in Medicine 27, 182–184. D’Agostino, R. B. and Nam, B. (2004). Evaluation of the performance of survival analysis models: discrimination and calibration measures. Handbook of Statistics 23, 1–25. Hand, D. (2012). Assessing the performance of classification methods. International Statistical Review 80, 400– 414. McGreechan, K., Macaskill, P., Irwig, L., Liew, G. and Wong, T. (2008). Assessing new biomarkers and predictive models for use in clinical practice. Archives Internal Medicine 21, 2304–2310. Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B. and Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine 27, 157–172. Pepe, M., Janes, H. and Li, C. (2014). Net risk reclassification p values: valid or misleading? Journal of National Cancer Institute 106, 1–6. Pletcher, M. and Pignone, M. (2011). Evaluating the clinical utility of a biomarker a review of methods for estimating health impact. Circulation 123, 1116–1124. Thompson, I. M., Resnick, M. and Klein, E. (2001). Prostate Cancer Screening. Humana Press, New York, NY. Vickers, A., Cronin, A. and Begg, C. (2011). One statistical test is sufficient for assessing new predictive markers. BMC Medical Research Methodology 11, 11–13. Walter, S. (2005). The partial area under the summary ROC curve. Statistics in Medicine 24, 2025–2040. Wang, Z. and Chang, Y.-c. I. (2011). Marker selection via maximizing the partial area under the ROC curve of linear risk scores. Biostatistics 12, 369–385.

 C 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Assessing the predictive power of newly added biomarkers.

As medical research and technology advance, there are always new biomarkers found and predictive models proposed for improving the diagnostic performa...
108KB Sizes 1 Downloads 10 Views