Biometrics 71, 404–416 June 2015

DOI: 10.1111/biom.12265

Generalized t -Statistic for Two-Group Classification Osamu Komori,1, * Shinto Eguchi,1 and John B. Copas2 1

Institute of Statistical Mathematics, Midori-cho, Tachikawa, Tokyo 190-8562, Japan 2 Department of Statistics, University of Warwick, Coventry CV4 7AL, U.K. ∗ email: [email protected]

Summary. In the classic discriminant model of two multivariate normal distributions with equal variance matrices, the linear discriminant function is optimal both in terms of the log likelihood ratio and in terms of maximizing the standardized difference (the t-statistic) between the means of the two distributions. In a typical case–control study, normality may be sensible for the control sample but heterogeneity and uncertainty in diagnosis may suggest that a more flexible model is needed for the cases. We generalize the t-statistic approach by finding the linear function which maximizes a standardized difference but with data from one of the groups (the cases) filtered by a possibly nonlinear function U. We study conditions for consistency of the method and find the function U which is optimal in the sense of asymptotic efficiency. Optimality may also extend to other measures of discriminatory efficiency such as the area under the receiver operating characteristic curve. The optimal function U depends on a scalar probability density function which can be estimated non-parametrically using a standard numerical algorithm. A lasso-like version for variable selection is implemented by adding L1 -regularization to the generalized t-statistic. Two microarray data sets in the study of asthma and various cancers are used as motivating examples. Key words: Area under the ROC curve; Asymptotic variance; Fisher linear discriminant function; Kullback–Leibler divergence; Lasso; t-Statistic.

1. Introduction Discriminant analysis is one of the best known techniques in applied statistics, going right back to the pioneering work of Fisher (1936). In the two-group case, we have a set of p measurements or variables x taken on samples from two separate populations, y = 0, 1. If we denote by p0 (x) and p1 (x) the two population density functions, the classic model is p0 (x) ∼ N(μ0 , 0 ),

p1 (x) ∼ N(μ1 , 0 )

(1)

for which the log likelihood ratio log{p1 (x)/p0 (x)} is, up to a linear transformation, just the linear discriminant function β 0 x with β0 =

−1 0 (μ1 − μ0 ) . 1/2 {(μ1 − μ0 ) −1 0 (μ1 − μ0 )}

(2)

An alternative motivation for β0 , probably closer to Fisher’s original idea, is to note that β 0 x is the linear score which gives maximum standardized separation between the two means, or β0 = argmax LI (β),

(3)

β

where LI (β) =

β (μ1 − μ0 ) 1

(β 0 β) 2

  .

(4)

We refer to LI (β) as the t-statistic by analogy with the familiar univariate t-test.

404

The aim of this article is to suggest how a generalized version of (4) can still be used to find a good linear discriminant score in cases where the very strong assumptions of model (1) no longer apply. We have in mind a typical case–control study where the sample of cases (y = 1) are individuals with a particular medical condition or exposure, to be compared with the same measurements taken on a sample from the general population (y = 0). Our motivating example, discussed in more detail in Section 5, is a set of microarray data in which measurements of allergen markers for asthmatic patients are compared with those for a sample of controls (Dottorini et al., 2011). The upper panel in Figure 2 in Section 5 shows a typical scatter plot for a pair of markers. The points shown as circles (controls) seem reasonably homogeneous, plausibly modeled by a bivariate normal distribution, but the cases (shown as crosses) show very distinct clustering, with one cluster almost indistinguishable from the controls. There are also more outlying observations. This suggests both heterogeneity and uncertainty in diagnosis, not uncommon in medical studies especially in microarray data analysis (Tibshirani and Hastie, 2007; Wu, 2007; Lian, 2008; Bravo et al., 2012). We are therefore interested in discriminant models which retain normality for p0 (x) as in the classic model (1) but allow more flexible models for p1 (x). For such models we consider generalizing (4) to LU (β) = E1 U

β (x − μ0 ) 1

(β 0 β) 2

 ,

(5)

where E1 denotes expectation over p1 (x) and U is some generator function. The normalization in the denominator in (5) is © 2014, The International Biometric Society

Generalized t-Statistic for Two-Group Classification now based on the variance in the control group only, reflecting the essential asymmetry in the way we are now treating the two groups. The argument of U in (5) is the standard normal score for β x amongst the controls, we are then using U to filter the values of this score to find the average for the cases. The argument is essentially the same as before—the larger is (5) the greater is the separation of values of β x between y = 0 and 1, but now we are filtering the data for the cases to allow for non-normality and heterogeneity in p1 (x). If U is the identity function, denoted as I, then (5) is just (4), but other possibilities for U are of interest. For example if U = , the standard normal distribution function, then

 

β (x − μ0 )



1

(β 0 β) 2

= P(β X0 < β x | x),

(6)

where X0 is a random value of x from p0 (x). If X1 is an independent random variable from p1 (x), then L (β) = E1 {P(β X0 < β X1 | X1 )} = P(β X0 < β X1 ), (7) which is just the area under the receiver operating characteristic curve (or AUC), a standard measure of the effectiveness of β x as a discriminant function (Su and Liu, 1993; Pepe, 2003). Another example of a function U leads to the standard Fisher linear discriminant function, as discussed in Sections 2.2 and 3. To estimate these quantities, suppose the two sample sizes are n0 and n1 and let x0 and S 0 be respectively the sample mean and sample covariance matrix of the control sample. Assuming normality for the controls these are the sufficient statistics for μ0 and 0 . Our proposal is to define the generalized t-statistic estimate  βU to be

 βU = argmax LU (β) ,

(8)

405

rithm for variable selection in high-dimensional cases. Section 5 illustrates the generalized t-statistic method by applying it to a pair of allergen markers in the asthma example mentioned earlier, using a parametric bootstrap to compare the mean squared error and mean AUC of  βUopt with the corresponding estimates given by other methods. Applications of the U-lasso are illustrated by the second example of Section 5 and by the two data examples in Section 6. The article concludes with a suggestion in Section 7 of how our method might be extended to cover Bayes risk consistency, and with further discussion in Section 8. A technical appendix (web supplement) gives the proofs of some of our main results and provides further details of the examples. 2.

Consistency and Optimality of the Generalized t-Statistic Estimate

2.1. Consistency As we are only considering linear discriminant functions the problem of estimating β is invariant under linear transformations of x, and so there is no loss of generality if we first standardize x when y = 0 such that μ0 = (0, . . . , 0) and 0 = I p , where I p is the p × p unit matrix. We now define two statis tics, w = β 0 x and g = Q0 x, where Q0 = I p − β0 β0 . The linear score w is proportional to the linear discriminant function 1 2 under the classic model with β0 = μ1 /(μ 1 μ1 ) , which we can think of as the standardized linear projection of x onto μ1 . Statistic g is then the residual vector in this projection, the component of x which is not explained by w. Clearly, when y = 0, w and g are independent with E0 (w) = 0, E0 (g) = 0 1 2 and var0 (g) = Q0 . When y = 1, E1 (w) = (μ 1 μ1 ) ≥ 0 and E1 (g) = 0. If the classic model (1) is true then w and g are independent with vary (g) = Q0 under both y = 0, 1, and so the following two key assumptions (A) and (B) also hold, namely (A)

E1 (g | w = a) = 0

for all a ∈ R,

β

and where LU (β) is the natural sample version of LU (β), n1 1  U LU (β) = n1 j=1



(B)



β (x1j − x0 ) . (β S 0 β)1/2

(9)

As the generalized t-statistic (9) has scale invariance in the sense that LU (β) = LU (cβ) for all c > 0 there is no loss of 

generality if we normalize  βU to satisfy  βU S 0  βU = 1.  Properties of βU are discussed in the later sections of this article. Section 2 concerns asymptotic consistency and optimality. We discuss assumptions on p1 (x) and U such that  βU is a consistent estimator for β0 in (2), assumptions which are trivially true for the classic model (normality and U = I) but continue to hold for a wide range of non-normal distributions for p1 (x). Under these assumptions we find the function Uopt which minimizes the asymptotic variance of  βU as both n0 and n1 go to infinity. Section 3 gives a numerical algorithm for estimating  βUopt . In Section 4 we propose a lasso-like version of the generalized t-statistic (U-lasso) which leads to an algo-

var1 (g | w = a) = Q0

for all a ∈ R.

Assumption (A) means that the residual g is distributed around a line aβ0 (a ∈ R) and has mean 0, leading to the unbiased estimation equation for β0 evident in Theorem 1. Assumption (B) means that the conditional variance of g with respect to y = 1 is constant for every value of w ∈ R and is the same as the variance of g with respect to y = 0. If p1 (x) is symmetrical with respect to the line aβ0 (a ∈ R) then (A) is satisfied as shown in Web Appendix A. This covers a wide range of distributions including elliptical distributions such as the multivariate t-distribution with mean μ1 and the precision matrix I p . However (A) and (B) also hold under more general conditions for p1 (x), for example the mislabel model p1 (x) = (1 − )φ(x, ν, I p ) + φ(x, 0, I p ),

(10)

where φ(·, ν, V ) denotes the normal probability density function with mean ν and variance V . This means there is some

406

Biometrics, June 2015

uncertainty in the diagnosis of the cases given by the mislabel probability . Here μ1 = (1 − )ν. Both assumptions (A) and (B) continue to hold as w and g are stochastically independent under both components of (10). Asymptotic properties of  βU are given by the derivative and Hessian of LU (β). At β = β0 these are ∂LU (β) = E1 {U  (w)g} ∂β



In this case the estimated linear function  βU x converges to  β0 x which has one-to-one correspondence to p1 (x)/p0 (x), and so is Bayes risk consistent (see Section 7 for further discussion). For another example, consider extending the mislabel model (10) to

(11)

p1 (x) =

∞ 

k φ(x, νk , V k ),

(15)

k=1

and

∞

E1 {U  (w) − U  (w)w} < 0 .

where k are non-negative mixture coefficients with k=1 k = 1. Allowing for a potentially infinite number of components means that (15) can be viewed as a nonparametric approximation to essentially any conditional probability density function of x given y = 1. Figure 1 shows contours of three bivariate (p = 2) examples of mixture models (15) which satisfy (A) and (B). Distribution p0 (x) is N(0, I 2 ) in each case. The first two examples are two-component mixtures showing different degrees of bi-modality for p1 (x), and the third example with 5 components shows a distinctly non-normal shape for p1 (x). Actually, we can rewrite assumption (A) and (B) based on the mixture model in (15) as follows.

As LU (β) is the asymptotic limit of LU (β) as the sample sizes tend to infinity, this establishes

Theorem 3. Assumptions (A) and (B) under the infinite mixture model in (15) are equivalent to

Theorem 1. Under assumptions (A) and (B) and (C),  βU is a consistent estimate of β0 .

(A )

∂2 LU (β) = E1 [{U  (w) − U  (w)w}gg ], ∂β∂β

(12)

where the Hessian is derived using assumption (A). Under assumption (A) (11) is zero, so β0 is a stationary point for all (smooth) functions U. Under (A) and (B), (12) is E1 {U  (w) − U  (w)w}Q0 , which is negative definite if we also have the condition (C)



k (Q0 − Qk ) = 0,

k∈K

When p1 (x) satisfies assumptions (A) and (B), β0 can be βU for any generregarded as the target value of the estimator  ator function U satisfying (C). Condition (C) is an assumption about the shape of U, which holds for the identity and standard normal distribution functions mentioned earlier, as well as other functions U introduced in the subsequent discussion. Let us consider the relationship to the standard logistic regression model log

p1 (x) = c + β x. p0 (x)

(13)

If p0 (x) ∼ N(0, I p ) then we have the classic model (1) with p1 (x) ∼ N(β, I p ) and c = −β β/2. The target parameter β0 is proportional to β in (13). In this situation Efron (1975) showed that Fisher linear discriminant analysis, equivalent to using the quadratic U-function in (22), is more efficient than logistic regression. This discussion can be extended to the more general model with p0 (x) ∼ N(0, I p ) and



p1 (x) ψ p0 (x)

 = c + β x,

(B )

Theorem 2. The target parameter β0 is proportional to β in (14) and both assumptions (A) and (B) hold.

k Qk νk = 0 for any  ∈ N,

k∈K



k (Qk V k Q0 − Q0 ) = 0 for any  ∈ N,

k∈K    where Qk = I p − V k β0 β 0 /(β0 V k β0 ), K = {k | β0 νk = β0 ν ,   β0 V k β0 = β0 V  β0 }.

In this way, assumptions (A) and (B) can be expressed according to the index set K , where the components of the infinite mixture in (15) are categorized so that they satisfy the mean and variance conditions regarding β 0 x. See Web Appendix C for a proof. The pair of assumptions (A ) and (B ) permits a specific type of multimodal distribution allowing a good deal of flexibility for p1 (x). It is clear that if V k = I p for all k then (B ) is satisfied because Qk = Q0 in this case. If νk ∝ μ1 and V k = αk,0 β0 β 0 +  α b b , where β b = 0 for all k and , then k, k, k, k, 0 =1 (A ) holds because 1. Remark p−1 

(14)

for any strictly increasing function ψ(·). Web Appendix B establishes that for this model



Qk = Q0 , Qk νk ∝ Q0 μ1 = 0.

p−1

Moreover if holds because

=1

(16)

 αk, bk, b k, = Q0 for all k, then (B ) also

Qk V k Q0 = Q0 (αk,0 β0 β 0 + Q0 )Q0 = Q0 .

(17)

Generalized t-Statistic for Two-Group Classification

407

4

4

4

0.01

0.02

0.1

0.09

0.06

0.07

0.1

0.05

0.0 9 0.06

0.08 0.06

0.1

0.04

3

0.0

4

0.1

0.02

0

0.1

0

0

0.07

2

2

0.12

1

1

2

0.1

1 0.

4

0.05

0.1

0.1

0.03

0.07 0.09

0.1

0.1

0.02

0.05

0.08

0.06

0.01

0.03

0.04 0.06

0.0 8

4

0.06

0.01

0.1

0.1

2

0.08

0.08

0.04

0.04

0.02

0.02

−2

−2

0.02

0

2

4

−2

0

−2

0.08 0.04 0.02

−2

0.1 2

2

2

4

−2

0.04

0

2

4

Figure 1. Examples of probability density functions which satisfy assumptions (A ) and (B ). Contours of p0 (x) are shown in gray, contours of p1 (x) in black. For all three panels, V 0 = V 1 = I 2 . For the left panel, ν1 = (0, 0) , ν2 = (2, 2) , 1 = 0.2 and 2 = 0.8. For the center panel, ν1 = (−1, −1) , ν2 = (2, 2) , 1 = 0.2 and 2 = 0.8. For the right panel, ν1 = (1, 1) , ν2 = (1.5, 0.5) , ν3 = (0.5, 1.5) , ν4 = (−0.5, 2.5), ν5 = (2, 2) , 1 = 3 = 4 = 0.1, 2 = 0.4 and 5 = 0.3.

2.2. Asymptotic Normality and Optimality Under the classic model (1) properties of the asymptotic variance of the coefficients of the Fisher linear discriminant function are fully investigated by Efron (1975). Some extensions to non-normal distributions are discussed by O’Neill (1980). Here we consider the asymptotic variance of  βU under more general assumptions. It turns out that under assumptions (A), (B) and (C) defined above, the asymptotic variance of  βU can be expressed as a functional of U  , the first derivative of U, and that the optimal U function achieving minimum asymptotic variance is closely related to f1 (w), the probability density function of w when y = 1. Theorem 4. Under assumptions (A), (B), and (C), 1/2 βU − β0 ) is asymptotically distributed as N(0, U ) when n1 ( n0 → ∞ and n1 → ∞ with a ratio n1 /n0 = π1 /π0 , where

U = cU Q− 0, 

cU =





2



(18) 

2

E1 {U (w) }+ π1 /π0 E1 {U (w)w} + π1 /π0 E1 {U (w)} 2



2

Theorem 5. Under assumptions (A), (B), the optimal U function is

Uopt (w) = log

f1 (w) , φ(w, m, s2 )

(20)

where m = E(w) and s2 = var(w) are the marginal mean and variance of w. See Web Appendix E for a proof, in which the Uopt is constructed such that (C) is satisfied. Note that Uopt (w) is not the logarithm of the likelihood ratio between f1 (w) and f0 (w) because f0 (w) = φ(w, 0, 1) = φ(w, m, s2 ). Remark 2. The expectation of the generalized t-statistic based on Uopt is the Kullback–Leibler divergence

LUopt (β) =

f1 (w) log

f1 (w) dw. φ(w, m, s2 )

(21)

,

E1 {U  (w)} − E1 {U  (w)w}

(19)

and where π0 = P(Y = 0), π1 = P(Y = 1) and Q− 0 is the generalized inverse of Q0 . See Web Appendix D for a proof. Since the matrix Q0 is independent of U, this shows that the problem of finding the optimal function U which minimizes the asymptotic variance U reduces to the functional minimization of the scalar cU with respect to U  . This gives

Maximizing the generalized t-statistic can therefore be thought of as maximizing the Kullback–Leibler divergence from f1 (w) to φ(w, m, s2 ). Assumptions (A) and (B) are sufficient but clearly not necessary conditions for these results. An example where neither (A) nor (B) hold is the classic model of two multivariate distributions with unequal variances, where x is transformed so that p0 (x) is N(0, I p ) and p1 (x) is N(μ1 , 1 ). In this case we consider Uβ0 (w) = −(w − cβ0 )2 ,

(22)

408

Biometrics, June 2015

2  2 where m1 = β 0 μ1 , s1 = β0 1 β0 , and cβ0 = {π0 + π1 (s1 + 2 m1 )}/(π1 m1 ). Then we have

βF =

(π0 I p + π1 1 )−1 μ1 1

−1 2 {μ 1 (π0 I p + π1 1 ) μ1 }

(23)

,

where βF is a fixed point satisfying βF = argmax LUβF (β) unβ

der certain regularity conditions. See Web Appendix F for a proof in the empirical version. The sample equivalent of this, found by maximizing LU (β) instead of LU (β), is just the vector of coefficients of the Fisher linear discriminant function,

 βF =

(π 0 S 0 + π 1 S 1 )−1 (x1 − x0 ) {(x1 − x0 )(π 0 S 0 + π 1 S 1 )−1 (x1 − x0 )} 2 1

,

(24)

where π 0 = n0 /n, π 1 = n1 /n and n = n0 + n1 . Although the two formulae for β0 in (2) and βF in (23) are different, in practice we often tacitly assume the homoscedastic model (1) but use (24) as the natural sample estimate of (2), taking the pooled within-group variance as an efficient estimate of the supposedly common variance 0 . In this sense, if we are only interested in linear discriminant functions, the difference between the two models may be of limited practical importance. 2.3. Optimality in Terms of AUC By concentrating on the asymptotic variance of  βU , the discussion so far has defined optimality in terms of asymptotic efficiency. However, it turns out that Uopt may also be optimum when judged by other measures useful in discriminant analysis, such as AUC. To see this, suppose that  β is any con sistent estimator of β0 with asymptotic variance n−1 1 varA (β). Then the expected generalized t-statistic is asymptotically E{LU ( β)} = LU (β0 ) +

1 tr{HU (β0 )varA ( β)} , 2n1

(25)

where HU is the Hessian matrix of LU . If U satisfies condition (C), then HU (β0 ) is negative definite and so this implies that β1 and  β2 are two consistent estimators with  if  β1 asymptotβ2 , then ically more efficient than  E{LU ( β1 )}−E{LU ( β2 )} =

1 tr[HU (β0 ){varA ( β1 )−varA ( β2 )}] 2n1

≥0

(26)

for sufficiently large n0 and n1 . Hence, from Theorem 5 with assumptions (A) and (B), for any consistent estimate  β we have

∂ ∂x





log

φ(x, μ1 , 1 )  , φ(x, 0, I p ) x=x

(29)

0

where x0 = (I p + 1 )−1 μ1 . 3.

Computation of Optimal Generalized t-Statistic Estimates If we have a parametric model for p1 (x) and reasonably tractable formulae for the terms f1 (w), m and s2 in (20) then maximizing LUopt may be straightforward using a standard numerical optimization routine. However note that Uopt in (20) depends on β0 which is unknown, so we have to estimate  and to maximize it to give  β at the same time. For it to give U U example, suppose we want to estimate β0 using the quadratic U-function (22). Then a natural fixed point algorithm would sequentially update the sequence β(t) by β(t) = argmax L (β). U (t−1) β β

(30)

where

β(t−1) (w) = −(w − ct−1 )2 , U 2 ct−1 = [π 0 + π 1 {s1,t−1 +m  21,t−1 }]/{π 1 m  1,t−1 }, 1

 1,t−1 = β(t−1) (x1 − x0 )/{β(t−1) S 0 β(t−1) } 2 , m

(27)

For example, if we take U(w) to be (w) , then we have



βF =

2 s1,t−1 = β(t−1) S 1 β(t−1) /{β(t−1) S 0 β(t−1) }.

βUopt )} ≥ E{LU ( β)}. E{LU (

H (β0 ) = −2



1/2 But wf1 (w) dw = E1 (w) = (μ > 0. This implies that 1 μ1 ) the integral within the brackets in (28) must also be positive because of the symmetry of φ(w, 0, 1). Hence from (27) the estimator  βUopt is also asymptotically optimal in the sense of maximizing AUC. Optimality in this sense extends to any function U satisfying condition (C). It is also noteworthy that this optimality holds for the extended logistic model in (14) because (27) hold for any estimate  β satisfying assumptions (A) and (B). This discussion is parallel to that on the asymptotic error rate of discriminant functions given by Efron (1975) and O’Neill (1980). If assumptions (A) and (B) are not satisfied then in general βU = argmaxβ LU (β) = β0 and so such a unified understanding of estimation performance may no longer hold. However, the example of the quadratic function (22) suggests that reasonable estimation may still be possible even if both assumptions (A) and (B) fail. If π0 = π1 then βF in (23) is the coefficient vector of the linear function which is best in the sense of maximizing AUC under normality assumptions with heterogeneous variances 1 = I p (Su and Liu, 1993). Moreover, there is a close relationship with the likelihood ratio as, up to constant factors,

 wφ(w, 0, 1)f1 (w) dw

Q0 .

(28)

It can be shown directly that the sequence β(t) converges to the Fisher coefficients in (24), as shown in Web Appendix F. Incidentally, using this method we can estimate the slope of the Fisher linear discriminant function without any computation of the matrix inverse of (π 0 S 0 + π 1 S 1 ).

Generalized t-Statistic for Two-Group Classification More generally, we can estimate  βUopt non-parametrically by using a non-parametric estimate of the density f1 (w) required in Theorem 5. For this we have to maximize the generalized t-statistic LU (β) while simultaneously estimating the function Uopt . This can be done using the following fixed point algorithm:

(1) Set β(1) = S −1 0 (x1 − x0 ). (2) For t = 2, . . . , T (a) Estimate

β(t−1) (w) = U 2  log{f1,t−1 (w)/φ(w, m  t−1 , st−1 )} based on β(t−1) .

(b) Update β(t−1) to β(t) by

β

(t)

n1 1  β(t−1) = argmax U n1 β j=1





β (x1j − x0 ) . (β S 0 β)1/2 (31)

(3) Output  β = β(T ) /[{β(T ) } S 0 β(T ) ]1/2 Uopt

Note that the initial value β(1) in step 1 above could be replaced by any other value, so avoiding the need to calculate the inverse of S 0 . Or we can add a positive constant to the diagonal of S 0 before inversion in a similar way to the estimation in ridge regression (Hoerl and Kennard, 1970). This suggests that the algorithm works even in settings such as p > n0 and β(t−1) (w) in the same way as p > n1 . After the estimation of U

1,t−1 (w) based on, for j = 1, . . . , n1 , in (30) we estimate f w1j,t−1 = β(t−1) (x1j − x0 )/{β(t−1) S 0 β(t−1) }1/2 ,

by L1 -regularization. Similarly, we can define the U-lasso by adding a L1 -regularization term to the generalized t-statistic to give LλU (β) = LU (β) − λ

1,t−1 (w) = f

1  φ(w, w1j,t−1 ,  ht−1 ), n1

p 

⎧ ∂ ⎪ LU (β) − λsign(βk ) if ⎪ ⎪ ⎨ ∂βk   gk ( β ) = ∂ ∂ LU (β) − λsign if LU ( β ) ⎪ ⎪ ∂βk ∂βk ⎪ ⎩ 0

βk = 0 βk = 0 and |∂β∂k LU (β)|>λ

otherwise, (34)

for k = 2, . . . , p where the sign function is given by sign(u) = 1 if u > 0, 0 if u = 0 and −1 otherwise. The second formula in (34) determines the active set of variables whose coefficients are non-zero. To keep the sign of βk unchanged with a fixed value of λ, we consider the range of optimization for a scalar ρ in the gradient ascent method such as



k=2,...,p



(32)





βk   sign(βk ) = −sign{gk (β)} = 0 . gk (β) (35)

j=1

4. The U-Lasso To deal with multiple variables or to rank them appropriately, the lasso (Tibshirani, 1996) and other variants (Zou and Hastie, 2005; Yuan and Lin, 2006; Zou, 2006) are widely employed because they automatically select the useful variables

(33)

where β = (β1 , . . . , βp ) , LU (β) is defined in (9) and λ is a nonnegative parameter that controls the shrinkage of β. As with the lasso, the absolute value of βk gets smaller as λ increases. However, for the U-lasso we need another constraint because of the scale invariance of LU (β). Previously we resolved this by arbitrarily scaling β so that β S 0 β = 1 (like L2 regularλ ization). But now λ is also involved, since Lcλ U (β) = LU (cβ) for any positive scalar c > 0. Following Ma and Huang (2005) and Wang et al. (2007) we avoid this unidentifiability of β by choosing an anchor variable xk∗ and fixing |βk∗ | = 1 (a device also commonly used in AUC analysis). We can assume k∗ = 1 without loss of generality. To estimate β in (33), we take the gradient ascent approach employed by Goeman (2010), where the gradient g(β) = (0, g2 (β), . . . , gp (β)) is defined by

ρedge (β) = min

ht−1 is estimated using an asymptotic where the bandwidth  mean integrated squared error criterion (Duong and Hazelton, 1,t−1 (w) determines that of the estimated 2003). The form of f 1,t−1 (w) is close to a noroptimal U function. For example, if f β(t−1) (w) would approximate the quadratic mal density then U function defined in (22). In general, the form of the estimated optimal U function is complicated, typically multimodal if p1 (x) is a mixture distribution such as (10) or (15). Once we β(t−1) we estimate β(t) in (31) by direct nuhave the estimate U merical optimization. See Web Appendix L for a listing of the R-code we have used to implement all of these calculations.

|βk |,

k=1

using a kernel density estimation method such as n1

409

Based on the g(β) and ρedge (β), the anchor-based algorithm for the U-lasso is given as follows. (1) Fix β1 = 1 or β1 = −1 and initialize β(1) with (β2 , . . . , βp ) = 0. (2) For t = 2, . . . , T ,  (w) based on β(t−1) . (a) Estimate U (t) (b) Update β = β(t−1) + ρopt g(β(t−1) ), where ρopt =

argmax 0≤ρ≤ρedge (β(t−1) )

(T ) (3) Output  β . U = β λ

Lλ {β(t−1) + ρ g(β(t−1) )}

 U

(36)

This updating process is repeated for different values of λ such as λ(1) > λ(2) , . . . , > λ(nλ ) = 0, where λ(1) =

410

Biometrics, June 2015

max2≤k≤p |∂LU (β(1) )/∂βk | and nλ is appropriately determined accounting for the number of variables p. Each time the resultant estimate  β

λ()

 U

is used as a starting value for the esti-

mation of  β for  = 1, . . . , nλ − 1. We follow Meier, van  U de Geer, and Buhlmann (2008) by using multiplicative grids between λ(1) and λ(nλ ) . λ(+1)

5.

Examples Showing the Performance of Optimal-U and U-Lasso

5.1. Estimation accuracy of  β U 5.1.1. Asthmatic Data. In this section we return to the motivating example already mentioned in Section 1, a set of microarray data in which measurements of allergen markers for asthmatic patients are compared with those for a sample of controls (Dottorini et al., 2011). These data can be accessed from Gene Expression Omnibus (GEO), National Center for Biotechnology Information (NCBI), accession number GSE20020. Here we illustrate the performance of  β for U the pair of variables (D03, X907) where there is marked nonnormality of the cases distribution p1 (x). As discussed in Section 2 for the case of population distributions p0 (x) and p1 (x), there is similarly no loss of generality here if we first linearly transform the data so that x0 = 0 and S 0 = I 2 . We consider a fitted three-component normal mixture as described in Section 2 1 −  2 )N(0, I 2 ) x0 ∼ N(0, I 2 ), x1 ∼ (1 − 

 1 ) + 2 N(ν2 , V  2 ), + 1 N( ν1 , V

(37)

 1, V  2 , 1 , 2 ) are estimated by maximizing the where ( ν1 ,  ν2 , V likelihood based on (37) with a constraint that νk ∝ μ1 and 2 V k = j=1 αjk ej ej  , where e1 = (1, 0) and e2 = (0, 1) . The constraint means that p1 (x) satisfies assumption (A) but not necessarily assumption (B). The first mixture component of p1 (x) is the control distribution and so (37) can be thought of as a generalization of the mislabel model (10). The upper panel in Figure 2 illustrates the fitted model for the chosen allergen pair, using the plotting co-ordinates w = β 0 x (horizontal axis) and its orthogonal complement g = Q0 x (vertical axis). Here, β0 is the value of (2) for the fitted model. The figure shows scatter plots for the case and control samples with corresponding contours plots of the fitted model. The control data (gray circles) are reasonably consistent with the standard bivariate normal distribution (gray contours), but the case data (black crosses) show strong clustering in line with the contours (black lines) of the mixture model. With this choice of plotting axes, assumption (A) means that the conditional mean of g lies on the horizontal line g = 0 for all fixed values of w, and condition (B) means that the variance in the vertical direction must be the same for all horizontal values. Symmetry of the contours about the line g = 0 confirms (A), also reasonably consistent with the data, but in this case it is clear that neither data nor the fitted model are consistent with (B). We illustrate the performance of the generalized t-statistic by simulating independent data sets from the fitted model (37) using the same sample sizes as in the actual data (a parametric bootstrap). Firstly, the bold black curve in the lower

panel in Figure 2 is an example of an estimated Uopt , which follows closely the dashed curve, the corresponding optimum function U for the fitted model. Both curves capture the heterogeneity of the asthmatic cases shown by the bar plots along the top (controls) and bottom (cases) of the graph. The other curves shown on the plot are the straight line for the usual t-statistic, −(w −  c0 )2 for the Fisher linear discriminant function, and (w) for Uauc . In some cases the U functions have been arbitrarily scaled to fit within the vertical scale of the graph. Taking the fitted model in (37) as if it were the true model gives us the ‘true’ value of β0 and hence allows us to simulate β − β0 ) ( β − β0 )/p of our method for varthe squared error ( ious choices of U. We compare six different estimates: the estiopt derived from (31) and using the algorithm mate given by U of Section 3; U(w) = (w), suggested earlier for maximizing AUC (Uauc ); the coefficients of the Fisher linear discriminant function (fisher), equivalent to using the quadratic U discussed earlier (Uquad ); the usual t-statistic (Ulinear with U = I) estimating (4); logistic regression (glm); and a support vector machine using linear kernels (svm). The resulting sample mean squared errors (×103 ) based on 1000 independent data sets were respectively 1.2(1.3), 33.4(36.4), 8.5(11.1), 3.4(4.1), 8.2(10.8), and 22.0(27.9) (with their corresponding opt , simulation standard errors in parenthesis). In this case U by explicitly modeling the marked non-normality of the y = 1 sample, clearly outperforms the other methods in terms of mean squared error. Section 2.3 shows that, under the assumptions (A), (B), and (C), estimates which are optimum in terms of mean squared error will also be asymptotically optimum in terms of AUC. For model (37), the AUC of the discriminant function β x is



AUC(β) =

1 −  1 − 2 β ν1 + 1    1 )β}1/2 2 {β (I 2 + V



+ 2 

β ν2

 2 )β}1/2 {β (I 2 + V





.

(38)

This similarly allows us to compare the sample averages of AUC( β) for the different methods. The observed average values and their standard errors (×104 ) were respectively 0.7408(0.26), 0.7393(1.8), 0.7404(4.9), 0.7405(1.7), opt is best in terms of 0.7405(4.7), and 0.7398(13.5). Again U maximizing AUC, although the smoothing inherent in the AUC measure shows that on this scale there is very little to choose between them. Further simulation studies based on using the parametric bootstrap with larger values of p are reported in Web Appendix G. 5.1.2. Synthetic data. To further demonstrate the performance of the generalized t-statistic we look at three other simulation settings similar to (37) but with different choices of the parameters: x0 ∼ N(0, I p ), x1 ∼ (1 − 1 − 2 )N(ν0 , V 0 ) + 1 N(ν1 , V 1 ) + 2 N(ν2 , V 2 ),

(39)

411

2

3

Generalized t-Statistic for Two-Group Classification

0.0 4

1

2 0.0

04 0.

6 0.0

0.

.04 8 0 .1 0.0 0 6 0.0

06

0.

08

0.12

0.

0.1 2

14

0

1 0.

g

0.08 8

0.1 6

0.14

1

−3

−2

0. 02

−1

0.

−2

−1

0

1

2

3

4

2

3

4

1.0

w

0.4 0.0

0.2

U(w)

0.6

0.8

true opt auc quad linear

−2

−1

0

1 w

Figure 2. Standardized scatter plots and contour plots for D03 and X907. Non-asthmatic samples are denoted by gray circles, asthmatic samples by black crosses. The x-axis and y-axis are values of w = β 0 x and g = Q0 x, respectively. The corresponding U functions are described in the lower panel, where the gray bar plot indicates the non-asthmatic observation. The black one indicates the asthmatic observations.

where 1 = 2 = 0.1, ν0 = (−1, −0.1, . . . , −0.1) ∈ Rp , p = 10 and n = 200. The three settings are: 1. ν1 = (1, 0.1, . . . , 0.1) ν2 = (3, 0.3, . . . , 0.3) ∈ Rp , V 0 = V 1 = V 2 = I p and π0 = π1 . 2. ν1 = (1, 0.1, . . . , 0.1) ν2 = (3, 0.3, . . . , 0.3) ∈ Rp , V 0 = I p , V 1 = 2I p , V 2 = I p /2 and π0 = 4π1 . 3. ν1 = {1, 0.1, 0.2, . . . , 0.1 × (p − 1)} ν2 = {3, 0.3, 0.4, . . . , 0.1 × (p + 2)} ∈ Rp , V 0 = V 1 = V 2 = T p and π0 = π1 , where T p is a Toeplitz matrix with the element tij = 1/|i − j|2 (i = 1, . . . , p; j = 1, . . . , p). β) were calculated as in (38). Both assumpValues of AUC( tions (A) and (B) hold for setting 1, but only assumption

(A) for setting 2. In setting 3 neither assumption (A) nor (B) holds. Boxplots of the sample squared errors of  β and of the sample AUCs are shown in Figure 3, although the squared errors are omitted in setting 3 because there is no consistency to the target parameter β0 in this case. As a whole, optimal-U works well for all three simulation settings, especially in settings 1 and 2 where optimal-U clearly outperforms the others in terms of both squared error and AUC. In setting 3, the auc-U is marginally better. 5.2. Variable Selection Using the U-Lasso In this section we give a simulation example of variable selection by U-lasso in the high dimensional settings p = 50, 100, and 500. In each case assume that there are only five

svm

glm

linear

fisher

auc

opt

0.40

0.45

AUC 0.50

0.55

0.60

0.55

0.60

0.35

auc

opt

fisher

linear

glm

svm

linear

glm

svm

opt

svm

fisher

setting 2

auc

setting 1

auc

opt

svm

glm

AUC

0.30

glm

0.25

linear

0.50

0.20

Squared error 0.15

linear

0.10

fisher

0.45

0.05

fisher

auc

opt

0.00

0.50

0.00

0.52

0.05

0.54

0.10

0.56

AUC

0.15

0.58

Squared error

0.60

0.20

0.62

0.25

0.64

412 Biometrics, June 2015

setting 3

Figure 3. The squared errors and AUCs for Uopt , Uauc , fisher, Ulinear , glm and svm in simulation settings 1, 2, and 3.

Generalized t-Statistic for Two-Group Classification

413

Table 1 Average values of the true discovery rate (TDR) and the false discovery rate (FDR) for 100 repetitions TDR

FDR

Sample size

Method

p = 50

p = 100

p = 500

p = 50

p = 100

p = 500

n0 = 200 n1 = 200

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.700 0.528 0.406 0.624 0.584 0.638

0.554 0.446 0.354 0.540 0.520 0.552

0.390 0.344 0.282 0.418 * 0.344

0.537 0.817 0.844 0.654 0.767 0.670

0.704 0.911 0.925 0.801 0.869 0.816

0.914 0.977 0.977 0.934 * 0.963

n0 = 500 n1 = 200

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.846 0.692 0.510 0.788 0.730 0.726

0.750 0.608 0.434 0.698 0.670 0.662

0.586 0.472 0.314 0.536 0.538 0.522

0.254 0.586 0.809 0.399 0.531 0.592

0.433 0.718 0.893 0.538 0.645 0.674

0.693 0.921 0.974 0.811 0.862 0.863

n0 = 200 n1 = 500

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.846 0.546 0.458 0.722 0.650 0.686

0.746 0.498 0.432 0.672 0.610 0.606

0.528 0.378 0.334 0.512 0.490 0.504

0.313 0.758 0.815 0.504 0.674 0.662

0.493 0.877 0.900 0.643 0.780 0.755

0.789 0.972 0.968 0.868 0.910 0.915

n0 = 500 n1 = 500

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.974 0.806 0.602 0.916 0.854 0.868

0.964 0.734 0.530 0.886 0.814 0.822

0.896 0.586 0.412 0.744 0.688 0.746

0.045 0.475 0.778 0.139 0.363 0.265

0.078 0.613 0.849 0.219 0.445 0.432

0.263 0.878 0.960 0.549 0.744 0.647

significant variables following the normal mixture model x0 ∼ N(0, I 5 ), x1 ∼ (1 − 1 − 2 )N(0, I 5 ) + 1 N(ν1 , I 5 ) + 2 N(ν2 , I 5 ),

(40)

with 1 = 2 = 0.1, ν1 = (1, 1, 1, 1, 1) and ν2 = 1.5ν1 . The remaining (p − 5) variables are completely uninformative with the same distribution N(0, I p−5 ) for both cases and controls. For each sample simulated from this model we use the Ulasso to select the top five variables. As in Meier et al. (2008) this gives the true discovery rate (TDR), defined here as the proportion of the top five variables which are amongst the five truly significant variables in the model. We can also select variables sequentially until the five truly significant variables have all been chosen, and then define the false discovery rate (FDR) as the proportion of the selected variables which are non-significant. For each simulation we choose the anchor variable to be the variable which gives the largest empirical AUC. For comparison purposes we also show corresponding values of TDR and FDR for the L1 -regularization path algorithm for logistic regression taken from the generalized linear model work of Park and Hastie (2007) and Goeman (2010). The objective function to be maximized is p p n0 +n1   1  y log{pi i (1 − pi )1−yi } − λ1 |βk | − λ2 βk2 , n i=1

k=1

k=1

(41)

where, for the ith observation, logit(pi ) = α + β xi and yi ∈ {0, 1} is the class label. Two versions are of interest: glm-lasso with λ2 = 0 (L1 penalty only), and glm-elastic-net with both L1 and L2 penalties. The λ’s are chosen by cross-validation in the same way as Zou and Hastie (2005): a fine grid search for λ1 and, in the case of the glm-elastic-net, a search amongst λ2 = 0, 0.01, 0.1, 1, 10, and 100. Simulation results for various sample sizes are summarized in Table 1. When n0 = n1 = 200 there are only quite small differences between the methods, with Uopt -lasso giving the largest TDR when p = 50 but Ulinear -lasso marginally best when p = 500. The mark * in Table 1 means that the glmlasso implemented by the R package in Goeman (2010) fails when the sample sizes are too small relative to p = 500. When n0 = n1 = 500, the performance of Uopt -lasso is noticeably better than the all the other methods. Uopt -lasso shows the smallest FDR for all settings. These results suggest that the argument which we used in Section 2.2 for defining Uopt in terms of minimizing asymptotic variance also leads to stable variable selection in the corresponding lasso version. See Web Appendix H for further results allowing for correlations amongst the variables in (40). Of the models studied we find that Uopt -lasso continues to give the highest TDR in most cases and the smallest FDR in all cases. 6. Practical Applications of U-Lasso The first example in Section 5 discussed a mixture model for a specific pair of variables in the asthmatic data set of

Biometrics, June 2015 1.0

414

See Web Appendix I for a second application of U-lasso using the expression profiles of 863 microRNAs discussed by (Keller et al., 2011).

0.8

X904

0.4

X905

0.2

beta

0.6

7.

0.0

X902 X903

−0.2

X910 X901 X911 X907 5

4

3

2

1

0

lambda

Figure 4. Solution paths of Uopt -lasso for the variables in the X group. Dottorini et al. (2011). Now we look at these data more generally to investigate the classification performance of U-lasso. We consider six groups of variables labeled D, X, G, T, and M which were identified by Dottorini et al. (2011) as relevant for differentiating gene reactivity between asthmatic and nonasthmatic individuals. Figure 4 shows the path solution for Uopt -lasso using the six variables in the X group. This plots values of  β U , defined just after (36) in the anchor-based algorithm of Section 4, against values of λ. The anchor variable (selected because it gives the largest AUC) is X904, hence for this variable β1 = 1 for all λ as shown. The graph shows how the number of variables with non-zero coefficients decreases from 8 (the total number in the group) to 1 (the anchor variable alone) as λ increases. The classification performance of U-lasso for different choices of U is shown in Table 2 together with corresponding results for glm-lasso and glm-elastic-net. For each subset of variables, the data are divided in the ratio of 2 to 1 into two separate samples, the training data and the test data. Estima-

An Extended Version of the Generalized t-Statistic We have assumed throughout that only linear discriminant functions of the form β x are of interest. More generally, a discriminant function can be any scalar function of x, F (x) say. If there is a one-to-one relation between F (x) and the log likelihood ratio log{p1 (x)/p0 (x)} then the Neyman–Pearson fundamental lemma shows that classifying the sample using F (x) will have strong optimal properties, for instance giving the maximum possible value of AUC (Eguchi and Copas, 2002). A simple example is when p0 (x) and p1 (x) are both multivariate normal but with different variances, in which case the log likelihood ratio is a quadratic and not a linear function of x. The generalized t-statistic formulation extends directly to the more general discriminant function F (x) by defining



LU (F ) =

U



F (x) − μ0 (F ) p1 (x) dx, σ0 (F )





where μ0 (F ) = F (x)p0 (x) dx and σ0 (F ) = [ {F (x) − μ0 (F )}2 p0 (x) dx]1/2 . Then we have the following theorem. Theorem 6. Let F ∗ = argmaxF LU (F ). Then the necessary and sufficient condition that there exists a one-to-one correspondence between F ∗ (x) and log{p1 (x)/p0 (x)}, is that

λ

tion of  β U is done by using only the training data, with the tuning parameter λ determined by fivefold cross-validation, leading to the corresponding value of training AUC. Test AUC λ

is then the value of AUC for  β U calculated for the test data. This process is repeated 100 times for each method, resulting in the figures quoted for each group in Table 2: the average values of training AUC, test AUC, the number of selected variables and the total number of variables. We observe that the performance of Uopt -lasso is comparable with the others despite the fact that assumptions (A) and (B) are unlikely to be satisfied for any of these data sets. The table suggests that Uquad is best in D group, Uauc best in X and M groups and Ulinear best in T group. For the F group, with the largest number of variables, the slightly smaller value of test AUC for Uopt -lasso needs to be set against the fifth column of the table which shows that the classifier using Uopt -lasso uses by far the smallest average number of variables. λ

U  (w) − wU  (w) > 0,

for all w ∈ R.

(42)

See Web Appendix J for a proof. We observe that (42) is a condition on the function U, relating to an arbitrary discriminant function F (x) without any assumptions on p0 (x) and p1 (x). By contrast, assumptions (A) and (B) describe conditions on p1 (x) which relate to a linear function β x without any assumptions on U. If assumptions (A), (B), and the normality for the control group are all satisfied, then Uopt is optimal in the sense of asymptotic variance and  βUopt is best in terms of the value of AUC amongst linear discriminant functions. However if any one of the above assumptions fails, then there is no longer guaranteed optimality of asymptotic variance or AUC. In that case, the condition (42) becomes relevant to guarantee the classification performance of the discriminant function F (x). It is easily shown that Ulinear , Uauc , and Uquad all satisfy (42). However the sign of (42) for Uopt depends on f1 (w). One of the questions for future research is whether restricting Uopt to satisfy (42) would lead to a useful improvement in classification performance. 8. Discussion In this article we focus on the heterogeneity or variability of a sample of cases (y = 1), where the distribution of a sample of controls (y = 0) is assumed to be homogeneous. This situation is seen in Figure 1 and illustrated in a practical example in Figure 2. The assumption that the distribution of x given

Generalized t-Statistic for Two-Group Classification

415

Table 2 Mean values of training AUC, test AUC, and the number of selected variables for U-lasso, glm-lasso, and glm-elastic-net based on 100 repetitions Group

D

X

G

T

M

F

Method

Training AUC

Test AUC

No. of variables

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.821 0.836 0.837 0.831 0.837 0.837

0.821 0.832 0.830 0.827 0.832 0.832

6.10 6.64 6.45 4.11 6.21 6.36

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.708 0.772 0.774 0.739 0.770 0.771

0.702 0.757 0.758 0.727 0.757 0.758

6.25 7.81 7.9 6.77 6.84 7.17

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.665 0.683 0.681 0.672 0.675 0.676

0.654 0.649 0.642 0.650 0.662 0.659

7.63 8.51 6.64 7.28 3.78 4.7

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.568 0.618 0.619 0.616 0.601 0.611

0.552 0.570 0.561 0.580 0.565 0.569

6.56 9.45 9.57 9.73 5.43 7.65

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.560 0.627 0.629 0.612 0.599 0.620

0.543 0.585 0.591 0.573 0.558 0.582

6.16 9.73 9.71 9.45 5.86 8.72

Uopt -lasso Uquad -lasso Uauc -lasso Ulinear -lasso glm-lasso glm-elastic-net

0.607 0.680 0.681 0.665 0.650 0.658

0.601 0.620 0.615 0.613 0.622 0.622

4.89 23.85 24.61 19.95 6.59 8.93

y = 1 is heterogeneous is widely recognized and applied to detect outliers in microarray data analysis (Tibshirani and Hastie, 2007; Wu, 2007; Lian, 2008). This idea has also been employed in the classification framework. MammaPrint (van’t Veer et al., 2002), developed for classification between tumors of good and poor prognoses, uses a correlation derived from the average profile of the good prognosis sample (y = 0). More recently, Bravo et al. (2012) proposed the anti-profiles method to take advantage of gene expression heterogeneity, claiming good classification performance for several colon cancer data sets. As mentioned by Bravo et al. (2012), the assumption of a stable and robust reference profile (y = 0) plays an essential role in the success of their method. We formulate the problem of classification in terms of the generalized t-statistic in (9), using this to derive a linear 

discriminant function  βU x. We find that two assumptions

Total No.

7

8

11

11

10

30

(A) and (B) are important. The consistency of  βU for the true value of β0 in (2) is guaranteed by (A), with assumption (B) used for finding the optimal U in terms of minimizing the asymptotic variance. However, examples such as those reported in Section 5 and elsewhere suggest that our proposed method is relatively robust to departures from these assumptions. An alternative to our method in cases where heterogeneity of probability distributions is observed is the mixture discriminant analysis of Hastie and Tibshirani (1996). Assuming mixtures of multivariate t-distributions we find in the numerical examples of Web Appendix K that the generalized t-statistic outperforms Hastie and Tibshirani’s method in all the cases considered. Here we have used the mda package in R (Hastie and Tibshirani, 1996) with default values of the tuning parameters. In contrast to the generalized t-statistic,

416

Biometrics, June 2015

our examples suggest that the performance of the mixture discriminant analysis method can depend sensitively on the Gaussianity of the distributions in the model. Also, our method does not require the number of components in the mixture to be specified. We have also developed U-lasso by putting an L1 penalty term on the generalized t-statistic. Our examples suggests that the minimum asymptotic variance property of Uopt also results in the stable estimation of  β U and in a good true discovery rate for variable selection in high dimensional settings. This method employs the concept of an anchor variable, as is commonly used in AUC-based analysis (Ma and Huang, 2005; Wang et al., 2007). The idea seems particularly suitable in practice when we can identify a clinically important variable beforehand and when the aim is to find out which other variables are useful for classification in combination with the selected anchor variable. λ

9. Supplementary Materials Web Appendices, Tables, and Figures referenced in Sections 2, 3, 5–8 are available with this paper at the Biometrics website on Wiley Online Library.

Acknowledgements This research was supported by Japan Science and Technology Agency (JST), Core Research for Evolutionary Science and Technology (CREST).

References Bravo, H. C., Pihur, V., McCall, M., Irizarry, R. A., and Leek, J. T. (2012). Gene expression anti-profiles as a basis for accurate universal cancer signatures. BMC Bioinformatics 13, 272. Dottorini, T., Sole, G., Nunziangeli, L., Baldracchini, F., Senin, N., Mazzoleni, G., Proietti, C., Balaci, L., and Crisanti, A. (2011). Serum IgE reactivity profiling in an asthma affected cohort. PLoS ONE 6, e22319. Duong, T. and Hazelton, M. L. (2003). Plug-in bandwidth matrices for bivariate kernel density estimation. Nonparametric Statistics 15, 17–30. Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association 70, 892–898. Eguchi, S. and Copas, J. (2002). A class of logistic-type discriminant functions. Biometrika 89, 1–22. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188. Goeman, J. J. (2010). L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal 52, 70–84. Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society, Series B 58, 155–176. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67. Keller, A., Leidinger, P., Bauer, A., ElSharawy, A., Haas, J., Backes, C., Wendschlag, A., Giese, N., Tjaden, C., Ott, K.,

Werner, J., Hackert, T., Ruprecht, K., Huwer, H., Huebers, J., Jacobs, G., Rosenstiel, P., Dommisch, H., Schaefer, A., M¨ uller-Quernheim, J., Wullich, B., Keck, B., Graf, N., Reichrath, J., Vogel, B., Nebel, A., Jager, S. U., Staehler, P., Amarantos, I., Boisguerin, V., Staehler, C., Beier, M., Scheffler, M., B¨ uchler, M. W., Wischhusen, J., Haeusler, S. F. M., Dietl, J., Hofmann, S., Lenhof, H.-P., Schreiber, S., Katus, H. A., Rottbauer, W., Meder, B., Hoheisel, J. D., Franke, A., and Meese, E. (2011). Toward the blood-borne miRNome of human diseases. Nature Methods 8, 841–843. Lian, H. (2008). Most: Detecting cancer differential gene expression. Biostatistics 9, 411–418. Ma, S. and Huang, J. (2005). Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21, 4356–4362. Meier, L., van de Geer, S., and Buhlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B 70, 53–71. O’Neill, T. J. (1980). The general distribution of the error rate of a classification procedure with application to logistic regression discrimination. Journal of the American Statistical Association 75, 154–160. Park, M. Y. and Hastie, T. (2007). L1 -regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B 69, 659–677. Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Su, J. Q. and Liu, J. S. (1993). Linear combinations of multiple diagnostic markers. Journal of the American Statistical Association 88, 1350–1355. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288. Tibshirani, R. and Hastie, T. (2007). Outlier sums for differential gene expression analysis. Biostatistics 8, 2–8. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536. Wang, Z., Chang, Y. I., Ying, Z., Zhu, L., and Yang, Y. (2007). A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23, 2788–1794. Wu, B. (2007). Cancer outlier differential gene expression detection. Biostatistics 8, 566–575. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B 68, 49–67. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67, 301–320.

Received January 2014. Revised July 2014. Accepted October 2014.

Generalized t-statistic for two-group classification.

In the classic discriminant model of two multivariate normal distributions with equal variance matrices, the linear discriminant function is optimal b...
407KB Sizes 0 Downloads 4 Views