IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

slack variables, or errors, in SVM formulation. Incorporating these constraints results in the SVM+ formulation [13], [18] n  γ ∗ 1 ∗ (w · w) + (w minimize · w ) + C ξi∗ w,b,w∗ ,d 2 2 i=1

s.t. yi ((w · zi ) + b) ≥ 1 − ξi∗ ξi∗ = w∗ · z∗i + d ≥ 0, i = 1, . . . , n. (29) The dual optimization problem of (29) is minimize − α

n 

αi +

n 1  αi α j yi y j (zi · z j ) 2 i, j =1

i=1

n 1  (αi + βi − C)(α j + β j − C)(z∗i · z∗j ) + 2γ i, j =1

1003

[16] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to Platt’s SMO algorithm for SVM classifier design,” Dept. Mech. Prod. Eng., National Univ. Singapore, Singapore, Tech. Rep.‚ CD-99-14, 1999. [17] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using second order information for training SVM,” J. Mach. Learn. Res., vol. 6, pp. 243–264, Nov. 2005. [18] D. Pechyony, R. Izmailov, A. Vashist, and V. N. Vapnik, “SMO-style algorithms for learning using privileged information,” in Proc. Int. Conf. Data Mining, Las Vegas, NV, Jul. 2010, pp. 1–7. [19] M. Grant and S. Boyd. (2010, Dec.). CVX: MATLAB Software for Disciplined Convex Programming, Version 1.21 [Online]. Available: http://cvxr.com/cvx [20] J. Platt, “Using sparseness and analytic QP to speed training of support vector machines,” in Advances in Neural Information Processing Systems, vol. 11. Cambridge, MA: MIT Press, 1999. [21] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task learning for classification with Dirichlet process priors,” J. Mach. Learn. Res., vol. 8, pp. 35–63, Jan. 2007.

n n   (αi + βi − C) = 0, yi αi = 0 s.t. i=1

i=1

0 ≤ αi ≤ C, 0 ≤ βi ≤ C,

i = 1, . . . , n. (30)

ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their helpful technical feedback. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

B. Bakker and T. Heskes, “Task clustering and gating for Bayesian multitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–89, Dec. 2003. R. Ando and T. Zhang, “A framework for learning predictive structures from multiple tasks and unlabeled data,” J. Mach. Learn. Res., vol. 6, pp. 1817–1853, Dec. 2005. X. Liao and L. Carin, “Radial basis function network for multi-task learning,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2005, pp. 795–802. R. Raina, A. Y. Ng, and D. Koller, “Constructing informative priors using transfer learning,” in Proc. Int. Conf. Mach. Learn., Pittsburgh, PA, Jun. 2006, pp. 713–720. N. D. Lawrence and J. Platt, “Learning to learn with the informative vector machine,” in Proc. Int. Conf. Mach. Learn., Alberta, BC, Canada, 2004, pp. 1–65. A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2006, pp. 41–48. G. Obozinski, B. Taskar, and M. Jordan, “Multi-task feature selection,” in Proc. Int. Conf. Mach. Learn. Workshop Struct. Knowl. Transfer Mach. Learn., Pittsburgh, PA, Jun. 2006. Y. Amit, M. Fink, N. Srebro, and S. Ullman, “Uncovering shared structures in multiclass classification,” in Proc. Int. Conf. Mach. Learn., Corvallis, OR, Jun. 2007, pp. 17–24. L. Liang and V. Cherkassky, “Connection between SVM+ and multitask learning,” in Proc. IEEE Int. Joint Conf. Neural Netw., Hong Kong, Jun. 2008, pp. 2048–2054. L. Liang, F. Cai, and V. Cherkassky, “Predictive learning with structured (grouped) data,” Neural Netw., vol. 22, nos. 5–6, pp. 766–773, 2009. T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in Proc. 10th Conf. Knowl. Disc. Data Mining, Seattle, WA, Aug. 2004, pp. 109–117. C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sep. 1995. V. N. Vapnik, Empirical Inference Science Afterword of 2006. New York: Springer-Verlag, 2006. I. W. Tsang, J. T. Kwok, and P. Cheung, “Core vector machines: Fast SVM training on very large data sets,” J. Mach. Learn. Res., vol. 6, pp. 363–392, Apr. 2005. J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” Microsoft Research, Seattle, WA, Tech. Rep. MSR-TR-98-14, 1998.

Complexity-Reduced Scheme for Feature Extraction With Linear Discriminant Analysis Yuxi Hou, Iickho Song, Fellow, IEEE, Hwang-Ki Min, and Cheol Hoon Park, Senior Member, IEEE

Abstract— Owing to the singularity of the within-class scatter, linear discriminant analysis (LDA) becomes ill-posed for small sample size (SSS) problems. Null-space-based LDA (NLDA), which is an extension of LDA, provides good discriminant performances for SSS problems. Yet, as the original scheme for the feature extractor (FE) of NLDA suffers from a complexity burden, a few modified schemes have since been proposed for complexity reduction. In this brief, by transforming the problem of finding the FE of NLDA into a linear equation problem, a novel scheme is derived, offering a further reduction of the complexity. Index Terms— Feature extractor, linear equation problem, null space-based linear discriminant analysis.

I. I NTRODUCTION Linear discriminant analysis (LDA) [1]–[4] is one of the most fundamental and powerful tools in feature extraction, based on which a number of feature extraction schemes [5]–[8] have been designed. Because of the singularity of the withinclass scatter, unfortunately, we cannot employ LDA for pattern recognition (PR) problems when the dimension d of data is larger than the number N of data. Normally, a PR problem is called small sample size (SSS) if d > N − c, where c is the number of classes. Frequently, d is in tens of thousands and N is up to several thousands in SSS problems. To extend the applicability of LDA for SSS problems also, several LDA-related methods have been proposed, including Manuscript received April 27, 2011; revised December 6, 2011 and March 27, 2012; accepted April 5, 2012. Date of publication April 27, 2012; date of current version May 10, 2012. This work was supported by the National Research Foundation of Korea with funding from the Ministry of Education, Science, and Technology under Grant 2011-0016462. The authors are with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail: [email protected]; [email protected]; minhk@ kaist.ac.kr; [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2194793

2162–237X/$31.00 © 2012 IEEE

1004

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

principal component analysis plus LDA [9], direct LDA [10], null-space-based LDA (NLDA) [11], [12], orthogonal LDA (OLDA) [13], and spectral regression discriminant analysis (SRDA) [14]. The schemes in [11] and [12] generally result in a good PR  performance, yet, they suffer from a complexity of  O 2Nd 2 . Simplifying the method in [11], a QR factorization scheme is proposed and analyzed to provide a complexity of O(2N 2 d + (c − 1)Nd + (2/3)N 3 ) in [15]. The scheme in [16], which is based on the eigendecomposition technique, exhibits a complexity of O((5/2)N 2 d + (c − (5/2))Nd + (16/3)N 3 ). In the meantime, OLDA can provide a complexity of O(7N 2 d + 2(c − 1)Nd − N 3 ) at the same performance as NLDA, and SRDA exhibits a complexity of O((1/2)N 2 d + (2c − (3/2))Nd + (1/6)N 3 ) by casting the LDA into a regression framework. In this brief, by making use of the symmetry and positive semidefiniteness of the within-class scatter, we first express the null space of the within-class scatter via an equation of the data and labels. We then transform the problem of finding the feature extractor (FE) of the NLDA into a linear equation problem that can be solved by Cholesky decomposition. The proposed scheme for the FE of the NLDA is shown to yield a complexity of O((1/2)N 2 d +(c+(1/2))Nd +2cd +(1/6)N 3) and O((1/2)N 2 d + (3/2)Nd + 2d + (1/6)N 3 ) for c-class (c ≥ 3) and two-class problems, respectively. These complexities are lower than those of the schemes in [11]– [16]. II. P RELIMINARIES Since a c-class PR problem can be decomposed into c twoclass PR problems by the one-versus-all approach [17], we will focus mainly on the two-class PR problems. Most commonly, a linear FE can be described as F(x) = w T x + , where x ∈ Rd denotes the input, the d × 1 vector w is called the linear discriminant vector (LDV), and the scalar is the bias. N of data x ∈ Rd and Let us assume a set D = {xi , yi }i=1 i labels yi ∈ {1, −1}. For reasons of simplicity, we assume that {xi } have been ordered so that the label vector can be expressed as y = [y1 y2 · · · y N ] = [11×N1 − 11×N2 ]T with 1a×b the a × b matrix of 1s. NLDA produces the LDV w N = arg max w T SB w w =1

(1)

with the constraint w T SW w = 0, where · indicates the Euclidean norm, SB = (m1 − m2 )(m1 − m2 )T is the d × d between-class scatter with rank c − 1, and T

SW = X X −

N1 m1 m1T



N2 m2 m2T

(2)

is the d × d within-class scatter with rank min(d, N − c). Here, X = [x1 x2 · · · x N ] denotes the d × N matrix of data, Z l = {i : xi ∈class l} denotes the index set for class l, and ml = (1/Nl ) {i ∈Z l } xi denotes the sample mean vector for class l, where Nl is the number of elements in Z l . Once the LDV w N is determined, the bias N can be obtained by N = N (1/N) i=1 (yi − w TN xi ). In OLDA, the LDV w O is determined as w T SB w w O = arg max T w =1 w ST w

(3)

under the constraint w T ST w = 0, where ST = SW + SB is the total scatter. Similarly, SRDA produces the LDV   w S = arg min X˜ T X˜ + α S w 2 (4) w

w− y¯ with X¯ = [x1 − m x2 − m · · · x N −m], where X˜ = X¯ T N m = (1/N) i=1 xi the global mean vector, and y¯ = [(N/N1 )11×N1 − (N/N2 )11×N2 ]T the response vector. In (4),

the regularization parameter α S is a nonnegative number, depending on which the LDV varies. Specifically, when α S = 0, we will obtain  + (5) w S = X¯ X¯ T X¯ y¯ where the superscript+denotes the pseudoinverse. III. C OMPLEXITY R EDUCTION OF THE FE

In this section, we will establish a new scheme to obtain the LDV w N . The new scheme will be shown to require a lower complexity than the schemes in [11]– [16] for SSS problems. A. FE of the NLDA Since (1/w T SB w) = w T w/({(m1 −m2 )T w}T(m1 −m2 )Tw) from w T w = w 2 = 1, (1) can be rewritten as 1 γ H = arg min γ T γ γ 4

(6)

where γ = 2w/((m1 − m2 )T w) and γH =

2w N (m1 − m2 )T w N

.

(7)

In addition, the constraint w T SW w = 0 can be expressed, as shown in Appendix A, as X T γ + b1 N×1 = y

(8)

where b = −((m1 + m2 )T w)/((m1 − m2 )T w). Now, the Lagrange function for the problem of (6) and (8) can be set as   1 T γ γ + α T X T γ + b1 N×1 − y (9) 4 where the N × 1 vector α denotes the Lagrange multiplier. Setting the partial derivatives of L 1 with respect to b and γ equal to zero, we get L1 =

11×N α = 0

(10)

γ = −2Xα.

(11)

and

With (10) and (11), the Lagrange function (9) can be rewritten as L 1 = −α T (X T Xα + y). Thus, the problem of (6) under the constraint (8) can be reformulated as   (12) α H = arg min α T X T Xα + y α

subject to (10), where α H and γ H are related as γ H = −2Xα H

(13)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

from (11). After some steps again with the Lagrange technique, we can rewrite (12) with the constraint (10) as  2 α H = arg min α T X T Xα + α T y − 11×N α α N  1 ·11×N X T Xα − 11×N α 11×N y . (14) N Setting the derivative of the objective function in (14) with respect to α equal to zero, we can show that A L α H = yH

(15)

where AL = X T X − and yH =

1 2



1 1 1 N×N X T X − X T X1 N×N N N

1 1 N×N − IN N

 y=−

N1 N2 y¯ N2

(16)

Xα H Xα H

Algorithm 2 Learning algorithm of the proposed scheme Input: data label vector y, class c  matrix X, ( j) ( j) c if c ≥ 3; {w N , N } if c = 2 Output: w N , N j =1

1. Decompose y into c two-class label vectors { y( j ) }cj =1 if c ≥ 3 by setting the labels 1 for the samples in j -th class and −1 for the others (one-versus-all approach); 2. Calculate A L by (16); 3. Calculate A˜ L by (19) with a value of μ H > mT m; 4. Perform Cholesky decomposition of A˜ L = R TL R L ; 5. For each two-class label vector y( j ), ( j) 1) Calculate yH by (17); ( j) 2) Solve (20) for α H using R L ; ( j) 3) Calculate the LDV w N by (18); ( j) ( j) 4) Calculate the bias N = (1/N)11×N ( y( j ) − X T w N ).

C. Complexity Analysis (17)

with IN denoting the N × N identity matrix. Now, since w N = −{(m1 − m2 )T w N }Xα H from (7) and (13), and w N = 1 from (1), the value of the scalar (m1 − m2 )T w N is obtained as ±1/ Xα H . Thus, we have wN = ±

1005

(18)

by which the LDV w N can be evaluated once α H is obtained. Note that (15) is derived under the conditions that the vector m1 −m2 is not orthogonal to the null space of SW and that the matrix X T X is of full rank. Clearly, when m1 −m2 is orthogonal to the null space of SW , there is no solution since w T SB w = |(m1 −m2 )T w|2 in (1) is always zero. On the other hand, when X T X is singular, α H could be obtained by the perturbation method of replacing X T X with a full rank matrix X T X + μIN , where μ is a small positive number. B. Evaluation of α H The direct evaluation of α H from (15) by α H = A−1 L yH requires approximately (4/3)N 3 + O(N 2 ) multiplications [18]. To reduce the computational burden, we instead employ the following method for the evaluation of α H . With 11×N α H = 0 from (10), it is straightforward to see that the matrix A˜ L = A L + μ H 1 N×N

(19)

A˜ L α H = yH .

(20)

satisfies

Since A˜ L is positive definite for a real number μ H satisfying μ H > mT m as shown in Appendix B, we can solve (20) for α H by Cholesky decomposition. As we shall see shortly, this procedure will provide us with some saving in complexity. In Algorithm 1, the learning algorithm of the proposed scheme is summarized.

Since the multiplication operation consumes considerably more CPU time than the addition operation, we will consider only the number of multiplications (NOM), and denote one multiplication operation by complexity 1. It is, by the way, noteworthy that the NOM is not significantly different from the number of additions in the proposed and conventional schemes. Assuming that the complexity of a division or a square root is the same as that of a multiplication [19], [20], we now evaluate the complexity of the proposed scheme. Step 1: First, thanks to the symmetry of X T X, we can obtain X T X with (1/2)N(N + 1)d multiplications and (1/2)N(N + 1)(d − 1) additions since d multiplications and d − 1 additions are necessary for each of the (1/2)N(N + 1) elements to be calculated. Second, the matrix A˜ L can be obtained with O(N 2 ) multiplications from X T X by (16) and (19). Noting that mT m = (1/N 2 )11×N · X T X · 1 N×1 , we can calculate mT m with 1 division and N 2 − 1 additions by dividing the sum of all elements in X T X by N 2 . Step 2: Using the method of Gaussian elimination with partial pivoting, it requires (1/6)N 3 +(1/2)N 2 −(2/3)N multiplications and (1/6)N 3 −(1/6)N additions [18] to decompose the matrix A˜ L into A˜ L = R TL R L , where R L is an upper triangular matrix with positive diagonal elements. Following the Cholesky decomposition, (20) can be solved for α H by successively employing the algorithms of backward and forward substitution with O(N 2 ) multiplications [18]. Step 3: From α H , we can obtain Xα H with Nd multiplications and (N − 1)d additions. Then, the value Xα H can be obtained with d multiplications, d − 1 additions, and 1 square root operation from Xα H . Finally, we can obtain w N with d divisions from Xα H and Xα H . In summary, the complexity requirement of the proposed scheme is (1/2)N 2 d + (3/2)Nd + 2d + (1/6)N 3 + O(N 2 ) for two-class PR problems. Next, each of the c two-class PR problems decomposed from a c-class problem can be transformed into a linear equation problem (20) with the same A˜ L but different yH . Therefore, once one of the twoclass PR problems is solved, we can solve the rest of the two-class PR problems with an additional complexity O(N 2 )

1006

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

14

OLDA [13] SRDA [14] NLDA scheme [15] NLDA scheme [16]

10 8 6

Fig. 1.

B1(5.99,5.05)

4 2 0

D2(7.09,13.78)

D1(5.99,13.84)

12

ratios of NOMs (relative to the proposed scheme)

ratios of NOMs (relative to the proposed scheme)

14

A1(5.99,3.98)

B2(7.09,5.24) A2(7.09,3.99)

C1(5.99,1.00)

C2(7.09,1.00)

1

2

3

4

5

6 lnN

7

8

9

6

B2(3.69,4.66)

10 D1(2.30,10.74)

Fig. 2.

ratios of NOMs (relative to the proposed scheme)

ratios of NOMs (relative to the proposed scheme)

8

D2(3.69,13.05) OLDA [13] SRDA [14] NLDA scheme [15] NLDA scheme [16]

12

0

D2(7.31,13.27) OLDA [13] SRDA [14] NLDA scheme [15] NLDA scheme [16]

6 B1(5.70,4.37) 4 2

A1(5.70,3.48)

B2(7.31,5.19) A2(7.31,3.88) C2(7.31,1.04)

C1(5.70,1.15) 4 5 6

7 lnN

8

9

10

Ratio of the NOMs versus lnN for TDT2 (d = 36771).

14

14

2

8

10 Fig. 3.

D1(5.70,11.90)

10

0

Ratio of the NOMs versus lnN for CRANMED (d = 41681).

4

12

B1(2.30,3.67) A2(3.69,3.76) A1(2.30,3.13) C1(2.30,1.12) 1 2 3

C2(3.69,1.04) 4

5 lnN

6

7

8

in Step 2 and (c − 1)(Nd + 2d) in Step 3. Note that the calculation of X T X and Cholesky decomposition of A˜ L are not required after the first two-class problem. In short, the complexity of the proposed scheme for c-class problems is (1/2)N 2 d + (1/2)Nd + c(N + 2)d + (1/6)N 3 + O(N 2 ). IV. E XPERIMENTAL R ESULTS Let us demonstrate the complexity of the proposed scheme in comparison with other schemes for four sets of samples. 1) Sample Set CRANMED [21]: The sample set of CRANMED consists of 1398 aerodynamic samples from the Cranfield collection and 1033 biomedical samples from the Medlars collection. After 41681 professional terms appearing frequently in aerodynamics and biomedicine are collected, each of the 2431 documents is transformed into a 41681×1 vector associated with a label (1 or −1) by counting the frequency that each of the 41681 terms appears in the document. 2) Sample Set Leukemia [22]: The sample set Leukemia is from a study of gene expression on two types of acute leukemia, 47 samples from acute lymphoblastic leukemia (ALL), and 25 samples from acute myeloblastic leukemia (AML). By measuring the quantitative expression levels of 7129 genes, a sample can be represented by a 7129 × 1 vector with a corresponding label 1 for ALL or −1 for AML.

10

Fig. 4.

OLDA [13] SRDA [14] NLDA scheme [15] NLDA scheme [16]

D1(4.79,9.04)

8 6

B2(5.89,4.51)

4 B1(4.79,3.40) 2 0

9

Ratio of the NOMs versus lnN for Leukemia (d = 7129).

D2(5.89,11.60) 12

A2(5.89,3.44)

A1(4.79,2.76) C1(4.79,1.37) 4 5

C2(5.89,1.17) 6 7 lnN

8

9

Ratio of the NOMs versus lnN for ORL (d = 10304).

3) Sample Set TDT2 [23]: The sample set of the second phase of the Topic Detection and Tracking (TDT2) project consists of 9394 samples in 30 classes collected from two newswires (APW, NYT), two radio programs (VOA, PRI), and two television programs (CNN, ABC). By counting the frequency of 36 771 terms appearing at least twice in the 9394 samples, each document is transformed into a 36 771 ×1 vector. In short, we have 9394 column vectors of dimension 36 771 and the corresponding labels. 4) Sample Set ORL [24]: The Olivetti-Oracle Research Lab (ORL) sample set consists of 400 face images of 10 304 pixels for 40 individuals with variation in pose, illumination, facial expression, and facial details. By measuring the gray level in each pixel, an image can be transformed into a 10 304 ×1 vector. We then have 400 column vectors of dimension 10 304 and the corresponding labels for 40 classes. All the schemes are implemented via MATLAB and run on an Intel i7 PC with 2.93 GHz CPU and 4 GB RAM. D. Comparison of Complexity Figs. 1–4 show the ratios of the NOMs in various schemes to that in the proposed scheme for the four sample sets. It is observed that the complexity of the proposed scheme is clearly lower than that of the schemes in [13], [15], and [16], and is

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

TABLE I AVERAGE CPU T IME ( IN S ECONDS ) C ONSUMED TO O BTAIN THE LDV FOR

CRANMED (T HE N UMBERS IN PARENTHESES D ENOTE THE

TABLE III AVERAGE CPU T IME ( IN S ECONDS ) C ONSUMED TO O BTAIN THE LDV FOR

TDT2 ( THE N UMBERS IN PARENTHESES D ENOTE THE R ATIOS

R ATIOS TO THE P ROPOSED S CHEME )

k

OLDA [13]

SRDA [14]

[15]

200

7.89 (11.11)

0.85 (1.20)

4.07 (5.73)

3.11 (4.38)

300

16.59 (11.29)

1.67 (1.14)

7.69 (5.23)

400

30.00 (12.24)

2.72 (1.11)

500

47.69 (12.96)

600

71.23 (13.62)

1007

TO THE

NLDA [16] Proposed

P ROPOSED S CHEME )

k

OLDA [13]

SRDA [14]

[15]

NLDA [16] Proposed

0.71 (1.00)

10

4.41 (8.65)

0.67 (1.31)

2.59 (5.08)

2.42 (4.75)

0.51 (1.00)

6.67 (4.54)

1.47 (1.00)

20

14.64 (9.57)

1.88 (1.23)

7.10 (4.64)

7.72 (5.05)

1.53 (1.00)

12.56 (5.13)

11.69 (4.77)

2.45 (1.00)

30

33.70 (11.09)

3.54 (1.16)

13.95 (4.59)

16.45 (5.41)

3.04 (1.00)

4.02 (1.09)

18.21 (4.95)

18.10 (4.92)

3.68 (1.00)

40

62.92 (12.43)

5.74 (1.13)

24.42 (4.83)

29.65 (5.86)

5.06 (1.00)

5.59 (1.07)

25.80 (4.93)

26.38 (5.04)

5.23 (1.00)

50

103.50 (13.22)

8.55 (1.09)

38.43 (4.91)

48.24 (6.16)

7.83 (1.00)

TABLE IV TABLE II AVERAGE CPU T IME ( IN M ILLISECONDS ) C ONSUMED TO O BTAIN THE

LDV FOR L EUKEMIA ( THE N UMBERS IN PARENTHESES D ENOTE THE

AVERAGE CPU T IME ( IN S ECONDS ) C ONSUMED TO O BTAIN THE LDV FOR

ORL ( THE N UMBERS IN PARENTHESES D ENOTE THE R ATIOS TO THE P ROPOSED S CHEME )

R ATIOS TO THE P ROPOSED S CHEME )

OLDA [13]

SRDA [14]

[15]

NLDA [16] Proposed

3

0.33 (6.60)

0.06 (1.20)

0.22 (4.40)

0.21 (4.20)

0.05 (1.00)

OLDA [13]

SRDA [14]

[15]

5

4.24 (4.93)

1.13 (1.31)

1.99 (2.31)

1.96 (2.28)

0.86 (1.00)

5

2.50 (1.56)

5.23 (3.27)

4.56 (2.85)

1.60 (1.00)

0.11 (1.38)

0.33 (4.13)

0.44 (5.50)

0.08 (1.00)

10

10.83 (6.77)

0.59 (7.38)

7

3.86 (1.48)

9.42 (3.62)

7.91 (3.04)

2.60 (1.00)

0.18 (1.29)

0.58 (4.14)

0.75 (5.36)

0.14 (1.00)

15

19.30 (7.42)

1.07 (7.64)

9

4.77 (1.61)

15.05 (5.08)

11.13 (3.76)

2.96 (1.00)

0.25 (1.25)

0.81 (4.05)

1.13 (5.65)

0.20 (1.00)

20

29.45 (9.95)

1.61 (8.05)

k

NLDA [16] Proposed

k

TABLE V PR A CCURACY FOR CRANMED OLDA

SRDA

[13]

[14]

[15]

[16]

Proposed

200

96.66%

96.66%

96.66%

96.66%

96.66%

300

97.15%

97.15%

97.15%

97.15%

97.15%

400

97.48%

97.48%

97.48%

97.48%

97.48%

500

97.67%

97.67%

97.67%

97.67%

97.67%

600

97.87%

97.87%

97.87%

97.87%

97.87%

k

practically the same as that of the scheme in [14]. The points A1–D1 and A2–D2 in Figs. 1–4 are marked for later use in the comparison with the ratios of consumed CPU time. Next, by randomly choosing k samples (and their labels) each from the c classes, we have built training data matrices of size 41681×2k, 7129×2k, 36771×30k, and 10304×40k from CRANMED, Leukemia, TDT2, and ORL, respectively. Note that N = ck here. The CPU time consumed to obtain the LDVs for the N training data is then measured. Tables I–IV, showing the CPU time averaged over 250 repetitions for each k, clearly indicate that the ratio of CPU time consumption is in general accordance with the expectation from the complexity analysis in Figs. 1–4. For example: 1) when k = 600, we have 5.24 at point B2 in Fig. 1 for the theoretical ratio and 5.04 for the CPU time ratio in Table I; 2) when k = 5, we have 3.13 at point A1 in Fig. 2 and 2.31 in Table II; and 3) when k = 50, we have 13.27 at point D2 in Fig. 3 and 13.22 in Table III. In summary, the proposed scheme provides a significant saving in the CPU time to obtain the FE when compared with the schemes in [13], [15], and [16]. In addition, it has some gain of the CPU time over the scheme in [14].

NLDA

E. Comparison of PR Performance The PR accuracy of FEs via the nearest neighbor (NN) classifier is investigated. Clearly, a higher PR accuracy indicates a better FE. Here, the PR accuracy denotes the ratio of the number of vectors (called testing samples) whose labels are estimated correctly by the NN classifier to the number M of vectors used in the estimation, where M = 2431−2k, 72−2k, 9394 −30k, and 400 −40k for CRANMED, Leukemia, TDT2, and ORL, respectively. Based on the simulation results shown in Tables V–VIII, we can make the following observations. 1) For the two-class PR problems of CRANMED and Leukemia, it turns out that the proposed scheme and the

1008

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

TABLE VI PR A CCURACY FOR L EUKEMIA k

OLDA

SRDA

V. C ONCLUSION

NLDA

[13]

[14]

[15]

[16]

Proposed

5

89.41%

89.41%

89.41%

89.41%

89.41%

10

94.32%

94.32%

94.32%

94.32%

94.32%

15

95.79%

95.79%

95.79%

95.79%

95.79%

20

96.12%

96.12%

96.12%

96.12%

96.12%

TABLE VII PR A CCURACY FOR TDT2 OLDA

SRDA

[13]

[14]

[15]

[16]

Proposed

10

70.15%

83.50%

70.15%

68.46%

83.45%

20

79.33%

87.02%

79.33%

79.36%

87.12%

30

80.97%

86.67%

80.97%

80.73%

87.02%

40

82.12%

86.79%

82.12%

82.12%

87.20%

50

82.17%

86.24%

82.17%

82.12%

86.86%

k

NLDA

A PPENDIX A

TABLE VIII PR A CCURACY FOR ORL

Theorem 2: When ζ = (m1 − m2 )T w is nonzero, X T X is of full rank, and 0 denotes the all-zero vector, we have w T SW w = 0 ⇔ SW w = 0 ⇔ a X T w + b1 N×1 = y (21) where a = (2/ζ ) and b = −((m1 + m2 )T w)/((m1 − m2 )T w) = −(((m1 + m2 )T w)/ζ ). Proof: It is well-known [26] that w T SW w = 0 ⇔ SW w = 0. Thus let us prove only a X T w + b1 N×1 = y ⇔ SW w = 0. 1) When a X T w + b1 N×1 = y, we have a X X T w + b X1 N×1 = X y

OLDA

SRDA

[13]

[14]

[15]

[16]

Proposed

3

91.15%

87.20%

91.15%

90.49%

87.54%

5

96.04%

92.20%

96.04%

95.39%

93.90%

7

97.71%

94.25%

97.71%

97.20%

96.83%

9

98.20%

95.57%

98.20%

97.76%

97.65%

k

By reformulating the constraint in NLDA, we have transformed the problem of obtaining the FE of NLDA into a constrained optimization problem, and eventually into a linear equation problem, which can be solved by the technique of Cholesky decomposition. The complexity analysis and simulation results have confirmed that the proposed scheme, at a competent PR performance, in general offers a significantly lower complexity than conventional schemes by avoiding the computational burden of eigendecomposition and QR factorization.

NLDA

schemes in [13]–[16] exhibit the same PR accuracy. This can be anticipated indirectly by noting that the LDVs (1) of NLDA and (3) OLDA are equivalent as shown in [13] and that the proposed scheme and SRDA are equivalent for two-class problems as shown in Appendix C. 2) On the other hand, the proposed scheme, when adopting the one-versus-all approach for multiclass problems, will no longer have the same objective function as the schemes in [13], [15], and [16]. Consequently, the PR performances of these schemes would naturally be different, as is observed clearly in Tables VII and VIII. Note that the proposed scheme generally provides a better PR accuracy than the other schemes for the 30-class set TDT2. For the 40-class set ORL, the proposed scheme performs slightly worse than the schemes in [13], [15], and [16], but better than the scheme in [14]. Apparently, the proposed scheme with the one-versus-all approach is not expected to always produce the highest PR accuracy for multiclass problems. In passing, let us note that “one can never claim to have a universally superior feature extraction method in practice” as stated in [25]. Nevertheless, the proposed scheme should be among the most appealing and attractive feature extraction schemes if both the PR accuracy and complexity are taken into consideration simultaneously.

(22)

if we multiply X from the left. Since X1 N×1 = N1 m1 + N2 m2 and X y = N1 m1 − N2 m2 , we can rewrite (22), after substituting a and b and multiplying the scalar ζ , as 2X X T w − 2N1 m1T wm1 − 2N2 m2T wm2 = 0. Consequently, we have (X X T − N1 m1 m1T − N2 m2 m2T )w = SW w = 0. 2) Assume SW w = 0. Then, using the definition of SW and multiplying by (2/ζ ), we have (2/ζ )X X T w − (2/ζ )m1T w N1 m1 − (2/ζ )m2T w N2 m2 = 0, which can be rewritten as (2/ζ )X X T w − N1 m1 − (((m1 + m2 )T w)/ζ )N1 m1 − (((m1 + m2 )T w)/ζ )N1 m1 + N2 m2 = 0, or as a X X T w − N1 m1 + b N1 m1 + b N2 m2 + N2 m2 = 0. Then, noting that X1 N×1 = N1 m1 + N2 m2 and X y = N1 m1 − N2 m2 , we obtain a X X T w + b X1 N×1 = X y, from which we have a X T w + b1 N×1 = y by multiplying (X T X)−1 X T from the left. Combining the results in 1) and 2), we have Theorem 1. A PPENDIX B X is of full rank, the matrix A˜ L is Theorem 3: When positive definite if μ H > mT m. Proof: Since X¯ = X(IN − (1/N)1 N×N ), we can rewrite (19), using (16) and the definition of m, as   (23) A˜ L = X¯ T X¯ + μ H − mT m 1 N×N . XT

When X T X is of full rank, rank(X) = rank(X T X) = N. ¯ = N − 1 from rank( X) ¯ = rank We also observe rank( X) (X(IN −(1/N)1 N×N )) ≤ rank(IN −(1/N)1 N×N ) = N −1 and ¯ ≥ rank(X) + rank(IN − (1/N)1 N×N ) − N = N − 1. rank( X) Since z T X¯ T X¯ z = X¯ z 2 ≥ 0 for any N × 1 nonzero real vector z, the matrix X¯ T X¯ is positive semi-definite, and the

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

eigenvalues of X¯ T X¯ are all nonnegative. Combining this with ¯ = rank( X) ¯ = N − 1, the matrix the fact that rank( X¯ T X) T ¯ ¯ X X will have N − 1 positive eigenvalues λ¯ 1 , λ¯ 2 , . . . , λ¯ N−1 with the remaining eigenvalue λ¯ N = 0. Let v¯1 , v¯2 , . . . , v¯ N be the normalized eigenvectors of X¯ T X¯ corresponding to λ¯ 1 , λ¯ 2 , . . . , λ¯ N , respectively. We have v¯iT v¯ j = 1 if i = j and 0 otherwise, since the normalized eigenvectors of a symmetric √ matrix are orthonormal T[18]. In addition, since ¯ N )1 N×1 = 0 and (1/N)1 N×1 1 N×1 = 1, it is easy X¯ T X(1/ to get 1 (24) v¯ N = √ 1 N×1 . N N Now, noting that {v¯i }i=1 are the orthonormal eigenvectors T ¯ ¯ of X X, we have   A˜ L v¯i = X¯ T X¯ v¯i + N μ H − mT m v¯ N v¯ TN v¯i λ¯ i v¯i , for i = 1, 2, . . . , N − 1,   = (25) N μ H − mT m v¯ N , for i = N

from (23). Equation (25) implies that the matrix A˜ L has the eigenvalues {λ¯ 1 , λ¯ 2 , . . . , λ¯ N−1 , N(μ H −mT m)}, and therefore is positive definite when μ H > mT m. A PPENDIX C Theorem 4: When X T X is of full-rank and α S = 0 in a two-class PR problem, we have w N = ±(w S /( w S )). Proof: When X T X is of full-rank, {λ¯ 1 , λ¯ 2 , . . . , λ¯ N−1 , 0} and {λ¯ 1 , λ¯ 2 , . . . , λ¯ N−1 , N(μ H −mT m)} are the eigenvalues of X¯ T X¯ and A˜ L , respectively, corresponding to the eigenvectors {v¯1 , v¯2 , . . . , v¯ N } as shown in Appendix B. Thus  + N−1 1 T ¯ ¯ X X v¯i v¯iT = (26) ¯i λ i=1 and we can express α H = A˜ −1 L yH as αH =

N−1  i=1

v¯ T yH 1  v¯ N . v¯i v¯iT yH +  N N μ H − mT m λ¯ i

(27)

Now, from (17) and (24), we have v¯ TN yH = 0. Using this ¯ + yH , which fact and (26), we can rewrite (27) as α H = ( X¯ T X) can subsequently be expressed as N2 ¯ Xα H (28) N1 N2 using (17) and then (5) when α S = 0. Next, noting that X¯ = X − m11×N , we can rewrite (28) as w S = −(N 2 /(N1 N2 ))Xα H + (N 2 /(N1 N2 ))m 11×N α H , or, since 11×N α H = 0 from (10), as N1 N2 (29) Xα H = − 2 w S . N Finally, substituting (29) into (18), we have Theorem 3. wS = −

ACKNOWLEDGMENT The authors would like to thank the Associate Editor and three anonymous reviewers for their constructive suggestions and helpful comments.

1009

R EFERENCES [1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001, pp. 1–38. [2] L. Wang and X. Fu, Data Mining With Computational Intelligence, New York: Springer-Verlag, 2005, pp. 1–10. [3] P. Howland, J. Wang, and H. Park, “Solving the small sample size problem in face recognition using generalized discriminant analysis,” Pattern Recognit., vol. 39, no. 2, pp. 277–287, Feb. 2006. [4] W. Zheng, “Heteroscedastic feature extraction for texture classification,” IEEE Signal Process. Lett., vol. 16, no. 9, pp. 766–769, Sep. 2009. [5] S. Huh and D. Lee, “Linear discriminant analysis for signatures,” IEEE Trans. Neural Netw., vol. 21, no. 12, pp. 1990–1996, Dec. 2010. [6] C. S. Dhir and S.-Y. Lee, “Discriminant independent component analysis,” IEEE Trans. Neural Netw., vol. 22, no. 6, pp. 845–857, Jun. 2011. [7] Z. Fan, Y. Xu, and D. Zhang, “Local linear discriminant analysis framework using sample neighbors,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1119–1132, Jul. 2011. [8] L. Zhang, L. Wang, and W. Lin, “Semi-supervised biased maximum margin analysis for interactive image retrieval,” IEEE Trans. Image Process., vol. 21, no. 4, pp. 2294–2308, Apr. 2012. [9] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces versus fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, Jul. 1997. [10] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional data with application to face recognition,” Pattern Recognit., vol. 34, no. 10, pp. 2067–2070, Oct. 2001. [11] L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A new LDA-based face recognition system which can solve the small sample size problem,” Pattern Recognit., vol. 33, no. 10, pp. 1713–1726, Oct. 2000. [12] P. Howland and H. Park, “Generalizing discriminant analysis using the generalized singular value decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp. 995–1006, Aug. 2004. [13] J. Ye and T. Xiong, “Computational and theoretical analysis of null space and orthogonal linear discriminant analysis,” J. Mach. Learn. Res., vol. 7, no. 7, pp. 1183–1204, Jul. 2006. [14] D. Cai, X. He, and J. Han, “SRDA: An efficient algorithm for largescale discriminant analysis,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 1, pp. 1–12, Jan. 2008. [15] D. Chu and G. S. Thye, “A new and fast implementation for null space based linear discriminant analysis,” Pattern Recognit., vol. 43, no. 4, pp. 1373–1379, Apr. 2010. [16] R. Huang, Q. Liu, H. Lu, and S. Ma, “Solving the small sample size problem of LDA,” in Proc. Int. Pattern Recognit. Conf. , QC, Canada, Aug. 2002, pp. 29–32. [17] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” J. Mach. Learn. Res., vol. 5, no. 1, pp. 101–141, Jan. 2004. [18] G. W. Stewart, Matrix Algorithms: Basic Decompositions. vol. 1, Philadelphia, PA, 1998, pp. 1–478. [19] H. Alt, “Square rooting is as difficult as multiplication,” Computing, vol. 21, no. 3, pp. 221–232, Sep. 1979. [20] D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms. vol. 2, 3rd ed. Reading, MA: Addison-Wesley, 1997, pp. 1–121. [21] V. Tunali, CRANMED [Online]. Available: http://www.dataminingresearch.com/index.php/category/dataset/ [22] Leukemia [Online]. Available: http://www.genome.wi.mit.edu/MPR [23] TDT2 [Online]. Available: http://www.zjucadcg.cn/dengcai/Data/Text Data.html [24] ORL [Online]. Available: http://www.cl.cam.ac.uk/ResLearch/DTG/ attarchive:pub/data/att_faces.tar [25] A. Antos, L. Devroye, and L. Györfi, “Lower bounds for Bayes error estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 7, pp. 643–645, Jul. 1999. [26] K. Liu, Y.-Q. Cheng, J.-Y. Yang, and X. Liu, “An efficient algorithm for Foley-Sammon optimal set of discriminant vectors by algebraic method,” Int. J. Pattern Recognit. Artif. Intell., vol. 6, no. 5, pp. 817–829, Dec. 1992.

Complexity-reduced scheme for feature extraction with linear discriminant analysis.

Owing to the singularity of the within-class scatter, linear discriminant analysis (LDA) becomes ill-posed for small sample size (SSS) problems. Null-...
455KB Sizes 0 Downloads 3 Views