Robust L1-norm two-dimensional linear discriminant analysis.

Accepted Manuscript Robust L1-norm two-dimensional linear discriminant analysis Chun-Na Li, Yuan-Hai Shao, Nai-Yang Deng PII: DOI: Reference:

S0893-6080(15)00025-8 http://dx.doi.org/10.1016/j.neunet.2015.01.003 NN 3436

To appear in:

Neural Networks

Received date: 2 July 2014 Revised date: 28 October 2014 Accepted date: 15 January 2015 Please cite this article as: Li, C. -N., Shao, Y. -H., & Deng, N. -Y. Robust L1-norm two-dimensional linear discriminant analysis. Neural Networks (2015), http://dx.doi.org/10.1016/j.neunet.2015.01.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Robust L1-norm two-dimensional linear discriminant analysis Chun-Na Lia , Yuan-Hai Shaoa , Nai-Yang Dengb a

Zhijiang College, Zhejiang University of Technology, Hangzhou, 310024, P.R.China b College of Science, China Agricultural University, Beijing, 100083, P.R.China

Abstract In this paper, we propose an L1-norm two-dimensional linear discriminant analysis (L1-2DLDA) with robust performance. Different from the conventional two-dimensional linear discriminant analysis with L2-norm (L22DLDA), where the optimization problem is transferred to a generalized eigenvalue problem, the optimization problem in our L1-2DLDA is solved by a simple justifiable iterative technique, and its convergence is guaranteed. Compared with L2-2DLDA, our L1-2DLDA is more robust to outliers and noises since the L1-norm is used. This is supported by our preliminary experiments on toy example and face datasets, which show the improvement of our L1-2DLDA over L2-2DLDA. Key words: linear discriminant analysis; two-dimensional linear discriminant analysis; L1-norm two-dimensional linear discriminant analysis; dimensionality reduction; iterative technique 1. Introduction In data mining, the input samples are traditionally one-dimensional (1D) vectors and an important processing step is dimensionality reduction, especially for high dimensional problems. Among various methods, principal component analysis (PCA) [1, 2] and linear discriminant analysis (LDA) [3– 5] are the most popular ones. The LDA seems more interesting since it is useful directly for supervised learning and for studying the class relationship between data points and hence of great importance for multivariate data analysis. In fact, LDA has been widely used and developed in classification problems and has shown promising results [6–9]. Preprint submitted to Neural Networks

February 1, 2015

When the input samples are two-dimensional (2D) matrices, it used to be a prevailing way to transform the matrix samples to the corresponding vector ones [1, 10–13]. However, as pointed in [14, 15], it is well known that the transformation procedure may cause loss of some useful structural information. Moreover, converting 2D matrices into 1D vectors always leads to very high dimensionality, and thus is difficult to evaluate the involved covariance matrices accurately due to the relatively small number of training samples [16]. Recently, there are much efforts endeavoured to conduct dimensionality reduction directly without any vectorization work on matrix samples. For unsupervised learning, 2D-PCA was first proposed by Yang et al. [17], and a generalized work has been subsequently described in [16, 18, 19] called bilateral-projection-based 2D-PCA (B2DPCA). For supervised learning, L2norm 2DLDA (L2-2DLDA) has also been studied. The idea of L2-2DLDA was first raised in [20], and extensions and modifications are developed afterwards [21–27]. As we know, LDA does not work or perform poorly when it deals with the small sample size (SSS) problem. Comparing to LDA, L22DLDA [22, 28] can avoid the SSS problem under a weak assumption, which is satisfied for the most real problems, while the within-class scatter matrix may encounter singular problem in LDA when the training sample size is smaller than the dimension of the data. The L2-2DLDA-based algorithms have been shown to be more efficient in computation and outperform the conventional LDA [21–23]. However, similar to LDA with L2-norm, the current L2-2DLDA and its modifications behave badly when running into the problems with outliers and noises. To achieve the robustness, various approaches are proposed for both LDA and L2-2DLDA. Lanckrie et.al. [29] give a robust performance of LDA with two classes (i.e., FDA) by employing a minimax optimization technique from probability point of view. In [30], the authors show that robust FDA can be carried out using convex optimization. Li et.al. [31, 32] develop the incremental tensor-based subspace learning algorithm to achieve robust visual tracking and foreground segmentation. A rotational invariant L1-norm (i.e., R1-norm) based discriminant criterion is also proposed to reduce the influence of outliers as well as its 2D forms [33]. The R1-norm is determined by the sum of elements rather than the sum of the squared elements, and thus is less sensitive to outliers than L2-norm. Above all, the applications of L1-norm to LDA-related methods are usually deemed as an effective way to deal with outliers and noises. In fact, the 2

40

y=x2

35

30

y=|x|

25

20

15

10

5

0 −6

−4

−2

0

2

4

6

Figure 1: Ilustration of the exaggeration effect of the L2-norm and comparison with that of the L1-norm.

robust property of L1-norm has been addressed and well developed [34–39]. For LDA, the authors in [40–42] consider maximizing the ratio of LDA by using L1-norm rather than L2-norm, which results in more robust performance than conventional LDA. We have to notice that L1-norm is different from R1-norm, and in particular R1-norm is not as robust as L1-norm by observing their definitions [34]. The relationship between L1-norm and its robustness is explained intuitively in Figure 1, where the dot line and the solid line correspond to L2-norm and L1-norm, respectively. From Figure 1, we observe that comparing to L1-norm distance, L2-norm will exaggerate the influence of outliers or noises to some extent. As we pointed out above, 2DLDA has its own merits over LDA while is also sensitive to outliers and noises. Therefore, it is natural to consider L1-norm 2DLDA. In addition, Li et.al. [14] proposed the L1-norm-based 2DPCA, which performs more robust than L2-norm 2DPCA. Motivated by these, we propose an L1-norm 2DLDA (L1-2DLDA) which aims at providing a robust 2DLDA based on L1-norm. More precisely, this paper proposes an L1-norm 2DLDA (L1-2DLDA). In contrast to the current L2-2DLDA where the optimization problem is transferred to a generalized eigenvalue problem, our L1-2DLDA needs to solve an L1-norm optimization problem. This 3

problem is difficult to be solved by the conventional optimization technique because it involves division operation of two L1-norm formulations, and the objective is neither convex nor differential. Here a simple but efficient iterative algorithm is constructed. Furthermore, the rationality of the algorithm is theoretically proved, which implies that the algorithm can be expected to have the ability to achieve a reasonable local optimum. The above conclusions are supported by the experimental results on a toy example and three human face databases, which demonstrate the effectiveness of the proposed algorithm, and as well as show the superiority of our L1-2DLDA over other related dimensionality reduction methods. The paper is organized as follows: Section 2 briefly dwells on conventional L2-2DLDA. Section 3 proposes our L1-2DLDA and gives the corresponding theoretical analysis. Section 4 makes comparisons of our L1-2DLDA with the conventional L2-2DLDA and some related approaches. At last, the concluding remarks are given in Section 6. 2. L2-norm 2DLDA In this paper, we consider a supervised learning problem in the d × ndimensional matrix space Rd×n . The training dataset is given by T = {(X1 , y1 ), ..., (XN , yN )}, where Xl ∈ Rd×n is the input matrix and yl ∈ {1, ..., c} is the corresponding label, l = 1, ..., N . Assume that the i-th class c P contains Ni samples. Then we have Ni = N . We further write the samples i=1

in the i-th class as {Xij }, j = 1, . . . , Ni , i = 1, . . . , c. Let X =

the mean of all sample matrices and X i =

1 Ni

Ni P

1 N

N P

Xl be

l=1

Xij be the mean of sample

j=1

matrices in the i-th class. L2-2DLDA aims at extracting features that well discriminate a set of matrices belonging to c classes. For a matrix sample X ∈ Rd×n , the idea is to project X onto W by transformation U = W T X,

(1)

where W = (w1 , . . . , wm ) ∈ Rd×m is the projection matrix, and the projected matrix U ∈ Rm×n is obtained. Now we turn to the specific details of L22DLDA. As done in conventional one-dimensional LDA, L2-2DLDA finds the 4

optimal projection matrix W which is usually supposed to have orthogonal columns, by maximizing the between-class scatter and meanwhile minimizing within-class scatter of the projected samples. Specifically, if c 1 X (2) Ni (X i − X)(X i − X)T Sb = N i=1 and

Ni c 1 XX Sw = (Xij − Xi )(Xij − Xi )T N i=1 j=1

(3)

are between-class scatter matrix and within-class scatter matrix, then L22DLDA tries to find the optimal W = (w1 , . . . , wm ) ∈ Rd×m that satisfied the following optimal problem tr(W T Sb W ) W tr(W T Sw W ) s.t. W T W = I.

max

(4)

It is proved that the solution of optimization problem (4) is given by W = (w1 , . . . , wm ), where w1 , . . . , wm are the orthogonal normal eigenvectors of Sw−1 Sb corresponding to the first m largest eigenvalues [22]. It should be pointed out that by performing the left-hand transformation (1), a d × n matrix X is projected to an m × n matrix U , and thus the column dimension of X is reduced. Corresponding to (1), we can imagine that a similar right-hand transformation V = XW can also be introduced, where W = (w1 , . . . , wm ) ∈ Rn×m . In this case, a d × n matrix X is projected to a d × m matrix V , and thus the row dimension of X is reduced. Since the implementation procedure for the right-hand transformations is almost the same, we only consider implementing (1) here. But we have to notice that their performance is usually different although sometimes they have the same recognition rate according to the results in [28]. 3. The proposed L1-norm 2DLDA In this section, we propose our L1-norm 2DLDA (L1-2DLDA). Similar to that of L2-2DLDA, only the left-hand linear transformation is considered in our L1-2DLDA. In addition, different from the 2D form R1-norm LDA, our proposed L1-2DLDA uses the L1-norm distance, and the scatter ratio criterion is adopted rather than scatter difference criterion, which omits the selection of tuning parameter. 5

3.1. Problem formulation The proposed L1-norm 2DLDA is a 2D generalization of L1-LDA [40, 42]. We consider the same problem in Section 2. Denote Yi = X i − X = (Yi1 , . . . , Yin ) ∈ Rd×n and Zij = Xij − Xi = (Zij1 , . . . , Zijn ) ∈ Rd×n , i = 1, . . . , c, j = 1, . . . , Ni , where Xij , Xi , X are defined as above. By using the property of matrix trace, for any matrix S, we have tr(SS T ) = kSk2F , where k · kF denotes the Frobenius norm. Thus, by observing (4) and the above statement, (4) can be rewritten as

max W

c P

Ni kW T Yi k2F

i=1 Ni c P P

i=1 j=1

kW T Zij k2F

(5)

s.t. W T W = I. As we can see, the objective of (5) is based on L2-norm in nature, since for a matrix S, kSkF is the sum of L2-norm of its columns. However, it is known that L2-norm is sensitive to outliers and noises, as shown in Figure1. In order to reduce the sensitivity, a powerful strategy is to replace the L2norm by the L1-norm. Thus, corresponding to problem (5), we construct the proposed L1-norm 2DLDA as follows

max W

c P

i=1

Ni kW T Yi k1

Ni c P P

i=1 j=1

kW T Zij k1

(6)

s.t. W T W = I. Note that Yi is the difference between the i-th class center and the whole data center while Zij is the difference between the j-th data in the i-th class and the i-th class center. We see that the objective of (6) in fact seeks a projection transformation matrix W such that the L1-norm between-class distance is maximized and meanwhile the L1-norm within-class distance is minimized. As can be seen in (6), for the proposed L1-2DLDA, we need to solve an L1-norm problem. Since the objective of (6) involves division operation of two L1-norm formulations, and is not differentiable, finding a global solution of (6) for m > 1 is difficult. Therefore, we simplify the problem into a series 6

of m = 1 optimization problems by using a greedy search strategy. Thus, in the following, we will solve the following problem firstly c P Ni kwT Yi k1 max J(w) = i=1 , (7) Ni c P w P T kw Zij k1 i=1 j=1

subject to wT w = 1, which is equivalent to c P n P max J(w) = w

i=1 k=1

Ni |wT Yik |

Ni P n c P P

i=1 j=1 k=1

subject to wT w = 1.

,

(8)

|wT Zijk |

3.2. Algorithm of L1-norm 2DLDA with one projection vector For problem (7), it is difficult to be solved by traditional optimization techniques in view of its L1-norm construction in both numerator and denominator. Inspired by the idea using in [34, 40], we will take a similar technique here. Thus, in the following, we solve problem (8) by presenting an iterative algorithm. The algorithm of L1-2DLDA is listed as follows. Algorithm 1. L1-2DLDA algorithm with one projection vector Input: Data matrices Yi and Zij , where i = 1, . . . , c, j = 1, . . . , Ni , and the maximum number of iterations itmax. Output: w∗ . Process: Step 1: Set the iteration number t = 0 and initialize w(0) as a random vector. Step 2: For i = 1, . . . , c, j = 1, . . . , Ni and k = 1, . . . , n, define sik (t) = sgn(w(t)T Yik ), rijk (t) = sgn(w(t)T Zijk ) and set p(t) =

c X n X

Ni sik (t)Yik , b(t) =

i=1 k=1

Ni X c X n X i=1 j=1 k=1

7

rijk (t)Zijk .

(9)

Step 3: Update w(t) by w(t + 1) = w(t) + δg(w(t)), where δ > 0 is a parameter of learning rate and g(w(t)) is given by g(w(t)) =

b(t) p(t) − . T w(t) p(t) w(t)T b(t)

(10)

Set w(t+1) ←− w(t+1)/kw(t+1)k2 . If w(t)T p(t) = 0 or w(t)T b(t) = 0, set w(t) ← w(t) + ∆w/||w(t) + ∆w||, where ∆w is a small nonzero random vector, and go to Step 2. Step 4: Convergence check: If J(w(t)) does not increase significantly or ||w(t + 1) − w(t)|| is small enough or the total iteration number is greater than itmax, go to Step 2. Otherwise, go to Step 5. Step 5: Stop iteration and set w∗ = w(t).

Although the above algorithm is similar to that of [40–42], there are two obvious differences: It deals with matrix inputs instead of vector ones, and a reasonable stop criterion is explicitly introduced. 3.3. Justification of Algorithm 1 In this section, we will validate the rationality of the algorithm of L12DLDA. Let w(t) be the t-th iteration of w, and

J(w(t)) =

c P n P

Ni |w(t)T Yik |

i=1 k=1 Ni P c P n P

i=1 j=1 k=1

.

(11)

|w(t)T Zijk |

Theorem 1. By implementing Algorithm 1 in Section 3.2, J(w(t)) is nondescending via each iteration. Proof 1. We use the similar strategy in [40]. To prove the theorem, we need the following equality [43] which holds for any vector v = (v1 , . . . , vn )T ∈ Rn : n

1 X vk2 1 ||v||1 = minn + ||u||1 . u∈R+ 2 uk 2 k=1 8

(12)

The minimum is uniquely achieved at uk = |vk | for k = 1, . . . , n, where u = (u1 , . . . , un )T ∈ Rn . We now rewritten the denominator and numerator of J(w(t)), respectively. On one hand, the denominator of (11) is Ni X Ni X c n c n 1 XX |w(t)T Zijk |2 1 X X + |w(t)T Zijk | |w(t) Zijk | = TZ | 2 |w(t) 2 ijk i=1 j=1 k=1 i=1 j=1 k=1 k=1 c N n Ni X c n T P Pi P Zijk Zijk 1 XX 1 T w(t) + = w(t) |w(t)T Zijk | |w(t)T Zijk | 2 2 i=1 j=1 k=1 i=1 j=1 k=1

Ni X c X n X i=1 j=1

T

c

N

i 1 XX 1 ||aij (t)||1 = w(t)T C(t)w(t) + 2 2 i=1 j=1

where C(t) =

Ni X c X n T X Zijk Zijk i=1 j=1 k=1

|aij (t)k |

,

(13)

(14)

and aij (t) = (aij (t)1 , . . . , aij (t)n )T with aij (t)k = w(t)T Zijk , k = 1, . . . , n. c P n P On the other hand, the numerator Ni |w(t)T Yik | = w(t)T p(t) by the i=1 k=1

definition of p(t) in Step 2 of the Algorithm. Thus, J(w(t)) =

w(t)T p(t) . Ni c P P 1 1 T w(t) C(t)w(t) + 2 ||aij (t)||1 2

(15)

i=1 j=1

Since the deviate of J with respect to w(t) is complicated, we introduce a new function as the following:     L(τ ) = ln  

 τ T p(t)  , N c P Pi  T τ C(t)τ + ||aij (t)||1

(16)

i=1 j=1

where τ ∈ Rd is a column vector. Then the gradient of L(τ ) with respect to 9

τ is presented as follows: ∂L(τ ) p(t) = T − ∂τ τ p(t)

C(t)τ . Ni c P P T τ C(t)τ + ||aij (t)||1

(17)

i=1 j=1

Let g(τ ) =

∂L(τ ) , ∂τ

that is

g(τ ) =

p(t) τ T p(t)

−

C(t)τ . Ni c P P T τ C(t)τ + ||aij (t)||1

(18)

i=1 j=1

Note that by replacing τ with w(t), g(w(t)) is just as in (10). Then g(w(t)) points to the nondescending direction of L(τ ) at w(t). That means by defining w(t + 1) = w(t) + δg(w(t)), L(w(t)) ≤ L(w(t + 1)),     ln   

  ≤ ln  

This in turn gives

 w(t)T p(t)   N c P Pi  T w(t) C(t)w(t) + ||aij (t)||1 i=1 j=1



(19)

 w(t + 1)T p(t)  . N c i PP  T ||aij (t)||1 w(t + 1) C(t)w(t + 1) + i=1 j=1

w(t)T p(t) Ni c P P w(t)T C(t)w(t) + ||aij (t)||1 i=1 j=1

w(t + 1)T p(t) ≤ . Ni c P P T w(t + 1) C(t)w(t + 1) + ||aij (t)||1

(20)

i=1 j=1

1),

We now prove J(w(t)) ≤ J(w(t + 1)). Firstly, by the definition of sik (t + w(t + 1)T p(t) ≤ w(t + 1)T p(t + 1). 10

(21)

Secondly, by (14) and (12), T

w(t + 1) C(t)w(t + 1) +

Ni c X X i=1 j=1

=

Ni X c X n X |w(t + 1)T Zijk |2

|aij (t)k |

i=1 j=1 k=1

≥ =

Ni c X X i=1 j=1

Ni c X X i=1

minn

u∈R+

n X k=1

+

||aij (t)||1

Ni c X X i=1 j=1

c

||aij (t)||1 N

i |w(t + 1)T Zijk |2 X X + ||u||1 uk i=1 j=1

(22)

Ni c X T 2 X |Zijk Zijk | w(t + 1) ||w(t + 1)T Zij ||1 w(t + 1) + TZ | |w(t + 1) ijk j=1 i=1 j=1 T

=w(t + 1)T C(t + 1)w(t + 1) +

Ni c X X i=1 j=1

||aij (t + 1)||1 .

Combining (21) and (22), and in view of (20), we get w(t)T p(t) Ni c P P ||aij (t)||1 w(t)T C(t)w(t) + i=1 j=1

w(t + 1)T p(t + 1) ≤ . Ni c P P T w(t + 1) C(t + 1)w(t + 1) + ||aij (t + 1)||1

(23)

i=1 j=1

Thus, J(w(t)) ≤ J(w(t + 1)) is proved, and the theorem is established. The above theorem shows that each iteration moves towards the nondescending direction. The probability of the trivial case happens, that is, the objective function of (11) remains unchanged during the iteration procedure, is very low. This can be seen from (19), (21) and (22). Specifically, the equality of (19) holds when g(w(t)) = 0, the equality of (21) holds when w(t)T Yik > 0 for each i = 1, . . . , c, k = 1, . . . , n, and the equality of (22) is achieved only when uk = |w(t + 1)T Zijk | for each k = 1, . . . , n and all i = 1, . . . , c, j = 1, . . . , Ni . Since the objective function (11) stands still uniquely when all of the three equalities holds at the same time, and it is a small probability event. This implies that the algorithm can always achieve 11

at a locally maximum point of (8). The experimental results in the following section will show that our solvers can always achieve at locally maximum points of problem (8). 3.4. Algorithm of L1-norm 2DLDA with multiple projection vectors By implementing Algorithm 1, we can obtain the first projection vector w1 . To get more than one projection vector, greedy search strategy is applied here. To compute wr for r > 1, we use the deflation technique to extract the remaining basis vectors. Specifically, if Xl = (xl1 , . . . , xln ) ∈ Rd×n for 1 ≤ l ≤ d, then the r-th (1 < r ≤ m) basis vector wr is computed by using the deflated samples xnew ij

= xij −

r−1 X

wl (wlT xij ),

(24)

l=1

where each wl is normalized to have unit length. (24) means that the new data samples are computed such that the information contained in the previously obtained projection vectors is deducted. As proved in [42], {wr }m r=1 is orthogonal and thus satisfies W T W = I. This gives the following recursive L1-2DLDA algorithm with multiple projection vectors. Algorithm 2. L1-2DLDA algorithm with m projection vectors Input: Data matrices Yi and Zij , where i = 1, . . . , c, j = 1, . . . , Ni , and desired projection vectors number m. Output: w1 , w2 , . . . , wm and thus W = (w1 , w2 , . . . , wm ). Process: Step 1: Set w0 = 0 ∈ Rd×1 and T 0 = {Xi0 = Xi }N i=1 . Step 2: For h = 1, . . . , m, do the following iteration: h h h h (1) Compute T h = {Xih }N i=1 , where Xi = {xi1 , xi2 , . . . , xin } is deh−1 P fined as xhij = xh−1 − wl (wlT xh−1 ij ij ); l=1

(2) Apply Algorithm 1 to data set T h and get wh . End for

12

4. Experiments 4.1. Experimental results In this section, we experimentally evaluate our L1-2DLDA with several existing dimensionality reduction methods, including PCA [1], LDA [3–5], PCA+LDA[10], L1-LDA [40, 42], 2D-PCA [16], and L2-2DLDA [20]. All of our experiments are carried out on a PC machine with P4 2 GHz CPU and 2 GB RAM memory under Matlab 2012b platform. The performance of the proposed method and other related methods are evaluated on a toy example as well as on the ORL database1 , Yale database2 , and FERET database3 . The toy example is a three-class dataset with ten samples in eachclass4 . The first, second, andthird classes contain 2 i 2 2 2 2 the samples of the form , , and , respectively, where 2 2 j 2 2 k i, j, k = 2.1, 2.2, . . . , 2.7, 2.8, 40, 50. Evidently, each class contains two outliers. For benchmark face data, original face images are cropped to 32 × 32 pixels with 256 gray levels per pixel. For 2D methods, each face image is represented by a 32 × 32 matrix, and for 1D methods, each image is represented by a 1024 dimensional row vector. Each face dataset is preprocessed by scaling features to [0, 1] at the beginning of each various experiment, that is, divided by 255. To test the performace for various methods, we first project test images to a new space obtained by the above dimensionality reduction methods on training samples; then the nearest neighbor classifier with Euclidean metric is applied to identify the new coming face images. For the purpose of evaluating the robustness performance of our propose L1-2DLDA comparing to other methods, we covered each face with blank on the second quarter and on the left half, as can be shown in Figure 2 and Figure 3, respectively. For each method, since the size of each image is 32 × 32, we reduce the row dimension of each face data to 5, 10, 15, 20, 25, and 30 dimensions for 2D methods, and the dimension of each face data to at most 30 for 1D methods. The learning rate parameter δ > 0 in Algorithm 1 of our L1-2DLDA is taken δ = 0.05 in all the experiments by considering both accuracy and speed. A similar learning rate also appears in L1-LDA, which http://www.uk.research.att.com/facedatabase.html http://cvc.yale.edu/projects/yalefaces/yalefaces.html 3 http://www.itl.nist.gov/iad/humanid/feret/ 4 http://www.optimal-group.org/Resource/L12DLDA.html

1

2

13

Figure 2: Sample faces with the second quarter are covered with blank for the first ten images from ORL database, Yale database, and FERET database, respectively.

will also be taken 0.05. 4.1.1. Results on toy example In this subsection, we conduct experiments on the toy example described in the beginning of Section 4.1 to evaluate the performance of various methods. We test the performance of our L1-2DLDA with PCA, LDA, PCA+LDA, L1-LDA, 2D-PCA, and L2-2DLDA through comparing the corresponding classification accuracies on the projected dataset by using neareast neighbor classifier. For L1-2DLDA, 2D-PCA, and L2-2DLDA, each sample was projected to a new space sized 1 × 2, and for PCA, LDA, PCA+LDA, and L1-LDA, each sample was reformed as a 4-dimensional row vector and then was projected into a 1×1 new space and a 1×2 space, respectively . The projected samples of various methods are shown in Figure 6. A random subset with l (= 2, 3, . . . , 7) samples was taken from each class to form the training set, and the rest of the data was considered to constitute the testing set. The average result over 2, 3, . . . , 7 samples are used for training is considered. Ten times experiments are repeated and the average accuracy is adopted. Table 1 summarizes the results on the toy example, including mean accuracies and their derivations. From Table 1, we can see that our L1-2DLDA owns the best classification accuracy over this synthetic dataset, while with the smallest variance. Also, we can observe that the performance of PCA, LDA, and PCA+LDA is influenced greatly by the numbers of extracted fea14

Figure 3: Sample faces with half are covered with blank for the first ten images from ORL database, Yale database and FERET database, respectively.

0.9

0.85

0.8

J(w(t))

0.75

0.7

0.65

0.6

0.55

0.5 0

5

10

15

20

25

30

t

Figure 4: Value of the L1-2DLDA objective function J(w(t)) with respect to t on the toy example. Twenty random initials are taken.

15

Table 1: Comparisons of different methods in terms of accuracies (%) on the synthetic dataset. For 1D methods, (1) denotes the situation when data are projected into a 1 × 1 space, and (2) denotes the situation when data are projected into a 1 × 2 space.

Methods PCA(1) Accuracy(%) 49.58±9.45 Methods PCA(2) Accuracy(%) 93.33±6.24 Methods 2D-PCA Accuracy(%) 95.55±6.57

LDA(1) 89.81±7.53 LDA(2) 94.52±6.83 L2-2DLDA 95.50±7.18

PCA+LDA(1) 51.60±7.87 PCA+LDA(2) 91.75±9.17 L1-2DLDA 96.92±5.74

L1-LDA(1) 92.87±7.75 L1-LDA(2) 95.50±8.43

tures, and even when two features are selected, they still perform the worst, which indicates that the performance of these three methods are poor on such dataset with outliers. On the other hand, 2D methods obviously obtain higher accuracies. Besides, it can be observed from Table 1 that L1-LDA also performs better than PCA, LDA, and PCA+LDA when dealing with outliers. Hence, in the following experiments, we just compare L1-LDA, 2D-PCA, L2-2DLDA with L1-2DLDA when conducting experiments on the following three face databases. To observe the convergence of the proposed algorithm, we present the variation of the objective function with respective to the iteration number t in Figure 4, where twenty random initials are taken. As can be seen, J(w(t)) increases monotonously with respect t, and it arrives at convergence after 6 to 14 iterations. This implies that the algorithm is able find a local optimum. To further show the convergency of our L1-2DLDA, we present the influence of random initial values to classification accuracy, as well as to optimal objective value in Figure 5 (a) and (b), respectively. Experiments are repeated twenty times. In Figure 5(a), we see that with different random initials, the corresponding accuracies are centralized in a rather small interval [95.40, 97.92]. From Figure 5(b), we see that when the random initial changes, the optimal objective value of L1-2DLDA is very stable, and varies around 0.8768. Thus, our L1-2DLDA performs steadily when random initials are taken. 4.1.2. Results on the ORL database In this subsection, we compare the performance of our L1-2DLDA with L1-LDA, 2D-PCA, and L2-2DLDA on the ORL database. 16

98 0.8768 97.5

0.8768

0.8768 Objective value

Accuracy

97

median= 96.34 96.5

96

0.8768

median= 0.87682

0.8768

0.8768

0.8768

95.5 Random initial

Random initial

(a)

(b)

Figure 5: (a) Accuracies of L1-2DLDA on the toy example with respect to random initials; (b) Optimal objective values of L1-2DLDA on the toy example with respect to random initials.

The ORL database is used to test the behaviour of the proposed method under the condition of minor variations of rotation and scaling. It contains 112 × 92 sized 400 grayscale images of 40 individuals, each of which is cropped with the size of 32 × 32. There are 10 images per subject, one per different facial expression or lighting configurations. All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position with tolerance for some side movement. A random subset with l (2, 3, . . . , 7) images per individual was taken to form the training set, and meanwhile the rest of the data was considered to constitute the testing set. For each given l, the average results over 50 random splits are considered. Table 2 listed the results on the ORL database. The recognition accuracies and the reduced dimensions are shown in the table, with dimensions are put in brackets under the corresponding accuracies. The best mean accuracy is shown by bold figure. We can see from Table 2 that our L1-2DLDA performs the best on both the circumstances that the second quarter and the half faces are covered with blank. Meanwhile, we notice that the performance of our L1-2DLDA is better than those of L1-LDA and L2-2DLDA, which indicates that 2D consideration is better than 1D when L1-norm is applied, and the application of L1-norm to L2-2DLDA is more robust than conventional L2-2DLDA. In addition, the recognition accuracies for the ORL 17

40

50

30

20

Class 1

40

Class 1

Class 2

30

Class 2

Class 3

20

20

Class 1 Class 2

10

Class 3

Class 3 0

10 10

0

−10

−10 0

−20

−20 −30

−10

−30

−40 −20

1

2

3

4

5

6

7

8

9

−50

10

1

2

3

(a) PCA(1)

4

5

6

7

8

9

−40

10

(b) LDA(1)

45 Class 1

3

4

5

6

7

8

9

10

3

30

Class 2

35

2

(c) PCA+LDA(1)

40

40

1

20

Class 1

2.8

Class 2

2.6

Class 3

Class 3

Class 1 Class 2

2.4

30 10

Class 3

2.2

25 0

2

20 1.8

−10 15

1.6 −20

10

1.4 −30

5 0

1

2

3

4

5

6

7

8

9

−40 −20

10

1.2 −10

(d) L1-LDA(1)

10

20

30

0

50

(f) LDA(2)

15

50 Class 1

Class 1

10

40

Class 2

Class 2 20

1 −50

40

(e) PCA(2)

40 30

0

Class 3

Class 1 Class 2

Class 3

5

30

Class 3

10 0 0

20 −5

−10 10 −10

−20 −30

−15

−40 −40

−20 −8

−30

−20

−10

0

10

20

0

−7

−6

−5

−4

−3

−2

−1

−10

0

0

10

20

30

−15

x 10

(g) PCA+LDA(2)

(h) L1-LDA(2)

(i) 2D-PCA

−14

50

1

45

x 10

0.9

Class 1

Class 1 40

0.8

Class 2

35 30

0.6

25

0.5

20

0.4

15

0.3

10

0.2

5 0

Class 2

0.7

Class 3

Class 3

0.1

0

10

20

30

40

0

50

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6 −15

x 10

(j) L2-2DLDA

(k) L1-2DLDA

Figure 6: Projected synthetic dataset for various methods.

18

40

50

Table 2: Comparison of different methods in terms of average accuracies (%) and reduced dimensions on the ORL database with the second quarter (marked by (1)) and half (marked by (2)) faces are covered with blank respectively.

Methods L1-LDA

2 26.54 (30)

2D-PCA L2-2DLDA L1-2DLDA

Methods L1-LDA

L2-2DLDA L1-2DLDA

(30)

(30)

(30)

(30)

81.15

87.98

91.25

93.50

96.46

96.39

(25 × 32)

(15 × 32)

(10 × 32)

(15 × 32)

(25 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(25 × 32)

(25 × 32)

(30 × 32)

(30 × 32)

(20 × 32)

(10 × 32)

(10 × 32)

(15 × 32)

81.43 82.67

2 24.93

88.22 88.99

92.11 92.79

94.68 95.39

95.81 96.38

97.43 97.77

Number of training samples per class (2) 3 4 5 6 7 30.29 34.68 38.54 42.21 45.42 (30)

(30)

(30)

(30)

73.96

82.13

86.69

90.29

91.70

94.58

(15 × 32)

(15 × 32)

(15 × 32)

(20 × 32)

(25 × 32)

(30 × 32)

(30 × 32)

(10 × 32)

(10 × 32)

(15 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(15 × 32)

(10 × 32)

74.09

81.72 82.38

86.43 86.89

90.09 90.64

91.68 91.81

91.12 91.61 92.33

Mean 36.01

(30)

(30 × 32)

73.58

Mean 37.92

(30)

(30 × 32)

(30)

2D-PCA

Number of training samples per class (1) 3 4 5 6 7 31.90 36.49 40.30 44.59 47.68

93.33 94.90

86.55 86.13 86.79

faces which are covered on the second quarter are higher than those of faces which are covered on the left half, which is consistent with the common sense. 4.1.3. Results on the Yale database In this subsection, we compare the performance of our L1-2DLDA with L1-LDA, 2D-PCA, and L2-2DLDA on the Yale database. Yale database contains 225 × 195 sized 165 grayscale images of 15 individuals, each of which is cropped with the size of 32 × 32. There are 11 images per subject, one per different facial expression or lighting configurations. The database is considered in order to evaluate the performance of methods under the condition when facial expression and lighting conditions 19

are changed. A random subset with l (2, 3, . . . , 7) images per individual was taken to form the training set, and meanwhile the rest of the database was considered to constitute the testing set. As mentioned earlier, for each given l, the average results over 50 random splits is considered. Table 3: Comparison of different methods in terms of average accuracies (%) and reduced dimensions on the Yale database with the second quarter (marked by (1)) and half (marked by (2)) faces are covered with blank respectively.


Methods L1-LDA

2 28.03 (30)

(25)

(30)

(30)

(30)

(30)

2D-PCA

63.21

72.22

72.70

75.19

77.33

78.89

(10 × 32)

(20 × 32)

(10 × 32)

(25 × 32)

(25 × 32)

(25 × 32)

(20 × 32)

(15 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(20 × 32)

(10 × 32)

(10 × 32)

(10 × 32)

(10 × 32)

(10 × 32)

(10 × 32)

L2-2DLDA L1-2DLDA

67.56 67.38

72.00 72.40

74.04 74.84

75.33 76.04

75.89 76.67

76.67 77.37


Methods L1-LDA

2 21.92 (30)

(30)

(30)

(30)

(25)

(30)

2D-PCA

63.56

70.25

71.24

72.22

70.00

73.67

(20 × 32)

(20 × 32)

(25 × 32)

(25 × 32)

(25 × 32)

(25 × 32)

(15 × 32)

(15 × 32)

(15 × 32)

(15 × 32)

(15 × 32)

(15 × 32)

(25 × 32)

(25 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(20 × 32)

L2-2DLDA L1-2DLDA

62.67

64.43

70.83

68.95

75.85 71.28

78.83

72.73

79.86

72.85

80.11

73.30

Mean 35.73 73.25 73.58 74.12

Mean 30.84 70.15 74.69 70.59

The recognition results are listed in Table 3, and the best mean accuracy is shown by bold figure. Similar to the experimental results we obtained on the ORL database, for all the methods, the recognition accuracies for faces which are covered on the second quarter are higher than those of faces which are covered on the left half. One can observe in Table 3 that when the second quarter faces are covered, our L1-2DLDA performs the best. In particular, in 20

this case, the performance of our L1-2DLDA is better than those of L1-LDA and L2-2DLDA, which confirms the results in the above section further. For the situation that half faces are covered, we can see the the best performance our L1-2DLDA is worse than L2-2DLDA which is the second best. But as we will see in Table 5, the overall average accuracy of L1-2DLDA is better than that of L2-2DLDA on Yale database for both cases. 4.1.4. Results on the FERET database In this subsection, we compare the performance of our L1-2DLDA with L1-LDA, 2D-PCA, and L2-2DLDA on the FERET database. For the FERET database, we select a subset which contains 1400 grayscale images of 200 individuals, each of which is cropped with the size of 32 × 32. For each individual there are 7 face images with expression, illumination and age variation. A random subset with l (2, 3, . . . , 6) images per individual was taken to form the training set, and meanwhile the rest of the database was considered to constitute the testing set. For each given l, the average results over 50 random splits is considered. The recognition results are shown in Table 4. From Table 4, we can see that when the second quarter of a face is covered, our L1-2DLDA is the second best while L2-2DLDA owns the highest accuracy. However, when half faces are covered, our L1-2DLDA outperforms the other three methods. This indicates that on the FERET datasets, when more noises are added, our L1-2DLDA performs more robust. Moreover, as we will see in Table 5, the overall average accuracy of L1-2DLDA is better than that of L2-2DLDA on FERET database for both cases. 4.2. Results analysis In this subsection, we analyze the performance of our L1-2DLDA in comparison with L1-LDA, 2D-PCA, and L2-2DLDA from different aspects. We first evaluate the robustness of our L1-2DLDA to outliers and noises. To see this, we compare the mean accuracies over all reduced dimensions and all l trains for different methods on the ORL database, Yale database, and FERET database, with the second quarter faces and half faces are covered with blank respectively. Here l = 2, . . . , 7 for the ORL and Yale database, and l = 2, . . . , 6 for the FERET database. The corresponding classification accuracies and standard derivations are listed in Table 5. We also list the p-values in 5% significance level, with ten repeated experiments on each dataset. The p-value was calculated by performing a paired t-test [44] by comparing L1-

21

Table 4: Comparison of different methods in terms of average accuracies (%) and reduced dimensions on the FERET database with the second quarter (marked by (1)) and half (marked by (2)) faces are covered with blank respectively.

Methods L1-LDA


2D-PCA L2-2DLDA L1-2DLDA

Methods L1-LDA

L2-2DLDA L1-2DLDA

(20)

(20)

(25)

30.19

36.97

41.94

46.40

50.27

(10 × 32)

(10 × 32)

(15 × 32)

(15 × 32)

(15 × 32)

(25 × 32)

(20 × 32)

(25 × 32)

(25 × 32)

(25 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(30 × 32)

(20 × 32)

31.57

31.14

38.91

37.93

44.13

42.86

48.96

47.38

52.66

51.01


2D-PCA

(30)

(25)

(25)

(25)

21.59

26.28

30.36

34.14

36.59

(20 × 32)

(20 × 32)

(20 × 32)

(15 × 32)

(20 × 32)

(20 × 32)

(30 × 32)

(30 × 32)

(25 × 32)

(30 × 32)

(20 × 32)

(30 × 32)

(20 × 32)

(20 × 32)

22.73

26.77

27.45

30.13

31.40

33.66

35.02

41.15 43.24 42.06

Mean 6.86

(25)

(30 × 32)

21.88

Mean 7.59

33.52

37.37

29.79 29.19 30.79

2DLDA to the other methods under the assumption of the null hypothesis that there is no difference between the test set accuracy distributions. From Table 5, we can see that our L1-2DLDA outperforms the other three methods on all the three datasets when the overall average accuracy is considered. This indicates the robustness of L1-2DLDA when dimensions and train sample numbers are varied. Besides, we see that the variances of L1-2DLDA on all the datasets are much smaller than those of L2-2DLDA, confirming the robustness improvement of our L1-2DLDA over L2-2DLDA. From the p-values, we see that the performance of our L1-2DLDA is statistically different from the other three methods on most datasets. To further evaluate the robustness of our L1-2DLDA, we add random

22

Figure 7: Sample faces with the second quarter are covered with random gaussian noises for the first ten images from ORL database, Yale database and FERET database, respectively.

Figure 8: Sample faces with half are covered with random gaussian noises for the first ten images from ORL database, Yale database and FERET database, respectively.

23

Table 5: Comparison of mean accuracies (%) over all reduced dimensions and all l trains for different methods on the ORL database, Yale dataset and FERET database, with the second quarter (marked by (1)) and half (marked by (2)) faces are covered with blank respectively.

Datasets\ Methods ORL database(1) ORL database(2) Yale database(1) Yale database(2) FERET database(1) FERET database(2)

L1-LDA 2D-PCA L2-2DLDA L1-2DLDA 33.82±3.59 91.07±0.23 84.25±7.13 91.72±0.54 0.0000 0.0220 0.0284 – 32.75±3.75 85.65±1.05 75.15±11.99 86.08±1.10 0.0000 0.5133 0.0505 – 34.22±1.24 72.87±0.17 72.25± 0.97 73.60±0.26 0.0000 0.0002 0.0081 – 28.90±2.14 69.06±1.47 69.68±4.34 69.91±0.73 0.0000 0.2370 0.9010 – 6.90±0.94 40.60±0.91 35.71±9.01 41.30±1.22 0.0000 0.2856 0.1631 – 5.96±0.76 29.16±1.18 24.37±6.12 29.77±1.84 0.0000 0.5114 0.0652 –

gaussian noises of mean 0 and variance 0.01 to the covered faces above, as shown in Figure 7 and Figure 8, respectively. Specifically, we compare the mean accuracies over all reduced dimensions and all l trains for different methods on the ORL the database, Yale database, and FERET database, with the second quarter faces and half faces are covered with random gaussian noises respectively. Here l = 2, . . . , 7 for the ORL and Yale database, and l = 2, . . . , 6 for the FERET database. The corresponding results are listed in Table 6. As the former case, we observe that in this situation, the performance of our L1-2DLDA is also better than that of the other three methods on all the databases. In particular, for Yale database, when the second quarter faces are covered, although L1-2DLDA and L2-2DLDA possess the same mean accuracy, the variance of our L1-2DLDA is 0.29, which is much smaller than 3.96 in L2-2DLDA. All in all, we obtain that our L1-2DLDA is superior than other three methods on the noise datasets. Now we consider the influence of the reduced dimensions to our L12DLDA as well as the other three methods. In order to study their relationship, we compare the recognition abilities when applying our L1-2DLDA 24

100

75

90

70

40

65 L12DLDA 2DLDA L1LDA 2DPCA

60

L12DLDA 2DLDA L1LDA 2DPCA

35 30

Accuracy

70 60

Accuracy

80

Accuracy

45


55 50

50

25 20

45 40 30

10

35

5

10

15

20

25

30

30

5

10

15

25

5

30

5

10

15

Dimension

Dimension

(a) ORL database


30


60 Accuracy

60

50

30

35

70

70

25

(c) FERET database

80

80

20 Dimension

(b) Yale database

90

Accuracy

20

25

Accuracy

20

15

40

50


20

15

40

40

10 30

30

20

5

10

15

20

25

20

30

5

5

10

15

Dimension

20

25

0

30

5

10

15

Dimension

(d) ORL database

20

25

30

Dimension

(e) Yale database

(f) FERET database

Figure 9: Relationship of dimension and classification accuracy when half faces and the second quarter faces are covered with blank, respectively. 100

80

90

45 40

70

80



50 40

35

40

20 15

20

10

20 10

10 5

10

15

20

25

0

30

5

5

10

15

Dimension

20

25

0

30

5

10

15

Dimension

(a) ORL database

20

25

30

Dimension

(b) Yale database

90

(c) FERET database

80

45 40


70

70 35 L12DLDA 2DLDA L1LDA 2DPCA

60 Accuracy

60

50

30 Accuracy

80

Accuracy

25

30

30

0


30 Accuracy

60

50 Accuracy

Accuracy

70

50

40


20 15

40 10 30

30

20

5

5

10

15

20

25

Dimension

(d) ORL database

30

20

5

10

15

20

25

Dimension

(e) Yale database

30

0

5

10

15

20

25

30

Dimension

(f) FERET database

Figure 10: Relationship of dimension and classification accuracy when half faces and the second quarter faces are covered with gaussian noises, respectively.

25

Table 6: Comparison of mean accuracies (%) over all reduced dimensions and all l trains for different methods on the ORL database, Yale dataset and FERET database, with the second quarter (marked by (1)) and half (marked by (2)) faces are covered with gaussian noises respectively.

Datasets\Methods ORL database(1) ORL database(2) Yale database(1) Yale database(2) FERET database(1) FERET database(2)

L1-LDA 2D-PCA 34.75±2.47 43.29±34.24 0.0000 0.0065 24.95±2.02 85.13±0.81 0.0000 0.4223 33.62±2.38 41.39±29.09 0.0000 0.0212 26.27±2.20 68.76±0.80 0.0000 0.0323 6.43±1.01 39.73±0.88 0.0000 0.3000 5.65±1.02 28.3±1.07 0.0000 0.0000

L2-2DLDA 84.21±6.56 0.0267 74.30±13.32 0.0635 73.82±3.96 0.9998 65.26±11.76 0.3810 27.48± 13.36 0.0416 19.74±8.78 0.0006

L1-2DLDA 91.22±0.84 – 85.72±1.51 – 73.82±0.29 – 69.92±0.39 – 40.24±0.74 – 38.14±2.59 –

and the other methods for dimensionality reduction on the above three face databases, with feature dimensions are varied. For each method and a fixed dimension p, e.g., p = 10, the corresponding recognition accuracy is the average of l train accuracies, where l = 2, . . . , 7 for the ORL and Yale database, and l = 2, . . . , 6 for the FERET database. As before, each l train experiment is repeated 50 times with random splits of the labels. The corresponding results are given in Figure 9. We consider both the situation that half and the second faces are covered with blank and gaussian noises, respectively. For these two cases, it can be seen from Figure 9 and Figure 10 that when the dimension is low, our L1-2DLDA performs the best on most cases. Also, from these two sets of figures we observe that the variation of dimensions has little influence to the performance of our L1-2DLDA, and thus L1-2DLDA is the stablest one among these four methods on these datasets. The fact that slight influence of dimensions to L1-2DLDA means small dimensions are usually enough to obtain good recognition accuracy, which is benefit to high dimensional computation. In fact, if we view the objective of problem (6) as a function of m, where m = 1, 2, . . . , n, and denote it by J(m), then if 26

for some m, |J(m) − J(m − 1)| < , while |J(m − 1) − J(i)| > for some > 0 and i = 1, 2, . . . , m − 1, then we may deem m projection vectors are adequate enough to contain almost all the discriminative information. It is also worth mentioning that L2-2DLDA is vulnerable to the variation of dimensions, and the recognition accuracy may descend as dimensions increasing. For example, in Figure 9(e), L2-2DLDA outperforms our L12DLDA only when the dimension is 15, and then descends afterwards. We speculate the reason is that when the reduced dimension is higher than 15, some useless or interferential information may also be contained, and thus causes negative influence. The same phenomenon happens when faces are covered with gaussian noises. However, our L1-2DLDA is very stable to the variation of dimensions with a steady uptrend along with the dimensions. 5. Conclusions For two-dimensional (2D) matrices learning, a robust dimensionality reduction method termed as L1-2DLDA is proposed in this paper. The main difference between our L1-2DLDA and the existing two-dimensional LDA is that the latter considers the L2-norm in its objective function while the former uses the L1-norm. This makes our method more robust to outliers and noises remarkably, and the preliminary experiments show the improvement of our L1-2DLDA over L2-2DLDA. The corresponding L1-2DLDA Matlab code can be downloaded from http://www.optimal-group.org/Resource/L1 2DLDA.html. In the future, we will be concerned with more efficient algorithms for solving our L1-2DLDA. In addition, two-sided L1-2DLDA corresponding to the two-sided L2-2DLDA in [21] is also interesting. Acknowledgments This work is supported by the National Natural Science Foundation of China (No.11201426, No.11426200, and No.11371365), the Zhejiang Provincial Natural Science Foundation of China (No. LQ12A01020, No. LQ13F030010, and No.LQ14G010004), the Ministry of Education, Humanities and Social Sciences Research Project of China (No.13YJC910011), and the Scientific Research Fund of Zhejiang Provincial Education Department (No.Y201432746).

27

References [1] Turk M, Pentland A. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 1991, 3(1): 71-86. [2] Martnez A M, Kak A C. PCA versus LDA. IEEE Trans. Pattern Analysis and Machine Intelligence, 2001, 23(2): 228-233. [3] Fisher R A. The use of multiple measurements in taxonomic problems. Annals of eugenics, 1936, 7(2): 179-188. [4] Fukunaga K. Introduction to statistical pattern recognition, second edition. Academic Press, New York, 1991. [5] Swets D L, Weng J J. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996, 18(8): 831-836. [6] Ye J, Li Q. A two-stage linear discriminant analysis via QRdecomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(6): 929-941. [7] Lu J, Plataniotis K N, Venetsanopoulos A N. Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition. Pattern Recognition Letters, 2005, 26(2): 181-191. [8] Ye J. Least squares linear discriminant analysis. Proceedings of the 24th International Conference on Machine Learning, ACM, 2007: 1087-1093. [9] Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 2007, 8(1): 86-100. [10] Belhumeur P N, Hepanha J P, Kriegman D J. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(7): 711-720. [11] Shakunaga T, Shigenari, K. Decomposed eigenface for face recognition under various lighting conditions. Proceedings of the Computer Society Conference on Computer Vision and Pattern Recognition, 2001, 1: 864871. 28

[12] Batur A U, Hayes Monson H. Linear subspace for illumination robust face recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, 2: 296-301. [13] Cai D, He X, Han J, Zhang H J. Orthogonal laplacianfaces for face recognition. IEEE Transactions on Image Processing, 2006, 15(11): 3608-14. [14] Li X, Pang Y, Yuan Y. L1-norm-based 2DPCA. IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics: a publication of the IEEE Systems, Man, and Cybernetics Society, 2010, 40(4): 11701175. [15] Wan M, Lai Z, Jin Z. Feature extraction using two-dimensional local graph embedding based on maximum margin criterion. Applied Mathematics and Computation, 2011, 217(23): 9659-9668. [16] Yang J, Zhang D, Frangi A F, et al. Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(1): 131-137. [17] Yang J, Yang J. From image vector to matrix: a straightforward image projection techniqueIMPCA vs. PCA. Pattern Recognition, 2002, 35(9): 1997-1999. [18] Kong H, Li X, Wang L, et al. Generalized 2D principal component analysis. Proceedings of the International Joint Conference on Neural Networks, 2005, 1: 108-113. [19] Kong H, Wang L, Teoh, E K, Li, X, Wang, J G, Venkateswarlu, R. Generalized 2D principal component analysis for face image representation and recognition. Neural Networks, 2005, 18(5-6): 585-594. [20] Liu K, Cheng Y Q, Yang J Y. Algebraic feature extraction for image recognition based on an optimal discriminant criterion. Pattern Recognition, 1993, 26(6): 903-911. [21] Yang J, Zhang D, Yong X, Yang J Y. Two-dimensional discriminant transform for face recognition. Pattern Recognition, 2005, 38: 11251129. 29

[22] Li M, Yuan B. 2D-LDA: A statistical linear discriminant analysis for image matrix. Pattern Recognition Letters, 2005, 26(5): 527-532. [23] Xiong H, Swamy M N S, Ahmad M O. Two-dimensional FLD for face recognition. Pattern Recognition, 2005, 38(7): 1121-1124. [24] Noushath S, Hemantha Kumar G, Shivakumara P. (2D)2 LDA: an efficient approach for face recognition. Pattern Recognition, 2006, 39(7): 1396-1400. [25] Jing X Y, Wong H S, Zhang D. Face recognition based on 2D Fisherface approach. Pattern Recognition, 2006, 39(4): 707-710. [26] Chawla N V, Bowyer K W. Random subspaces and subsampling for 2D face recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, 2: 582-589. [27] Fan Z, Xu Y, Zhang D. Local linear discriminant analysis framework using sample neighbors. IEEE Transactions on Neural Networks, 2011, 22(7): 1119-1132. [28] Kong H, Wang L, Teoh E K, Wang, J G, Venkateswarlu R. A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples. Proceeding of Computer Vision and Pattern Recognition, 2005, 2: 1083-1088. [29] Lanckriet G R G, Ghaoui L E, Bhattacharyya C, et al. A robust minimax approach to classification. The Journal of Machine Learning Research, 2003, 3: 555-582. [30] Kim S J, Magnani A, Boyd S. Robust fisher discriminant analysis. Advances in Neural Information Processing Systems. 2005: 659-666. [31] Li X, Hu W, Zhang Z, Zhang X. Robust foreground segmentation based on two effective background models. Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, 2008: 223228. [32] Li X, Hu W, Zhang Z, Luo G. Robust visual tracking based on incremental tensor subspace learning. IEEE 11th International Conference on Computer Vision, 2007: 1-8. 30

[33] Li X, Hu W, Wang H, Zhang Z. Linear discriminant analysis using rotational invariant L1 norm. Neurocomputing, 2010, 73(13): 2571-2579. [34] Kwak N. Principal component analysis based on L1-norm maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(9): 1672-1680. [35] De La Torre F, Black MJ. A framework for robust subspace learning. International Journal of Computer Vision, 2003, 54:117-142. [36] Aanas H, Fisker H, Astrom K, Carstensen J. Robust factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9): 1215-1225. [37] Kim S J, Koh K, Lustig M, Boyd S, Gorinevsky D. An interior-point method for large-scale `1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 2007, 1(4): 606-617. [38] Jeyakumar V, Li G, Suthaharan S. Support vector machine classifiers with uncertain knowledge sets via robust optimization. Optimization, 2014, 63(7): 1099-1116. [39] Deng N Y, Tian Y J, Zhang C H. Support Vector Machines: Optimization Based Theory, Algorithms, and Extensions. CRC press, Boca Raton, FL, 2012. [40] Wang H, Tang Q, Zheng W. L1-norm-based common spatial patterns. Biomed. Eng., IEEE Transactions on, 2012, 59(3): 653-662. [41] Wang H, Lu X, Hu Z, Zheng W. Fisher discriminant analysis with L1norm. IEEE Transactions on Cybernetics, 2014, 44(6): 828 - 842. [42] Zhong F, Zhang J. Linear discriminant analysis based on L1-norm maximization. IEEE Transactions on Image Processing, 2013, 22(8): 30183027. [43] Jenatton R, Obozinski G, Bach F. Structured sparse principal component analysis. International Conference on Artificial Intelligence and Statistics, 2010: 1-8.

31

[44] Nene S A, Nayar S K, Murase H. Columbia Object Image Library (COIL-20). New York(US): Columbia University. Report No.: CUCS005-96, February 1996.

32

On the optimal class representation in linear discriminant analysis.

Classification of Bladder Cancer Patients via Penalized Linear Discriminant Analysis

Two-Stage Regularized Linear Discriminant Analysis for 2-D Data.

Retinal vessel diameter measurement using unsupervised linear discriminant analysis.

Robust Linear Models for Cis-eQTL Analysis.

Multitask linear discriminant analysis for view invariant action recognition.

Incremental Linear Discriminant Analysis: A Fast Algorithm and Comparisons.

Complexity-reduced scheme for feature extraction with linear discriminant analysis.

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data.

Tumor classification based on orthogonal linear discriminant analysis.

Sparsifying the Fisher Linear Discriminant by Rotation.

Regularized kernel discriminant analysis with a robust kernel for face recognition and verification.

L1-norm kernel discriminant analysis via Bayes error bound optimization for robust feature extraction.

Linear Discriminant Analysis Achieves High Classification Accuracy for the BOLD fMRI Response to Naturalistic Movie Stimuli.

Identifying Plant Part Composition of Forest Logging Residue Using Infrared Spectral Data and Linear Discriminant Analysis.

Multi-task linear programming discriminant analysis for the identification of progressive MCI individuals.

Hybrid random walk-linear discriminant analysis method for unwrapping quantitative phase microscopy images of biological samples.

Improved activity recognition via Kalman smoothing and multiclass linear discriminant analysis.

Video-based human activity recognition using multilevel wavelet decomposition and stepwise linear discriminant analysis.

Gait characterization in golden retriever muscular dystrophy dogs using linear discriminant analysis.

Predicting haplotype carriers from SNP genotypes in Bos taurus through linear discriminant analysis.

Human facial expression recognition using stepwise linear discriminant analysis and hidden conditional random fields.

On Kolmogorov Asymptotics of Estimators of the Misclassification Error Rate in Linear Discriminant Analysis.

Statistical sex determination from craniometrics: Comparison of linear discriminant analysis, logistic regression, and support vector machines.