2760

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

Robust Nonnegative Patch Alignment for Dimensionality Reduction Xinge You, Senior Member, IEEE, Weihua Ou, Chun Lung Philip Chen, Fellow, IEEE, Qiang Li, Ziqi Zhu, and Yuanyan Tang, Fellow, IEEE

Abstract— Dimensionality reduction is an important method to analyze high-dimensional data and has many applications in pattern recognition and computer vision. In this paper, we propose a robust nonnegative patch alignment for dimensionality reduction, which includes a reconstruction error term and a whole alignment term. We use correntropy-induced metric to measure the reconstruction error, in which the weight is learned adaptively for each entry. For the whole alignment, we propose locality-preserving robust nonnegative patch alignment (LP-RNA) and sparsity-preserviing robust nonnegative patch alignment (SP-RNA), which are unsupervised and supervised, respectively. In the LP-RNA, we propose a locally sparse graph to encode the local geometric structure of the manifold embedded in high-dimensional space. In particular, we select large p-nearest neighbors for each sample, then obtain the sparse representation with respect to these neighbors. The sparse representation is used to build a graph, which simultaneously enjoys locality, sparseness, and robustness. In the SP-RNA, we simultaneously use local geometric structure and discriminative information, in which the sparse reconstruction coefficient is used to characterize the local geometric structure and weighted distance is used to measure the separability of different classes. For the induced nonconvex objective function, we formulate it into a weighted nonnegative matrix factorization based on half-quadratic optimization. We propose a multiplicative update rule to solve this function and show that the objective function converges to a local optimum. Several experimental results on synthetic and real data sets demonstrate that the learned representation is more discriminative and robust than most existing dimensionality reduction methods. Index Terms— Correntropy-induced metric (CIM), dimensionality reduction, locality preserving (LP), robust nonnegative patch alignment (RNPA), sparsity preserving (SP). Manuscript received August 20, 2013; revised January 11, 2015; accepted January 12, 2015. Date of publication May 4, 2015; date of current version October 16, 2015. This work was supported in part by the National Natural Science Foundation of China under Grant 61272203 and Grant 61402122, in part by the International Scientific and Technological Cooperation Project under Grant 2011DFA12180, and in part by the Ph.D. Programs Foundation through the Ministry of Education of China under Grant 20110142110060. X. You and Z. Zhu are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]; [email protected]). W. Ou is with the School of Mathematics and Computer Science, Guizhou Normal University, Guiyang 550001, China (e-mail: ouweihuahust@ gmail.com). C. L. P. Chen and Y. Tang are with the Faculty of Science and Technology, University of Macau, Macau 853, China (e-mail: [email protected]; [email protected]). Q. Li is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology at Sydney, Sydney, NSW 2007, Australia (e-mail: leetsiang.cloud@ gmail.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2015.2393886

I. I NTRODUCTION

I

N MANY applications, real-world data, such as digital images, videos, and speech signals, usually have a high dimensionality. Dimensionality reduction is an important method to understand the phenomena underlying those high-dimensional data because it mitigates the curse of dimensionality and other undesired properties in high-dimensional space [1]. In the past decades, many methods for dimensionality reduction have been proposed; these can be categorized into two classes: 1) supervised and 2) unsupervised. Representative supervised methods include marginal fisher analysis [2], maximum margin criterion [3], linear discriminative analysis (LDA) [4], and its variations [5]. Unsupervised methods, include principle component analysis (PCA) [6], Laplacian eigenmaps [7], locality linear embedding (LLE) [8], and locality preserving projections (LPP) [9]. These methods have wide applications for dimensionality reduction but are not robust against noisy data. Recently, great efforts have been made to improve the robustness of those methods; for example, robust PCA [10], [11]. Although successful in many applications, they are inconsistent with the psychological evidence of parts-based representation in the human brain. Nonnegative matrix factorization (NMF) [12] is a powerful dimensionality reduction method, which automatically extracts a parts-based and meaningful feature with nonnegative constraints [13]. Learning parts-based representation based on NMF provides a new way for dimensionality reduction, and many related NMF methods have been proposed in the application. They can be categorized into two main classes. The first class improves the robustness of NMF. For example, the 2,1 -norm [14], [15], 1 -norm [16], earth mover’s distance metric [17], and correntropy-induced metric (CIM) [18] are adopted as the error function. Of these, the CIM-NMF performs best, but the local geometric structure is not considered in this method. The second class considers the geometric structure and supervised information. For example, Zhang et al. [19] proposed topology-preserving NMF by minimizing the constraint gradient distance. Cai et al. [20] proposed graph-regularized NMF (GNMF), in which the geometric structure is encoded via a k-nearest neighbor (NN) graph. However, it is difficult to find the suitable graph and associated parameters. Wang et al. [21] proposed multiplegraph-regularized NMF by combining multiple graphs. While it is more adaptive than GNMF, it is not robust against occlusion. Yang et al. [22] proposed nonnegative embedding (NGE) within the graph-embedding framework.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

YOU et al.: RNPA FOR DIMENSIONALITY REDUCTION

NGE preserves the favorite similarities and unfavored similarities via intrinsic graphs and penalty graphs. However, the graph in NGE becomes unreliable for noisy data. Unreliable graphs will result in indistinct local structure. Zhang et al. [23], [24] proposed a robust NGE by replacing the 2 -norm by the 1 -norm. However, the 1 -norm might not be suitable for occlusion [25]. To better understand the common property for different NMF algorithms, Guan et al. [26] proposed a nonnegative patch alignment framework (NPAF). NPAF constructs a part optimization by building a patch for each sample and obtains the global coordinate by a whole-alignment strategy. NPAF shows that the intrinsic differences between the various NMFs are the patches that they build. It provides a systemic framework to understand different NMF algorithms and to develop new algorithms. However, NPAF still assumes that the approximation errors conform to a Gaussian distribution. In this paper, we propose a robust nonnegative patch alignment (RNPA) for dimensionality reduction, which is a general dimensionality framework and can incorporate all types of supervised or unsupervised information. We utilize the CIM to measure approximation errors and to formulate the induced problem into a weighted NMF with the wholealignment term via the half-quadratic minimization technique. We show that the derived solution has simple multiplicative update rules and is converged to a local optimum. As an application of this framework, we propose (LP-RNPA) and (SP-RNPA) algorithms, which are unsupervised and supervised, respectively. In the locality-preserving robust nonnegative alignment (LP-RNA), we propose a locally sparse graph to encode the geometric structure of high-dimensional data in the part optimization. In particular, we select large p-NNs for each sample then obtain the sparse representation with respect to these neighbors. The sparse representation is utilized to build a graph that simultaneously enjoys locality, sparseness, and robustness. In the SP-RNA, we simultaneously utilize local geometric structure and discriminative information. Several experimental results on synthetic and real data sets demonstrate that the learned low representation is more robust against occlusion, large-magnitude noise than the most existing dimensionality reduction methods. The rest of this paper is organized as follows. We present the RNPA framework in Section II-A, followed by the algorithm in Section II-B, and the proof of convergence in Section II-C. Based on this framework, we propose LP-RNPA in Section III-A and SP-RNPA in Section III-B. Finally, extensive experimental results are shown in Section IV, and the conclusions are presented in Section V. II. ROBUST N ONNEGATIVE PATCH A LIGNMENT BASED ON C ORRENTROPY-I NDUCED M ETRIC Let us begin by establishing some notation. We denote a vector and a matrix by x and X, respectively. For matrix X, X i j denotes the entry in the i th row and j th column, X i,∗ denotes the i th row, and X ∗, j denotes the j th column.

2761

The function  ·  F denotes the Frobenius norm for a matrix,  · 2 denotes the 2 -norm, and  · 1 denotes the 1 -norm for a vector. The operator  denotes the Hadamard entrywise matrix multiplication and Tr(·) denotes the trace operator. PAF [27] is a systemic framework by which to understand the essential differences of various dimensionality reduction algorithms. It contains part optimization and whole alignment. Part Optimization: Let X = [ x 1 , x2 , . . . , xN ] ∈ Rm×N , where N is the total number of samples. For a sample xi ∈ X, the K -related samples xi1 , xi2 , . . . , xi K and itself xi form a local patch X i = [ x i , xi1 , xi2 , . . . , xi K ]. The corresponding low-dimensional representation of this local patch X i is denoted by Hi = [hi , hi1 , . . . , hi K ]. The part optimization aims to find Hi by the following model:   (1) arg min Tr Hi L i HiT Hi

where L i ∈ R(K +1)×(K +1) encodes the objective function for the patch X i , which varies for different algorithms. Whole Alignment: All Hi s can be unified together as a whole based on the assumption that the coordinates of the i th patch Hi are selected from the global coordinates H = [h1 , h2 , . . . , h N ] Hi = H Si

(2)

where Si ∈ R N×(K +1) is the selection matrix and its entries are defined as  1 if p = Fi (q) (Si ) pq = 0 otherwise where Fi = {i, i 1 , i 2 , . . . , i K } denotes the set of indices for the i th patch X i . Replacing Hi with H Si , the global coordinate can be obtained by alignment strategy [28] as follows: arg min Hi

N  i=1

Tr



Hi L i HiT



= arg min H

N 

  Tr H Si L i SiT H T

i=1

= arg min Tr(H L H T ) H

(3)

N T where L = i=1 Si L i Si . According to the alignment strategy, L can be obtained via an iterative procedure L(Fi , Fi ) ←− L(Fi , Fi ) + L i for i = 1, 2, . . . , N with the initialization L = 0. Inspired by the patch alignment framework, Guan et al. [26] proposed a NPAF, which offers a new viewpoint to better understand the common property of different NMF algorithms. The objective function is defined as γ min D(X, W H ) + Tr(H L H T ) (4) W ≥0,H ≥0 2 where W ∈ Rm×r is the bases, H ∈ Rr×N is the coordinate, and D(X, W H ) are the approximation errors between X and W H . A. Definition of Robust Nonnegative Patch Alignment In the NPAF, the Kullback–Leibler divergence or Frobenius matrix norm was adopted to measure reconstruction errors. Although they have nice mathematical properties and are

2762

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

equation for the nonconvex term: J (X, W H ) = min

⎧ ⎨

Pi j ∈R ⎩

Fig. 1. Three different error functions. L2 is the 2 -norm, and L1 is the 1 -norm.

effective in application, they are not the best choices for non-Gaussian noise. In practice, the noise is much more complex, and it is not suitable to simply use a certain distribution to model. How the data noise is modeled is vital for dimensionality reduction of corrupted high-dimensional data. Recently, the CIM proposed in [29] exhibited super-robust performance in face recognition [30], feature extraction [31], and NMF [18]. It is adaptive to approximate an unknown distribution via an iterative process. The formulation is defined as 1  m 2 1  kθ [ x (i ) − y(i )] CIM( x , y) = kθ (0) − m

Pi j [X i j − (W H )i j ]2 + ϕ(Pi j )

i, j

⎫ ⎬ (6)



where Pi j is the associated auxiliary variable and ϕ(.) is the conjugate function of gσ (e). For a fixed error E i j = X i j − (W H )i j , the minimum is reached at Pi j = gσ [X i j − (W H )i j ] [31]. By substituting (6) in (5), we obtain the following augmented objective function: ⎧ ⎨ min [Pi j [X i j − (W H )i j ]2 + ϕ(Pi j )] W ≥0,H ≥0,P≥0 ⎩ i, j ⎫ ⎬ + λTr(H L H T ) . (7) ⎭ For (7), we can solve it recursively by optimizing one variable with the others fixed. 1) Optimize P for Given W and H : Given W, H , (7) can be solved separately with respect to Pi j . The solution is Pi j = gσ [X i j − (W H )i j ].

(8)

i=1

where { x (i ), y(i )}m i=1 are the samples, and k θ (·) is a kernel function that satisfies mercer theory [32]. In this paper, we only consider Gaussian kernels, i.e., √ gσ (e) = (1/(2π)σ ) exp ((−e2 /2σ 2 )). The success of CIM is that it is dominated by small errors and is insensitive to the large errors that correspond to the corruptions or occlusions. As shown in Fig. 1, the value of the 2 -norm function increases quadratically with the errors, whereas the CIM approaches the 0 -norm with increasing of errors. Utilizing the CIM to measure the reconstruction errors, we obtain the objective function of RNPA min

[J (X, W H ) + λTr(H L H T )]

W ≥0,H ≥0

where J (X, W H ) =



i, j {1 − gσ [X i j

(5)

− (W H )i j ]}.

2) Optimize W for Given H and P: Given H and P, (7) can be solved as follows by optimizing each row of W separately: L(W, ) =

m 

(X i,∗ − Wi,∗ H )Ai (X i,∗ − Wi,∗ H )T

i=1

+ Tr( T W ) where Ai = diag(Pi,∗ ) ∈ R N×N , = [θik ] ∈ Rm×r is the Lagrange multipliers for the nonnegative constraints W ≥ 0. The partial derivatives of L(W, ) with respect to Wik are presented as follows: ∂L(W, ) = −2(X i,∗ Ai H T )k + 2(Wi,∗ H Ai H T )k + θik . ∂ Wik (9)

B. Algorithm for Robust Nonnegative Patch Alignment The presence of the nonconvex term in the objective function (5) makes it difficult to minimize directly. The half-quadratic technique, which was first proposed to solve constrained image restoration in [33], is suitable to optimize this class problem. By introducing additional auxiliary variables, it reformulates the nonconvex term as an augmented objective function in a enlarged parameter space. Next, the local optimum is found in the augmented parameter space by iterating between the auxiliary variables and optimized variables. According to the property of convex conjugate function [34] and the half-quadratic theory [35], we obtain the following

Setting (9) to zero and utilizing the Karush-KuhnTucker (KKT) conditions θik Wik = 0, we get the following equation for Wik : [−2(X i,∗ Ai H T )k + 2(Wi,∗ H Ai H T )k + θik ]Wik = 0. This equation leads to the following update rule for Wik : (X i,∗ Ai H T )k (Wi,∗ H Ai H T ) k [(X  P)H T ]ik = Wik . {[(W H )  P]H T }ik

Wik = Wik

(10)

YOU et al.: RNPA FOR DIMENSIONALITY REDUCTION

2763

Algorithm 1 Robust Nonnegative Patch Alignment (RNPA) Rm×N , +

Input: The non-negative matrix X ∈ matrix L, parameters λ and r . and the Output: The non-negative basis matrix W ∈ Rm×r + r×N coefficients matrix H ∈ R+ . Initialization: Randomly initialize W 0 and H 0. While not convergence  2 1 m  N 1: Compute σ 2 = 2m N i=1 j =1 X i j − (W H )i j ; 2: Update P by Pi j = gσ (E i j );  3:

B Under updating rule (13), the following inequality holds: X − W H (t +1)2P + λTr[H (t +1) L(H (t +1))T ]

≤ X − W H (t )2P + λTr[H (t ) L(H (t ))T ] (15)  where P = [Pi j ] ∈ Rm×N and X2P = i j X i2j Pi j . Thus, the objective function (5) does not increase under updating rules (10) and (13), and it converges because objective (5) is nonnegative. The detailed proof is presented in the Appendix.

(X P) H T

Update W by Wik {[(W H )P]H Tik} ; ik 

W T (X P)+λH L −

III. LP-RNA AND SP-RNA



Update H by Hkj [W T (W HP)+λH L +kj] . kj end

4:

3) Optimize H for Given W and P: In this case, (7) is equivalent to minimizing the following objective function: L(H, ) =

n 

[(X ∗, j − W H∗, j )T B j (X ∗, j − W H∗, j )]

j =1

+Tr( T H ) + λTr(H L H T ) where B j = diag(P∗, j ) ∈ Rm×m , = [ψkj ] ∈ Rr×N is the Lagrange multipliers for the nonnegative constraints H ≥ 0. The partial derivatives of L(H, ) with respect to Hkj are ∂L(H, ) = −2(W T B j X ∗, j )k + 2(W T B j W H∗, j )k ∂ Hkj +ψkj + 2λ(H L)kj . (11) Setting (11) to zero and utilizing the KKT conditions ψkj Hkj = 0, we get the following equation for Hkj : −2(W T B j X ∗, j )k Hkj + 2(W T B j W H∗, j )k Hkj +ψkj Hkj + 2λ(H L)kj Hkj = 0.

(12)

By separating L into two parts, i.e., L = L + − L − , L + ij = (|L i j | + L i j )/2, L − = (|L | − L )/2, and with some simple i j i j ij calculus, (12) leads to the update rule for Hkj  T  W B j X ∗, j + λ(H L − )∗, j k  Hkj = Hkj  T W B j W H∗, j + λ(H L + )∗, j k = Hkj

[W T (X  P) + λH L − ]kj . [W T (W H  P) + λH L + ]kj

(13)

This procedure repeats until convergence. The complete algorithm is summarized in Algorithm 1. C. Convergence Proof for RNPA Following the strategy suggested in [12], we proved that the value of objective function (5) is nonincreasing under the updating rules (10) and (13). Theorem 1: Let W (t ) , H (t ) be the updated results at iteration t and W (t +1) , H (t +1) be the results at iteration t + 1, respectively. Then, we have the following. A Under updating rule (10), the following inequality holds: X − W (t +1) H 2P ≤ X − W (t ) H 2P .

(14)

Recently, numerous research results have demonstrated that the high-dimensional data possibly resides on the nonlinear submanifolds [8], [36]. How to detect the essential local geometric structure of those submanifolds is important for dimensionality reduction. As an application of the RNPA framework, we propose LP-RNA for the unlabeled data and SP-RNA for the labeled data. In Section III-A, we present LP-RNA, which preserves local geometric structure by constructing a locally sparse graph. In Section III-B, we present SP-RNA, which simultaneously utilizes local geometric structure and discriminative information. A. Locality-Preserving Robust Nonnegative Alignment 1) Locally Sparse Graph Construction: A graph is an effective tool to characterize the underlying manifold. A traditional k-NN graph or ε-ball graph are sensitive to noise or outliers and their parameters are difficult to set. When the high-dimensional data were corrupted by large magnitude noise, the graph constructed by traditional methods becomes unreliable. As pointed out in [37], the key issues in graph construction are robustness, sparsity, and locality. The 1 -graph [38] enjoys sparsity by encoding each sample as a sparse representation of the remaining samples, but ignores the locality and might wrongly select the intersubspace samples to represent it. Cheng et al. [39] proposed the sparsity-induced similarity (SIS) measure by considering the locality constraints into the sparse representation [40]. But the graph based on SIS is not robust because the noise or outliers are not considered in their model. In this section, we propose a locally sparse graph by considering locality and sparsity constraints simultaneously. We are enlightened by the assumption that, given a data point, there exists a small neighborhood in which only the points that come from the same subspace or manifold lie approximately in a low-dimensional affine subspace, as shown in Fig. 2. This assumption is widely adopted in manifold learning [8], [41] or subspace clustering [42]. The exact neighbor size is usually unknown for each sample xi . So, we select its relatively large p neighborhoods and sort them with descending order in terms of Euclidean disp x i1 , xi2 , . . . , xi p ]. Based on this assumption, tance, e.g., X i = [ sample xi can be sparsely represented by a few samples from p its neighborhoods X i . This can be formulated by considering occlusion or outliers as follows: p

xi = X i α i + ei

(16)

2764

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

2) Part Optimization: Given a sample xi , we utilize it xi and its NNs to form the local patch X i = [ x i , xi1 , xi2 , . . . , xi p ] ∈ Rm×( p+1) and denote the corresponding low-dimensional representation by Hi = [hi , hi1 , hi2 , . . . , hi p ]. If xi and xi j are close, then the corresponding representation hi and hi j should be close as well. Based on the similarity matrix S, we obtain the following objective function of patch X i : Fig. 2. Motivation for locally sparse graph. Provided sufficient data (such that the manifold is well sampled), each data point and its neighbors would lie on or close to a locally linear patch of the manifold.

where α i ∈ R p is the sparse representation coefficients, and ei ∈ Rm models the occlusion or the large magnitude noises. αi ; ei ] If only partial entries of xi are corrupted, then [ could be sparse. To preserve the stability of sparse coding, we constrain the representationcoefficients between two adjacent p−1 atoms to be similar, i.e., αi ( j ) − α i ( j + 1))2 . Thus, j =1 ( we obtain α i by optimizing the following problem: p−1  [ αi ( j ) − α i ( j + 1)]2 min  α i ; ei 1 + ξ1





j =1

s.t.

   p  α i xi = X i , Im ei

(17)

βi

(18)

where βi = [ αi ; ei ] is the sparse-representation coefficients, p Bi = [X i , Im ] is the expanded dictionary, and R is a ( p + m) × ( p + m) matrix with ⎧ ⎨ 1 if j < p and j = i Ri, j = −1 if j < p and j = i + 1 ⎩ 0 otherwise because the second term in (18) is quadratic respect to the coefficient βi , we can incorporate it into the first term     2  xi  Bi   − √ (19) min  βi   + ξ2 βi 1 .  0 ξ R  1 βi 2 This is a standard lasso problem, which can be solved efficiently by available software, such as 1 − s [43]. After that, we utilize the sparse-representation coefficients to define the affinity matrix S ∈ R N×N as ⎧ ⎨ σ (βi ( j )) if i j ∈ {i 1 , . . . , i p } Si,i j = 1 if i j = i ⎩ 0 otherwise where j = 1, 2, . . . , p. σ (t) = |t| if |t| ≥ τ ; otherwise σ (t) = 0. We keep the strongest connections for each data point and set the others to zero by τ . Finally, we conduct a symmetrizing step and obtain the similarity matrix: S ←− (S + S T /2).

p 

hi − hi j 22 Si,i j .

(20)

j =1

Formulation (20) can be rewritten as follows: ⎛⎡ ⎤ ⎞ (hi − hi1 )T  ⎜⎢ ⎥  ⎟ .. min Tr ⎝⎣ ⎦ h i − hi1 , . . . , hi − h i p diag(si )⎠ . (hi − hi p )T where si = [Si,i1 , Si,i2 , . . . , Si,i p ]T ∈ R p . With simple calculus, we obtain the part optimization    T    e p diag(si ) e p , −I p HiT arg min Tr Hi −I p Hi   = arg min Tr Hi L i HiT 

where Im ∈ Rm×m is an identity matrix. Considering the representation errors, (17) can be formulated as follows by Lagrange multipliers: min  x i − Bi βi 22 + ξ1 R βi 22 + ξ2 βi 1

min

T

Hi

where e p = 1, 1, . . . , 1 ∈ I p is a p × p identity matrix, and ⎤ ⎡ p   T    si ( j ) −siT e p ⎦. diag(si ) e p , −I p = ⎣ j =1 Li = −I p −si diag(si ) Rp,

Using the whole alignment strategy [28], we can construct the alignment matrix L. Finally, the low-dimensional representation H can be obtained using (7). B. Sparsity-Preserving Robust Nonnegative Alignment According to the discussion above, the locally sparse representation is an effective tool to characterize local geometric structure. In this section, we propose a SP-RNA that simultaneously utilizes the local geometric structure and the discriminative information. x 1 , x2 , . . . , xN ] ∈ Rm×N , Given a sample xi ∈ X = [ we group the other samples into two classes according to the label of xi : 1) samples in the same class as xi , denoted by X s and 2) samples from classes different from that of xi , denoted by X d . We utilize xi and its k1 NNs in X s to form the within-class patch X iw = [ x i , xi 1 , xi 2 , . . . , xi k1 ], and use xi and its k2 NNs in X d to x i , xi1 , xi2 , . . . , xik2 ]. form the between-class patch X ib = [ Next, the corresponding low-dimensional representation of X iw and X ib are denoted by Hiw = [hi , hi 1 , hi 2 , . . . , hi k1 ] and Hib = [hi , hi1 , hi2 , . . . , hik2 ], respectively. Thus, the low-dimensional representation of the whole patch is Hi = [hi , hi 1 , . . . , hi k1 , hi1 , . . . , hik2 ]. For xi , the locally sparse representation in X iw can be formulated as follows: sw = arg min si ; ei 1  i  w   Xi Im si xi s.t. =  T T 1 ei 1 0

(21)

YOU et al.: RNPA FOR DIMENSIONALITY REDUCTION

2765

Using the whole alignment strategy [28], we obtain the following two objective functions: min H

N 

  wT Tr Hiw L w = min Tr(H L w H T ) i Hi

i=1 N 

max H

Fig. 3. Motivation for class separability. Left side: data distribution in original space, and the thickness of the boundary indicates the Euclidean distances. Right side: data distribution in low-dimensional space, and the thickness of the boundary indicates the weight in the part optimization.

where 1 T ∈ Rk1 +1 is an all-ones vector, 0 T ∈ Rm is a zero vector, and Im is a m × m identity matrix. ei represents noise, and si = [0, si (1), . . . , si (k1 )]T ∈ Rk1 +1 is a column vector in which the first element is zero, which represents that xi has been removed from X iw . Similar to [44], we constrain 1 T si = 1 by considering the invariant to translations. Motivated by LLE [8], we expect that such a linear reconstruction relationship siw could be preserved in low-dimensional space  2 h i − Hiw siw 2 . min 

(22)

With simple calculus, we obtain the part optimization on within-class patch X iw   wT min Tr Hiw L w i Hi w Hi

(23)

= ccT , cT = [1, −sik1 ] ∈ Rk1 +1 , and where L w i sik1 = [siw (1), . . . , siw (k1 )] ∈ Rk1 . For the between-class patch X ib , we expect that if xi and the neighbors xi j were ‘close, then the corresponding distance between hi and hi j would be as large as possible, as shown in Fig. 3. Motivated by the weighted pairwise Fisher criteria [45], we formulate the part optimization on patch X ib as follows: max

k2 

ωi,i j hi − hi j 22

(24)

j =1

√ where ωi,i"j = 1/2(di,i j )2 erf(di,i j /2 2), er f (x) = x (2/(π)1/2 ) 0 exp (−t 2 )dt and where di,i j =  x i − xi j 2 is the distance between xi and xi j . With a simple algebraic reformulation, the part optimization becomes  T max Tr Hib L bi Hib Hib

(25)

H

 T Tr Hib L bi Hib = max Tr(H L b H T ) H

i=1

(27) (28)

T

w T and L b = S b L b S b are the alignwhere L w = Siw L w i i i i Si ment matrices of within-class and between-class, respectively. Siw ∈ R N×(k1 +1) and Sib ∈ R N×(k2 +1) are the selection matrices for the within-class patch and the between-class patch, which are defined as   w 1, if p = Fiw (q) Si pq = 0, otherwise   b 1, if j = Fib (k) Si j k = 0, otherwise

where Fiw = [i, i 1 , . . . , i k1 ] and Fib = [i, i 1 , . . . , i k2 ] are the set of indices on the within-class patch X iw and the betweenclass patch X ib , respectively. As suggested in [26], we obtain the whole alignment by combining (27) and (28) %  # $ 1 1 T (29) min Tr H (L b )− 2 L w (L b )− 2 H T . H

Let L = (L b )−(1/2) L w [(L b )−(1/2) ]T , the low-dimensional representation H can be obtained using (7). IV. E XPERIMENTS In this section, we report the results of extensive experiments to evaluate the effectiveness and robustness of the proposed methods LP-RNA and SP-RNA. We compare the results with those of the following representative algorithms: 1) PCA [6]; 2) LPP [36]; 3) NMF [12]; 4) LDA [4]; 5) GNMF [20]; 6) L21-NMF [14]; 7) CIM-NMF [18]; 8) Group Sparse Graph regularized Nonnegative Matrix Factorization (GS-GNMF) [37]; and 9) Nonnegative Discriminative Locality Alignment (NDLA) [26]. To evaluate the effectiveness of the locally sparse graph, we combine the k-NNs graph with the CIM, and denote as CIM-k-NN Graph regularized Nonnegative Matrix Factorization (KGNMF). In Section IV-A, we describe the data sets and evaluation metrics. In Section IV-B, we conduct image clustering on Carnegie Mellon University, Pose, Illumination and Expression (CMU-PIE) and extended Yale B data sets under different noise levels and occlusions. In Section IV-C, we do image recognition to evaluate the level of discrimination of the learned representation. We analyze the influences of the parameters in Section IV-D and show the convergence in Section IV-E.

where   T ek2 ek2 = diag(ω i ) −Ik2 −Ik2

A. Data and Evaluation Metrics



L bi

(26)

ek2 = [1, 1, . . . , 1] ∈ Rk2 is a row vector, and Ik2 is a k2 × k2 identity matrix, ω  i = [ωi,i1 , . . . , ωi,ik2 ]T .

Four benchmark image data sets, i.e., CMU-PIE, Extended Yale B, ORL, and the COIL20 data set were used in our experiments. They are described in detail as follows. 1) CMU-PIE: The CMU-PIE database contains 41 368 images from 68 persons under 13 different

2766

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

Fig. 4. From top to bottom: sample images are selected from the CMU-PIE, extended Yale B, ORL, and COIL20 data sets, respectively. Left-most column: clean images. Upper row for each data set: images corrupted by salt&pepper noises with levels varying from 5% to 50%. Lower row for each data set: images occluded by random a white block varying in size from 6 × 6 to 16 × 16.

poses, 43 different illumination conditions, and 4 different expressions. In our experiment, we chose 42 images at pose 27 for each person under different lights with 32×32 resolution, which can be downloaded from http://www.zjucadcg.cn/dengcai/Data/GNMF.html. There are 2856 images in all. 2) Extended Yale B: The extended Yale B database contains 161 289 images from 38 persons under 9 poses and 64 different illuminations, which are cropped to 32×32. In our experiments, we chose about 64 images with the frontal pose under different illumination for each person. There are 2414 images in all. 3) ORL: The ORL data set contains 400 images with 32×32 pixels for each image. In all, there are 40 subjects and 10 images for each subject. 4) COIL20: The COIL20 data set contains a total of 1440 images of 20 different objects, depicted from 72 different viewpoints. The image size is 32 × 32. For all data sets, 20% of the images from each are selected at random and are corrupted by salt&pepper noise or occluded by a white block. For salt&pepper noise, the noise level takes one of the values in the set {5%, 10%, 20%, 30%, 40%, 50%}. For occlusion, the block size is varied within the set {6 × 6, 8 × 8, 10 × 10, 12 × 12, 14 × 14, 16 × 16} pixels and the position is random. Some examples with a different noise level or block size are shown in Fig. 4. To evaluate the clustering performance, we compare the generated clusters with the ground truth by computing the clustering accuracy (ACC) and normalized mutual information (NMI). Please see [46] for more details. B. Image Clustering In this section, we do image clustering with three sections. In Section IV-B1, we visualize the clustering results on CMU-PIE. In Section IV-B2, we implement image clustering on CMU-PIE with fixed hidden factor r = 68, and change the noises level. In Section IV-B3, we fix the salt&pepper

noise level at 20% and occlusion size at 10 × 10, while the hidden factor r varies within {10, 20, 30, 40, 50, 60, 68} for CMU-PIE and {4, 8, 12, 16, 20, 24, 28, 32, 36} for extended Yale B. We adopt k-means as the baseline and implement in the original space. For all other methods, the representations are first learned, and then k-means is applied in the new representation space. The hidden factor r is set to the cluster number. Each experiment is implemented 10 times, and the average value and standard derivation are reported. For GNMF, we use k = 5 as the neighbor size and λ = 100 as the regularization parameter according to [20]. For NMF, we implement the GNMF program by setting the regularization parameter λ to 0. For GS-GNMF, we use an elastic net as the group sparse regularization and set λ1 = 0.0001, λ2 = 0.001, and λ = 100 according to [37]. For L21-NMF and CIM-NMF, we use the same setting as suggested in the original papers. In LPP, the graph is constructed using the k NNs with k = 5. For LP-RNA, we use cross validation to estimate the parameters λ, ξ1 , ξ2 , and neighbor size p. The candidate set for λ, ξ1 , and ξ2 is {10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 100 , 101 , 102 , 103 , 104 , 105 }, and the candidate set for p is {50, 100, 150, 200, 250, 300, 400, 500}. For each subject, seven samples selected at random to constitute the training set and six samples to constitute the validation set, and the rest is used as the test set. According to the results, we set the neighbor size to 100 for CMU-PIE and to 200 for extended Yale B with ξ1 = ξ2 = 0.01. The regularization parameter λ is set to 100 in LP-RNA. The maximum iteration number is adopted as the stop criterion and is set to 1500 for CIM-NMF and to 200 for the other NMF methods. The discussion on influences of parameters in Section IV-D. 1) Visualization of Clustering Results: To visualize the clustering results, we select at random three subjects from the CMU-PIE data set with 20% salt&pepper noise and occlusion size 10 × 10. There are 42 samples for each subject and 126 samples in all. The hidden factor r is set to 3. The learned representations in the new space are shown in Figs. 5(a)–(h) and 6(a)–(h). LP-RNA clearly performs best in these two cases. For salt&pepper noise, as shown in Fig. 5, LPP, GNMF, GS-GNMF, and LP-RNA obtain better results than the other methods. Note that all four methods explicitly exploit the geometric structure. LP-RNA and GS-GNMF simultaneously achieve one on both ACC and NMI, whereas the points are much better separated by LP-RNA. The results of GNMF are worse than the other three methods. These results demonstrate that the discriminant ability can be improved if the geometric structure is utilized efficiently. For occlusion, as shown in Fig. 6, the performance of LPP, GNMF, and GS-GNMF is worse than that for salt&pepper noises. However, LP-RNA still achieves an ACC of 0.96 and a NMI of 0.85, and the points are much better separated by LP-RNA. This demonstrates the LP-RNA is more robust than the other methods. 2) Fixed Hidden Factor: Furthermore, while the noise level is varied, we conduct clustering on CMU-PIE with hidden factor r fixed at 68 for two different types of noises. The

YOU et al.: RNPA FOR DIMENSIONALITY REDUCTION

2767

Fig. 5. Visualization of clustering results for three clusters selected at random from CMU-PIE data set with 20% salt&pepper noises level. The hidden factor r = 3 and the different shapes indicate different subjects. Clustering results of (a) PCA (0.556, 0.282), (b) LPP (0.984, 0.912), (c) NMF (0.611, 0.322), (d) L21-NMF (0.579, 0.3100), (e) CIM-NMF (0.571, 0.325), (f) GNMF (0.905, 0.769), (g) GS-GNMF (1, 1), and (h) LP-RNA (1, 1). Numbers in the parentheses: associated values of ACC and NMI, respectively.

Fig. 6. Visualization of clustering results for three clusters selected at random from CMU-PIE data set with occlusion size 10 × 10. The hidden factor r = 3 and different shapes of the points indicate different subjects. Clustering results of (a) PCA (0.397, 0.024), (b) LPP (0.714, 0.431), (c) NMF (0.556, 0.305), (d) L21-NMF (0.611, 0.344), (e) CIM-NMF (0.532, 0.198), (f) GNMF (0.524, 0.511), (g) GS-GNMF (0.587, 0.326), and (h) LP-RNA (0.960, 0.849). Numbers in the parentheses: associated values of ACC and NMI, respectively.

results are listed in Tables I and II. Overall, for all methods, the performance for salt&pepper noises is better than that for the occlusion. LP-RNA achieves the best performance on two cases for all noise levels. For occlusion, LP-RNA outperforms the second best method LPP up to 6% on ACC and 2% on NMI. For salt&pepper noises, LP-RNA outperforms on average the second-best method LPP by up to 5% on both metrics. 3) Fixed Noises Level: In this section, we vary the hidden factor with fixed noises level and conduct clustering on CMU-PIE and extended Yale B. The clustering results are shown in Fig. 7. For salt&pepper noises, as shown in Fig. 7(a) and (b), LP-RNA, LPP, GNMF, GS-GNMF,

and CIM-KGNMF performs better than other methods because the geometric structure is explicitly exploited in those five methods whereas the others are not. For occlusion, the performance of GS-GNMF drops down sharply, while LP-RNA still performs better, as shown in Fig. 7(c) and (d). This demonstrates that the graph in GS-GNMF is unreliable when the occlusion or outliers come in. The extended Yale B data set is more difficult than CMU-PIE because of the extreme variation in illumination. The performance of all methods decreases. However, as shown in Fig. 7(e) and (f), LP-RNA performs much better than the others. This verifies the locally sparse graph can efficiently

2768

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

TABLE I ACC AND NMI ON CMU-PIE D ATA S ET W ITH S ALT &P EPPER N OISE

TABLE II ACC AND NMI ON CMU-PIE D ATA S ET W ITH O CCLUSIONS

capture the geometric structure. For the occlusion on this data set, as shown in Fig. 7(g) and (h), the performance is worse than LPP in the low dimension. This might be due to the sparse assumption being violated in the low-dimensional feature space. C. Image Recognition In this section, we discuss image recognition on the CMU-PIE, ORL, and COIL20 data sets. For these data sets, we fix the noises level to be the same as in Section IV-B3. The subspace dimension r ranges over {30, 60, 90, 120, 150}. For PCA, LDA, and LPP, the NN rule was used for classification. For the other methods, we adopt the algorithm proposed in [17] to test according to [16], in which the CIM distance is used for SP-RNA. We divided each data set into three separate sets: 1) training set; 2) validation set; and 3) testing set, which are selected in the same way as in Section IV-B. The training set was used to learn the low-dimensional subspace, and the validation set was used to determine the optimal parameters. According to the results with the validation set, we use k1 = 6, k2 = 20, λ = 0.0001 for SP-RNA and NDLA on these three data sets. For GNMF and GS-GNMF, we set λ = 0.001 for all three data sets. For NDLA, we utilize the MUR updating rule by separating the whole alignment matrix L into two parts according to [26]. We implement each experiment 10 times independently and report the average recognition accuracy and the standard deviation on the testing set. The other settings are the same as in Section IV-B.

As shown in Fig. 8, all the NMF-based methods performs better upon increasing the subspace dimension r . On the CMU-PIE data set, SP-RNA achieves the best result, and the best recognition rate is 98.87% for the salt&pepper noises. The best results of PCA and LPP, however, are still 20. The test results show that the selected model is effective. Fig. 13 shows the performance with different values for the parameter λ. Fig. 13(a) and (b) shows the clustering performance on CMU-PIE and extended Yale B for occlusion size 10×10. The LP-RNA achieves consistently good performance

YOU et al.: RNPA FOR DIMENSIONALITY REDUCTION

2771

and is set to 1500 for CIM-NMF and to 200 for other NMF methods in our experiments. V. C ONCLUSION

Fig. 13. Performance versus parameter λ on dimension r = 30. (a) and (b) Clustering results on CMU-PIE and extended Yale B for occlusion of size 10 × 10, respectively. (c) and (d) Recognition results on CMU-PIE and extended Yale B for occlusion of size 10 × 10, respectively.

In this paper, we propose a RNPA for dimensionality reduction based on a CIM. It is a general dimensionality reduction method because it can incorporate supervised and unsupervised information. It includes a reconstruction term and a whole alignment term. In this paper, we propose an unsupervised alignment term using the local geometric structure, i.e., LP-RNA, and propose a supervised alignment term by incorporating label information, i.e., SP-RNA. In LP-RNA, we propose the locally sparse graph to capture the local geometric structure based on the submanifold assumption. In SP-RNA, we simultaneously utilize the local geometric structure and discriminative information. We propose SP to characterize the local geometric structure and use the weighted distance to measure the separability of different classes. We propose a simple updating algorithm and prove that it converges to a local optimum. Extensive experiments show that the learned representation is more robust against large-magnitude noises, occlusion, and extreme variations in illumination than most existing dimensionality reduction methods. A PPENDIX P ROOF OF T HEOREM A We rewrite (14) as √ √  P  (X − W (t +1) H )2F ≤  P  (X − W (t ) H )2F . Let F(W ) be

Fig. 14. Convergence curves of GNMF, CIM-NMF, and LP-RNA. (a) Results on CMU-PIE. (b) Results on the extended Yale B.

over the wide range of [102 , 105 ]. Fig. 13(c) and (d) shows the recognition rate when the subspace dimension r is set to 30. The performance of three-regularized NMF falls sharply when λ is larger than 1. SP-RNA achieves the best performance over the wide range of [10−5 , 10−2 ], which means the proposed methods LP-RNA and SP-RNA perform robustly over a wide range, although the performance varies severely when λ is not within this range.

Tr((P  (W H ))H T W T − 2(P  X)H T W T ).

We need to find an auxiliary function for F(W ) and then find the minimum for the auxiliary function. By applying the matrix inequality proposed in [14], we have

Tr(P  (W H )H W )) ≤ T

T

 (P  (W H )H T )i j Wi2j

Wi j

ij

.

Based on the result z ≥ 1 + log z, ∀z ≥ 0, the following inequality holds for the second term in (30):  ((P  X)H T )i j Wi j Tr(−2(P  X)H T W T ) ≤ −2 &

E. Algorithmic Convergence As proved in Section II-C, the updating rules of Algorithm 1 guarantee a local optimal solution for the objective function in (5). Fig. 14 shows how the object function decreases with increasing iteration number on CMU PIE and extended Yale B data sets. Compared with GNMF, LP-RNA has a slightly slower convergence, which is caused by the nonconvex term in the objective function. However, LP-RNA is faster than CIM-NMF because the geometric structure of the data is considered explicitly in the model. As suggested in [20], the maximum iteration number is adopted as the stop criterion

(30)

ij

× 1 + log

Wi j

Wi j

' .

Thus, it is easy to verify that the following function:



Z (W, W ) =

 (P  (W H )H T )i j Wi2j

Wi j ⎧ & '⎫ ⎨  Wi j ⎬ − 2 ((P  X)H T )i j Wi j 1 + log ⎩ Wi j ⎭ ij ij

2772

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

is an auxiliary function for F(W ). The gradient of Z (W, W ) is



 (P  (W H )H T )i j Wi j ∂ Z (W, W ) =2 ∂ Wi j Wi j ij ⎫ ⎧ ⎬ ⎨  W i j . − 2 ((P  X)H T )i j ⎩ Wi j ⎭ ij



By setting the gradient of Z (W, W ) to zero, we obtain the updating rule (10). Utilizing the property of auxiliary function proposed in [12] and the results above, we have the following inequality: F(W (t +1) ) = Z (W (t +1) , W (t +1) ) ≤ Z (W (t +1) , W (t ) ) ≤ Z (W

(t )

,W

(t )

) = F(W

(t )

).

Thus, inequality (14) holds because F(W ) is bounded from below.

P ROOF OF T HEOREM B Similar to the proof of Theorem A, we rewrite (15) as √  P  (X − W H (t +1))2F + Tr(λH (t +1) L(H (t +1))T ) √ ≤  P  (X − W H (t ))2F + Tr(λH (t ) L(H (t ))T ). Let F(H ) = Tr((P  (W H ))H T W T − 2(P  X)H T W T ) + Tr(λH L H T ) = Tr((P  (W H ))H T W T ) + Tr(λH L + H T ) + Tr(−2(P  X)H T W T ) − Tr(λH L − H T ). The main step is to find an auxiliary function for F(H ) and find the minimum of the auxiliary function. By applying the matrix inequality proposed in [14], we have

Tr(W T (P  (W H ))H T )) ≤

 (W T (P  (W H )))i j Hi2j

Hi j

ij

Tr(λH L + H T ) ≤ λ

 (H L + )i j Hi2j

ij

Hi j

.

Using inequality z ≥ 1 + log z, ∀z ≥ 0, the following inequalities hold: Tr(−2W T (P  X)H T ) & '  Hi j T ≤ −2 (W (P  X))i j Hi j × 1 + log Hi j ij −Tr(λH L − H T ) & '  H H i j ik L − j k Hi j Hik 1 + log . ≤ −λ Hi j Hik ijk

By summing over all bounds, we validate that the following function is the auxiliary function of F(H ): & '  H i j (W T (P  X))i j Hi j 1 + log Z (H, H ) = −2 Hi j ij & '  Hi j Hik − L − j k Hi j Hik 1 + log Hi j Hik ijk

+

 (W T (P  (W H )))i j Hi2j

Hi j

ij

+

 (H L + )i j Hi2j

ij

Hi j

.

Similar to the proof of Theorem A, we obtain the updating rule (13) by setting the gradient of Z (H, H ) to zero. Utilizing the property of auxiliary function proposed in [12] and the results above, the following inequality holds: F(H (t +1)) = Z (H (t +1), H (t +1)) ≤ Z (H (t +1), H (t )) ≤ Z (H (t ), H (t )) = F(H (t )). Thus, inequality (15) holds because F(H ) is bounded from below. R EFERENCES [1] L. O. Jimenez and D. A. Landgrebe, “Supervised classification in highdimensional space: Geometrical, statistical, and asymptotical properties of multivariate data,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 28, no. 1, pp. 39–54, Jan. 1998. [2] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 40–51, Jan. 2007. [3] X. R. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extraction by maximum margin criterion,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 157–165, Jan. 2006. [4] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, Jul. 1997. [5] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition using LDA-based algorithms,” IEEE Trans. Neural Netw., vol. 14, no. 1, pp. 195–200, Jan. 2003. [6] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 1991, pp. 586–591. [7] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems, vol. 14. Cambridge, MA, USA: MIT Press, 2001 pp. 585–591. [8] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. [9] X. He and P. Niyogi, “Locality preserving projections,” in Advances in Neural Information Processing Systems, vol. 16. Cambridge, MA, USA: MIT Press, 2003, pp. 153–160. [10] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, 2011, Art. ID 11. [11] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Subspace learning from image gradient orientations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp. 2454–2466, Dec. 2012. [12] D. D. Lee and H. Sebastian Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, Oct. 1999. [13] N. K. Logothetis and D. L. Sheinberg, “Visual object recognition,” Annul. Rev. Neurosci., vol. 19, no. 1, pp. 577–621, 1996.

YOU et al.: RNPA FOR DIMENSIONALITY REDUCTION

[14] D. Kong, C. Ding, and H. Huang, “Robust nonnegative matrix factorization using L21-norm,” in Proc. 20th ACM Int. Conf. Inf. Knowl. Manage., 2011, pp. 673–682. [15] J. Huang, F. Nie, H. Huang, and C. Ding, “Robust manifold nonnegative matrix factorization,” ACM Trans. Knowl. Discovery Data, vol. 8, no. 3, 2014, Art. ID 11. [16] N. Guan, D. Tao, Z. Luo, and J. Shawe-Taylor, “MahNMF: Manhattan non-negative matrix factorization,” Mach. Learn., vol. 1, no. 5, pp. 11–43, 2012. [17] R. Sandler and M. Lindenbaum, “Nonnegative matrix factorization with earth mover’s distance metric for image analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1590–1602, Aug. 2011. [18] L. Du, X. Li, and Y. D. Shen, “Robust nonnegative matrix factorization via half-quadratic minimization,” in Proc. IEEE 12th Int. Conf. Data Mining, Dec. 2012, pp. 201–210. [19] T. Zhang, B. Fang, Y. Y. Tang, G. He, and J. Wen, “Topology preserving non-negative matrix factorization for face recognition,” IEEE Trans. Image Process., vol. 17, no. 4, pp. 574–584, Apr. 2008. [20] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011. [21] J. J.-Y. Wang, H. Bensmail, and X. Gao, “Multiple graph regularized nonnegative matrix factorization,” Pattern Recognit., vol. 46, no. 10, pp. 2840–2847, Oct. 2013. [22] J. Yang, S. Yang, Y. Fu, X. Li, and T. Huang, “Non-negative graph embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [23] H. Zhang, Z.-J. Zha, S. Yan, M. Wang, and T.-S. Chua, “Robust nonnegative graph embedding: Towards noisy data, unreliable graphs, and noisy labels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2464–2471. [24] H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, and T.-S. Chua, “Robust (semi) nonnegative graph embedding,” IEEE Trans. Image Process., vol. 23, no. 7, pp. 2996–3012, Jul. 2014. [25] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Regularized robust coding for face recognition,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1753–1766, May 2013. [26] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Non-negative patch alignment framework,” IEEE Trans. Neural Netw., vol. 22, no. 8, pp. 1218–1230, Aug. 2011. [27] T. Zhang, D. Tao, X. Li, and J. Yang, “Patch alignment for dimensionality reduction,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1299–1313, Sep. 2009. [28] Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM J. Sci. Comput., vol. 26, no. 1, pp. 313–338, 2005. [29] W. Liu, P. P. Pokharel, and J. C. Principe, “Correntropy: Properties and applications in non-Gaussian signal processing,” IEEE Trans. Signal Process., vol. 55, no. 11, pp. 5286–5298, Nov. 2007. [30] R. He, W.-S. Zheng, and B.-G. Hu, “Maximum correntropy criterion for robust face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1561–1576, Aug. 2011. [31] X.-T. Yuan and B.-G. Hu, “Robust feature extraction via information theoretic learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 1193–1200. [32] R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms. Cambridge, MA, USA: MIT Press, 2001. [33] D. Geman and G. Reynolds, “Constrained restoration and the recovery of discontinuities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 3, pp. 367–383, Mar. 1992. [34] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. [35] M. Nikolova and R. H. Chan, “The equivalence of half-quadratic minimization and the gradient linearization iteration,” IEEE Trans. Image Process., vol. 16, no. 6, pp. 1623–1627, Jun. 2007. [36] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using Laplacianfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 328–340, Mar. 2005. [37] Y. Fang, R. Wang, and B. Dai, “Graph-oriented learning via automatic group sparsity for data analysis,” in Proc. IEEE 12th Int. Conf. Data Mining, Dec. 2012, pp. 251–259. [38] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. S. Huang, “Learning with 1 -graph for image analysis,” IEEE Trans. Image Process., vol. 19, no. 4, pp. 858–866, Apr. 2010.

2773

[39] H. Cheng, Z. Liu, and J. Yang, “Sparsity induced similarity measure for label propagation,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 317–324. [40] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [41] E. Elhamifar and R. Vidal, “Sparse manifold clustering and embedding,” in Advances in Neural Information Processing Systems. Red Hook, NY, USA: Curran & Associates Inc., 2011, pp. 55–63. [42] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 2790–2797. [43] K. Koh, S.-J. Kim, and S. Boyd, “An interior-point method for largescale 1 -regularized logistic regression,” J. Mach. Leran. Res., vol. 8, pp. 1519–1555, Jul. 2007. [44] L. Qiao, S. Chen, and X. Tan, “Sparsity preserving projections with applications to face recognition,” Pattern Recognit., vol. 43, no. 1, pp. 331–341, Jan. 2010. [45] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, “Multiclass linear dimension reduction by weighted pairwise Fisher criteria,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 7, pp. 762–766, Jul. 2001. [46] M. Zheng et al., “Graph regularized sparse coding for image representation,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1327–1336, May 2011. [47] D. Donoho and V. Stodden, “When does non-negative matrix factorization give a correct decomposition into parts?” in Advances in Neural Information Processing Systems, vol. 16. Cambridge, MA, USA: MIT Press, 2003, pp. 1141–1148. [48] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Nov. 2004.

Xinge You (M’08–SM’10) received the B.S. and M.S. degrees in mathematics from Hubei University, Wuhan, China, in 1990 and 2000, respectively, and the Ph.D. degree from the Department of Computer Science, The Hong Kong Baptist University, Hong Kong, in 2004. He is currently a Professor with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan. His current research interests include wavelets and its application, signal and image processing, pattern recognition, machine learning, and computer vision.

Weihua Ou received the M.S. degree in mathematics from Southeast University, Nanjing, China, in 2006, and the Ph.D. degree in information and communication engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2014. He is currently an Associated Professor with the School of Mathematics and Computer Science, Guizhou Normal University, Guiyang, China. His current research interests include sparse (low-rank) representation and its application, multiview learning, image processing, and computer vision.

2774

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015

Chun Lung Philip Chen (S’88–M’88–SM’94– F’07) received the M.S. degree in electrical engineering from the University of Michigan, Ann Arbor, MI, USA, and the Ph.D. degree in electrical engineering from Purdue University, West Lafayette, IN, USA, in 1985 and 1988, respectively. He was a Tenured Professor, a Department Head, and an Associate Dean in two different universities in the U.S. for 23 years. He is currently the Dean of the Faculty of Science and Technology, and a Chair Professor with the Department of Computer and Information Science, University of Macau, Macau, China. His current research interests include systems, cybernetics, and computational intelligence. Dr. Chen is a fellow of the American Association for the Advancement of Science. He has been an Editor-in-Chief of the IEEE T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS : S YSTEMS since 2014. He is currently an Associate Editor of several IEEE transactions. He is also the Chair of Technical Committees 9.1 Economic and Business Systems of International Federation of Automatic Control.

Qiang Li received the B.Eng. degree in electronics and information engineering and the M.Eng. degree in signals and information processing from the Huazhong University of Science and Technology, Wuhan, China, in 2010 and 2013, respectively. He is currently pursuing the Ph.D. degree with the University of Technology at Sydney, Sydney, NSW, Australia. His current research interests include probabilistic inference, probabilistic graphical models, image processing, video surveillance, and computer vision.

Ziqi Zhu received the B.S. degree in computer science from Wuhan University, Wuhan, China, in 2005, and the Ph.D. degree in computer science from the Huazhong University of Science and Technology, Wuhan, in 2011. He currently holds a post-doctoral position with the Huazhong University of Science and Technology. His current research interests include topics in pattern recognition and machine learning, such as texture analysis and face recognition.

Yuanyan Tang (S’88–M’88–SM’96–F’04) received the B.S. degree in electrical and computer engineering from Chongqing University, Chongqing, China, the M.Eng. degree in electrical engineering from the Graduate School of Post and Telecommunications, Beijing, China, and the Ph.D. degree in computer science from Concordia University, Montréal, QC, Canada. He is currently the Chair Professor with the Faculty of Science and Technology, University of Macau, Macau, China, a Professor with the Department of Computer Science, Chongqing University, and an Adjunct Professor of Computer Science with Concordia University. He is also an Honorary Professor with Hong Kong Baptist University, Hong Kong, and an Advisory Professor at many institutes in China.

Robust Nonnegative Patch Alignment for Dimensionality Reduction.

Dimensionality reduction is an important method to analyze high-dimensional data and has many applications in pattern recognition and computer vision...
4MB Sizes 2 Downloads 7 Views